Spider

Reference
#dataset #benchmark #EX #EM #TS

Introduction

Spider is a large-scale, complex and cross-domain semantic parsing and Text2SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables, covering 138 different domains.

The goal of Spider is converting the textual representation to an SQL query.

Assumptions & Guarantees

To create Spider, some assumptions on the dataset were made:

SQL Coverage: The dataset should cover all practical SQL queries
SQL Consistency: The annotation protocol is deterministic. All annotators will create the same SQL query given a question.
Question Clarity: For every question, there is exactly one valid query (up to semantic equivalence). That is, one should be able to infer the query from the question without resulting to any additional knowledge.

Also, to avoid overfitting over a particular database. It is guaranteed that training, dev, and test do not share a database.

SQL Hardness Criteria

To better understand the model performance on different queries, we divide SQL queries into 4 levels: easy, medium, hard, extra hard. We define the difficulty as the following:

First, we define:

SQL components 1: WHERE, GROUP BY, ORDER BY, LIMIT, JOIN, OR, LIKE，HAVING
SQL components 2: EXCEPT, UNION, INTERSECT, NESTED
Others: number of agg > 1, number of select columns > 1, number of where conditions > 1, number of group by clauses > 1, number of group by clauses > 1 (no consider col1-col2 math equations etc.)

Then different hardness levels are determined as follows.

Easy: if SQL key words have ZERO or exact ONE from [SQL components 1] and SQL do not satisfy any conditions in [Others] above. AND no word from [SQL components 2].
Medium: Either of below:
- SQL satisfies no more than two rules in [Others], and does not have more than one word from [SQL components 1], and no word from [SQL components 2].
- SQL has exact 2 words from [SQL components 1] and less than 2 rules in [Others], and no word from [SQL components 2]
Hard: Either of below:
- SQL satisfies more than two rules in [Others], with no more than 2 key words in [SQL components 1] and no word in [SQL components 2].
- SQL has 2 < number key words in [SQL components 1] <= 3 and satisfies no more than two rules in [Others] but no word in [SQL components 2].
- SQL has no more than 1 key word in [SQL components 1] and no rule in [Others], but exact one key word in [SQL components 2].
Extra Hard: All others left.
All: all SQL-question pairs to be scored.

Evaluation Metrics

Four main evaluation metrics were used in Spider:

Partial Matching without Values

In order to know model's performance on different SQL components, we provide the detailed scores on each part. Since models in our paper do not predict value string, our Partial and Exact Matching evaluation metrics do not take value strings into account.

For each SQL, we compute accuracy and F1 scores for all following components:

SELECT COLUMN: e.g. gold: ([select, col1, none], [select, col2, max]) and predicted: ([select, col1, none], [select, col3, min]) compute accuracy, recall, precision and F1 scores.
SELECT COLUMN WITHOUT AGG: e.g. gold: ([select, col1], [select, col2]) and predicted: ([select, col1], [select, col3]) compute accuracy, recall, precision and F1 scores.
WHERE COLUMN: ([where, col4, NOT IN, NESTED SQL], [where, col1, >=, novalue], [where, col2, =, novalue])
WHERE COLUMN WITHOUT OP: ([where, col1], [where, col4])
GROUP BY: ([groupby, col2], [groupby, col5])
GROUP BY HAVING: ([groupby, col2, having col1, count, >=])
ORDER BY: ([orderby, col1, no agg, desc, no limit], [orderby, *, count, asc, 3])
AND/OR: ([where, col1, col2, and], [where, col3, col2, or])
EXCEPT, UNION, INTERSECT, NESTED SQL: get the except/union/intersect/nested part in all SQLs containing except/union/intersect/nested, check if predicted except/union/intersect/nested part equals to the gold except/union/intersect/nested part.
SQL KEY WORDS: for gold and predicted sql, create a set of SQL key words if they are in [where, group by, having, desc, asc, order by, limit, except, union, intersect, not in, in, or, like].

Exact Matching without Values

In Spider, SQL matching is done component-wise using the SQL skeletons. If the predicted result gets all SQL parts right, then the score of Exact Matching without Values for this predicted example is 1, otherwise 0.