WikiSQL

#dataset #benchmark #EX #EM

Introduction

WikiSQL is a large crowd-sourced dataset for developing natural language interfaces for relational databases.

WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

Assumptions & Guarantees

  • SQL labels only cover single SELECT column and optional aggregation, and WHERE conditions.
  • Moreover, all the databases only contain single tables. No JOINs.
  • Furthermore, no GROUP BY, and ORDER BY, etc... are included.

To summarise, It is guaranteed that the ground-truth SQL query is of the following form

OPT-AGG (SELECT COL FROM TABLENAME
	WHERE  CONDITIONS
	)

with:

  • OPT-AGG one of MAX, MIN, COUNT, SUM or nothing.
  • COL is a column name
  • TABLENAME is the table name.
  • CONDITIONS are list of conditions in the following BNF form:
    CONDITIONS ::= CONDITION | CONDITIONS OP CONDITION
    OP ::= OR | AND
    CONDITION ::= TOKEN CMP TOKEN
    CMP ::= > | < | <> | >= | <= | == 
    

Evaluation Metrics

Three main evaluation metrics were used in Spider: