Text2SQL Datasets

Reference

Dataset Description Paper Implementation
WikiSQL A large crowd-sourced dataset for developing natural language interfaces for relational databases. It was released along with Seq2SQL.
Spider A large-scale, complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students.
BIRD It represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing
CoSQL CoSQL is a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems

Features

Dataset Questions Queries Databases Domains Size Size (Rows/DB)
WikiSQL 80654 77,840 22241 1 154.74MB 17
Spider 10181 5693 200 138 1.8GB 2K
BIRD 12751 12751 95 37 33.4GB 280K
CoSQL 15598* 3007 200 138

Difficulty

Dataset Junction Nesting Reasoning Knowledge Context
WikiSQL
Spider
BIRD
CoSQL

State of The Art

See Benchmarks for more details.

Dataset Top Execution Accuracy Top Exact Set match
Spider 91.2% 81.5%
BIRD 65.45% 71.35%
CoSQL 66.3%* 57.8%*
WikiSQL 89.2% 83.7%