Text2SQL Datasets

Reference

Dataset	Description	Paper	Implementation
WikiSQL	A large crowd-sourced dataset for developing natural language interfaces for relational databases. It was released along with Seq2SQL.	✅	✅
Spider	A large-scale, complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students.	✅	✅
BIRD	It represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing	✅	✅
CoSQL	CoSQL is a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems	✅	✅

Features

Dataset	Questions	Queries	Databases	Domains	Size	Size (Rows/DB)
WikiSQL	80654	77,840	22241	1	154.74MB	17
Spider	10181	5693	200	138	1.8GB	2K
BIRD	12751	12751	95	37	33.4GB	280K
CoSQL	15598*	3007	200	138

Difficulty

Dataset	Junction	Nesting	Reasoning	Knowledge	Context
WikiSQL	❌	❌	❌	❌	❌
Spider	✅	✅	❌	❌	❌
BIRD	✅	✅	✅	✅	❌
CoSQL	✅	✅	✅	✅	✅

State of The Art

See Benchmarks for more details.

Dataset	Top Execution Accuracy	Top Exact Set match
Spider	91.2%	81.5%
BIRD	65.45%	71.35%
CoSQL	66.3%*	57.8%*
WikiSQL	89.2%	83.7%