CodeS was trained in a two major steps:
StarCoder is a large language model trained on permissively licensed data from GitHub, including from 80+ programming languages (Including SQL), Git commits, GitHub issues, and Jupyter notebooks
StarCoder was trained for a large number of tasks. And for that reason, it may not suited for complex SQL queries. For that reason, StarCoder was pretrained with Text/SQL data into CodeS.
This data is used to enhance the SQL generation capability of language models.
For CodeS, the SQL segment from StarCoder’s pre-training corpus was chosen, and the training took 2 epochs.
This data is used to enhance the SQL generation capability of language models.
To bridge the gap between natural language questions and SQL queries, 4 datasets were incorporated in the pre-training corpus:
NL-SQL-458K contains a vast number of Text2SQL pairs. It was extracted using regular expressions to extract all “SELECT” queries from three extensive open-source corpora:
To bolster the capability in natural language comprehension, high-quality dialog data
were sampled from three sources: