CodeS: Pretraining

Introduction

CodeS was trained in a two major steps:

First, it was pretrained from the StarCoder model on multiple Text/SQL datasets
Second, it was fine-tuned on the target dataset.

StarCoder

StarCoder is a large language model trained on permissively licensed data from GitHub, including from 80+ programming languages (Including SQL), Git commits, GitHub issues, and Jupyter notebooks

StarCoder was trained for a large number of tasks. And for that reason, it may not suited for complex SQL queries. For that reason, StarCoder was pretrained with Text/SQL data into CodeS.

SQL Data (11GB)

This data is used to enhance the SQL generation capability of language models.

For CodeS, the SQL segment from StarCoder’s pre-training corpus was chosen, and the training took 2 epochs.

Text2SQL Data (4.5GB)

Datasets

This data is used to enhance the SQL generation capability of language models.

To bridge the gap between natural language questions and SQL queries, 4 datasets were incorporated in the pre-training corpus:

CoNaLa and StaQC, which are derived automatically from Stack Overflow, encompasses many NL-to-Python and NL-to-SQL pairs.
CodeAlpaca 20k4, which encompasses a wealth of instruction-following data related to code, being created using the self-instruct methodology.
Jupyter-structured-clean-dedup, a subset of the StarCoder’s pre-training corpus, comprises a vast collection of structured Jupyter notebooks containing both code and accompanying natural language explanations.
NL-SQL-458K is a brand-new dataset specifically crafted by the authors of CodeS

NL-SQL-458K

NL-SQL-458K contains a vast number of Text2SQL pairs. It was extracted using regular expressions to extract all “SELECT” queries from three extensive open-source corpora:

The Pile
The Stack
GitHub Code
Then, SQL queries with syntax errors were filtred, resulting in 458K SQL queries. To generate corresponding natural language questions for each SQL query, GPT-3.5 was used using prompts of eight paired (SQL, question) demonstrations.

Text Data (4.5GB)

To bolster the capability in natural language comprehension, high-quality dialog data
were sampled from three sources:

Alpaca-cleaned3 is designed for developng an instruction-following language model. This dataset is constructed using the self-instruct technique, aided by OpenAI’s text-davinci-003 model.
Unnatural-instructions is also a large instruction-following dataset collected with almost no human labor.
UltraChat is a multi-turn dialogue dataset, produced by iteratively invoking two distinct GPT-3.5 APIs.