CodeS

#model #pretraining #SFT #prompting #augmentation #schema-linking

Introduction

CodeS^[1] is a fully open-source language model base on StarCoder^[2], which achieves high accuracy with much smaller parameter sizes than large language models.

To achieve state of the art results, CodeS implements the following steps:

Incrementation pre-training: The model is pre-trained incrementally to adjust its knowledge base.
Database prompt construction: Extracting a good textual representation of a database given a question
Bi-directional Augmentation: Augmenting the target dataset with 2 approaches.
Supervised fine-tuning: Training the model on the new augmented dataset.
Few-shot in-context learning: Building examples in the prompt to guide the model.

Incremental Pre-training

StarCoder was trained on multiple programming languages, and is unsuited for Text2SQL tasks.

To improve its capabilities in SQL generation and natural language understanding, it was pre-trained sequentially on 3 different datasets

Database Prompt construction

Beyond model advancements, a suitable prompt is required in a Text2SQL task. High-quality prompts furnish the language model with valuable insights, enabling it to generate precise SQL queries more efficiently.

To craft these superior database prompts, two key strategies were employed:

Schema Linking
Value Retrieving

Schema Linking

The schema linking was achieved using two classification models and that predicts the relevance score of tokens in the natural language question

Once the scores are predicted:

The elements having the top scores as predicted by will be considered as the relevant tables.
The elements having the top scores as predicted by will be considered as the relevant columns.

If less than tables are relevant, then list will be padded to tables by uniformly drawing tables from the database. The same strategy is followed to guarantee the selection of columns.

Value Retrieving

Introduction

It is a method used to extract values from the natural language question.

Example

Database: BIRD
Query: How many clients opened their accounts in Jesenik branch were women?
Values Retrieving: Jesenik is found in the column district.a2
We can incorporate that information to the prompt as follows district.a2=Jesenik

For value retrieving, a coarse-to-fine approach was followed. The essence of
this method lies in:

Using indexes for a fast yet coarse-grained initial search
Then, a fine-grained matching process using the LCS^[3] algorithm.

Indexing

Lucene was used to build the BM25 index for all values stored in each database. When a user’s question is received:

The indexes first extract hundreds of potentially relevant values from the whole database based on the question.
Then, the LCS method is employed to calculate the degree of match between the question and these potentially relevant values to find the most relevant values.
The mots relevant values will be extracted. will be in the order of to

LCS

With the values extracted, we will select the top candidate using the Longest Common Substring against the base question.

Database Representation

To remove potential ambiguities, the following metadata on the database was included in the prompt:

Primary Keys & Foreign Keys

This very present in recent Text2SQL models. In CodeS, the representation of primary keys was specified as follows:

A unique identifier was chosen to represent a primary key. As an example id.
The fact that is a foreign key on is represented as in the prompt

Representative Database Values

For each column of a table , distinct examples are sampled as follows:

SELECT DISTINCT
C FROM T WHERE C IS NOT NULL 
LIMIT K

Default Value

By default, the authors have set for all tables and all columns: Only two sample values are selected per column

Notes & References

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, & Hong Chen. (2024). CodeS: Towards Building Open-source Language Models for Text-to-SQL.↩︎
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, and et al. 2023. StarCoder: may the source be with you! CoRR abs/2305.06161 (2023). arXiv:2305.06161↩︎
Longest Common Substring ↩︎