CodeS[1] is a fully open-source language model base on StarCoder[2], which achieves high accuracy with much smaller parameter sizes than large language models.
To achieve state of the art results, CodeS implements the following steps:
StarCoder was trained on multiple programming languages, and is unsuited for Text2SQL tasks.
To improve its capabilities in SQL generation and natural language understanding, it was pre-trained sequentially on 3 different datasets
Beyond model advancements, a suitable prompt is required in a Text2SQL task. High-quality prompts furnish the language model with valuable insights, enabling it to generate precise SQL queries more efficiently.
To craft these superior database prompts, two key strategies were employed:
The schema linking was achieved using two classification models
Once the scores are predicted:
If less than
It is a method used to extract values from the natural language question.
Database: BIRD
Query: How many clients opened their accounts in Jesenik branch were women?
Values Retrieving: Jesenik
is found in the column district.a2
We can incorporate that information to the prompt as follows district.a2=Jesenik
For value retrieving, a coarse-to-fine approach was followed. The essence of
this method lies in:
Lucene was used to build the BM25 index for all values stored in each database. When a user’s question is received:
With the
To remove potential ambiguities, the following metadata on the database was included in the prompt:
This very present in recent Text2SQL models. In CodeS, the representation of primary keys was specified as follows:
id
.For each column
SELECT DISTINCT
C FROM T WHERE C IS NOT NULL
LIMIT K
By default, the authors have set