Most frequently, the representation of the database schema may influence the model's performances. In the literature, the database representation is considered a sub-task of question representation. And thus, they are described there.
In this page, we will focus on some database related information that may be relevant for the LLM.
The primary key uniquely identifies each row in a table, while the foreign key creates associations between tables.
For simplicity, Primary Keys are atomic, which means that they are identified by only one column. The same also applies for foreign keys
Incorporating primary and foreign key information can guide the model to deduce the appropriate join path, ensuring accurate joins.
It is very recommended to include information about primary keys and foreign keys. It is almost a standard practice.
The data type of a column dictates its validation rules and permissible operations. For instance, numeric types like INTEGER and REAL support arithmetic operations, whereas
string types don’t.
If certain data is stored as a string type, the CAST function must be used in the SQL query before performing arithmetic operations on it.
In many cases, terms are encoded in a database. The encoding itself maybe trivial for a human, but that is not the case for a Text2SQL model.
Generally, the genders "male" and "female" are encoded as M
and respectfully F
in a database. Sometimes it is encoded as
Without retrieving some examples from the database, the LLM cannot infer the encoding apriori.
Real-world database usually contain ambiguous schemas. optional comments are incorporated as an additional input to both the Schema Linking model and the LLM