Database Representation

Introduction

Most frequently, the representation of the database schema may influence the model's performances. In the literature, the database representation is considered a sub-task of question representation. And thus, they are described there.

In this page, we will focus on some database related information that may be relevant for the LLM.

Primary Keys & Foreign Keys

The primary key uniquely identifies each row in a table, while the foreign key creates associations between tables.

Assumption

For simplicity, Primary Keys are atomic, which means that they are identified by only one column. The same also applies for foreign keys

Incorporating primary and foreign key information can guide the model to deduce the appropriate join path, ensuring accurate joins.

Tip

It is very recommended to include information about primary keys and foreign keys. It is almost a standard practice.

Column Data Types

The data type of a column dictates its validation rules and permissible operations. For instance, numeric types like INTEGER and REAL support arithmetic operations, whereas
string types don’t.

If certain data is stored as a string type, the CAST function must be used in the SQL query before performing arithmetic operations on it.

Representative Database Values

In many cases, terms are encoded in a database. The encoding itself maybe trivial for a human, but that is not the case for a Text2SQL model.

Example

Generally, the genders "male" and "female" are encoded as M and respectfully F in a database. Sometimes it is encoded as and respectfully.

Without retrieving some examples from the database, the LLM cannot infer the encoding apriori.

Comments

Real-world database usually contain ambiguous schemas. optional comments are incorporated as an additional input to both the Schema Linking model and the LLM