Test Suites

Quick-start

The test suite accuracy^[1] is defined as the fraction of predictions that match the ground truth as measured by an execution on a set of databases.

Test suites were introduced as a refinement of Execution Matching on which multiple databases are tested against a query to reduce false positives^[2].

Formula

Notations:

The test suites accuracy between and over is defined as:

Where:

Randomisation

Usually, it is not practical to have a set of databases Instead, for a ground truth query a random database is constructed such that is able to distinguish neighbour queries^[3] of

Example

Query

SELECT Name FROM  User WHERE  Age >= 25

Query
```
SELECT Name FROM  User WHERE  Age > 25
```
Both queries are considered to be neighbours since the difference in the parsing trees only occur with the symbols >= and >

With that, the (randomised) test suite is defined as:

Advantages

Test Suites usually has much fewer false positives than Execution Accuracy. In fact, the former's false positives constitute a subset of the latter's.
If randomisation is applied properly, Test Suites are believed to be a very good approximation of Query Matching.

Short-comings

This metric is dependent on the database system, and thus it may take large running times if the database is sufficiently large.
On large databases, and without randomisation, the gains introduced by Test Suites will be less and less significant.

Notes & References

Ruiqi Zhong, Tao Yu, & Dan Klein. (2020). Semantic Evaluation for Text-to-SQL with Distilled Test Suites.↩︎
A case on which but and are not semantically equivalent ↩︎
Informally, a query is said to be neighbour to if they have close parsing trees.↩︎