Test Suites

#evaluation #TS #matching

Quick-start

The test suite accuracy[1] is defined as the fraction of predictions that match the ground truth as measured by an execution on a set of databases.

Test suites were introduced as a refinement of Execution Matching on which multiple databases are tested against a query to reduce false positives[2].

Formula

The test suites accuracy between and over is defined as:

Where:

Randomisation

Usually, it is not practical to have a set of databases Instead, for a ground truth query a random database is constructed such that is able to distinguish neighbour queries[3] of

Example
  1. Query
    SELECT Name FROM  User WHERE  Age >= 25
    
  2. Query
    SELECT Name FROM  User WHERE  Age > 25
    
    Both queries are considered to be neighbours since the difference in the parsing trees only occur with the symbols >= and >

With that, the (randomised) test suite is defined as:

Advantages

  • Test Suites usually has much fewer false positives than Execution Accuracy. In fact, the former's false positives constitute a subset of the latter's.
  • If randomisation is applied properly, Test Suites are believed to be a very good approximation of Query Matching.

Short-comings

  • This metric is dependent on the database system, and thus it may take large running times if the database is sufficiently large.
  • On large databases, and without randomisation, the gains introduced by Test Suites will be less and less significant.

Notes & References


  1. Ruiqi Zhong, Tao Yu, & Dan Klein. (2020). Semantic Evaluation for Text-to-SQL with Distilled Test Suites.↩︎
  2. A case on which but and are not semantically equivalent↩︎
  3. Informally, a query is said to be neighbour to if they have close parsing trees.↩︎