TransWikia.com

Correlated subquery with common table expressions in SQLite

Database Administrators Asked by nalzok on December 7, 2020

Yesterday I achieved a 500x performance boost after rewriting an EXISTS with IN. I did some investigation to find out why and finally came to the following example. Basically, we draw 5 lucky numbers out of 0 – 999, and count how many lucky numbers are drawn.

CREATE TABLE IF NOT EXISTS natural_numbers AS
WITH RECURSIVE natural_numbers(x) AS (
    SELECT 0
    UNION
    SELECT x + 1
    FROM natural_numbers
    LIMIT 1000
)
SELECT x
FROM natural_numbers;


.eqp   ON
.timer ON

WITH lucky_numbers(lucky_number) AS (
    SELECT x
    FROM natural_numbers
    ORDER BY RANDOM()
    LIMIT 5
)
SELECT SUM(EXISTS(
        SELECT 1
        FROM lucky_numbers
        WHERE x = lucky_number
))
FROM natural_numbers;


WITH lucky_numbers(lucky_number) AS (
    SELECT x
    FROM natural_numbers
    ORDER BY RANDOM()
    LIMIT 5
)
SELECT SUM(x IN lucky_numbers)
FROM natural_numbers;

The output is

QUERY PLAN
|--SCAN TABLE natural_numbers
`--CORRELATED SCALAR SUBQUERY 2
   |--CO-ROUTINE 1
   |  |--SCAN TABLE natural_numbers
   |  `--USE TEMP B-TREE FOR ORDER BY
   `--SCAN SUBQUERY 1
4
Run Time: real 0.095 user 0.094819 sys 0.000000
QUERY PLAN
|--SCAN TABLE natural_numbers
`--LIST SUBQUERY 2
   |--SCAN TABLE natural_numbers
   `--USE TEMP B-TREE FOR ORDER BY
5
Run Time: real 0.001 user 0.000289 sys 0.000007

It seems that EXISTS re-executes the common table expression lucky_numbers for each x (and is consequently much slower), whereas IN only executes it once for the whole table (the result is always 5, no more, no less). This is kind of surprising because common table expressions are supposed to be a view instead of a temporary table, so I would expect the behavior of EXISTS.


To complicate things further, if you replace LIMIT 5 with LIMIT X where X >= 8 in the EXISTS clause, its execution plan would become

QUERY PLAN
|--SCAN TABLE natural_numbers
`--CORRELATED SCALAR SUBQUERY 2
   |--CO-ROUTINE 1
   |  |--SCAN TABLE natural_numbers
   |  `--USE TEMP B-TREE FOR ORDER BY
   `--SEARCH SUBQUERY 1 USING AUTOMATIC COVERING INDEX (lucky_number=?)

and the result will always be X instead of a random variable for X <= 7. I suspect this is because the common table expression lucky_numbers is no longer re-run for each row, but instead cached in the AUTOMATIC COVERING INDEX. In other words, the query planner is changing the result of our query here!


So, when are common table expressions run for each row / the whole table? Looks like with IN it’s guaranteed to run only once for the whole table (which is somehow inconsistent with the document’s words that an ordinary common table expression is essentially a pre-packaged SELECT statement), but with EXISTS things are more complicated and the decision is left to the query planner.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP