PostgreSQL Antipatterns: chained EXISTS

I have already talked about the specifics of evaluating conditions in SQL in general and in PostgreSQL, in particular. Today we will continue the topic and try to write and optimize a simple query - which of the employees has “super-priority” tasks to complete.



CREATE TABLE task AS
SELECT
  id
, (random() * 100)::integer person --  100 
, least(trunc(-ln(random()) / ln(2)), 10)::integer priority --     2   
FROM
  generate_series(1, 1e5) id; -- 100K 
CREATE INDEX ON task(person, priority);

      
      





The word "is" in SQL turns into EXISTS



- here's the simplest version and let's start:



SELECT
  *
FROM
  generate_series(0, 99) pid
WHERE
  EXISTS(
    SELECT
      NULL
    FROM
      task
    WHERE
      person = pid AND
      priority = 10
  );

      
      







all plan pictures are clickable



So far, everything looks good, but ...



EXISTS + IN



... then they came to us and asked to include not only priority = 10



8 and 9 as "super" :



SELECT
  *
FROM
  generate_series(0, 99) pid
WHERE
  EXISTS(
    SELECT
      NULL
    FROM
      task
    WHERE
      person = pid AND
      priority IN (10, 9, 8)
  );
      
      









They read 1.5 times more, and it also affected the execution time.



OR + EXISTS



Let's try to use our knowledge that it is priority = 8



much more likely to meet a record with than 10:



SELECT
  *
FROM
  generate_series(0, 99) pid
WHERE
  EXISTS(
    SELECT
      NULL
    FROM
      task
    WHERE
      person = pid AND
      priority = 8
  ) OR
  EXISTS(
    SELECT
      NULL
    FROM
      task
    WHERE
      person = pid AND
      priority = 9
  ) OR
  EXISTS(
    SELECT
      NULL
    FROM
      task
    WHERE
      person = pid AND
      priority = 10
  );

      
      









Note that PostgreSQL 12 is already smart enough to make subsequent EXISTS



-subqueries only for those "not found" by the previous ones after 100 searches for value 8 - only 13 for value 9, and only 4 for 10.



CASE + EXISTS + ...



On previous versions, a similar result can be achieved by "hiding under CASE" the following queries:



SELECT
  *
FROM
  generate_series(0, 99) pid
WHERE
  CASE
    WHEN
      EXISTS(
        SELECT
          NULL
        FROM
          task
        WHERE
          person = pid AND
          priority = 8
      ) THEN TRUE
    ELSE
      CASE
        WHEN
          EXISTS(
            SELECT
              NULL
            FROM
              task
            WHERE
              person = pid AND
              priority = 9
          ) THEN TRUE
        ELSE
          EXISTS(
            SELECT
              NULL
            FROM
              task
            WHERE
              person = pid AND
              priority = 10
          )
      END
  END;

      
      





EXISTS + UNION ALL + LIMIT



The same, but you can get a little faster if you use the "hack" UNION ALL + LIMIT



:



SELECT
  *
FROM
  generate_series(0, 99) pid
WHERE
  EXISTS(
    (
      SELECT
        NULL
      FROM
        task
      WHERE
        person = pid AND
        priority = 8
      LIMIT 1
    )
    UNION ALL
    (
      SELECT
        NULL
      FROM
        task
      WHERE
        person = pid AND
        priority = 9
      LIMIT 1
    )
    UNION ALL
    (
      SELECT
        NULL
      FROM
        task
      WHERE
        person = pid AND
        priority = 10
      LIMIT 1
    )
    LIMIT 1
  );

      
      









Correct indexes are the key to the health of the database



Now let's look at the problem from a completely different side. If we know for sure that the task



number of records that we want to find are several times less than the rest , then we will make a suitable partial index. At the same time, let's go straight from the "dot" enumeration 8, 9, 10



to >= 8



:



CREATE INDEX ON task(person) WHERE priority >= 8;
      
      





SELECT
  *
FROM
  generate_series(0, 99) pid
WHERE
  EXISTS(
    SELECT
      NULL
    FROM
      task
    WHERE
      person = pid AND
      priority >= 8
  );
      
      









I had to read 2 times faster and 1.5 times less!



But, probably, to subtract all suitable ones at task



once - will it be even faster? ..



SELECT DISTINCT
  person
FROM
  task
WHERE
  priority >= 8;
      
      









Far from always, and certainly not in this case - because instead of 100 reads of the first available records, we have to read more than 400!


And in order not to guess which of the query options will be more effective, and to know it with confidence - use explain.tensor.ru .



All Articles