PostgreSQL Antipatterns: "There should be only one left!"

In SQL, you describe “what” you want to get, not “how” it should be done. Therefore, the problem of developing SQL queries in the “as you hear it as you write” style takes its place of honor, along with the peculiarities of calculating conditions in SQL .



Today, using extremely simple examples, let's see what this can lead to in the context of use GROUP/DISTINCTand LIMITtogether with them.



Now, if you wrote in the request "first connect these plates, and then discard all duplicates, there should be only one copy for each key" - this is how it will work, even if the connection was not needed at all.



And sometimes you are lucky and it "just works", sometimes it has an unpleasant effect on performance, and sometimes it gives absolutely unexpected effects from the developer's point of view.





Well, maybe not so spectacular, but ...



"Sweet couple": JOIN + DISTINCT



SELECT DISTINCT
  X.*
FROM
  X
JOIN
  Y
    ON Y.fk = X.pk
WHERE
  Y.bool_condition;


How would it be clear that we wanted to select such records X, for which Y has those associated with a fulfilled condition . We wrote a request through JOIN- we got some pk values ​​several times (exactly how many matching records in Y turned out to be). How to remove? Of course DISTINCT!



It is especially "happy" when for each X-record there are several hundred linked Y-records, and then duplicates are heroically removed ...







How to fix it? To begin with, realize that the task can be modified to “select such records X for which Y has AT LEAST ONE associated with a running condition” - after all, we do not need anything from the Y record itself.



Nested EXISTS



SELECT
  *
FROM
  X
WHERE
  EXISTS(
    SELECT
      NULL
    FROM
      Y
    WHERE
      fk = X.pk AND
      bool_condition
    LIMIT 1
  );


Some PostgreSQL versions understand that it is enough to find the first available record in EXISTS, older ones do not. Therefore, I prefer to always specify LIMIT 1inside EXISTS.



LATERAL JOIN



SELECT
  X.*
FROM
  X
, LATERAL (
    SELECT
      Y.*
    FROM
      Y
    WHERE
      fk = X.pk AND
      bool_condition
    LIMIT 1
  ) Y
WHERE
  Y IS DISTINCT FROM NULL;


The same option allows, if necessary, at the same time to immediately return some data from the found linked Y-record. A similar option is discussed in the article "PostgreSQL Antipatterns: a rare entry will fly to the middle of a JOIN" .


"Why pay more": DISTINCT [ON] + LIMIT 1



An additional advantage of such query transformations is the ability to easily restrict the iteration over records if only one / several of them are needed, as in the following case:



SELECT DISTINCT ON(X.pk)
  *
FROM
  X
JOIN
  Y
    ON Y.fk = X.pk
LIMIT 1;


Now we read the request and try to understand what the DBMS is proposing to do:



  • we connect the plates
  • uniqueizable by X.pk
  • select one from the remaining records


That is, you got what? “Some one record” from the unique ones - and if we take this one of the non-uniqueized ones, will the result change somehow? .. “And if there is no difference, why pay more?”



SELECT
  *
FROM
  (
    SELECT
      *
    FROM
      X
    --     
    LIMIT 1 -- +1 Limit
  ) X
JOIN
  Y
    ON Y.fk = X.pk
LIMIT 1;


And exactly the same topic with GROUP BY + LIMIT 1.



"I just ask": implicit GROUP + LIMIT



Similar things are encountered during various checks for the non-emptyness of the plate or CTE during the execution of the request:



...
CASE
  WHEN (
    SELECT
      count(*)
    FROM
      X
    LIMIT 1
  ) = 0 THEN ...


Aggregate functions ( count/min/max/sum/...) execute successfully on the entire set, even without explicit indication GROUP BY. Only LIMITthey are not very friendly with them.



The developer may think “if there are records there, then I need no more LIMIT” . But don't! Because for the base it is:



  • count what you want on all records
  • give as many lines as you ask


Depending on the target conditions, it is appropriate to make one of the substitutions here:



  • (count + LIMIT 1) = 0 on NOT EXISTS(LIMIT 1)
  • (count + LIMIT 1) > 0 on EXISTS(LIMIT 1)
  • count >= N on (SELECT count(*) FROM (... LIMIT N))


"How much to hang in grams": DISTINCT + LIMIT



SELECT DISTINCT
  pk
FROM
  X
LIMIT $1


A naive developer can honestly think that the query will stop as soon as we find the first $ 1 different values ​​that come across .



Sometime in the future, this may and will work thanks to the new Index Skip Scan node , the implementation of which is currently being worked out, but not yet.



So far, at first, all-all records will be retrieved , uniqueized, and only from them will be returned how many requested. It is especially sad if we wanted something like $ 1 = 4 , and there are hundreds of thousands of records in the table ...



In order not to be sad in vain, we will use the recursive query "DISTINCT for the poor" from the PostgreSQL Wiki :






All Articles