Today, using extremely simple examples, let's see what this can lead to in the context of use
GROUP/DISTINCT
and LIMIT
together with them.
Now, if you wrote in the request "first connect these plates, and then discard all duplicates, there should be only one copy for each key" - this is how it will work, even if the connection was not needed at all.
And sometimes you are lucky and it "just works", sometimes it has an unpleasant effect on performance, and sometimes it gives absolutely unexpected effects from the developer's point of view.
Well, maybe not so spectacular, but ...
"Sweet couple": JOIN + DISTINCT
SELECT DISTINCT
X.*
FROM
X
JOIN
Y
ON Y.fk = X.pk
WHERE
Y.bool_condition;
How would it be clear that we wanted to select such records X, for which Y has those associated with a fulfilled condition . We wrote a request through
JOIN
- we got some pk values several times (exactly how many matching records in Y turned out to be). How to remove? Of course DISTINCT
!
It is especially "happy" when for each X-record there are several hundred linked Y-records, and then duplicates are heroically removed ...
How to fix it? To begin with, realize that the task can be modified to “select such records X for which Y has AT LEAST ONE associated with a running condition” - after all, we do not need anything from the Y record itself.
Nested EXISTS
SELECT
*
FROM
X
WHERE
EXISTS(
SELECT
NULL
FROM
Y
WHERE
fk = X.pk AND
bool_condition
LIMIT 1
);
Some PostgreSQL versions understand that it is enough to find the first available record in EXISTS, older ones do not. Therefore, I prefer to always specify
LIMIT 1
inside EXISTS
.
LATERAL JOIN
SELECT
X.*
FROM
X
, LATERAL (
SELECT
Y.*
FROM
Y
WHERE
fk = X.pk AND
bool_condition
LIMIT 1
) Y
WHERE
Y IS DISTINCT FROM NULL;
The same option allows, if necessary, at the same time to immediately return some data from the found linked Y-record. A similar option is discussed in the article "PostgreSQL Antipatterns: a rare entry will fly to the middle of a JOIN" .
"Why pay more": DISTINCT [ON] + LIMIT 1
An additional advantage of such query transformations is the ability to easily restrict the iteration over records if only one / several of them are needed, as in the following case:
SELECT DISTINCT ON(X.pk)
*
FROM
X
JOIN
Y
ON Y.fk = X.pk
LIMIT 1;
Now we read the request and try to understand what the DBMS is proposing to do:
- we connect the plates
- uniqueizable by X.pk
- select one from the remaining records
That is, you got what? “Some one record” from the unique ones - and if we take this one of the non-uniqueized ones, will the result change somehow? .. “And if there is no difference, why pay more?”
SELECT
*
FROM
(
SELECT
*
FROM
X
--
LIMIT 1 -- +1 Limit
) X
JOIN
Y
ON Y.fk = X.pk
LIMIT 1;
And exactly the same topic with
GROUP BY + LIMIT 1
.
"I just ask": implicit GROUP + LIMIT
Similar things are encountered during various checks for the non-emptyness of the plate or CTE during the execution of the request:
...
CASE
WHEN (
SELECT
count(*)
FROM
X
LIMIT 1
) = 0 THEN ...
Aggregate functions (
count/min/max/sum/...
) execute successfully on the entire set, even without explicit indication GROUP BY
. Only LIMIT
they are not very friendly with them.
The developer may think “if there are records there, then I need no more LIMIT” . But don't! Because for the base it is:
- count what you want on all records
- give as many lines as you ask
Depending on the target conditions, it is appropriate to make one of the substitutions here:
(count + LIMIT 1) = 0
onNOT EXISTS(LIMIT 1)
(count + LIMIT 1) > 0
onEXISTS(LIMIT 1)
count >= N
on(SELECT count(*) FROM (... LIMIT N))
"How much to hang in grams": DISTINCT + LIMIT
SELECT DISTINCT
pk
FROM
X
LIMIT $1
A naive developer can honestly think that the query will stop as soon as we find the first $ 1 different values that come across .
Sometime in the future, this may and will work thanks to the new Index Skip Scan node , the implementation of which is currently being worked out, but not yet.
So far, at first, all-all records will be retrieved , uniqueized, and only from them will be returned how many requested. It is especially sad if we wanted something like $ 1 = 4 , and there are hundreds of thousands of records in the table ...
In order not to be sad in vain, we will use the recursive query "DISTINCT for the poor" from the PostgreSQL Wiki :