How does a relational DBMS do a JOIN?

The original is here





What is this article about and to whom is it addressed?

Almost everyone works with SQL, but even experienced developers sometimes cannot answer a simple question. How does the DBMS execute the most common INNER JOIN?





On the other hand, developers in C # or other OOP languages ​​often perceive a DBMS as just a repository. And posting some business rules in SQL is bad. In contrast to them, libraries like Linq2Db are created   (not to be confused with  Linq2Sql  - completely different authors and different libraries). When you use it, all your code is written in C # and you get all the benefits of a typed language. But this is a formality. Then this code is translated into SQL and executed on the DBMS side.





In order to better understand how the same code works in SQL and in C #, we will try to implement the same code on the first and second, and then we will analyze how it works. If you know well what  Nested LoopMerge JoinHash Join are  - it probably makes sense for you to read the article diagonally. But if you don't know, the article should be useful for you.





Working with multiple collections

Suppose we have some kind of service center for the maintenance of cars - a service station (SRT). There are two entities:  Person  - customers of the service center and  Visit  - a specific visit to this center.  Person,  in addition to the identifier, contains the first name, last name and activity status (for example, if a client changed the car for another brand, he is transferred to the inactive status and will not visit us in the near future).  Visit,  in addition to the identifier, contains a link to the client, the date of the visit and the amount that the client paid for this visit. All of the above could be styled using the following C # classes for the simplest case:





internal sealed class Person
{
    internal int Id { get; set; }
    internal string FirstName { get; set; }
    internal string LastName { get; set; }
    internal bool IsActive { get; set; }
}

internal sealed class Visit
{
    internal int Id { get; set; }
    internal int PersonId { get; set; }
    internal DateTime Date { get; set; }
    internal decimal Spent { get; set; }
}

// ...
internal Person[] persons = new Person[];
internal Visit[] visits = new Visit[];
// ...	
      
      



(  PostgreSQL) :





create table public.visit
(
    id integer,
    person_id integer,
    visit_datetime timestamp without time zone,
    spent money
) tablespace pg_default;

create table public.person
(
    id integer,
    first_name character varying(100) COLLATE pg_catalog."default",
    last_name character varying(100) COLLATE pg_catalog."default",
    is_active boolean
) tablespace pg_default;
      
      



 . - - .





- -  2020 , , . ?





Nested Loop

, . . - .





public decimal NestedLoop()
{
    decimal result = 0;
    var upperLimit = new DateTime(2020, 12, 31);

    foreach (var person in persons)
    {
        if (person.IsActive == false)
        {
            continue;
        }
        
        foreach (var visit in visits)
        {
            if (person.Id == visit.PersonId && visit.Date <= upperLimit)
            {
                result += visit.Spent;
            }
        }
    }
    return result;
}
      
      



:





, .  O(N²), - , .





, SQL :





select setseed(0.777);

delete from public.person;
insert into public.person(id, first_name, last_name, is_active)
select row_number() over () as id,
substr(md5(random()::text), 1, 10) as first_name,
substr(md5(random()::text), 1, 10) as last_name,
((row_number() over ()) % 5 = 0) as is_active
from generate_series(1, 5000);/*<-- 5000   */

delete from public.visit;
insert into public.visit(id, person_id, visit_datetime, spent)
select row_number() over () as id,
(random()*5000)::integer as person_id, /*<-- 5000   */
DATE '2020-01-01' + (random() * 500)::integer as visit_datetime,
(random()*10000)::integer as spent
from generate_series(1, 10000); /* 10000 -      */
      
      



CTO P  5000,  V - 10000. , . . , . - .  (P,V)   (5000, 10000). : C# (Nested Loop) . .  20.040 , .  20.27 .  40 . SQL .





select sum(v.spent) from public.visit v
                    join public.person p on p.id = v.person_id
where v.visit_datetime <= '2020-12-31' and p.is_active = True
      
      



 2.1   . . .. 10 , .





Merge Join

20 . Nested Loop - . …  Merge Join  Sort-Merge Join. , . . - . , - . , , , .  O(N*log(N)).





1.4   C#. .  20 . , , . ? ! Hash Join  .





Hash Join

. . , , Join. :





Hash Join ( . )





 O(N). .NET Linq . (Grace hash joinHybrid hash join) - . C# ,  0.9 .





! , . . . Nested Loop - , Merge Join - ( ). Hash Join - .





- . -. (P, V) (50, 100). : Nested Loop  - 2.202 , Merge Join - 4.715 , Hash Join - 7.638 . :





C# :





Method





Nested Loop





Merge Join





Hash Join





(10, 10)





62.89 ns





293.22 ns





1092.98 ns





(50, 100)





2.168 us





4.818 us





7.342 us





(100, 200)





8.767 us





10.909 us





16.911 us





(200, 500)





38.77 us





32.75 us





40.75 us





(300, 700)





81.36 us





52.54 us





54.29 us





(500, 1000)





189.58 us





87.10 us





82.85 us





(800, 2000)





606.8 us





173.4 us





172.7 us





(750, 5000)





1410.6 us





428.2 us





397.9 us





X1 X2 ? . . , 2020 ? C#. , , . . , . , . , Array List? , .. API. , - . … . . LinkedList? . .





Method





Nested Loop





Nested Loop with Linked List





(10, 10)





62.89 ns





262.97 ns





(50, 100)





2.188 us





8.160 us





(100, 200)





8.196 us





32.738 us





(200, 500)





39.24 us





150.92 us





(300, 700)





80.99 us





312.71 us





(500, 1000)





196.3 us





805.20 us





(800, 2000)





599.3 us





2359.1 us





(750, 5000)





1485.0 us





5750.0 us





. :





, Nested Loop. Indexed Nested Loop Hash Join. ( ).





, . , . - .. - . , . PostgreSQL (page), (tuples). . , .





 





, :





 .





,  buffer pool  buffer manager. .  Nested LoopMerge Join  Hash Join. , . (Query Plan).





C#. (P, V) (50000, 100000). C#  145.13 .  Nested Loop  - 305.38 Hash Join - 36.59 . :





set enable_hashjoin to 'off';--   Nested Loop
set enable_mergejoin to 'off';
set enable_material to 'off';

select sum(v.spent) from public.visit v
join public.person p on p.id = v.person_id
where v.visit_datetime <= '2020-12-31' and p.is_active = True
      
      



 Nested Loop   11247.022 . :





,  Nested Loop. :





set enable_hashjoin to 'on';
set enable_mergejoin to 'on';
set enable_material to 'on';

select sum(v.spent) from public.visit v
join public.person p on p.id = v.person_id
where v.visit_datetime <= '2020-12-31' and p.is_active = True
      
      



- ,  Hash Join:





,  25.806 , C# .





, , .   . , , ..





JOIN. C# SQL , .    ( , ). , .





, - . , . C# … , .. . Linq Join  Hash Join, , . .. .





- , , - .








All Articles