🐯 🚤 🚣🏻 Using window functions and CTEs in MySQL 8.0 to implement a cumulative total without hacks 🙏🏻 🤲🏾 ♑️

Approx. transl. : In this article, the team leader of the UK company Ticketsolve shares a solution to his very specific problem, while demonstrating general approaches to creating so-called accumulating functions using modern MySQL 8.0 features. Its listings are clear and provided with detailed explanations, which helps to understand the essence of the problem, even for those who did not dive into it so deeply.

A common strategy for performing updates using cumulative functions in MySQL is using custom variables and patternUPDATE [...] SET mycol = (@myvar := EXPRESSION(@myvar, mycol)).

This pattern does not work well with the optimizer (leading to non-deterministic behavior), so they decided to abandon it. The result is a kind of emptiness because (relatively) complex logic is now harder to implement (at least with the same simplicity).

The article will discuss two ways to implement it: using window functions (canonical approach) and using recursive CTEs (general table expressions).

Requirements and background

Although CTEs are fairly intuitive, for those who are not very familiar with them, I recommend referring to my previous post on this topic .

The same is true for window functions: I will comment on queries / concepts in detail, but a general idea still doesn't hurt. There are a huge number of books and publications devoted to window functions (which is why I still have not written about them); however, in most examples, calculations are performed either on financial results or on demographic indicators. However, in this article I will use a real case.

On the software side, I recommend using MySQL 8.0.19 (but not required). All expressions must be executed in the same console to be reused @venue_id.

In the software world, there is a famous architectural dilemma: should logic be implemented at the application level or at the database level? While this is a perfectly valid question, in our case I am assuming that the logic should remain at the base level; the reason for this may be, for example, speed requirements (as was the case in our case).

Task

In this task, we allocate seats in a certain hall (theater).

For business purposes, each location needs to be assigned a so-called "grouping" - an additional number that represents it.

Here is the algorithm for determining the grouping value:

start at 0 and top left;
if there is an empty space between the current and the previous one, or this is a new row, then we add 2 to the previous value (if this is not the absolute first place), otherwise, we increase the value by 1;
assign a grouping to a place;
go to a new place in the same row or to the next row (if the previous one is over) and repeat from point 2; we continue everything until the places run out.

Algorithm in pseudocode:

current_grouping = 0

for each row:
  for each number:
    if (is_there_a_space_after_last_seat or is_a_new_row) and is_not_the_first_seat:
      current_grouping += 2
    else
      current_grouping += 1

    seat.grouping = current_grouping

In real life, we want the configuration on the left to give the values shown on the right:

 x→  0   1   2        0   1   2
y   ╭───┬───┬───╮    ╭───┬───┬───╮
↓ 0 │ x │ x │   │    │ 1 │ 2 │   │
    ├───┼───┼───┤    ├───┼───┼───┤
  1 │ x │   │ x │    │ 4 │   │ 6 │
    ├───┼───┼───┤    ├───┼───┼───┤
  2 │ x │   │   │    │ 8 │   │   │
    ╰───┴───┴───╯    ╰───┴───┴───╯

Training

Let the base table have the following minimalist structure:

CREATE TABLE seats (
  id         INT AUTO_INCREMENT PRIMARY KEY,
  venue_id   INT,
  y          INT,
  x          INT,
  `row`      VARCHAR(16),
  number     INT,
  `grouping` INT,
  UNIQUE venue_id_y_x (venue_id, y, x)
);

We don't really need the rowand columns number. On the other hand, we don't want to use a table whose records are completely contained in the index (just to be closer to real problems).

Based on the diagram above, the coordinates of each location are (y, x):

(0, 0), (0, 1)
(1, 0), (1, 2)
(20)

Note that we are using y as the first coordinate as it makes it easier to keep track of the rows.

You must load a large enough number of records to prevent the optimizer from finding unexpected short paths. Of course, we use recursive CTEs:

INSERT INTO seats(venue_id, y, x, `row`, number)
WITH RECURSIVE venue_ids (id) AS
(
  SELECT 0
  UNION ALL
  SELECT id + 1 FROM venue_ids WHERE id + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
  v.id,
  c.y, c.x,
  CHAR(ORD('A') + FLOOR(RAND() * 3) USING ASCII) `row`,
  FLOOR(RAND() * 3) `number`
FROM venue_ids v
     JOIN (
       VALUES
         ROW(0, 0),
         ROW(0, 1),
         ROW(1, 0),
         ROW(1, 2),
         ROW(2, 0)
     ) c (y, x)
;

ANALYZE TABLE seats;

A few notes:

Here, CTE is being used in an interesting (hopefully!) Way: each loop represents a venue_id, but since we want multiple locations to be generated for each venue, we do a cross join with the table containing the locations.
Used the row constructor from v8.0.19 ( VALUES ROW()...) to represent a ( joinable ) table without actually creating it.
Generates random values for row and number as placeholders.
For the sake of simplicity, we did not do any optimizations (for example, data types are wider than necessary; indexes are added before inserting records, etc.).

Old approach

The good old approach is quite straightforward and straightforward:

SET @venue_id = 5000; --  venue id;  () id 

SET @grouping = -1;
SET @y = -1;
SET @x = -1;

WITH seat_groupings (id, y, x, `grouping`, tmp_y, tmp_x) AS
(
  SELECT
    id, y, x,
    @grouping := @grouping + 1 + (seats.x > @x + 1 OR seats.y != @y),
    @y := seats.y,
    @x := seats.x
  FROM seats
  WHERE venue_id = @venue_id
  ORDER BY y, x
)
UPDATE
  seats s
  JOIN seat_groupings sg USING (id)
SET s.grouping = sg.grouping
;

-- Query OK, 5 rows affected, 3 warnings (0,00 sec)

Well that was easy (but don't forget the warnings)!

A small digression: in this case, I use the properties of Boolean arithmetic. The following expressions are equivalent:

SELECT seats.x > @x + 1 OR seats.y != @y `increment`;

SELECT IF (
  seats.x > @x + 1 OR seats.y != @y,
  1,
  0
) `increment`;

Some find this intuitive, others not; it's a matter of taste. From now on I will use a more compact expression.

Let's see the result:

SELECT id, y, x, `grouping` FROM seats WHERE venue_id = @venue_id ORDER BY y, x;

-- +-------+------+------+----------+
-- | id    | y    | x    | grouping |
-- +-------+------+------+----------+
-- | 24887 |    0 |    0 |        1 |
-- | 27186 |    0 |    1 |        2 |
-- | 29485 |    1 |    0 |        4 |
-- | 31784 |    1 |    2 |        6 |
-- | 34083 |    2 |    0 |        8 |
-- +-------+------+------+----------+

Great approach!

Alas, it has a "minor" drawback: it works great except when it doesn't work ...

The point is that the query optimizer does not necessarily perform calculations from left to right, so assignments (: =) may be performed in the wrong order leading to the wrong result. People often face this problem after updating MySQL.

In MySQL 8.0, this functionality is indeed deprecated:

--    UPDATE.
--
SHOW WARNINGS\G
-- *************************** 1. row ***************************
--   Level: Warning
--    Code: 1287
-- Message: Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable=expression, ...', or 'SELECT expression(s) INTO variables(s)'.
-- [...]

Well, let's fix the situation!

Modern Approach # 1: Windowing Functions

The introduction of window functions has been a highly anticipated event in the MySQL world.

Generally speaking, the "sliding" nature of window functions works well with cumulative functions. However, some complex cumulative functions require the results of the last expression — functionality that window functions do not support because they operate on columns.

This does not mean that the problem cannot be solved; it just needs to be rethought.

In our case, the task can be divided into two parts. The grouping for each location can be considered as the sum of two values:

the serial number of each place,
the cumulative value of the increments of all places preceding this one.

Those familiar with windowing functions will recognize the typical patterns here.

The sequence number of each seat is a built-in function:

ROW_NUMBER() OVER <window>

But with the cumulative value, everything is much more interesting ... To calculate it, we perform two actions:

count the increment for each place and write it down to the table (or CTE),
then, for each location, we sum up the increments for that location using the window function.

Let's take a look at SQL:

WITH
increments (id, increment) AS
(
  SELECT
    id,
    x > LAG(x, 1, x - 1) OVER tzw + 1 OR y != LAG(y, 1, y) OVER tzw
  FROM seats
  WHERE venue_id = @venue_id
  WINDOW tzw AS (ORDER BY y, x)
)
SELECT
  s.id, y, x,
  ROW_NUMBER() OVER tzw + SUM(increment) OVER tzw `grouping`
FROM seats s
     JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x)
;

-- +-------+---+---+----------+
-- | id    | y | x | grouping |
-- +-------+---+---+----------+
-- | 24887 | 0 | 0 |        1 |
-- | 27186 | 0 | 1 |        2 |
-- | 29485 | 1 | 0 |        4 |
-- | 31784 | 1 | 2 |        6 |
-- | 34083 | 2 | 1 |        8 |
-- +-------+---+---+----------+

Great!

(Note that I am omitting the UPDATE from now on for the sake of simplicity.)

Let's analyze the request.

High-level logic

The following CTE (edited) :

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1 OR y != LAG(y, 1, y) OVER tzw `increment`
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)
;

-- +-------+-----------+
-- | id    | increment |
-- +-------+-----------+
-- | 24887 |         0 |
-- | 27186 |         0 |
-- | 29485 |         1 |
-- | 31784 |         1 |
-- | 34083 |         1 |
-- +-------+-----------+

… Calculates the increments for each location from the previous one (more on LAG()later). It works on every record and the one that precedes it, and is not cumulative.

Now, to calculate the cumulative increments, we'll simply use a window function to calculate the sum up to and including each location:

-- (CTE here...)
SELECT
  s.id, y, x,
  ROW_NUMBER() OVER tzw `pos.`,
  SUM(increment) OVER tzw `cum.incr.`
FROM seats s
     JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x);

-- +-------+---+---+------+-----------+
-- | id    | y | x | pos. | cum.incr. | (grouping)
-- +-------+---+---+------+-----------+
-- | 24887 | 0 | 0 |    1 |         0 | = 1 + 0 (curr.)
-- | 27186 | 0 | 1 |    2 |         0 | = 2 + 0 (#24887) + 0 (curr.)
-- | 29485 | 1 | 0 |    3 |         1 | = 3 + 0 (#24887) + 0 (#27186) + 1 (curr.)
-- | 31784 | 1 | 2 |    4 |         2 | = 4 + 0 (#24887) + 0 (#27186) + 1 (#29485) + 1 (curr.)
-- | 34083 | 2 | 1 |    5 |         3 | = 5 + 0 (#24887) + 0 (#27186) + 1 (#29485) + 1 (#31784)↵
-- +-------+---+---+------+-----------+     + 1 (curr.)

LAG () window function

The LAG function, in its simplest form ( LAG(x)), returns the previous value of a given column. The classic inconvenience with such functions is handling the first entry (s) in the window. Since there is no previous record, they return NULL. In the case of LAG, you can specify the desired value as the third parameter:

LAG(x, 1, x - 1) --    `x -1`
LAG(y, 1, y)     --    `y`

By specifying the default values, we ensure that the very first place in the window's bounds will have the same logic as for the next place (x-1) and without changing the row (y).

An alternative solution is to use IFNULL, however, the expressions are quite cumbersome:

--  ,  !
--
IFNULL(x > LAG(x) OVER tzw + 1 OR y != LAG(y) OVER tzw, 0)
IFNULL(x > LAG(x) OVER tzw + 1, FALSE) OR IFNULL(y != LAG(y) OVER tzw, FALSE)

The second parameter LAG()is the number of positions to move back within the window; 1 is the previous value (it is also the default).

Technical aspects

Named windows

Our query uses the same window many times. The following two queries are formally equivalent:

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1
    OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x);

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER (ORDER BY y, x) + 1
    OR y != LAG(y, 1, y) OVER (ORDER BY y, x)
FROM seats
WHERE venue_id = @venue_id;

However, the second can lead to non-optimal behavior (which I have encountered - at least in the past): the optimizer can consider the windows independent and calculate each separately. For this reason, I advise you to always use named windows (at least when they are repeated).

PARTITION BY statement

Usually windowing functions are performed on a partition. In our case, it will look like this:

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1
    OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (PARTITION BY venue_id ORDER BY y, x); -- !

Since the window matches the full set of records (which is filtered by the condition WHERE), we do not need to specify it (the partition).

But if this query had to be run on the entire table seats, then it would have to be done so that the window was reset for everyone venue_id.

Sorting

The request ORDER BYis set at the window level:

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1
    OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)

In this case, window sorting is separate from SELECT. It is very important! The behavior of this request:

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1
    OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS ()
ORDER BY y, x

… undefined. Let's turn to the manual :

The query result strings are determined from the FROM clause after the WHERE, GROUP BY, and HAVING clauses have been executed, and the execution within the window occurs before ORDER BY, LIMIT, and SELECT DISTINCT.

Some considerations

In general terms, for this type of problem, it makes sense to calculate the state change for each record and then sum them - instead of representing each record as a function of the previous one.

This solution is more complex than the functionality it replaces, but at the same time it is reliable. Alas, this approach is not always possible or easy to implement. This is where recursive CTEs come into play.

Modern Approach # 2: Recursive CTEs

This approach requires a little trickery due to the limited capabilities of the CTE in MySQL. On the other hand, it is a one-size-fits-all, direct solution, so it does not require any rethinking of a global approach.

Let's start with a simplified version of the final request:

-- `p_`  `Previous`    
--
WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
  (
    SELECT id, venue_id, y, x, 1
    FROM seats
    WHERE venue_id = @venue_id
    ORDER BY y, x
    LIMIT 1
  )

  UNION ALL

  SELECT
    s.id, s.venue_id, s.y, s.x,
    p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
  FROM groupings, seats s
  WHERE s.venue_id = p_venue_id AND (s.y, s.x) > (p_y, p_x)
  ORDER BY s.venue_id, s.y, s.x
  LIMIT 1
)
SELECT * FROM groupings;

Bingo! This query is (relatively) simple, but more importantly, it expresses the cumulative grouping function in the simplest way possible:

p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)

--   :

@grouping := @grouping + 1 + (seats.x > @x + 1 OR seats.y != @y),
@y := seats.y,
@x := seats.x

The logic is clear even for those who are not too familiar with CTE. The first row is the first seat in the hall, in order:

SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1

In the recursive part, we iterate:

SELECT
  s.id, s.venue_id, s.y, s.x,
  p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
FROM groupings, seats s
WHERE s.venue_id = p_venue_id AND (s.y, s.x) > (p_y, p_x)
ORDER BY s.venue_id, s.y, s.x
LIMIT 1

Condition WHEREtogether with operators ORDER BYand LIMITsimply find the next place, a place with the same venue_id, but used for lshimi coordinates (x, y) in the sequence (venue_id, x, y).

The part s.venue_idin the sort expression is very important! It allows us to use an index.

Operator SELECT:

performs accumulation (calculates (p_)grouping),
provides values for the current position ( s.id, s.venue_id, s.y, s.x) in the next cycle.

We choose FROM groupingsto fulfill the requirements for CTE recursiveness.

What's interesting here is that we use a recursive CTE as an iterator, fetching from a table groupingsin a recursive subquery and joining it with seatsto find the data for further processing.

JOINis formally cross, but LIMITonly one record is returned because of the operator .

Working version

Unfortunately, the above query does not work as it ORDER BYis not currently supported in recursive subqueries. In addition, the semantics LIMITas used here differ from the typical semantics that apply to an external query :

LIMIT is now supported [..] The effect on the resulting dataset is the same as using LIMIT with an external SELECT

However, this is not such a serious problem. Let's take a look at a working version:

WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
  (
    SELECT id, venue_id, y, x, 1
    FROM seats
    WHERE venue_id = @venue_id
    ORDER BY y, x
    LIMIT 1
  )

  UNION ALL

  SELECT
    s.id, s.venue_id, s.y, s.x,
    p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
  FROM groupings, seats s WHERE s.id = (
    SELECT si.id
    FROM seats si
    WHERE si.venue_id = p_venue_id AND (si.y, si.x) > (p_y, p_x)
    ORDER BY si.venue_id, si.y, si.x
    LIMIT 1
  )
)
SELECT * FROM groupings;

-- +-------+------+------+------------+
-- | p_id  | p_y  | p_x  | p_grouping |
-- +-------+------+------+------------+
-- | 24887 |    0 |    0 |          1 |
-- | 27186 |    0 |    1 |          2 |
-- | 29485 |    1 |    0 |          4 |
-- | 31784 |    1 |    2 |          6 |
-- | 34083 |    2 |    0 |          8 |
-- +-------+------+------+------------+

It is a little uncomfortable to use a subquery, but this approach works and the boilerplate is minimal here, since several expressions are required anyway.

Here, instead of doing the ordering and limiting associated with the union groupingsand seats, we do it within the subquery and pass it to the outer query, which then selects only the target record.

Reflections on performance

Let's examine the query execution plan using EXPLAIN ANALYZE:

mysql> EXPLAIN ANALYZE WITH RECURSIVE groupings [...]

-> Table scan on groupings  (actual time=0.000..0.001 rows=5 loops=1)
    -> Materialize recursive CTE groupings  (actual time=0.140..0.141 rows=5 loops=1)
        -> Limit: 1 row(s)  (actual time=0.019..0.019 rows=1 loops=1)
            -> Index lookup on seats using venue_id_y_x (venue_id=(@venue_id))  (cost=0.75 rows=5) (actual time=0.018..0.018 rows=1 loops=1)
        -> Repeat until convergence
            -> Nested loop inner join  (cost=3.43 rows=2) (actual time=0.017..0.053 rows=2 loops=2)
                -> Scan new records on groupings  (cost=2.73 rows=2) (actual time=0.001..0.001 rows=2 loops=2)
                -> Filter: (s.id = (select #5))  (cost=0.30 rows=1) (actual time=0.020..0.020 rows=1 loops=5)
                    -> Single-row index lookup on s using PRIMARY (id=(select #5))  (cost=0.30 rows=1) (actual time=0.014..0.014 rows=1 loops=5)
                    -> Select #5 (subquery in condition; dependent)
                        -> Limit: 1 row(s)  (actual time=0.007..0.008 rows=1 loops=9)
                            -> Filter: ((si.y,si.x) > (groupings.p_y,groupings.p_x))  (cost=0.75 rows=5) (actual time=0.007..0.007 rows=1 loops=9)
                                -> Index lookup on si using venue_id_y_x (venue_id=groupings.p_venue_id)  (cost=0.75 rows=5) (actual time=0.006..0.006 rows=4 loops=9)

The plan is in line with expectations. In this case, the basis of the optimal plan lies in index searches:

-> Nested loop inner join  (cost=3.43 rows=2) (actual time=0.017..0.053 rows=2 loops=2)
-> Single-row index lookup on s using PRIMARY (id=(select #5))  (cost=0.30 rows=1) (actual time=0.014..0.014 rows=1 loops=5)
-> Index lookup on si using venue_id_y_x (venue_id=groupings.p_venue_id)  (cost=0.75 rows=5) (actual time=0.006..0.006 rows=4 loops=9)

... of paramount importance. The performance will drop significantly if you perform an index scan (that is, linearly scan the index records instead of looking for the necessary ones at once).

Thus, for this strategy to work, the linked indexes need to be in place and to be used as efficiently as possible by the optimizer.

If the restrictions are lifted in the future, then there will be no need to use a subquery, which will greatly simplify the task for the optimizer.

Alternative for suboptimal plans

In case the optimal plan cannot be determined, use a temporary table:

CREATE TEMPORARY TABLE selected_seats (
  id INT NOT NULL PRIMARY KEY,
  y INT,
  x INT,
  UNIQUE (y, x)
)
SELECT id, y, x
FROM seats WHERE venue_id = @venue_id;

WITH RECURSIVE
groupings (p_id, p_y, p_x, p_grouping) AS
(
  (
    SELECT id, y, x, 1
    FROM seats
    WHERE venue_id = @venue_id
    ORDER BY y, x
    LIMIT 1
  )

  UNION ALL

  SELECT
    s.id, s.y, s.x,
    p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
  FROM groupings, seats s WHERE s.id = (
    SELECT ss.id
    FROM selected_seats ss
    WHERE (ss.y, ss.x) > (p_y, p_x)
    ORDER BY ss.y, ss.x
    LIMIT 1
    )
)
SELECT * FROM groupings;

Even if index scans are passed in this query, they cost a selected_seatslot , since the table is very small.

Conclusion

I am very pleased that the efficient but flawed workflow can now be replaced with the fairly simple functionality introduced in MySQL 8.0.

In the meantime, the development of new features for 8.0 continues, which makes an already successful release even better.

Successful recursion!

Using window functions and CTEs in MySQL 8.0 to implement a cumulative total without hacks