Clickhouse - window functions that don't exist ...

I started working with columnar databases with BigQuery. When I had to "move" to Clickhouse I was unpleasantly surprised by the lack of full-fledged window functions. There are, of course, many functions for working with arrays, functions of higher order and other functions (one function runningDifferenceStartingWithFirstValue is worth it). The winner of 1999 for the title of the longest word Donaudampfschifffahrtsgesellschaftskapitänswitwe immediately comes to mind. Which in translation from German means "the widow of the captain of the shipping company on the Danube."



Searching for “window functions in Clickhouse” does not return meaningful results. This article is an attempt to summarize scattered data from the Internet, examples with ClickHouseMeetup and my own experience.



Window functions - syntax



Let me remind you of the syntax of window functions and the type of result we get. In the examples, we will use the Standart SQL Google BigQuery dialect. Here is a link to the documentation about window functions (they are called analytic function in the documentation - more accurate translation sounds like analytic functions). And here is the list of functions itself.



The general syntax looks like this:



analytic_function_name ( [ argument_list ] ) OVER over_clause
over_clause:
  { named_window | ( [ window_specification ] ) }
window_specification:
  [ named_window ]
  [ PARTITION BY partition_expression [, ...] ]
  [ ORDER BY expression [ { ASC | DESC }  ] [, ...] ]
  [ window_frame_clause ]
window_frame_clause:
  { rows_range } { frame_start | frame_between }
rows_range:
  { ROWS | RANGE }


Let's take it step by step:



  1. The window function is applied to the recordset defined in the over_clause expression,
  2. The recordset is defined by the PARTITION BY clause. Here you can list one or more fields by which the recordset will be determined. Works similar to GROUP BY.

    The sorting of records within a set is specified using ORDER BY.
  3. You can additionally restrict a predefined set of records as a window. The window can be defined statically. For example, you can take 5 records as a window, 2 before and 2 after the current record and the current record itself. It will look like this: ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING.

    An example of a construct for specifying a dynamically defined window looks like this - RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This construct defines a window from the first to the current record according to the specified sort order.


As an example, consider the calculation of the cumulative sum (example from the documentation):



SELECT item, purchases, category, SUM(purchases)
  OVER (
    PARTITION BY category
    ORDER BY purchases
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS total_purchases
FROM Produce


Result:



+-------------------------------------------------------+
| item      | purchases  | category   | total_purchases |
+-------------------------------------------------------+
| orange    | 2          | fruit      | 2               |
| apple     | 8          | fruit      | 10              |
| leek      | 2          | vegetable  | 2               |
| cabbage   | 9          | vegetable  | 11              |
| lettuce   | 10         | vegetable  | 21              |
| kale      | 23         | vegetable  | 44              |
+-------------------------------------------------------+


What can be done in Clickhouse



Let's try to repeat this example in ClickHouse. Of course, ClickHouse has the runningAccumulate , arrayCumSum and groupArrayMovingSum functions . But in the first case, you need to determine the state in a subquery ( more details ), and in the second case, the function returns an array, which then needs to be expanded.



We will construct the most general query. The request itself might look like this:



SELECT
   items,
   summ as purchases,
   category,
   sumArray(cum_summ) as total_purchases
FROM (SELECT
         category,
         groupArray(item) AS items,
         groupArray(purchases) AS summ,
         arrayMap(x -> arraySlice(summ, 1, x), arrayEnumerate(summ)) AS cum_summ
     FROM (SELECT
               item,
               purchases,
               category
           FROM produce
           ORDER BY category, purchases)
     GROUP BY category)
   ARRAY JOIN items, summ, cum_summ
GROUP BY category, items, summ
ORDER BY category, purchases


Let's take it step by step:



  1. First, we construct a subquery, within which the required data sorting takes place (ORDER BY category, purchases). The sort must match the fields in the PARTITION BY and ORDER BY expressions of the window function.
  2. , , PARTITION BY. item .

    purchases , summ .
  3. — ArrayMap. , func arr.

    arr — [1, 2, …, length(summ)], arrayEnumerate.

    func — arraySlice(summ, 1, x), x — arr, . summ x. , cum_sum , , .



    ArrayMap arrayEnumerate , , . ( 3), ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING.



    arrayMap(x -> arraySlice(summ, if(x-1 > 0, x-1, 1), if(x-1 > 0, 3, 2)), arrayEnumerate(summ))


    , . 2 ClickHouse:



    • — [edited] — . [/edited]. , , arrayMap arrayFilter. . — ( — ) (alias) arrayMap, arrayFilter .
    • — . , , arrayReverse arraySlice.


  4. The final step is to expand the arrays into a table using ARRAY JOIN. We also need to apply the sum aggregate function with the -Array modifier (as a result, the aggregate function looks like sumArray) to the result returned by the ArrayMap function.


Output



It is possible to emulate the operation of window functions in ClickHouse. Not very fast and not very pretty. Briefly, the pipeline consists of 3 steps:



  1. Sorted query. This step prepares the recordset.
  2. Grouping into arrays and performing array operations. This step defines the window of our window function.
  3. Expanding back into a table using aggregate functions.



All Articles