Searching for “window functions in Clickhouse” does not return meaningful results. This article is an attempt to summarize scattered data from the Internet, examples with ClickHouseMeetup and my own experience.
Window functions - syntax
Let me remind you of the syntax of window functions and the type of result we get. In the examples, we will use the Standart SQL Google BigQuery dialect. Here is a link to the documentation about window functions (they are called analytic function in the documentation - more accurate translation sounds like analytic functions). And here is the list of functions itself.
The general syntax looks like this:
analytic_function_name ( [ argument_list ] ) OVER over_clause
over_clause:
{ named_window | ( [ window_specification ] ) }
window_specification:
[ named_window ]
[ PARTITION BY partition_expression [, ...] ]
[ ORDER BY expression [ { ASC | DESC } ] [, ...] ]
[ window_frame_clause ]
window_frame_clause:
{ rows_range } { frame_start | frame_between }
rows_range:
{ ROWS | RANGE }
Let's take it step by step:
- The window function is applied to the recordset defined in the over_clause expression,
- The recordset is defined by the PARTITION BY clause. Here you can list one or more fields by which the recordset will be determined. Works similar to GROUP BY.
The sorting of records within a set is specified using ORDER BY. - You can additionally restrict a predefined set of records as a window. The window can be defined statically. For example, you can take 5 records as a window, 2 before and 2 after the current record and the current record itself. It will look like this: ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING.
An example of a construct for specifying a dynamically defined window looks like this - RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This construct defines a window from the first to the current record according to the specified sort order.
As an example, consider the calculation of the cumulative sum (example from the documentation):
SELECT item, purchases, category, SUM(purchases)
OVER (
PARTITION BY category
ORDER BY purchases
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS total_purchases
FROM Produce
Result:
+-------------------------------------------------------+
| item | purchases | category | total_purchases |
+-------------------------------------------------------+
| orange | 2 | fruit | 2 |
| apple | 8 | fruit | 10 |
| leek | 2 | vegetable | 2 |
| cabbage | 9 | vegetable | 11 |
| lettuce | 10 | vegetable | 21 |
| kale | 23 | vegetable | 44 |
+-------------------------------------------------------+
What can be done in Clickhouse
Let's try to repeat this example in ClickHouse. Of course, ClickHouse has the runningAccumulate , arrayCumSum and groupArrayMovingSum functions . But in the first case, you need to determine the state in a subquery ( more details ), and in the second case, the function returns an array, which then needs to be expanded.
We will construct the most general query. The request itself might look like this:
SELECT
items,
summ as purchases,
category,
sumArray(cum_summ) as total_purchases
FROM (SELECT
category,
groupArray(item) AS items,
groupArray(purchases) AS summ,
arrayMap(x -> arraySlice(summ, 1, x), arrayEnumerate(summ)) AS cum_summ
FROM (SELECT
item,
purchases,
category
FROM produce
ORDER BY category, purchases)
GROUP BY category)
ARRAY JOIN items, summ, cum_summ
GROUP BY category, items, summ
ORDER BY category, purchases
Let's take it step by step:
- First, we construct a subquery, within which the required data sorting takes place (ORDER BY category, purchases). The sort must match the fields in the PARTITION BY and ORDER BY expressions of the window function.
- , , PARTITION BY. item .
purchases , summ . - — ArrayMap. , func arr.
arr — [1, 2, …, length(summ)], arrayEnumerate.
func — arraySlice(summ, 1, x), x — arr, . summ x. , cum_sum , , .
ArrayMap arrayEnumerate , , . ( 3), ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING.
arrayMap(x -> arraySlice(summ, if(x-1 > 0, x-1, 1), if(x-1 > 0, 3, 2)), arrayEnumerate(summ))
, . 2 ClickHouse:
- —
, , arrayMap arrayFilter.. — ( — ) (alias) arrayMap, arrayFilter . - — . , , arrayReverse arraySlice.
- —
- The final step is to expand the arrays into a table using ARRAY JOIN. We also need to apply the sum aggregate function with the -Array modifier (as a result, the aggregate function looks like sumArray) to the result returned by the ArrayMap function.
Output
It is possible to emulate the operation of window functions in ClickHouse. Not very fast and not very pretty. Briefly, the pipeline consists of 3 steps:
- Sorted query. This step prepares the recordset.
- Grouping into arrays and performing array operations. This step defines the window of our window function.
- Expanding back into a table using aggregate functions.