🤛 ⚾️ 🕤 Choosing a hash function in the data sharding problem 🧚 👐🏽 👨🏼‍⚕️

We at Miro are working on the process of sharding Postgres databases and use different approaches depending on business requirements. Recently, we faced the task of sharding new databases, during which we chose a new approach to sharding for us, based on consistent hashing .

During the implementation of this approach, one of the central questions was which implementation of the non-cryptographic hash function we should choose and use. In this article, I will describe the criteria and the comparison algorithm that we developed and used in practice to find the best implementation.

About the architectural approach

There are many products ( mongo , redis , etc.) that use consistent hashing for sharding, and our implementation will be very similar to them.

Let, at the input, we have a set of entities with selected sharding keys, of a string type. For these keys, using the hash function, we get a hash code of a certain length, for which we define the required slot through the modulo operation. The number of slots and the correspondence of entities to slots is fixed. It is also necessary to keep the correspondence between the ranges of slots and shards, which is not a difficult task, and a configuration file is quite suitable for the storage location.

The advantages of this approach are:

even distribution of entities across shards;
determining the correspondence of entities and shards without additional storage with a minimum of resource costs;
the ability to add new shards to the cluster.

Cons:

inefficiency of some search operations, in which it is necessary to make queries for all shards;
rather complicated resharding process.

Requirements

java- -.

- , 256 , - - , 4 . - 2 4 .

, , ;
. , , ;
~~( );~~
~~. , .~~

: - ; - , .

, .

java- - -:

DJB2 (32-);
SDBM (32-);
LoseLose (32-);
FNV-1 / FNV-1a (32-);
CRC16 (16-) ;
Murmur2/Murmur3 (32-).

, 216,553 ;
, UTF-8.

(- ) - "2", "4", "8", "16", "32", "64", "128", "256".

, - ops/ms (- );
- . . , - , .

JMH. :

, 256 . - , .

- warmup- - 50;
- measurement- - 100;
- throughput
-Xms1G, -Xmx8G
GCProfiler

, α=0,05, . .

, , ;
- , , ;
$\ overline {x_ {b}}$ ,

n — , $p_ {i}$ — , -

$x_ {length}$ - , a a b -
,

$\ chi_ {obs} ^ 2 = \ sum \ frac {n_ {i} - \ hat {n_ {i}}} {\ hat {n_ {i}}}$ ,

$n_ {i}$ - , , $\ hat {n_ {i}}$ - , ;
$\ chi_ {cr} ^ 2 (\ alpha, k)$ , α k ;
$\ chi_ {obs} ^ 2 <\ chi_ {cr} ^ 2$ , , — .