Accidents as experience # 1. How to break two ClickHouse clusters without specifying one nuance

failure stories , . , β€” , Β« Β» , . , , . β€” !





β€” , . , , , ( ) . . β€” , . 





…






ClickHouse- HDD-. 2 , 2 . ( ), : β€” raw-, β€” .





, . . : , .





ClickHouse, Β« Β»: , . . . (NVMe) , . , , ( ) : .





, , ZooKeeper. , , , .. . , ? , , : .





ZooKeeper? ZK , ClickHouse, , ( VM β€” ). ZK β€” , ( ). 





: CH, ZK, . 





, :





…





. . ClickHouse: CH, , .





Linux-, ClickHouse. : , . 





, CH, . :





2020.10.24 18:50:28.105250 [ 75 ] {} <Error> enriched_distributed.Distributed.DirectoryMonitor: Code: 252, e.displayText() = DB::Exception: Received from 192.168.1.4:9000. DB::Exception: Too many parts (300). Merges are processing significantly slower than inserts.. Stack trace:
      
      



replication_queue



, β€” . error- :





/var/log/clickhouse-server/clickhouse-server.err.log.0.gz:2020.10.24 16:18:33.233639 [ 16 ] {} <Error> DB.TABLENAME: DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = DNS error: Temporary DNS error while resolving: clickhouse2-0 (version 19.15.2.2 (official build)
      
      



:





2020.10.24 18:38:51.192075 [ 53 ] {} <Error> search_analyzer_distributed.Distributed.DirectoryMonitor: Code: 210, e.displayText() = DB::NetException: 
Connection refused (192.168.1.3:9000), Stack trace:
2020.10.24 18:38:57.871637 [ 58 ] {} <Error> raw_distributed.Distributed.DirectoryMonitor: Code: 210, e.displayText() = DB::NetException: Connection refused (192.168.1.3:9000), Stack trace:
      
      



, . , SHOW CREATE TABLE



. , β€” ZKPATH!





ENGINE = ReplicatedMergeTree('/clickhouse/tables/DBNAME/{shard}/TABLENAME', '{replica}')
      
      



:





  • ZK- ;





  • β€” CH-;





  • β€” ;





  • ZKPATH



    , .





, ClickHouse, . , .





: , ZK , ZKPATH .





«» . «» , . , .





, : .





, , , :





  • / ;





  • , ( replication_queue



    );





  • «» .





( ) , .





β€” DROP



, ( ) :





  • , DROP



    ( , .. , );





  • ZK, , .





, , clickhouse2-0



clickhouse2-1



( 2-0 2-1 ).





DROP TABLE



2-0 , 2-1 - : ZK «».





:





rmr /clickhouse/tables/DBNAME/2/TABLENAME/replicas/clickhouse2-1
      
      



… node not empty



, . znode «».





ClickHouse DROP’ ZK-, . , ClickHouse 19, 20-. CH , , :





SYSTEM DROP REPLICA 'replica_name' FROM ZKPATH '/path/to/table/in/zk';
      
      



… !





( , ) ZK-, CH- - . ? replication_queue



. , ( ) .





, ZK-. :





/clickhouse/tables/DBNAME/1/TABLENAME/replicas/clickhouse-0/queue/
      
      



, , replication_queue



CH. SELECT



nodename



, .





β€” . , :





clickhouse-client -h 127.0.0.1 --pass PASS -q "select node_name from system.replication_queue where source_replica='clickhouse2-1'" > bad_queueid
      
      



ZK:





for id in $(cat bad_queueid); do /usr/share/zookeeper/bin/zkCli.sh rmr /clickhouse/tables/DBNAME/2/TABLENAME/replicas/clickhouse-1/queue/$id; sleep 2; done
      
      



, SELECT



.





, «» ZK, β€” .





, .





?





  1. , , . , .





  2. Β« β€” Β». , - ( ). , , , . , , .





  3. - , , . «» β€” , , Dev ( ), , Ops. .





  4. hardcode, , : ZooKeeper, / , … β€” . , .





( ) ClickHouse , β€” . CH, , ( , ), .





P.S.

:





  • Β« SRE-. 2Β» (. β„–2 ClickHouse);





  • Β« Redis K8s - Β»












All Articles