How to build a modern analytical data warehouse based on Cloudera Hadoop

Hello.





At the end of last year, GlowByte and Gazprombank made a big joint presentation at the Big Data Days conference, dedicated to the creation of a modern analytical data warehouse based on the Cloudera Hadoop ecosystem. In the article, we talked about the experience of building a system, the difficulties and challenges that we had to face and overcome in order to achieve success in the project.





Hadoop . โ€” ยซ ?ยป. . - , - , , , , , Hadoop.





โ€” Cloudera , โ€œโ€ . .





โ€œโ€ โ€”   . -3 .





, 2017 โ€œ โ€ . 





,   , data driven .





. , : , . . .





:





  • ( , );





  • ;





  • ;





  • -;





  • ;





  • Self-service ;





  • Data Science .





. :





-





  • -: CRM, Real Time Offer, Next Best Offer, ;





  • - as is ( Data Lake);





  • ;





  • ;





  • ;





  • ( );





  • ;





  • ;





  • .





 





  • ;





  • ;





  • SLA;





  • ELT ;





  • Enterprise (, SAP Business Objects, SAS);





  • .





, , open source , โ€” \ .





Hadoop Cloudera Data Hub





.





Fig.  Architecture
.

Cloudera Data Hub.  





1.

. ETL . โ€œโ€ . .





Hadoop 40- - t-1 t-15 batch , real-time . : 





  • CRM;





  • ;





  • ;





  • ;





  • Collection;





  • MDM;





  • ;





  • ;





  • BI





2. โ€œ โ€

, , , . . Disaster Recovery . 





science , , - . . , . . . 





, , .





, , K8S, GPU .





, , ETL,  , Cloudera.





CDH 5.16.1. .





Data : CPU 2x22 Cores 768Gb RAM SAS HDD 12x4Tb. HPE DL380 Cloudera Enterprise Reference Architecture for Bare Metal Deployments. โ€œโ€, - , ETL . . , โ€œ100500โ€ , , โ€œโ€.





, , .





  • Hadoop;





  • (ETL);





  • ยซ- โ€“> Hadoopยป ยซHadoop โ€“> Hadoopยป;





  • ;





  • ;









  .





Hadoop 1.0 , java , , , ยซ ยป ยซ ยป. , ,   SQL.





, ,   โ€“ SQL  SQL. . SQL- ยซ , ยป.





ยซยป SQL Hadoop. Impala . Impala Cloudera Hadoop .





Impala ?





Impala โ€“ , HDFS, MapReduce, TEZ SPARK.





Impala โ€“ . 





Impala Parquet, (bloom , ), . Impala , MPP Teradata GreenPlum.





Impala , , ETL .





Hadoop  YARN . . 





SQL , , SQL , 3-4 . 





Hadoop :





Fig.  Working with Impala SQL in Hue
. Impala SQL Hue

- Hue, Cloudera. , SQL Excel.





Fig.  SQL access to Hadoop in a local thick client.
. SQL Hadoop โ€œโ€ .

Cloudera, โ€“ Impala ETL , ad-hoc BI ? - Impala ยซ ยป Hive. E , . 





  โ€“ ETL .





ETL :





  • ;





  • ;





  • jobโ€™ .





- , , Hadoop , . Hadoop - SQL. โ€œ โ€ ( , ), Hadoop โ€œ โ€.





, . metadata driven E-L-T ETL , SQL . SQL . ETL , SQL. SAS Data Integration.





ETL metadata driven ELT. airflow!





 





  • ;





  • lineage ETL , API;





  • .. jobโ€™ ETL .





  • CI/CD





Fig.  Examples of ETL process diagrams
. ETL

SAS DI API .





Fig.  Object dependency graph
.

โ€“ .





โ€“ Data Replicator. Hadoop. 









  • ;





  • ;





  • .. , ( ), ..





, , . , SLA Hadoop.





Data Replicatorโ€™  - Hadoop DR . , - , API. ETL , API . , DR , , ยซยป .





,   Hadoop ( Hadoop )   , , kafka, flume, ETL tool.





Hadoop . , , ( Hive) ( Impala). 





โ€“ , .  247 . .. \ , ( , ..). .





, HIVE 3 ACID , , Hive ( Map Reduce),   ACID Impala  Hadoop .





HDFS snapshot VIEW.





HDFS, , VIEW.





VIEW, , . 





โ€“ VIEW HDFS , Hadoop. UNDO Oracle, retention .





,   HDFS , DDL VIEW .. metastore. .. VIEW .





HDFS Snapshot .





DataReplictorโ€™. , , ETL API. , ETL API VIEW.





, 247 . HDFS HDFS. , 25%.





โ€“ .









  • ;





  • ;





  • , ;









  • Hadoop cgroups;





  • Hadoop;





  • Hadoop, YARN Impala;





  • Impala โ€“ .





โ€“ ETL Cloudera.





. SQL , .





900 SQL . . 





Fig.  Average CPU utilization per day
. CPU

, . 1,5 2 . .





, , , . Hadoop , , , open source ( Apache Big Top) .





Cloudera :









    • Active Directory (AD) ;





    • AD Sentry;





    • Sentry Impala HDFS;





    • Target VIEW ;





  • ;





  • SSL . .





  • Hadoop ( )





    • ;





    • ETL;





    • Hadoop ;





  • , , .





โ€“ . 





Hadoop ( ) โ€“ , . .





. , Hadoop, , , .





ad-hoc   , , .





, :





  • ;





  • ;





  • ;





  • ;





  • ;





  • ;





  • MDM;





  • ;





  • ;





  • ;





  • ;





  • ;





  • ;





  • ;





  • ;





  • ;





  • .





, 177 2350 -. snappy 20 ( 100 RAW).





2010 . , . , . , , . . , , .





- -, . 40 , 550 13200 .





, Hadoop. Cloudera Data Hub - , . , .





, metastore ( ).





Impala. โ€œโ€ . โ€“ ( , ETL, , ) , . sqoop export. Impala .





, , decommission , , .





. 36 500 . 





Cloudera Data Impact 2020 Data For Enterprise AI.





, Hadoop Cloudera . - . โ€œ โ€. โ€œ โ€ , .





.





โ€œโ€, โ€œโ€, โ€œโ€ . . , , .   ยซยป . 





  time to market , data driven .





. โ€œโ€ , t - 3-5 - . , , CRM. , , . .   - !





Hadoop. Hadoop . SQL MPP, โ€œโ€ , โ€œ โ€ .





Cloudera Data Platform 7.1. , CDP . , , , , Impala 3.4, parquet, Zstd . Atlas Cloudera Data Flow ยซ ยป. Cloudera BI - Cloudera Data Visualization.





Hadoop:





  • Real-time Kudu (real-time , ). Kudu, Parquet, ยซยป SQL Impala. - .





  • ODS





ODS Oracle Golden Gate , Hadoop ยซยป ยซยป .









    • property Hadoop;





    • Arango;





    • Arango;





    • ( );









    • ( , , );





    • ,









    • , ;









  • , . - , โ€œ โ€.





  • K8S





, . , .





:





, .





, ().








All Articles