Hello.
At the end of last year, GlowByte and Gazprombank made a big joint presentation at the Big Data Days conference, dedicated to the creation of a modern analytical data warehouse based on the Cloudera Hadoop ecosystem. In the article, we talked about the experience of building a system, the difficulties and challenges that we had to face and overcome in order to achieve success in the project.
Hadoop . โ ยซ ?ยป. . - , - , , , , , Hadoop.
โ Cloudera , โโ . .
โโ โ . -3 .
, 2017 โ โ .
, , data driven .
. , : , . . .
:
( , );
;
;
-;
;
Self-service ;
Data Science .
. :
-
-: CRM, Real Time Offer, Next Best Offer, ;
- as is ( Data Lake);
;
;
;
( );
;
;
.
;
;
SLA;
ELT ;
Enterprise (, SAP Business Objects, SAS);
.
, , open source , โ \ .
Hadoop Cloudera Data Hub
.
Cloudera Data Hub.
1.
. ETL . โโ . .
Hadoop 40- - t-1 t-15 batch , real-time . :
CRM;
;
;
;
Collection;
MDM;
;
;
BI
2. โ โ
, , , . . Disaster Recovery .
science , , - . . , . . .
, , .
, , K8S, GPU .
, , ETL, , Cloudera.
CDH 5.16.1. .
Data : CPU 2x22 Cores 768Gb RAM SAS HDD 12x4Tb. HPE DL380 Cloudera Enterprise Reference Architecture for Bare Metal Deployments. โโ, - , ETL . . , โ100500โ , , โโ.
, , .
Hadoop;
(ETL);
ยซ- โ> Hadoopยป ยซHadoop โ> Hadoopยป;
;
;
.
Hadoop 1.0 , java , , , ยซ ยป ยซ ยป. , , SQL.
, , โ SQL SQL. . SQL- ยซ , ยป.
ยซยป SQL Hadoop. Impala . Impala Cloudera Hadoop .
Impala ?
Impala โ , HDFS, MapReduce, TEZ SPARK.
Impala โ .
Impala Parquet, (bloom , ), . Impala , MPP Teradata GreenPlum.
Impala , , ETL .
Hadoop YARN . .
SQL , , SQL , 3-4 .
Hadoop :
- Hue, Cloudera. , SQL Excel.
Cloudera, โ Impala ETL , ad-hoc BI ? - Impala ยซ ยป Hive. E , .
โ ETL .
ETL :
;
;
jobโ .
- , , Hadoop , . Hadoop - SQL. โ โ ( , ), Hadoop โ โ.
, . metadata driven E-L-T ETL , SQL . SQL . ETL , SQL. SAS Data Integration.
ETL metadata driven ELT. airflow!
;
lineage ETL , API;
.. jobโ ETL .
CI/CD
SAS DI API .
โ .
โ Data Replicator. Hadoop.
;
;
.. , ( ), ..
, , . , SLA Hadoop.
Data Replicatorโ - Hadoop DR . , - , API. ETL , API . , DR , , ยซยป .
, Hadoop ( Hadoop ) , , kafka, flume, ETL tool.
Hadoop . , , ( Hive) ( Impala).
โ , . 247 . .. \ , ( , ..). .
, HIVE 3 ACID , , Hive ( Map Reduce), ACID Impala Hadoop .
HDFS snapshot VIEW.
HDFS, , VIEW.
VIEW, , .
โ VIEW HDFS , Hadoop. UNDO Oracle, retention .
, HDFS , DDL VIEW .. metastore. .. VIEW .
HDFS Snapshot .
DataReplictorโ. , , ETL API. , ETL API VIEW.
, 247 . HDFS HDFS. , 25%.
โ .
;
;
, ;
Hadoop cgroups;
Hadoop;
Hadoop, YARN Impala;
Impala โ .
โ ETL Cloudera.
. SQL , .
900 SQL . .
, . 1,5 2 . .
, , , . Hadoop , , , open source ( Apache Big Top) .
Cloudera :
Active Directory (AD) ;
AD Sentry;
Sentry Impala HDFS;
Target VIEW ;
;
SSL . .
Hadoop ( )
;
ETL;
Hadoop ;
, , .
โ .
Hadoop ( ) โ , . .
. , Hadoop, , , .
ad-hoc , , .
, :
;
;
;
;
;
;
MDM;
;
;
;
;
;
;
;
;
;
.
, 177 2350 -. snappy 20 ( 100 RAW).
2010 . , . , . , , . . , , .
- -, . 40 , 550 13200 .
, Hadoop. Cloudera Data Hub - , . , .
, metastore ( ).
Impala. โโ . โ ( , ETL, , ) , . sqoop export. Impala .
, , decommission , , .
. 36 500 .
Cloudera Data Impact 2020 Data For Enterprise AI.
, Hadoop Cloudera . - . โ โ. โ โ , .
โโ, โโ, โโ . . , , . ยซยป .
time to market , data driven .
. โโ , t - 3-5 - . , , CRM. , , . . - !
Hadoop. Hadoop . SQL MPP, โโ , โ โ .
Cloudera Data Platform 7.1. , CDP . , , , , Impala 3.4, parquet, Zstd . Atlas Cloudera Data Flow ยซ ยป. Cloudera BI - Cloudera Data Visualization.
Hadoop:
Real-time Kudu (real-time , ). Kudu, Parquet, ยซยป SQL Impala. - .
ODS
ODS Oracle Golden Gate , Hadoop ยซยป ยซยป .
property Hadoop;
Arango;
Arango;
( );
( , , );
,
-
, ;
, . - , โ โ.
K8S
, . , .
:
, .
, ().