Hello, Habr!
Artemy Kozyr from the Analytics team is in touch, and I continue to introduce you to Wheely. In this issue:
- Fundamentals of Agile Cluster Computing
- Column storage and data compression
- Instead of indexes: segmentation and sort keys
- Management of access, rights, resources
- Integration with S3 or Dataleyk out of the blue
Wheely , Redshift Amazon. ?
. -, . .
leader node () compute node ( ). , , . โ , .
Redshift
MPP (massively parallel processing, - ). , : BigQuery, Vertica, Greenplum, Teradata, Azure SQL DW.
compute node : CPU, memory, disk, .
Wheely 8 dc2.large, 16 vCPU, 120 GiB , 1.3 TB . - , , : , , .
(elasticity), . : (resize), (node upgrade).
concurrency scaling, compute nodes, "" .
(PostgreSQL, MySQL, SQL Server), (row-oriented storage), Redshift (column-oriented storage). , .
. ( - ). 100+ ; 5. 5 . 100+ , I/O .
( !)
-, , (compression). , (Business, First, Luxe) . , : {100, Luxe}. , . : () , .
, , Amazon Redshift. Compressing Redshift columnar data even further with proper encodings.
21%
Redshift () (B-Tree, Bitmap) . , :
, : (EVEN), (ALL), (KEY). (distribution style).
3 : KEY, ALL, EVEN
, . , , . , distribution key, sort key โ SORT MERGE JOIN:
{{ config( dist='request_id', sort='request_id' ) }}
ORDER BY SQL-, . โ (compound sort key).
{{ config( materialized='table', dist="city", sort=['city', 'date', 'product_name'] ) }}
, Interleaved sort key, ( ) . , .
{{ config( materialized='table', dist="journey_id", sort_type='interleaved', sort=["completed_ts_loc" , "city" , "country" , "service_group" , "is_airport" , "is_wheely_journey"] ) }}
.
, , :
. :
- -
- :)
Amazon Redshift PostgreSQL :
----------------------
-- USER MANAGEMENT ---
----------------------
CREATE USER etl WITH PASSWORD '' ;
CREATE USER hevo WITH PASSWORD '' ;
CREATE USER dbt WITH PASSWORD '' ;
CREATE USER da WITH PASSWORD '' ;
CREATE USER nb WITH PASSWORD '' ;
CREATE USER looker WITH PASSWORD '' SYSLOG ACCESS UNRESTRICTED ;
CREATE USER ar WITH PASSWORD '' ;
CREATE USER ak WITH PASSWORD '' ;
-------------------------
--- SCHEMA MANAGEMENT ---
-------------------------
CREATE SCHEMA IF NOT EXISTS hevo AUTHORIZATION hevo ;
CREATE SCHEMA IF NOT EXISTS ext AUTHORIZATION etl ;
CREATE SCHEMA IF NOT EXISTS flatten AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS staging AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS intermediate AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS analytics AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS ad_hoc AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS meta AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS looker_scratch AUTHORIZATION looker ;
-----------------------
-- GROUP MANAGEMENT ---
-----------------------
CREATE GROUP etl
WITH USER etl, hevo ;
CREATE GROUP dbt
WITH USER dbt ;
CREATE GROUP analytics
WITH USER da, nb ;
CREATE GROUP bi
WITH USER looker ;
CREATE GROUP business_users
WITH USER ar, ak ;
----------------------------
-- PRIVILEGES MANAGEMENT ---
----------------------------
-- GROUP etl
GRANT USAGE ON SCHEMA hevo, staging, flatten, analytics, intermediate TO GROUP etl ;
-- GROUP dbt
GRANT USAGE ON SCHEMA hevo, snapshots TO GROUP dbt ;
GRANT USAGE, CREATE ON SCHEMA ext TO GROUP dbt ;
GRANT SELECT ON ALL tables IN SCHEMA hevo TO GROUP dbt ;
GRANT ALL PRIVILEGES ON ALL tables IN SCHEMA meta, ext, snapshots TO GROUP dbt ;
ALTER DEFAULT PRIVILEGES FOR USER hevo IN SCHEMA hevo
GRANT SELECT ON tables TO GROUP dbt ;
ALTER DEFAULT PRIVILEGES FOR USER etl IN SCHEMA ext
GRANT ALL PRIVILEGES ON tables TO GROUP dbt ;
-- GROUP analytics
GRANT USAGE ON SCHEMA hevo, ext, flatten, staging, intermediate, restricted, analytics, meta TO GROUP analytics ;
GRANT USAGE, CREATE ON SCHEMA ad_hoc TO GROUP analytics ;
GRANT SELECT ON ALL tables IN SCHEMA hevo, ext, flatten, staging, intermediate, restricted, analytics, meta TO GROUP analytics ;
GRANT ALL PRIVILEGES ON ALL tables IN SCHEMA ad_hoc TO GROUP analytics ;
ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA flatten, staging, intermediate, restricted, analytics, meta
GRANT SELECT ON tables TO GROUP analytics ;
ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA ad_hoc
GRANT ALL PRIVILEGES ON tables TO GROUP analytics ;
-- GROUP bi
GRANT USAGE ON SCHEMA analytics, meta, ad_hoc, snapshots TO GROUP bi ;
GRANT SELECT ON ALL tables IN SCHEMA analytics, meta, ad_hoc TO GROUP bi ;
ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA analytics, meta, ad_hoc
GRANT SELECT ON tables TO GROUP bi ;
-- GROUP business users
GRANT USAGE ON SCHEMA analytics TO GROUP business_users ;
GRANT USAGE, CREATE ON SCHEMA ad_hoc TO GROUP business_users ;
GRANT SELECT ON ALL tables IN SCHEMA analytics TO GROUP business_users ;
GRANT ALL PRIVILEGES ON ALL tables IN SCHEMA ad_hoc TO GROUP business_users ;
ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA analytics
GRANT SELECT ON tables TO GROUP business_users ;
Amazon Redshift automatic workload management (WLM), .. . - . Redshift :
- ELT
- BI Ad-hoc
- โโ
- ,
, .
short-query acceleration. Amazon Redshift ML- , (~ 30 .) , . , Wheely , BI .
Data Lake
? ! Redshift Spectrum :
- SQL DDL ( S3)
- : (SELECT), (WHERE), (JOINs), (INSERT)
- - : JSON, ORC, PARQUET
EXTERNAL TABLE _Hive. _ : Hive Metastore.
Spectrum Redshift
Wheely Spectrum:
1. Data Quality Pipeline
S3 Spectrum - (MongoDB) (Redshift). : QA , .
2. S3
>3 S3 parquet. , Redshift , .
UNLOAD ('SELECT * FROM "hevo"."events_prod_clickstream_archive"')
TO 's3://wheely_analytics/dwh/hevo/events_prod/clickstream_archive'
IAM_ROLE ''
MANIFEST
FORMAT AS PARQUET
NULL AS ''
MAXFILESIZE AS 256
ENCRYPTED AUTO
;
: -. .
, . . , . production-, ad-hoc .
, (target table). . , . . , , , .
, , 7 .
: in-database ML, native semi-structured data support
. Wheely .
, Amazon Redshift, preview ( ):
- Using machine learning in Amazon Redshift
- Ingesting and querying semistructured data in Amazon Redshift
, . ! , , .
? Wheely. โ .
.