Amazon Redshift Analytics Engine + Cloud Benefits

Amazon Redshift Analytics Engine + Cloud Benefits







Hello, Habr!







Artemy Kozyr from the Analytics team is in touch, and I continue to introduce you to Wheely. In this issue:







  • Fundamentals of Agile Cluster Computing
  • Column storage and data compression
  • Instead of indexes: segmentation and sort keys
  • Management of access, rights, resources
  • Integration with S3 or Dataleyk out of the blue


Wheely , Redshift Amazon. ?







. -, . .









leader node () compute node ( ). , , . โ€” , .







Redshift provides parallel data processing on a cluster of machines







Redshift







MPP (massively parallel processing, - ). , : BigQuery, Vertica, Greenplum, Teradata, Azure SQL DW.







compute node : CPU, memory, disk, .







Specification of compute node types available for Amazon Redshift







compute node Amazon Redshift







Wheely 8 dc2.large, 16 vCPU, 120 GiB , 1.3 TB . - , , : , , .







(elasticity), . : (resize), (node upgrade).







Concurrency Scaling will allow you to raise a temporary cluster in order to cope with the peak load







Concurrency Scaling







concurrency scaling, compute nodes, "" .









(PostgreSQL, MySQL, SQL Server), (row-oriented storage), Redshift (column-oriented storage). , .







. ( - ). 100+ ; 5. 5 . 100+ , I/O .







Column storage organizes single column data into blocks (not rows!)







( !)







-, , (compression). , (Business, First, Luxe) . , : {100, Luxe}. , . : () , .







An example of data compression using the Runlength encoding algorithm







Runlength encoding







, , Amazon Redshift. Compressing Redshift columnar data even further with proper encodings.







Revision of data compression algorithms saved 21% of disk space







21%









Redshift () (B-Tree, Bitmap) . , :







, : (EVEN), (ALL), (KEY). (distribution style).







3 types of distribution of each row of data in the cluster: KEY, ALL, EVEN







3 : KEY, ALL, EVEN







, . , , . , distribution key, sort key โ€” SORT MERGE JOIN:







ORDER BY request_id:
{{
   config(
       dist='request_id',
       sort='request_id'
   )
}}
      
      





ORDER BY SQL-, . โ€” (compound sort key).







ORDER BY city, date, product_name:
{{
   config(
       materialized='table',
       dist="city",
       sort=['city', 'date', 'product_name']
   )
}}
      
      





, Interleaved sort key, ( ) . , .







INTERLEAVED SORT BY completed_ts_loc, city, country, service_group, is_airport, is_wheely_journey:
{{
   config(
       materialized='table',
       dist="journey_id",
       sort_type='interleaved',
       sort=["completed_ts_loc"
               , "city"
               , "country"
               , "service_group"
               , "is_airport"
               , "is_wheely_journey"]
   )
}}
      
      





.









, , :









. :







  • -
  • :)


Amazon Redshift PostgreSQL :







:
----------------------
-- USER MANAGEMENT ---
----------------------

CREATE USER etl WITH PASSWORD '' ;
CREATE USER hevo WITH PASSWORD '' ;

CREATE USER dbt WITH PASSWORD '' ;

CREATE USER da WITH PASSWORD '' ;
CREATE USER nb WITH PASSWORD '' ;

CREATE USER looker WITH PASSWORD '' SYSLOG ACCESS UNRESTRICTED ;

CREATE USER ar WITH PASSWORD '' ;
CREATE USER ak WITH PASSWORD '' ;

-------------------------
--- SCHEMA MANAGEMENT ---
-------------------------

CREATE SCHEMA IF NOT EXISTS hevo AUTHORIZATION hevo ;
CREATE SCHEMA IF NOT EXISTS ext AUTHORIZATION etl ;

CREATE SCHEMA IF NOT EXISTS flatten AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS staging AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS intermediate AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS analytics AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS ad_hoc AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS meta AUTHORIZATION dbt ;

CREATE SCHEMA IF NOT EXISTS looker_scratch AUTHORIZATION looker ;

-----------------------
-- GROUP MANAGEMENT ---
-----------------------

CREATE GROUP etl
WITH USER etl, hevo ;

CREATE GROUP dbt
WITH USER dbt ;

CREATE GROUP analytics
WITH USER da, nb ;

CREATE GROUP bi
WITH USER looker ;

CREATE GROUP business_users
WITH USER ar, ak ;

----------------------------
-- PRIVILEGES MANAGEMENT ---
----------------------------

-- GROUP etl
GRANT USAGE ON SCHEMA hevo, staging, flatten, analytics, intermediate TO GROUP etl ;

-- GROUP dbt
GRANT USAGE ON SCHEMA hevo, snapshots TO GROUP dbt ;
GRANT USAGE, CREATE ON SCHEMA ext TO GROUP dbt ;

GRANT SELECT ON ALL tables IN SCHEMA hevo TO GROUP dbt ;
GRANT ALL PRIVILEGES ON ALL tables IN SCHEMA meta, ext, snapshots TO GROUP dbt ;

ALTER DEFAULT PRIVILEGES FOR USER hevo IN SCHEMA hevo
   GRANT SELECT ON tables TO GROUP dbt ;

ALTER DEFAULT PRIVILEGES FOR USER etl IN SCHEMA ext
   GRANT ALL PRIVILEGES ON tables TO GROUP dbt ;

-- GROUP analytics
GRANT USAGE ON SCHEMA hevo, ext, flatten, staging, intermediate, restricted, analytics, meta TO GROUP analytics ;
GRANT USAGE, CREATE ON SCHEMA ad_hoc TO GROUP analytics ;

GRANT SELECT ON ALL tables IN SCHEMA hevo, ext, flatten, staging, intermediate, restricted, analytics, meta TO GROUP analytics ;
GRANT ALL PRIVILEGES ON ALL tables IN SCHEMA ad_hoc TO GROUP analytics ;

ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA flatten, staging, intermediate, restricted, analytics, meta
   GRANT SELECT ON tables TO GROUP analytics ;

ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA ad_hoc
   GRANT ALL PRIVILEGES ON tables TO GROUP analytics ;

-- GROUP bi
GRANT USAGE ON SCHEMA analytics, meta, ad_hoc, snapshots TO GROUP bi ;

GRANT SELECT ON ALL tables IN SCHEMA analytics, meta, ad_hoc TO GROUP bi ;

ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA analytics, meta, ad_hoc
   GRANT SELECT ON tables TO GROUP bi ;

-- GROUP business users
GRANT USAGE ON SCHEMA analytics TO GROUP business_users ;
GRANT USAGE, CREATE ON SCHEMA ad_hoc TO GROUP business_users ;

GRANT SELECT ON ALL tables IN SCHEMA analytics TO GROUP business_users ;
GRANT ALL PRIVILEGES ON ALL tables IN SCHEMA ad_hoc TO GROUP business_users ;

ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA analytics
   GRANT SELECT ON tables TO GROUP business_users ;
      
      





Amazon Redshift automatic workload management (WLM), .. . - . Redshift :







  • ELT
  • BI Ad-hoc
  • โ€œโ€
  • ,


Queue configuration (WLM) allows flexible management of cluster load







(WLM)







, .







short-query acceleration. Amazon Redshift ML- , (~ 30 .) , . , Wheely , BI .







Data Lake



? ! Redshift Spectrum :







  • SQL DDL ( S3)
  • : (SELECT), (WHERE), (JOINs), (INSERT)
  • - : JSON, ORC, PARQUET


EXTERNAL TABLE _Hive. _ : Hive Metastore.







Spectrum provides a wide range of popular data formats in Redshift







Spectrum Redshift







Wheely Spectrum:







1. Data Quality Pipeline







S3 Spectrum - (MongoDB) (Redshift). : QA , .







2. S3







>3 S3 parquet. , Redshift , .







S3 ( parquet)
UNLOAD ('SELECT * FROM "hevo"."events_prod_clickstream_archive"')
 TO 's3://wheely_analytics/dwh/hevo/events_prod/clickstream_archive' 
 IAM_ROLE ''
 MANIFEST
 FORMAT AS PARQUET
 NULL AS ''
 MAXFILESIZE AS 256 
 ENCRYPTED AUTO
 ;
      
      





: -. .









Instead of a thousand words













, . . , . production-, ad-hoc .







, (target table). . , . . , , , .







Recovering a deleted or damaged table from a backup is now easier and faster than ever













, , 7 .







: in-database ML, native semi-structured data support



. Wheely .







, Amazon Redshift, preview ( ):









, . ! , , .







? Wheely. โ€” .







: https://t.me/enthusiastech.







.








All Articles