Who will be responsible for the quality of analytics: QA for Data Warehouse

Trust your eyes and what you see on the Dashboard

At Wheely , we rely heavily on data to make operational and strategic decisions. From the payment of weekly bonuses to partners to expansion to other cities and countries.

Each manager or Product Owner knows his area intimately and any deviations can raise questions. Therefore, increased requirements are imposed on the reliability of dashboards and metrics. And we in the Analytics team strive to identify and fix problems before they get reported. 

As you know, it is easier to prevent, and therefore I decided to approach the problem in a systematic and proactive manner. And, of course, the first thing I did was create a Slack channel , into which I set up delivery of notifications about any errors in our pipelines.

Confidence in the relevance of data marts

, :

  • 10

  • 8

  • DWH

, QA :

  • ,

:

  • .yml freshness:

freshness:
   warn_after: {count: 4, period: hour}
   error_after: {count: 8, period: hour}
 loaded_at_field: "__etl_loaded_at"

  • SQL-:

select
 max({{ loaded_at_field }}) as max_loaded_at,
 {{ current_timestamp() }} as snapshotted_at
from {{ source }}
where {{ filter }}
  • :

, , :

  • (edge cases),

  • (bottleneck)

:

  • : , Out of Memory, Disk Full

  • SLA

:

  • , + ( )

  • CPU

  • - IO, network

.

:

  • ,

:

+pre-hook: "{{ logging.log_model_start_event() }}"
+post-hook: "{{ logging.log_model_end_event() }}"

, , . - , , , , PRIMARY KEY, FOREIGN KEY, NOT NULL, UNIQUE.

DWH . . .. , .

:

  • (NULL) , ?

  • (UNIQUE ID )?

  • (PRIMARY - FOREIGN KEYS)?

  • , (ACCEPTED VALUES)?

QA :

  • ,

:

  • .yml tests:

- name: dim_cars
     description: Wheely partners cars.
     columns:
         - name: car_id
           tests:
               - not_null
               - unique
         - name: status
           tests:
               - not_null
               - accepted_values:
                   values: ['deleted', 'unknown', 'active', 'end_of_life', 'pending', 'rejected'
                           , 'blocked', 'expired_docs', 'partner_blocked', 'new_partner']   

  • SQL-

-- NOT NULL test
select count(*) as validation_errors
from "wheely"."dbt_test"."dim_cars"
where car_id is null
 
-- UNIQUE test
select count(*) as validation_errors
from (
   select
       car_id
   from "wheely"."dbt_test"."dim_cars"
   where car_id is not null
   group by car_id
   having count(*) > 1
) validation_errors
 
-- ACCEPTED VALUES test
with all_values as (
   select distinct
       status as value_field
   from "wheely"."dbt_test"."dim_cars"
),
validation_errors as (
   select
       value_field
   from all_values
   where value_field not in (
       'deleted','unknown','active','end_of_life','pending','rejected','blocked','expired_docs','partner_blocked','new_partner'
   )
)
select count(*) as validation_errors
from validation_errors

-

- - , . -, .

:

  • ,

  • %

( ), .

QA :

  • , -.

:

  • SQL ,

  • SQL-

  • (PASSED) 0 , (FAILED) >= 1

Continuous Integration - DWH

, . DWH . . , , , PROD- PR Merge:

  • DEV- PROD-

  • (, Out of Memory)

- Continuous Integration (CI). !

:

  • master- PROD- DWH  .

:

  • CI (, PROD-, 7 )

  • feature- master

- DWH

( ) :

  • DWH ,

  • (, , ) --

, , (, ).

:

  • , () .

, :

  • , : , , (, , ), (, , ).

  • ,

  • DWH

drill-down :

, . , :

  • ,

  • Continuous Integration and Testing

  • ( )

, Wheely. , .

, , , «Data Engineer» OTUS, .

4 20:00 «Data Engineer». OTUS , .

:

  1. Data Build Tool - DBT

  2. The farm-to-table testing framework -

  3. Tests - Related reference docs - DBT,

  4. How to get started with data testing - dbt discourse

  5. Data testing: why you need it -

  6. Manual Work is a Bug - DRY




All Articles