Hello, Habr! My name is Sergey, I am Lead Software Engineer / Sream Lead at EPAM, a Google Cloud certified engineer and architect. For over 10 years I have been doing commercial development for various global companies, mainly with a focus on backend. I also love to share my knowledge. Today I want to talk about Apache Airflow, which, in my opinion, is a good tool for building your pipelines.
What's the plan?
In a nutshell, I'll tell you about Airflow for those who haven't worked with it yet. All this can be found in more detail on the Internet, so I will go through only the basic concepts.
Let's see what Google Cloud Composer is , how it uses Airflow and simplifies development on real projects.
Let's take a look at the development and deployment practices within Google Cloud Composer, as well as the difficulties and limitations that can be encountered when launching Airflow in Cloud Composer.
And, of course, I will share useful tools that you can use in your work.
Airflow in multiple paragraphs
So, this is a tool for planning, building and monitoring pipelines written in Python. There are other out-of-the-box solutions for process orchestration, such as Luigi. But now let's talk about the advantages of Airflow:
Really good for building pipelines. At the heart of everything he has is a directed acyclic graph, which allows you to implement sequential or parallel execution of tasks, as well as manage their order and dependencies.
. open-sourse , .
-: , .
REST API, API.
ETL-pipeline ββ . , GCS . , , , . , , .
Airflow. DAG (Directed Acyclic Graph) β , Airflow. , . , .
Tasks DAG . , , . Operators , , . , β DAG-. β DataLoadOperator, GoogleCloudStorageListOperator UpdateStatusOperator β , . , , DAG-, . .
E DAGs Run Tasks Instances. DAGs Run β DAG, . Tasks Instance β , DAG Run. DAG Run Tasks Instances β execution date.
DAG , , pipeline-.
, Airflow:
(Scheduler) DAG-, .
( Executor) β , . : SequentialExecutor, CeleryExecutor . . , CeleryExecutor Queue Broker.
(Workers) ( Celery).
- , , , HTTP-, DAG- . , , Airflow, DAG-, . .
Logs. , Airflow. , - loud-. , Stackdriver GCS bucket .
Admin Panel / DAG-, , , (, ββ).
?
Airflow , DAG-. , . - , . , task execution , Admin Panel. , . , deployment-.
Google Cloud Composer
Google Cloud Composer Composer β fully managed , , cloud.
, . , , storage A storage B. Airflow DAG, Composer . , . - , retry, . , . , cron jobs, 100 , β Airflow Composer , .
Composer:
-, . Composer Airflow Google Console UI β β, , DAG- . , DAG-, , , Composer bucket GCS. .
-, Composer , UI. Airflow, .
-, Composer security- , Google Cloud. , Private IPs, Authorization . .
-, Composer Console .
Composer. β, fully managed ββ?β , , . , :
( , Composer ). Tenant Project β , Identity Access Management. . AppEngine Flexible -, Cloud SQL β Airflow. Cloud SQL , , -, . Cloud Storage Composer bucket, , / DAG-, . , Kubernetes . Core-, , worker- , , Redis, CeleryExecutor, Google Kubernetes Engine. : Kubernetes , , Redis. , Redis Airflow, Kubernetes Engine. , Composer Stackdriver β . , 100 , .
, , β β. , , , , DevOps-, , , , . .
, :
-, Tenant Project, Cloud SQL . β . , , .
Deployment development,
β , . :
, . Airflow DAGS_FOLDER PLUGINS_FOLDER sys.path , . : DAGS_FOLDER DAG-, PLUGINS_FOLDER β Airflow.
, , libs utils. , . plugins Airflow PLUGINS, Airflow. operators, hooks, macros β , Airflow β β.
. pip requirements.txt . Composer, UI, . , CI/CD , gcloud. pip Composer.
Airflow DAGS_FOLDER PLUGINS_FOLDER sys.path, , , DAGS_FOLDER, PLUGINS_FOLDER. , Airflow, , . , Airflow , , DAG-. PLUGINS_FOLDER . plugins β . . , , .
Airflow: . Composer read-only. .
.airflowignore . .gitignore , , Airflow . , . , PLUGINS_FOLDER , . .airflowignore Python , .
Airflow. Airflow , UI-, View. operators, hooks . , Airflow PLUGINS_FOLDER sys.path, , , from vnd.operators.my_operator. , , Airflow. , :
from vnd.operators.my_operator import MyOperator
from vnd.sensors.my_sensor import MySensor
, . , , Airflow plugins AirflowPlugin .
: , , callable- , lazy . , DAG-. , c , . , . , , , - , . : DAG- .
CI/D β . linters, isorts Gitlab CI/CD . CI/D , : Jenkins, Gitlab pipeline, Spinnaker.
, linters, unit , β . Composer, gcloud rsync.
Composer , rsync, Airflow. , gcloud composer, DAG-, - . , , - . rsync, , /.
, Airflow . , , . , , Airflow. . afctl β CLI-, Airflow . , , DAG- . , afctl , , ( ).
, , Airflow. , Google Cloud Platform, AWS, Azure, . providers Airflow.
Airflow Plugins β , CRM , . . GitHub: Airflow , , .
:
, Airflow self-service, Cloud Composer.
Cloud Composer, , ββ (scaling to zero). , , . Composer : , A/B-, . , GKE , .
Composer, , Prod/Dev/Staging. . Composer Airflow, , Google Airflow. Composer : , image, Airflow.
DAG β , . , pip.
, , . , .
AirflowPlugin β UI , Airflow. .
.airflowignore , Airflow. , .airflowignore DAG- , Airflow DAG- DAG-. , DAG- , , , .
DAG- Cloud Composer. DAG- . β ββ.
Airflow. , GoogleCloudStorageListOperator , , . - , .
KubernetesPodOperator , Python. Kubernetes, Airflow Pod-. , Composer-.
Composer, read-only Airflow. .
, , Airflow Composer :)