Open Source DataHub: LinkedIn's Metadata Search and Discovery Platform
Finding the right data quickly is essential for any company that relies on large amounts of data to make decisions based on that data. This not only affects the productivity of data users (including analysts, machine learning developers, data scientists, and data engineers), but also has a direct impact on the end products that rely on a quality machine learning (ML) pipeline. Also, the trend towards introducing or building machine learning platforms naturally raises the question: what is your method of internally discovering features, models, metrics, datasets, etc.
In this article, we will share how we published the DataHub data source under the open license on our metadata search and discovery platform, starting in the early days of the WhereHows project . LinkedIn maintains its own version of DataHub separately from the open source version. We'll start by explaining why we need two separate development environments, after which we'll discuss the first approaches to using WhereHows open source and compare our internal (production) version of DataHub with the version on GitHub.... We'll also share details about our new automated open source upload and download solution to keep both repositories in sync. Finally, we'll give instructions on how to get started using the open source DataHub and briefly discuss its architecture.
WhereHows is now DataHub!
LinkedIn DataHub ( WhereHows), LinkedIn, . - DataHub . , . DataHub GitHub.
WhereHows, LinkedIn , ; 2016 . — , LinkedIn, , LinkedIn, . , WhereHows (, . .), . WhereHows , . , .
: « »
« », , . , GitHub, . , . , .
, , , , . , , , , .
, , -. , , , WhereHows -. , , .
: « »
** « », , . , . , — , . . , , , , .
!
, WhereHows GitHub . , WhereHows LinkedIn , . — DataHub. .
LinkedIn , . , LinkedIn (ELR). , , , .
, DataHub, . , . , . ( DataHub), . .
DataHub , . :
- LinkedIn / , rsync.
- , Apache Rat.
- .
- , , .
, .
DataHub , GitHub, DataHub LinkedIn ( multiproducts). DataHub, , LinkedIn. DataHub .
1: LinkedIn DataHub DataHub
, , . , , .
{
"datahub-dao": [
"${datahub-frontend}/datahub-dao"
],
"gms/impl": [
"${dataset-gms}/impl",
"${user-gms}/impl"
],
"metadata-dao": [
"${metadata-models}/metadata-dao"
],
"metadata-builders": [
"${metadata-models}/metadata-builders"
]
}
— JSON, , — LinkedIn. . Bash. , , .
{
"${metadata-models}/metadata-builders/src/main/java/com/linkedin/Foo.java":
"metadata-builders/src/main/java/com/linkedin/Foo.java",
"${metadata-models}/metadata-builders/src/main/java/com/linkedin/Bar.java":
"metadata-builders/src/main/java/com/linkedin/Bar.java",
"${metadata-models}/metadata-builders/build.gradle": null,
}
; . 1: 1 LinkedIn . , :
- , , FQCN, . « ».
- «null» , .
- . .
metadata-models 29.0.0 -> 30.0.0
Added aspect model foo
Fixed issue bar
dataset-gms 2.3.0 -> 2.3.4
Added rest.li API to serve foo aspect
MP_VERSION=dataset-gms:2.3.4
MP_VERSION=metadata-models:30.0.0
LinkedIn , , . DataHub , - , , DataHub , .. , (, , ) , DataHub , . , , -, .
, , , . , , DataHub .
DataHub
DataHub, , . DataHub LinkedIn, .
, , LinkedIn's Offspring ( LinkedIn). Offspring , . ; DataHub .
. LinkedIn, LinkedIn . , . , DataHub . , , , .
DataHub. — . , () , .
— GMS ( ) , GMS. GMA ( ) — DataHub, GMS — GMA. GMA — , (, , . .) , , GMS . GMS, DataHub .
.
Product Features | LinkedIn DataHub | Open Source DataHub |
---|---|---|
Supported Data Constructs | 1) Datasets 2) Users 3) Metrics 4) ML Features 5) Charts 6) Dashboards | 1) Datasets 2) Users |
Supported Metadata Sources for Datasets | 1) Ambry 2) Couchbase 3) Dalids 4) Espresso 5) HDFS 6) Hive 7) Kafka 8) MongoDB 9) MySQL 10) Oracle 11) Pinot 12) Presto 12) Seas 13) Teradata 13) Vector 14) Venice | Hive Kafka RDBMS |
Pub-sub | LinkedIn Kafka | Confluent Kafka |
Stream Processing | Managed | Embedded (standalone) |
Dependency Injection & Dynamic Configuration | LinkedIn Offspring | Spring |
Build Tooling | Ligradle (LinkedIn’s internal Gradle wrapper) | Gradlew |
CI/CD | CRT (LinkedIn’s internal CI/ CD) | TravisCI and Docker Hub |
Metadata Stores | Distributed multiple GMS: 1) Dataset GMS 2) User GMS 3) Metric GMS 4) Feature GMS 5) Chart/Dashboard GMS | Single GMS for: 1) Datasets 2) Users |
Docker
Docker . DataHub , , Kafka, Elasticsearch, Neo4j MySQL, Docker. Docker Docker Compose.
2: DataHub * **
DataHub . , Docker:
datahub-gms:
datahub-frontend: Play, DataHub.
datahub-mce-consumer: Kafka Streams, (MCE) .
datahub-mae-consumer: Kafka Streams, (MAE) .
DataHub .
CI / CD DataHub
DataHub TravisCI Docker Hub . GitHub . , (, Confluent), Docker, Docker Hub . Docker, Docker Hub, docker pull.
DataHub Docker Docker Hub «latest». Docker Hub , Docker Hub.
DataHub
DataHub :
- Docker docker-compose docker-compose .
- , , , .
- DataHub .
Gitter . GitHub. , !
DataHub Docker, docker-compose. Kubernetes, Kubernetes .
We also plan to provide a turnkey solution for deploying DataHub on a public cloud service such as Azure , AWS, or Google Cloud . Given the recent announcement of LinkedIn's migration to Azure, this will align with the internal priorities of the metadata group.
Last but not least, thanks to all early DataHub users in the open source community who appreciated the alpha versions of DataHub and helped us identify issues and improve the documentation.