DataHub: a versatile metadata search and discovery tool

DataHub: A versatile metadata search and discovery tool.



As the operator of the world's largest professional network and economic graphics , LinkedIn's data team is constantly working to scale its infrastructure to meet the demands of our ever-growing big data ecosystem. As the volume and variety of data grows, it becomes increasingly difficult for data scientists and engineers to discover available data assets, understand their origins, and take appropriate action based on the data. To help us continue to scale performance and innovate in the database, we created a comprehensive metadata search and discovery tool, DataHub.



Editor's Note: Since this blog post was published, the team has opened the DataHub with source code in February 2020 . Learn more about how to open source for the platform here .



Scaling metadata



To improve the productivity of the LinkedIn data group, we previously developed and open source WhereHows, a central metadata repository and portal for datasets. The type of metadata stored includes both technical metadata (eg, location, schemas, sections, ownership) and process metadata (eg, origin, job completion). WhereHows also has a search engine that helps you find datasets of interest.



Since our first release of WhereHows in 2016, there has been a growing interest in the industry to improve the productivity of data scientists with metadata. For example, tools developed in this area include Dataportal AirBnb , Databook Uber , Metacat Netflix , Amundsen Lyft, and most recently Google's Data Catalog . At LinkedIn, we've also been busy expanding the collection of metadata for new use cases while maintaining privacy. However, we concluded that WhereHows had fundamental limitations that prevented us from meeting our growing metadata needs. Here's what we learned while working with WhereHows scaling:



  1. Push , pull: . API . push .
  2. , : WhereHows , . API, . . , , , . , , , .
  3. , . , , , . — , Hadoop, . , . , , .
  4. . (, , ), , , , . .
  5. : , , ( ). , ( , , , , API , , , , , . .), .


DataHub



WhereHows , . LinkedIn , , . , , DataHub, : LinkedIn , .



WhereHows : . , . DataHub , 19 , , , , , , . , , , API .





- DataHub — , . Ember Framework Play. , -, ES9, ES.Next, TypeScript, Yarn with Yarn Workspaces, , Prettier ESLint. , , .





, - DataHub , , . Yarn Workspaces Ember Ember. , (, ) (, Ember npm / Yarn), - DataHub .



, . , , , , .



DataHub



: (1) , (2) (3) / . :















, , . , . , OR, NOT , .



DataHub , . , , . , « », .



— / — . « », . , , , , . , , , , , , , . . , , .





DataHub, , . :



  1. : .
  2. : API, .
  3. : , .
  4. : , .




, — «, ». , :



  1. — : , , , .
  2. : , . , , (ACL) , , , . , , .


, , Pegasus, , LinkedIn. Pegasus . , Pegasus , .



, Pegasus , , - (ERD).





— , — . , OwnedBy, HasMember HasAdmin. , , , , .



ERD, , . , « ». , , . : , . . , , . .



, , , , , firstName , firstName . , « » .



Pegasus, , Pegasus (PDSC). . -, PDSC User:



{
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "urn",
      "type": "com.linkedin.common.UserUrn",
    },
    {
      "name": "firstName",
      "type": "string",
      "optional": true
    },
    {
      "name": "lastName",
      "type": "string",
      "optional": true
    },
    {
      "name": "ldap",
      "type": "com.linkedin.common.LDAP",
      "optional": true
    }
  ]
}


URN, GUID. User , , LDAP, .



PDSC OwnedBy:



{
  "type": "record",
  "name": "OwnedBy",
  "fields": [
    {
      "name": "source",
      "type": "com.linkedin.common.Urn",
    },
    {
      "name": "destination",
      "type": "com.linkedin.common.Urn",
    },
    {
      "name": "type",
      "type": "com.linkedin.common.OwnershipType",
    }
  ],
  "pairings": [
    {
      "source": "com.linkedin.common.urn.DatasetUrn",
      "destination": "com.linkedin.common.urn.UserUrn"
    }
  ]
}


, , «» « », URN. , , «». , «», URN. OwnedBy .



, . , type ldap. , PDSC. « — », .



{
  "type": "record",
  "name": "Ownership",
  "fields": [
    {
      "name": "owners",
      "type": {
        "type": "array",
        "items": {
          "name": "owner",
          "type": "record",
          "fields": [
            {
              "name": "type",
              "type": "com.linkedin.common.OwnershipType"
            },
            {
              "name": "ldap",
              "type": "string"
            }
          ]
        }
      }
    }
  ]
}


, , : , ERD. « » .





DataHub : API, Kafka. , , , .



API DataHub Rest.li, RESTful, LinkedIn. Rest.li Pegasus , , , . , API — API .



, Kafka (MCE), , URN . MCE Apache Avro, Pegasus.



API Kafka . , , , . .



LinkedIn Kafka - , . MCE , , , . , Apache Samza . Samza , . Avro Pegasus API Rest.li .







, , . DataHub :



  1. -
  2. ,


DataHub , . , Espresso — NoSQL LinkedIn, - CRUD. Galene . , , , . , , .



DataHub (DAO), DAO -, DAO DAO. DAO , - - DataHub. DataHub , LinkedIn.





DAO — (CDC). , DAO «-» (MAE). MAE URN , . -, MAE , . MCE, MAE .





— . , . - Samza, MAE. MAE. , .



, URN MAE .







DataHub, LinkedIn. .



DataHub LinkedIn . 1500 , , . LinkedIn , 23 , 25 , 500 , , LinkedIn, , .



We continue to improve DataHub by adding more interesting user stories and relevance algorithms to the product. We also plan to add native GraphQL support and use the Pegasus Domain Specific Language (PDL) to automate code generation in the near future. In the meantime, we are actively working to share this WhereHows evolution with the open source community, and we will make an announcement following the public release of the DataHub.




All Articles