Evgenii Karimov - Notes about data catalog solutions

Data catalogs have become an essential part of the modern data infrastructure and management. It allows to have an overview of data, presented in the organisation, by storing metadata about it - such as location, format, columns/attributes, etc. This is especially important combined with the modern data teams architectures, such as Data Mesh, when every team can contribute to data catalog.

There’s a growing interest in the industry to improve productivity of data engineers and scientists with metadata. Following projects were released over the past several years:

Dataportal by AirBnb
Databook by Uber
Amundsen by Lyft
Metacat by Netflix

Naturally a few startups appeared in this area as well, my favorite is Tree Schema.

So far I’ve tried to use only Amundsen together with Apache Atlas for tracking data lineage, this is a great couple, however, another part, which is lacking in almost every product - a possibility to set and track data quality, which I would like to touch in another article.

Notes about data catalog solutions

Share this post

Choosing open-source comment system

About the modern data engineering stack