Data catalogs have become an essential part of the modern data infrastructure and management. It allows to have an overview of data, presented in the organisation, by storing metadata about it - such as location, format, columns/attributes, etc. This is especially important combined with the modern data teams architectures, such as Data Mesh, when every team can contribute to data catalog.
There’s a growing interest in the industry to improve productivity of data engineers and scientists with metadata. Following projects were released over the past several years:
Naturally a few startups appeared in this area as well, my favorite is Tree Schema.
So far I’ve tried to use only Amundsen together with Apache Atlas for tracking data lineage, this is a great couple, however, another part, which is lacking in almost every product - a possibility to set and track data quality, which I would like to touch in another article.