Enterprise Data Catalog


Enterprise Data Catalog

A Data Catalog Can Improve Productivity: Gus Segura

The database catalog of a database instance consists of metadata in which definitions of database objects such as data lake source interfaces, source files, base tables, views (virtual tables), synonyms, value ranges, indexesusers, and user groups are stored – or- “An Enterprise metadata catalog is a set of relational tables that contain information about how to convert data from non-relational to relational formats.

The Data Catalog can be used by dba’s, developers and users to locate the data in your enterprise. Do you build – implement Data Platforms? Does your desk look like that? The demand for data catalogs is increasing and the number of different sources of data in an enterprise is increasing. The following is a brief article on reasons why to build and/or implement a data catalog.

Most modern enterprises implement many data warehouse’s and data lakes. There is a wide variation of structured, semi-structured and un-structured data. Every new and/or updated data source should be identified in this catalog including source interfaces, source files (avro, parquet, seq, csv, etc) [and they’re contents at the field level], tables, columns, attributes, transforms, aggregations, facts, dimensions, hierarchies, data marts, data models and business implementation.

Beyond structure, It’s important to understand the data profile at scale, the lineage of the data and its origins (provenance). Finally, We’d like to track who has access? why? How is the data being used? How best to meet any SLA, Security and Compliance requirements. It’s not simply build a data platform and walk away.

This is a start of a series I’m doing on the data pipeline (real-time and batch).  I wanted to explore some of the reasons why you would implement a catalog during development and how it can increase productivity through out the life-cycle of the data warehouse, business intelligence and data science platform.  If you’re thinking of implementing a new project and have questions about implementing data governance, data quality and a meta data catalog, Please email me at : support@blueskymetrics.com. 

Top Reasons to Implement a Data Catalog

    • Support – Integration of your Business Intelligence Tool(s) – Environment
      • Most BI and Data-warehousing tools have an integrated data management repositories. They work great! However, Most users will not have access to the BI Meta Data tool for an number of reasons (see security). So, In order to maximize self-services (enabling users to find the data), I find it best to have a separate Data Catalog tool that more focused on delivering business value to users and developers.  Most newer BI tools have API’s that enable integration.
    • DevOps: Tighter integration with Release Process, Model Enhancements, Automation
      • DevOps tools are getting better at automation and integration. Along the lines mentioned previously. The Development process starts with the Pipeline, Source Files, Model Definitions, etc.  Automate your data pipeline and feature delivery to make the process as optimized as possible.  Again, scripting these processes that you do over and over again.
    • Allow for Multiple Interpretations of the Models and Data
      • I’ve been at many places where people look at the same data set in different ways.  Accounting is looking at Cost, Sales may be looking at volume, Shipping may be looking at logistics.  It’s all the same data.  A good Data Catalog will support multiple interpretations, business models and data flows of the same data sets.  Its important to understand the each business unit has its own perceptions and requirements.  Capture them in the Enterprise Data Catalog and increase your productivity.
    • Data Quality
      • This is a whole other discussion and I will circle back.  The point, Data Quality starts before the pipeline. Its starts with the business understanding of the source data.  Knowing the volume, value, velocity and variation are the key points to store in your data catalog.  Set the expectation from the start.  DQ can be your single biggest cost in compute, data transfer rates, network and team resources. If you can’t write it down; you can track and measure it.
    • Security
      • Access – Who can see what? Authentication – How the Users and Groups are Authenticated to access data. Some parts of the pipeline at the attribute level may not be available to all users.  You may not even want some users to ‘Know’ that the data – fields – elements exist.  So, Help make your security administrator job easier and include the team early in the process. Use the Data Catalog help track this valuable meta data.
    • Compliance: Master Data Management
      • Remove Duplication (2nd biggest cost in allot of enterprises).  Standardize business definition.  Root Cause Tracking for problem resolution. MDM should be a Journey to collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing data.  It does not have to happen day one.

Next Month, I will be reviewing  the top data catalog tools I’ve implemented and perhaps one or two that look very interesting. Remember, ALLOT of BI tools have a meta data repository and/or data catalog integrated into the tool.  There is a big chunk of the market just using the features of the BI tool and that works for them…

Remember, The Data Warehouse is Different

  • Data Discovery – Enabling Data Sets for Analysis, Reporting, Data Science.
  • Data Quality – 1st Step, A Data Catalog of your Enterprise Data.
  • There are many ways to move data in a Data Lake.
  • A Data Engineer is not the same as SDE – The data pipeline uses different tools.

However, In Self-Service, MDM, Data Engineering development process, data quality and data governance :

  1. I find having a specific data catalog tool that is feature rich for the enterprise gives me the best productivity in my data pipeline design, maintenance and enabling users to find the data.
  2. Using just aBI Tool may not be enough for your Enterprise Needs or maybe its too much.  There allot of ways to implement a Data Catalog that don’t include significant costs and/or architectural commitments. Check back on review.
  3. I believe in redundancy in the enterprise (the right amount) – Most of the tools will sync (most of the time automatically).

Again, Please contact me if you have a specific question and/or need. Thank you, – Gus

Do you need help with your Data Warehouse, Data Pipeline Project – Please Contact US.
Email: support@blueskymetrics.com – Phone: 765.325.8373 ( call / text ).

Stay Informed: Please Contact US | Subscribe to our Blog Updates or If you would like more information.