Data Engineering DevOps


Data Engineering DevOps – It’s Different…

DevOps is the combination of Development and Operations in an Agile environment or “DevOps is the practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support.” I think its slightly different for Data Engineering and the data pipeline tasks related to ETL.  Yes, We use allot of the same tools; However, We use them in different ways.  The following is why I think Data Engineering DevOps is different from traditional software engineering and cautions to remember when making the transition to DevOps for Data Engineering. [img|edureka ].

Generally, A DevOps environment will contain some or all of the following:

  • Maintain Code Repository
  • Automate the Build
  • Test the Build
  • Commit Changes Often
  • Build after each commit
  • Fix bugs right away
  • Test in a Cloned Environment
  • Monitor Environments – Health Checks
  • Elastically Expand and/or Contract based on demand

Different Development Paradigm

The development paradigm is different.  In DE, You are all over the map on the types of elements you build and where you will build them. Example: You may make a database model change (DDL). This change may have to be made in your DB modeling tool and the model saved into a repository. Next, You may have to update a stored procedure, ETL job and/or trigger running in the DB and save those changes to another repository. Finally, You will have to change the UI and/or mobile app front end.

A good Business Intelligence Manager will recognize this super-human requirement and segment – coordinate they’re team so that during the planing phase the DE, UI Developer, Testers, Development DBA, Data Architect, Ops DBA and Ops Resource Team work together to understand the new feature story items, have coordinated properly so that each team member knows what they are responsible for and when they will need to be ready to ‘put the elements together’.

In contrast, Most UI or App dev team members can work on they’re own for some portion of the sprint cycle.  DE dev teams need to constantly communicate.  The are many tools to help make this happen and augment the process without creating blockers.


So, You made it through Planing, Code and Build. Now you need to Test. Basic smoke, functional and regression testing is probably the same is most cases.  Probably because it depends on your development environment and tools. However, I want to focus on DB specific testing like QA, UAT, Scale, Data Quality and Security.  Here, I’m talking about creating multiple “SnapShot” replicas of you’re production environment and keeping them in sync. Cloud vendors would say, “This is no problem; We can create as many replicas as you need”. You may most likely need to maintain at least a Blue-Green production environment with a subset Development, Test and UAT.  Security and compliance environment have an even higher requirement.  This can be a very resource intensive process.

For any Big Data, Clustered (Shared Nothing) and/or RDBMS Data Warehouse database; this may not be practical or cost effective. The simple reason is SIZE. Creating multiple replicas of a multi petabyte database is not even possible for some enterprises.

There is no simple answer.  Its a balance between having the right DB snapshots and making sure they are synced with production to meet your dev and release cycle.  Here are some best practices:

  • Explore Blue-Green production system: Use one or the other for QA, Security and Data Quality testing.
  • Learn how to sub-set the database model and the data needed to support dev on the current story item.
  • Automate the code pipe line [newer BI tools are getting closer, don’t be afraid to break out the API and build something].

Monitoring – Operations

Monitoring the database like other services in your architecture is very similar. You will have health checks, resource monitoring, elastic determination points for expansion – contraction like you would in other services. You will also have to monitor other elements as well like disk storage capacity for data, index and logs.  You may have to monitor IOPS and data transfer traffic in and out.

You may need to have replica read servers being fed from one write server.  Yes, You can automate allot of this especially if you’re running the right enterprise tools for your Hadoop cluster and/or Relational Database. Also, Today – Most offer a REST API that makes creating an automated tools “specific to your environment” straight forward.

If you need help fine tuning your DevOps Data Engineering environment, optimizing your spend on cloud resources and/or migrating your data platform to the cloud;  Please give us a call or send an email today.

At Blueskymetrics, We help customers build metrics in the cloud.  These are Cloud Data Warehouse solutions that include Big Data and RDBMS like Oracle, SQL Server, Redshift, Aurora and others.  We love data mining. Our solutions focus on Data Science and Machine learning. If you need help migrating your data warehouse to the cloud and/or evolving your data platform solution; Please send me an email or give us a call.
If you’re starting a new project or you need help with your existing data solution- Please Contact US.
Email: – Phone: 765.325.8373 ( call / text ).

Stay Informed: Please Contact US | Subscribe to our Blog Updates or If you would like more information.

Subscribe : Blueskymetrics Blog

* indicates required,  Managed By Mail-chimp – Please check your Spam Folder and Confirm Subscription.