Data Quality, Top 5 Don’ts!



Data Quality

Things to avoid in your data quality efforts

Data Quality – There are many definitions of data quality but data are generally considered high quality if “they are fit for their intended uses in operations, decision making and planning.” Talk about an arbitrary definition.  I have this conversation allot with new customers. It boils down to something like this.. ” I can’t really tell you when its right per se but I can tell you when it wrong “.  So, Let’s start with the Top 5 things You DON’T want to hear or do  with your customers data or internal client.

Before We Start…

Before we start… Data Quality should be a primary feature that you deliver in your project.  It should not be after-thought or something that you pass on to some one else to clean up.  There have been times when I’m building a new Data Mining – Machine Learning product that is based on existing data .. And Me and my project team will have to expend extra effort cleaning someone else’s data and/or dealing with a data quality issue at the beginning. This make me a “Sad Data Scientist”.   To turn my frown upside down, I make sure to Bloom Filter out some of the biggest gotchas at the beginning. Here is my list of Top 5 Don’ts.

Top 5 Don’ts

  1. DON’T Be Passive.
  2. DON’T Let your Customers and/or Clients be your Data Quality Filter.
  3. DON’T Assume that quality, scalability and repeatability are built In.
  4. DON’T Play the Blame game – even if its not your fault.
  5. DON’T Think its your responsibility to “Fix” everything.

Ok, Let’s break this down a bit more.

Point 1 –  Have an Active plan on how to deal with Data Quality issues. They will bubble to the top from time to time and when they do you should have a recovery plan.  If you have a general plan; you can refine on the fly and be ready to action when issues arise.

This should include:  [ More on this later next month ].

  • Identification
  • Verification
  • Resolution
  • Confirmation

Point 2 – This speaks to the first paragraph, Don’t wait for a customer to call you and your pager to go off because you knew there was a data quality issue and ignored it.  Even if you don’t (or the enterprise doesn’t) have time or resource to fix an issue, identification of the edge cases will go along way in establishing credibility with customers and working on a real solution down the road.

Point 3 –  Thinking that a Data Quality issue is a one-time occurrence and that your fix is a fix for all time is probably wrong. What? Yes, You may have identified an edge-case and created a fix but it may have more to do with infrastructure of some source system integration error that you don’t have control over.  Re-visit the edge case, test the fix, make sure it’s still working. -OR- for fun, check you trouble tracking system and see how many times you fixed the same problem.  Hint, If you have fixed the problem many times – you may not have found the true “Root Cause”.

Point 4 – Avoid the blame game. If your building a Machine Learning – Predictive Analytic product and you’re data is expecting a certain type, amount, value, etc.  on the inbound you should be checking this during your ingestion process. Just like in an OLTP system when we verify the a field is numeric, string or date of a certain length or value. [ BTW: SAS and R have some great tools for this.] We’ve all been in meetings where we’ve reviewed a Sev 1 or Sev 2 Trouble Ticket [affecting the enterprise]… and 30 minutes of the hour meeting is folks blaming this or that.  Just stop and find the “root cause candidates” – eliminate them logically and test the ones that are left.

Point 5 – That’s Right! You don’t have to fix everything. Data Quality should be a company wide exercise.  If you have identified a problem that is larger than your product, project and/or team – its you responsibility to identify the problem and elevate it to the right team for resolution. Participate in the fix. Provide any information you can during the process.  Help verify the fix once the resolution team has implemented.  Communication is key.  Keep updating the ticket or your trouble tracking system when you have new information.

Realize, Data Quality is more about fixing several smaller issues for a cumulative increase in over-all quality. It is a commitment you make to your product and customers to provide actionable insights that are reliable, repeatable and measurable.

Please Contact US or Subscribe to our Blog if you found this interesting or would like more information.

Subscribe : Blueskymetrics Blog

* indicates required,  Managed By Mail-chimp – Please check your Spam Folder and Confirm Subscription.