Newsletter

Ensuring data quality within your organisation

The quality of your data impacts every aspects of your analyses. Read on ways to minimize issues and implement tests.

Picture of author and link to their profile
Thibaut Collette

August 30, 2022 · 5 min read

Lines of Codes

A few weeks ago, I had the chance to see again an old colleague of mine. His job was Software Quality Assurance Engineer. His job, in particular, was to ensure the quality of the product we shipped. We both had the same end goal, deploying the best possible product for our users, however, the way to get there was different but I learned a trick or two that I am about to share.

SQA (Software Quality Assurance) is "a means and practice of monitoring all software engineering processes to ensure compliance". Ouch. In other words, how do you ensure quality in the software you ship? The topic has now been discussed for decades and is very mature.

Data quality, on the other hand, is still pretty new, and its own definition might continue to evolve. Data quality is not only about the process but also the source, the privacy, ...

While it might sounds scary and costly, there are small steps that can help you come a long way and ensure ease of use and greater impact.

PEBCAK

First, truths need to but told. (Data) Quality issues are very often created between the chair and the keyboard.

Meme with text: "Word of the day: PEBCAK, Problem Exists Between Chair and Keyboard"

In order to minimize those kinds of "problems", here are a few suggestions:

  • Limit copy pasting. Copy pasting surreptitiously introduce a lot of mistakes. When you can, work directly in your data source or electronically import your data.
  • Split complex logic. You might be proud of your 250-line SQL query or your 300-char spreadsheet formula. Split those in smaller parts that can be proofread later. By you or by a teammate.
  • Take a step back. Was the source of data trusted? Are there expected biases in the cohort I am looking at? Does the result make sense?

Finally, as your team grows, make sure you enforce code review. Advantages of this are long-lasting and wide-ranging. Not only it removes most small mistakes but it ensures consistency between teammates and faster learning curve for your new recruits.

A good documentation helps a lot with quality. Even more for data quality. However, it's often very costly to maintain.

Naming conventions are key to reduce the need for extensive documentation. It will require consistency but here is a start for your naming convention. Design tip #168 - What's in a name? from the Kimball Group.

Then, the idea would be to document the most critical operations and the content that would not be understood without context. Setting up "Documentation hour" can be helpful. Sit with a colleague or two. Order something to eat and document. Bring fun into something that looks otherwise boring.

If you already use dbt then dbt documentation is a great place to start as it will always be up-to-date with your transformation.

Manual testing plan, alerting, automatic testing...

Basically, all software QA technics can be applied to data.

You can start by running manual tests.

There are extensive tooling but you can start off a single document where you list some actions — like visiting a specific dashboard, running specific queries, ... Go through this list or part of it on a regular basis (weekly?).

Then, as you grow, you can automate part of these tests. dbt itself (would recommend it very early) but also great expectations. They will help you enforce consistency in the tables and columns you'll query.

Finally, alerting helps you and your team react faster when there is an issue or at least a suspicion of it. The initial risk might be too many alerts that are either false positive or low impact alerts. Internally, we initially decided to start without any alerting. Then as issues arose we defined new alerts along with our fixes in order to monitor that everything is back to normal and stays like that.

DevOps in your company often have a monitoring platform and very often they are happy to share their toy when it comes to "improved quality". Adding simple thresholds (too many lines added in 24 hours, not enough lines, too much time spent in queries, no activity, ...) can be done very quickly when sitting together.

---

(Obviously) we are big supporters of notebooks.

Notebooks allow for documenting around your code in a precise way or splitting your code in small cells that can be reviewed and reused.

When the team grows you'll most probably need dedicated tooling around data observability and lineage. But again, as always, you can start small and iterate towards your long term solution.

oh! I was about to forget. I should warn you about the "snowball effect". Let's say an issue or a bug occurs and you're notified. It's critical and you decide to hotfix it as the fix looks easy. You deploy your change - great job. Unfortunately, you now receive many notifications as it broke 2 other workflows that you didn't think about while trying to hotfix very quickly. Now your problem is much bigger. It snowballs. Data quality starts with a cool mind.

Husprey Logo

Learn more about Husprey

Husprey is a powerful, yet simple, platform that provides tools for Data Analysts to create SQL notebooks effortlessly, collaborate with their team and share their analyses with anyone.