Why do you need data observability?
The truth is data and analytics teams have a tough job. They usually get data from systems they don’t control, but they are responsible for correct insights. Even if you do everything right, things will break in upstream systems while business processes change. Being in a fast-moving environment and being human will result in data issues.
These issues will come your way. You should expect them.
So, it is better to be prepared: to know when they arise, what they affect and have context on what their cause is to resolve them faster. Data observability helps you to be prepared.
Outcomes you can expect by investing in data observability:
- Building up trust in data with transparency and fast reactions to errors
- Higher adoption of data because of more trust
- Give back time to analysts to do work they love [1]
What is data observability?
Data Observability is the discipline of collecting meta data about data systems.
The main goal is to detect data issues, understand their impact and provide context to resolve them.
Observability can also help with cost optimization, and refactoring data pipelines.
Example
A table in Snowflake is used to power a recommender system in an e-commerce system. This table should be refreshed every 3 hours, so recommendations can pick up short-term trends. Usually there are a lot of moving pieces that generate this table: an event collection platform, a data integration tool, a data modelling tool, a machine learning system, …
It is hard to monitor all of them in such a detailed manner to ensure everything works as expected.
A monitoring service that just checks if the data in the table was updated is a simple solution to ensure freshness. If this service also checks that the row-count is steady and important columns have stable statistical attributes (like ratio of null-values or maximum value) we can focus on something else.
3 Pillars of Data Observability
Software Engineers learned a lot of valuable lesson over the last 50 years of building software.
Data Engineering/Analytics Engineering is still young compared to that. Let’s see if we can apply some the principles from Software Observability to Data Observability. [1]
Did you have some data issues you would like to avoid in the future? Reach out!
____
[1] Dealing with data issues takes too much time. Check out Mikkel Dengsøe's thoughts on the topic:
https://mikkeldengsoe.substack.com/p/time-allocation
Image Courtesy: Mikkel Dengsøe
[2] Does it make sense to think about data systems different than software? In theory both are IT systems that consists of storage, compute,and network. In practise data can get expensive to monitor, just because there is so much of it. Also, the emphasis for software is that servers are running and serving requests, while the emphasis for data is good quality.
[3] Data-Ware-House-Keeping or housekeeping for your data is about cleaning unused/useless tables or columns, reducing cost for exploding data models or managing access to reflect company policies.