SUP! Hubert’s Substack

SUP! Hubert’s Substack

Share this post

SUP! Hubert’s Substack
SUP! Hubert’s Substack
OpenLineage with Streaming Data

OpenLineage with Streaming Data

A spicy look into why data lineage is important

Hubert Dulay's avatar
Hubert Dulay
Mar 18, 2023
∙ Paid
4

Share this post

SUP! Hubert’s Substack
SUP! Hubert’s Substack
OpenLineage with Streaming Data
Share

Is data lineage important? Many don’t seem to think so. Case in point, TJ’s tweet below.

Twitter avatar for @teej_m
TJ @teej_m
What is the actual problem that a "data lineage tool" solves? If you say data mesh, you're fired.
7:37 PM ∙ Jul 8, 2022
101Likes4Retweets

Fine!! I won’t mention data mesh in this post, TJ 😉. Follow that thread and you’ll see comments that say provenance and observability are what data lineage solves but not much else.

Twitter avatar for @DSJayatillake
David is at data-folks.masto.host @DSJayatillake
@teej_m It doesn't solve much on its own other than helping people understand the provenance of their data... Which is helpful. However in conjunction with other systems, like observability, it's very powerful. You can get to the root cause of incidents rather than only seeing symptoms.
11:53 PM ∙ Jul 8, 2022
11Likes1Retweet

Data Products

Fine, I lied 😬. I’m going to talk about data mesh a little by talking a bit about data products. The easiest way to define data products is by comparing them to produce you buy at the grocery store. If you care about what you intake into your body, then you probably care about how your food is processed and grown. When shopping around for produce, people tend to inspect the food they buy at least a little bit. The organic labels that come with your food tells something about how it processed. Specifically if it may have pesticides. If you know what it takes to have an organic label on your food, then you may have some idea of its processing lineage. The label gives you a hint of that lineage. We don’t have this for data. You’ll need data lineage for that.

USDA organic certified icons. Set of realistic stickers with rolled up  corners. Round organic certification labels with curled edges. Vector  illustration Stock Vector | Adobe Stock

Data lineage gives you the assurance needed for consumers of data that it was properly cleansed, enriched, and secured. We should find a way to provide this information as part of the metadata of the data product. This is especially important for data scientists. They need to explain the insights they provide to the business. If data scientists are unsure of the provenance and processes through which the data products traveled, how can they be sure their metrics are safe to use for critical changes in the business. This should resonate especially to those in the healthcare or pharmaceutical industries where changes could be life threatening.

Twitter avatar for @Ubunta
ABC @Ubunta
Data Lineage is very underrated but a complex important requirement for Data Engineering and Machine Learning systems.
1:21 PM ∙ May 20, 2022

Streaming Data Lineage

If you’ve tried to build a complete picture of data lineage from source system to sink system, oftentimes you’re dealing with incomplete information. You may have to stitch multiple lineage graphs together (provided by separate tools) to get the full picture from source to sink.

Keep reading with a 7-day free trial

Subscribe to SUP! Hubert’s Substack to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Hubert Dulay
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share