SUP! Hubert’s Substack

SUP! Hubert’s Substack

OpenLineage with Streaming Data

A spicy look into why data lineage is important

Hubert Dulay's avatar
Hubert Dulay
Mar 18, 2023
∙ Paid

Is data lineage important? Many don’t seem to think so. Case in point, TJ’s tweet below.

X avatar for @teej_m
TJ@teej_m
What is the actual problem that a "data lineage tool" solves? If you say data mesh, you're fired.
7:37 PM · Jul 8, 2022

4 Reposts · 101 Likes

Fine!! I won’t mention data mesh in this post, TJ 😉. Follow that thread and you’ll see comments that say provenance and observability are what data lineage solves but not much else.

X avatar for @DSJayatillake
David is at data-folks.masto.host@DSJayatillake
@teej_m It doesn't solve much on its own other than helping people understand the provenance of their data... Which is helpful. However in conjunction with other systems, like observability, it's very powerful. You can get to the root cause of incidents rather than only seeing symptoms.
11:53 PM · Jul 8, 2022

1 Repost · 11 Likes

Data Products

Fine, I lied 😬. I’m going to talk about data mesh a little by talking a bit about data products. The easiest way to define data products is by comparing them to produce you buy at the grocery store. If you care about what you intake into your body, then you probably care about how your food is processed and grown. When shopping around for produce, people tend to inspect the food they buy at least a little bit. The organic labels that come with your food tells something about how it processed. Specifically if it may have pesticides. If you know what it takes to have an organic label on your food, then you may have some idea of its processing lineage. The label gives you a hint of that lineage. We don’t have this for data. You’ll need data lineage for that.

USDA organic certified icons. Set of realistic stickers with rolled up  corners. Round organic certification labels with curled edges. Vector  illustration Stock Vector | Adobe Stock

Data lineage gives you the assurance needed for consumers of data that it was properly cleansed, enriched, and secured. We should find a way to provide this information as part of the metadata of the data product. This is especially important for data scientists. They need to explain the insights they provide to the business. If data scientists are unsure of the provenance and processes through which the data products traveled, how can they be sure their metrics are safe to use for critical changes in the business. This should resonate especially to those in the healthcare or pharmaceutical industries where changes could be life threatening.

X avatar for @Ubunta
ABC@Ubunta
Data Lineage is very underrated but a complex important requirement for Data Engineering and Machine Learning systems.
1:21 PM · May 20, 2022

4 Likes

Streaming Data Lineage

If you’ve tried to build a complete picture of data lineage from source system to sink system, oftentimes you’re dealing with incomplete information. You may have to stitch multiple lineage graphs together (provided by separate tools) to get the full picture from source to sink.

User's avatar

Continue reading this post for free, courtesy of Hubert Dulay.

Or purchase a paid subscription.
© 2026 Hubert Dulay · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture