Documenting Data Lineage Comforts, Explains and Conditions a Data Team
- Christian Steinert
- 12 minutes ago
- 5 min read
Data Catalogue Series #2: How Information Flows to Recommend Profitable Design
We are sticking with a cost effective sentiment from last week. Our target is small to medium-sized healthcare companies. Data lineage is an interesting one because it’s so core to our experience as data engineers & consultants. It’s also a favorite exercise of my consulting process.
Without data lineage, data professionals and consultants stepping into a new company are lost. To make the best architecture and engineering decisions, one must understand the current state of how information flows. It gives us the compass we need to build the right ingestion patterns and data models that are aligned to the business.
IBM defines data lineage as “the process of tracking the flow of data over time, providing a clear understanding of where the data originated, how it has changed, and its ultimate destination within the data pipeline” (https://www.ibm.com/think/topics/data-lineage).
I think this definition is spot on. The level of detail to get into during a data audit may vary by choice and data lineage asset created, but this concisely captures its purpose.
What is one of the foundational topics of any academic information systems curriculum? Systems analysis & design. This was my favorite technical course in undergraduate studies.
It gave me a foundation for how information flows through systems and how it’s stored in RDBMS. Learning the skill of systems thinking was the biggest takeaway. Additionally, we extensively covered a diverse range of design documentation types through diagrams - my two favorite being Data Flow Diagrams (DFD) and Entity Relationship Diagrams (ERD). We’ll focus on DFDs in a bit.
Data Lineage Discovery
Whenever we step into an organization’s environment - two technical questions immediately surface:
1. What source systems are driving the business today?
At least for me, this question begins to carve structure for an engagement. Due to internal terminology and semantics, data environments get lost in translation.
Furthermore, we all know the wide variability of the data ecosystem company to company. A discovery and/or data audit can feel open-ended. In the beginning of my consulting journey, I felt lost in the chaotic forest of data pipelines and SELECT statements. Asking about source systems provides a definitive number. Something I can latch onto that is familiar and closest to the business process. I begin mentally picturing each source system as its own object on the left side of a Data Flow Diagram.
This is a rocket-booster start to a comprehensive data audit that will produce a profit-driving recommendation.
2. How are you reporting on the data from these systems?
If at all, what is a client currently doing today to obtain, digest and act on information? I have them walk through their entire workflow and update process.
For clients in mid-sized healthcare orgs, it’s typically manual reports in Excel. They’re exporting data from an EHR or pulling that same EHR data from an ad hoc Power BI dashboard directly connected to the API or database and a few cross walk Excel workbooks.
This further illustrates the current state’s flow of information from source to intermediary (storage) to destination (reporting).
You can begin piecing together a high-level documentation.
Now let’s get into the assets created during Data Lineage documentation.
Data Lineage Assets
Data Flow Diagram
IBM defines a Data Flow Diagram as “a visual representation of the flow of data through an information system or business process. DFDs make complex systems easier to understand and are a popular resource for software engineering, systems analysis, process improvement, business management and agile software development” (https://www.ibm.com/think/topics/data-flow-diagram).
Constructing DFDs help to quickly grasp the foundations of a data ecosystem. Furthermore, they make explaining a current vs. future state far easier for stakeholders. I’ve learned that, in IT, diagrams are everything because they streamline your talk track concisely. It makes it easier for executives to follow the diagram as you talk about it.
Leverage a tool like excalidraw.com, Miro or Google Drawings.
Similar to the hybrid data dictionary and data glossary from last week, I don’t follow an incredibly strict rule for a DFD. Technically different shapes and objects have a standardized meaning. A square is an entity and a circle is a process step.
I’ve combined aspects of the business process with the technology components in DFDs I’ve created. I’ve seen DFDs that do this as well, so I don’t think I’m off the ball here. Just informing you based on my own experiences and what’s been well received with my clients.
Below is a DFD we built for a client’s current state data ecosystem. We documented pieces of business processes that flow into the actual databases.

Just a note that it should go without saying you should create a DFD for the future state process/architecture as well. However, we find focusing on a DFD during a current state audit is what baselines our understanding at the beginning of an engagement.
Data Lineage Flow Document
Essentially, take the diagram and log each individual data asset from source to destination in a Google Sheet or Excel workbook.
Typically this spreadsheet contains the following columns:
Agent Job or Data Pipeline
Stored Procedure or dbt model name
SQL Transformation
Source database tables
BI report
This helps to paint the picture for each report, which pipeline it’s part of, the data transformation logic under the hood, database tables listed and the report(s) they connect to.
Making Data Lineage Work for Healthcare Organizations
Here's the reality check: most healthcare organizations I work with have never documented their data lineage. They're flying blind, making decisions based on reports they don't fully trust because they can't trace the data back to its source.
This is where the magic happens. Once you have your DFD and flow document in place, you can start asking the right questions:
Why is patient readmission data showing up differently in the executive dashboard versus the quality report?
Which transformation is causing our revenue cycle metrics to be delayed by two days?
If we need to add a new data source for social determinants of health, where does it fit in our current architecture?
Without data lineage documentation, these conversations turn into finger-pointing sessions. With it, they become productive problem-solving discussions.
The Consultant's Secret Weapon
I'll let you in on something that's transformed how I approach healthcare data engagements: I always lead with data lineage documentation, even when clients think they need something else.
They'll call asking for a new dashboard or wanting to migrate to a different EHR reporting tool. But I've learned that jumping straight into building without understanding the current state is like performing surgery without reviewing the patient's medical history.
The data lineage exercise becomes my diagnostic tool. It reveals the hidden complexity, identifies the real bottlenecks, and - most importantly - builds trust with the data team who finally feels heard and understood.
Quick Wins for Your Team
If you're reading this as a healthcare data leader, here are three immediate actions you can take:
Start small with your most critical business process. Pick your revenue cycle or quality reporting workflow and map just that one flow from source to destination.
Get your team involved in the documentation process. The analysts who build the reports know where the bodies are buried. Their institutional knowledge is gold during this exercise.
Use the documentation as a communication tool with leadership. Nothing gets executives more excited about data investments than showing them a clear before-and-after picture of how information flows (or doesn't flow) through their organization.
Next Week Preview
Not sure yet. We’ve covered two key areas of data cataloguing - data dictionaries/glossaries and lineage. I’m pondering a few content ideas. May revisit this mini series later. 🙂
Until then, start asking those two foundational discovery questions in your next team meeting. You might be surprised by what you discover about your own data ecosystem.
Christian Steinert is the founder of Steinert Analytics, helping healthcare & roofing organizations turn data into actionable insights. Subscribe to Rooftop Insights for weekly perspectives on analytics and business intelligence in these industries.
Comments