Data lineage tracks the complete journey of data from its origin through all the transformations, processes, and systems of the data lifecycle until it reaches its final destination.
Just as a family tree traces the ancestry of individuals from their origin to the present day, data lineage tracks the complete journey of data from its origin through all the transformations, processes, and systems of the data lifecycle until it reaches its final destination. In the same way that a family tree provides a comprehensive view of one's family history, data lineage provides a comprehensive view of data. Both family trees and data lineage help us understand the origin and history of something, whether it's our family or our data, and enable us to make more informed decisions based on that understanding.
And just like families, cloud environments can be complex. For example, data can be replicated from multiple prod sources to support research and data science tasks, or to serve as a backup. When data is replicated like this from multiple sources, it increases its attack surface, such as unauthorized access and secured storage. Or another common example, the same schema is deployed to multiple environments. It can also create dependencies between environments, making it harder to make changes, updates, or properly protect.
It is important to note that data lineage and data supply chains are closely related terms, which layer into one another. Data supply chain refers to the entire process of managing the intake of data to or within an organization, while data lineage is a specific aspect of the data supply chain that focuses on the history and movement of data. Together, they ensure that an organization can maintain data security over time. For this article, we will focus on data lineage.
Because data processing pipelines are becoming more complex, cloud security platforms with data lineage are becoming increasingly important. Not to be confused with data provenance, which focuses on the origin of data collection, data lineage provides a view into the entire data lifecycle of an organization’s data. Organizations benefit from data lineage by gaining;
Data lineage is an easy approach to understanding ‘who’ owns ‘what’ data - even abandoned datastores. Because it is a process for tracking the evolution of data as it flows from source to destination, it is possible to understand the connections between different data sources and users. This means answering important questions about where user data comes from, what transformations occur on it along the way, and how it is ultimately used. By understanding these connections, you can hold team members accountable.
By reducing your attack surface, you can reduce your organization’s risk of a data breach. Once you understand your data lineage, you’ll want to practice data removal, detect abandoned datastores , and remove unused permissions to data.
With data lineage, organizations can track their data through a single dynamic control plane and understand who actually has access to the data, which is necessary for compliance and auditing purposes. It helps ensure that data is being used in accordance with regulations, controls, and internal policies. Privacy, risk, compliance, and security teams can manage a single dynamic dashboard of policy requirements and see in real-time when policies are not met.
Organizations can quickly identify the source of any data-related risks or errors and fix them more efficiently. Privacy, risk, compliance, and security teams can manage a single dynamic dashboard of policy requirements.
With a complete understanding of their data's journey, organizations can make better-informed decisions about how to use their data to drive business growth and success. Engineers can have shared ownership with security enabling them to responsibly leverage any cloud data store without constraints.