...
END to END data Lineage Management
Data Lineage is the core of understanding Big Data acquired through various sources. Without knowing the Data lineage, Big Data is like the ‘pass the message’ game where the last person to receive the message has a completely degraded version of the original message. The same happens in case of bad Data Lineage as well, as an enterprise’s data assets flow through its Data Architecture.
Businesses need data that is secure and compliant. This data needs to be available at applicable places. Clean Big Data is also a requirement when there are multiple end-users, many platforms, sources in different platforms such as video, text, images, and audio. Big Data is stored remotely in the Cloud and it becomes less tangible as to how the data got there. Understanding Data Lineage addresses these types of problems and more. Let us understand the basics as well as Data Lineage Management.
What is Data Lineage?
Gathering insights about the origin, movement, characteristics, and quality of the data is the first step in Data Lineage. Big Data typically begins at Data lineage and the understanding further goes towards the final change. This has been the traditional approach towards Data Lineage. For instance, a project for creating a new clinician/patient system at an established technology company, project members would refer to a map of tables and joins to guide what SQL to use for selecting, summarizing, or grouping the data. Programmers involved would update the code to generate the needed values and QA would read these plans to anticipate ways to break the software. This method was just the beginning. Data Lineage is much more and needs to be understood in detail.
The traditional approach to Data Lineage presents many roadblocks for the data, especially for the Master Data, i.e information about people, processes, and things that form the core of any business. In an instance from the banking industry; team members of the project are developing a new checking program for a large bank division that handles foreign transactions. QA and software engineers face issues in obtaining a valid set of test data from other bank divisions. Including additional Data Lineage facets, such as who uses the Big Data, what does it mean, when is the data accessed, why is the data stored, and how are the data elements related makes Data Lineage more meaningful. These obstacles that engineers face could have been mitigated, shortening the time frame for development and testing. Meaningful Data Lineage needs to contain multiple dimensions: who, what, where, why, and how.
Why Keep Track of the Data Lineage?
Tracking Data lineage has a lot of benefits
Data Governance:
This one requires Metadata Management needed to ensure Big Data meets business standards. Metadata Management solution is aimed at reaching the absolute source of its origin to the end at the other side. A Data Lineage solution connects Metadata together giving the understanding and validation of data usage.
Compliance:
Customers, staff members, and auditors have to keep trust in the data that is reported when they receive business opportunities and other challenges. A report needs to inform or answer the question as to “How did the information reach there? In such a case, tracking Data Lineage provides proof that the reports adhere to the data presented.
Data Quality:
Data Quality faces challenges like data movement, transformation, interpretation, and selection through people and processes. It has become mandatory for businesses to give the origin of data and its transformation in an organized manner. Data Lineage Management solution comes into picture when you want to know the ability of “at the end-to-end flow,” of a process. It is essential to share information on when the data has been transformed, what it means, and how the Data Quality moves from one place to another.
Business Impact Analysis:
For businesses, it is important to understand how users within the organization, external users share the Big Data or the Master Data and how is the data transformed? For example, an employee must be able to get data for revisiting old decisions, as to why was it made or what were the consequences, etc. Responding to such questions requires going back and forth in time with your data, which necessitates understanding the Data Lineage.
How to Create and Use Data Lineage in Your Business?
Creating and using Data Lineage effectively is required to make better decisions and respond more rapidly to business opportunities and regulations. Some of the good strategies to create effective data lineage are:
Record the Where and How of Your Data:
Determine the technical lineage of the physical data through underlying applications, services, data stores. Thoroughly check the key business including through key business processes and flow. Track the movement of data and changes that it has gone through.
Investigate the 5 W’s:
Meaningful data is multidimensional, not pertaining to only where and how. One needs to find out who is using the data, what does it mean, when was it captured, when is it being used and why is it stored and/or used.
Understand Relationships:
Data movement and storage etc are in a relationship. The connection between how data originates and moves between people, processes, services, and products is an important fact to know.
Service Now, a partner with SPM Global Technology offers plugins for Data management that are efficient to maintain Data Lineage:
Data Archiving-Provides the ability to archive records to minimize performance issues.
Database Rotations- Provides tools for managing large tables to minimize performance issues.
Many to Many task relations- Provides the ability to define many-to-many relationships between task tables.
Quora questions:
1.What is data lineage?
Also called as the pedigree of the data, data lineage is an information fusion. It deals with how derived data is based on its sources (fused together). This essentially throws the assumptions of most algorithms, techniques, methodologies in information fusion overboard and reduces the results to worse than random. Therefore, in decentralized environments, with unreliable sources, some sources may be based on sensor data, others on humans reports, where reporting may be offset in time, it is strongly desirable to keep track of the source and how data was derived from source data.
2. What is the difference between data lineage and data tracing?
Data lineage includes the data's origins, what happens to it, and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.
In summary, data lineage is the documentation of the data life cycle, while data traceability is the process of evaluating that the data is following its life cycle as expected. Many data-quality projects will require data traceability to track information and ensure that its usage is proper.