The Technology
Data lineage (a type of provenance) consists of metadata added to the data in a database. It enables tracking down errors within the data and can also be used for justification for the existence of data in the database. Data lineage can be implemented using several methods. Annotations were proved useful in explaining query results at various resolution levels. However, tracking the lineage of data tuples throughout their database lifetime requires more sophisticated methods. For scenarios where data tuples affect other tuples, lineage annotations become deeply nested and highly complex in terms of space consumption, lineage querying time, as well as clarity and readability with time.
We use Machine Learning (ML) and Natural Language Processing (NLP) techniques for approximating lineage tracking. Word embedding is used to endow an explicitly inserted tuple with a small set of vectors that “encode” its content, and an algebra on such sets of vectors that derives a set of vectors which encodes the lineage of a query-inserted tuple. The new technique requires only a constant additional space per tuple for recording lineage information. During the execution of a query, the lineage vectors of the final (and intermediate) result tuples are constructed in a similar fashion to that of semiring-based exact provenance calculations.
Advantages
- “Natural ranking” of explanations
- No space complexity blow-up over time
- Lifelong lineage
Applications and Opportunities
- Data provenance/data lineage