ML based lineage in databases

Researcher:

Prof. Emeritus Oded Shmueli | Computer Science

Categories:

AI & Data Science | Computer Science & Electrical Engineering

The Technology

Data lineage (a type of provenance) consists of metadata added to the data in a database. It enables tracking down errors within the data and can also be used for justification for the existence of data in the database. Data lineage can be implemented using several methods. Annotations were proved useful in explaining query results at various resolution levels. However, tracking the lineage of data tuples throughout their database lifetime requires more sophisticated methods. For scenarios where data tuples affect other tuples, lineage annotations become deeply nested and highly complex in terms of space consumption, lineage querying time, as well as clarity and readability with time.
We use Machine Learning (ML) and Natural Language Processing (NLP) techniques for approximating lineage tracking. Word embedding is used to endow an explicitly inserted tuple with a small set of vectors that “encode” its content, and an algebra on such sets of vectors that derives a set of vectors which encodes the lineage of a query-inserted tuple. The new technique requires only a constant additional space per tuple for recording lineage information. During the execution of a query, the lineage vectors of the final (and intermediate) result tuples are constructed in a similar fashion to that of semiring-based exact provenance calculations.

Advantages

“Natural ranking” of explanations
No space complexity blow-up over time
Lifelong lineage

Applications and Opportunities

Data provenance/data lineage

Business Development Contacts

Dr. Arkadiy Morgenshtein

Director of Business Development, ICT

ML based lineage in databases

Categories:

The Technology

Advantages

Applications and Opportunities

BECOME A MEMBER