ML based lineage in databases

Prof. Oded Shmueli | Computer Science


Information and Computer Science

The Technology

Data lineage (a type of provenance) consists of metadata added to the data in a database. It enables tracking down errors within the data and can also be used for justification for the existence of data in the database. Data lineage can be implemented using several methods. Annotations were proved useful in explaining query results at various resolution levels. However, tracking the lineage of data tuples throughout their database lifetime requires more sophisticated methods. For scenarios where data tuples affect other tuples, lineage annotations become deeply nested and highly complex in terms of space consumption, lineage querying time, as well as clarity and readability with time.
We use Machine Learning (ML) and Natural Language Processing (NLP) techniques for approximating lineage tracking. Word embedding is used to endow an explicitly inserted tuple with a small set of vectors that “encode” its content, and an algebra on such sets of vectors that derives a set of vectors which encodes the lineage of a query-inserted tuple. The new technique requires only a constant additional space per tuple for recording lineage information. During the execution of a query, the lineage vectors of the final (and intermediate) result tuples are constructed in a similar fashion to that of semiring-based exact provenance calculations.


  • “Natural ranking” of explanations
  • No space complexity blow-up over time
  • Lifelong lineage

Applications and Opportunities

  • Data provenance/data lineage
arrow Business Development Contacts
Shikma Litmanovitz
Director of Business Development, Physical Science