DataSys: Data-Intensive Distributed Systems LaboratoryData-Intensive Distributed Systems Laboratory

Illinois Institute of Technology
Department of Computer Science

Integrating Data Provenance Support in Database Systems     Boris Glavic

Boris Glavic

Department of Computer Science

Bahen Center for Information Technology

University of Toronto

Stuart Building 113
Monday, January 9th, 2012


Abstract: Current data management technologies including scientific databases, data warehouses, data integration frameworks, web technologies, and workflow management systems have enabled the recording and rapid sharing of enormous amounts of information. A large portion of such data is no longer the direct result of measurements or manually created by a user, but rather derived from existing data through complex automated transformations. In such settings it is of utmost importance to understand the origin and creation process of data to estimate its quality, to gain additional insights about it, or to trace errors in transformed data back to its origins. This kind of information is often referred to as data provenance.   In this talk I give an overview of how I address these challenges in my work in database systems.  I will focus mainly on two of my projects in data provenance: (1) Perm ( is a scalable system for generation and querying of provenance information over relational data. The two key ideas behind this approach are representing data and its provenance together in a single relation and rewriting queries to generate this representation. Perm supports fully integrated, efficient, on-demand provenance generation and querying. (2) The TRAMP system enables debugging of information integration scenarios based on provenance information. The system supports tracing errors with different types of causes  (the data, inconsistencies between data sources, the schemas, schema constraints, the mappings, or the transformations). TRAMP combines data provenance with two novel notions, transformation provenance and mapping provenance, to explain the relationship between some transformed data and those transformations and mappings that produced that data.  

Bio: Dr. Boris Glavic is a PostDoc in the Department of Computer Science at the University of Toronto. He is a member of the Database Research Group under Renée J. Miller. His current research is mainly focused on data provenance and information integration.   He received a Diploma (Master) in Computer Science from the RWTH Aachen in Germany, and a PhD in Computer Science from the University of Zurich in Switzerland being advised by Michael Böhlen and Gustavo Alonso