DataSys: Data-Intensive Distributed Systems LaboratoryData-Intensive Distributed Systems Laboratory

Illinois Institute of Technology
Department of Computer Science

Tracing SPADE's Lineage

Dr. Tanu Malik
Computation Institute
University of Chicago

In this talk, I will present architecture and design choices in building SPADE, a tool for recording and querying provenance in distributed environments. SPADE has developed filesystem support to transparently generate and certify data provenance. SPADEv2, the current version of SPADE is developed atop a FUSE interface to efficiently record the provenance of files. Details about the processes that read and write such files, along with information about the files, are stored in a local database that can be queried via a SQL interface. An overloaded namespace allows provenance metadata to be seamlessly transferred between hosts without modifying applications. The included ‘lineage’ tool can be used to verify the provenance of a file. I will present a short demo in which we shall install SPADE and show various capability of the tool. We will then discuss some potential research projects and how to expand SPADE.