DataSys: Data-Intensive Distributed Systems LaboratoryData-Intensive Distributed Systems Laboratory

Illinois Institute of Technology
Department of Computer Science

XSearch: Distributed Indexing and Search in Large-Scale File Systems

XSearch Increasing data volumes, particularly in science and engineering, has resulted in the widespread adoption of parallel and distributed filesystems for storing and accessing data efficiently. However, as filesystem sizes and the amount of data “owned” by users increase, it is increasingly difficult to discover and locate data amongst the terabytes or petabytes of accessible data. While it is now routine to search for data on a PC or discover data online at the click of a button, there is no such equivalent method for discovering data on large parallel and distributed file systems. The XSearch project argues the need for new methods to support search in the context of large-scale storage systems. This work discusses why current models are inadequate and investigate the increasing size and complexity of production parallel and distributed file systems to outline the scale of the challenge to be addressed. Throughout this project I have explored popular search data structures (Hashmaps, Tries, Trees, Skip Lists), information retrieval libraries (Apache Lucene, Xapian) and cloud search platforms (Apache Solr, ElasticSearch), all of them implemented and evaluated in C, C++, Java or Python. It has likewise allowed me to gain deep insight into I/O and data-intensive programming, database and search engine design, advanced multithreaded synchronization techniques (lock-free synchronization, atomic operations), as well as experience with various parallel and distributed systems (Lustre, Ceph, HDFS, FusionFS). XSearch aims to implement a scalable distributed indexing system, which is designed to support powerful free-text search mechanisms across file-based data. The long-term goals are to integrate XSearch with existing parallel and distributed file systems to provide efficient search.