XSearch: Distributed Indexing and Search in Large-Scale File Systems
Increasing data volumes, particularly in science and engineering, has resulted in the widespread adoption of parallel and distributed filesystems for storing and accessing data efficiently. However, as filesystem sizes and the amount of data “owned” by users increase, it is increasingly difficult to discover and locate data amongst the terabytes or petabytes of accessible data. While it is now routine to search for data on a PC or discover data online at the click of a button, there is no such equivalent method for discovering data on large parallel and distributed file systems. The XSearch project argues the need for new methods to support search in the context of large-scale storage systems. This work discusses why current models are inadequate and investigate the increasing size and complexity of production parallel and distributed file systems to outline the scale of the challenge to be addressed. Throughout this project I have explored popular search data structures (Hashmaps, Tries, Trees, Skip Lists), information retrieval libraries (Apache Lucene, Xapian) and cloud search platforms (Apache Solr, ElasticSearch), all of them implemented and evaluated in C, C++, Java or Python. It has likewise allowed me to gain deep insight into I/O and data-intensive programming, database and search engine design, advanced multithreaded synchronization techniques (lock-free synchronization, atomic operations), as well as experience with various parallel and distributed systems (Lustre, Ceph, HDFS, FusionFS). XSearch aims to implement a scalable distributed indexing system, which is designed to support powerful free-text search mechanisms across file-based data. The long-term goals are to integrate XSearch with existing parallel and distributed file systems to provide efficient search.
-
Period: 01/2017 - Present
-
Languages: C/C++, Python
-
Features: TBA
-
Technologies: TBA
-
OS: Linux
-
Testbeds: Mystic
-
Scalability: TBA
-
Performance: TBA
-
Funding: TBA
Publications
- Alexandru Iulian Orhean, Ioan Raicu. "Distributed Indexing and Search in Large-Scale Storage Systems", Illinois Institute of Technology, Computer Science Department, PhD Proposal, December 2022
-
Alexandru Iulian Orhean, Anna Giannakou, Lavanya Ramakrishnan, Kyle Chard, Ioan Raicu. "SCANNS: Towards Scalable and Concurrent Data Indexing and Searching in High-End Computing System”, The 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid) 2022
-
Alexandru Iulian Orhean, Anna Giannakou, Katie Antypas, Ioan Raicu, Lavanya Ramakrishnan. "Evaluation of a Scientific Data Search Infrastructure”, Concurrency and Computation: Practice and Experience (CCPE), 2022
-
Alexandru Orhean, Kyle Chard, Ioan Raicu. “XSearch: Distributed Information Retrieval in Large-Scale Storage Systems”, Illinois Institute of Technology, Department of Computer Science, PhD Oral Qualifier, 2018
-
Anna Blue Keleher, Kyle Chard, Ian Foster, Alexandru Iulian Orhean, Ioan Raicu. “Finding a Needle in a Field of Haystacks: Metadata Search for Distributed Research Repositories”, IEEE/ACM Supercomputing/SC 2017
-
Alexandru Iulian Orhean, Itua Ijagbone, Dongfang Zhao, Kyle Chard, Ioan Raicu. “Toward Scalable Indexing and Search on Distributed and Unstructured Data”, IEEE Big Data Congress 2017
-
Jonathan Wu, Suraj Chafle, Ioan Raicu, Kyle Chard. “Optimizing Search in Un-Sharded Large-Scale Distributed Systems”, IEEE/ACM SuperComputing/SC 2016 (Poster)
-
Itua Ijagbone, Ioan Raicu (advisor). "Scalable Indexing and Searching on Distributed File Systems", Department of Computer Science, Illinois Institute of Technology, MS Thesis, 2016
Presentation
- TBA