Project -- Data Analytics and Systems (Kyle Chard)
Increasing data sizes and system heterogeneity is leading to a world in which data and computation are distributed across a continuum of computing devices. As a result, new systems, such as Globus and funcX, have been developed to make it easier to manage data and computation. Our projects combine novel systems research with cutting-edge data analytics to understand and improve system performance.
Globus and funcX have each managed millions of transfers and computations on behalf of users. We will use this collection of current and historical information to apply data analytics to improve performance, for example predicting endpoint and file access by specific users, error conditions based on data and network conditions, transfer performance between two wide area storage locations, and execution performance on heterogeneous computing infrastructure. In this project students will analyze historical data, apply various data analytics techniques to understand those data, identify features and develop predictive heuristics, and apply prediction techniques such as collaborative filtering and neural networks to improve performance. We have prior work in this space to predict data transfers (ACM SRC 2016) and to predict function invocation dependencies for container caching algorithms (ACM SRC 2021).
Our second project focuses on extracting valuable metadata from distributed data collections with the aim to make data findable, accessible, interoperable, and reusable (FAIR). Unfortunately, scientific data are stored in myriad (often opaque) formats, are large, and are distributed among many storage locations. We are developing an automated system that can apply a set of customized extractors to files to derive metadata. We will work with students to define methods to infer file types and predict extractor utility, develop new extractors for different data formats (e.g., using NLP and computer vision techniques), compute metadata utility (e.g., completeness, uniqueness, readability), and scale these methods to large data volumes. We have conducted prior work to develop extractors (ACM SRC 2017), measure the cleanliness of repositories (ACM SRC 2018), and develop probabilistic extraction pipelines (ACM SRC 2021).
These projects are appropriate for undergraduate students with some background in data analytics, machine learning, or natural language processing (NLP). They will explore approaches such as using rule-based methods, classification models, supervised machine learning, deep learning, and crowdsourcing. Although projects in these areas are challenging, the software stack in scalable machine learning has advanced significantly in the last several years with highly efficient and usable software libraries and hardware.