DataSys: Data-Intensive Distributed Systems LaboratoryData-Intensive Distributed Systems Laboratory

Illinois Institute of Technology
Department of Computer Science

Correlation Aware Optimizations for Analytic Databases    Hideaki Kimura

Hideaki Kimura

Computer Science Department

Brown University

Stuart Building TBA
Friday, January 13th, 2012


Abstract: Recent years have seen that the analysis of large data-sets is crucially important in a wide range of business, governmental, and scientific applications. For example, research projects in astronomy need to analyze petabytes of image data taken from telescopes. Providing a fast and scalable analytical data management system for such users has become increasingly important. The major bottlenecks for analytics on such big data are disk- and network-I/O. Because the data is too large to fit in RAM, each query causes substantial disk I/O. Traditional database systems provide indexes to speed up disk reads, but many analytic queries do not benefit from indexes because data is scattered over a large number of disk blocks and disk seeks are prohibitively expensive. Furthermore, such huge data sets need to be partitioned and distributed over hundreds or many thousands of nodes. When a query requires more than one data at once, such as a query involving a JOIN operation, the data management system must transmit a large amount of data over the network. For example, the Shuffle phase in Map-Reduce systems copies file blocks over the network and causes a significant bottleneck in many cases. Our approach to tackling these challenges in big data analytics is to exploit correlations. I will describe our correlation-aware indexing, replication, and data placement which make big data analytics faster and more scalable. Finally, if time allows, I will also introduce another on-going project to develop a scalable transactional processing system on modern hardware in collaboration with Hewlett-Packard Laboratories.  

Bio: Hideaki Kimura is a doctoral candidate in the Computer Science Department at Brown University.  His main research interests are in data management systems. His dissertation research with Prof. Stan Zdonik is on correlation-based optimizations for large analytic databases. He also worked on transaction processing systems exploiting modern hardware at HP Labs.