DataSys: Data-Intensive Distributed Systems LaboratoryData-Intensive Distributed Systems Laboratory

Illinois Institute of Technology
Department of Computer Science

Iterative MapReduce enabling HPC-Cloud Interoperability   Judy Qiu

Dr. Judy Qiu
Assistant Professor

Computer Science and Informatics

Indiana University

Stuart Building 113
Friday, November 4th, 2011
12:45PM - 1:45PM

Slides

Abstract: Clouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on HPC clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister. Comparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to HPC applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce and MPI in information retrieval and life sciences applications. The overall major challenge for this research is building a production environment capable of handling the incredible increases in dataset sizes while solving the technical challenges of portability with scaling performance and fault tolerance using an attractive powerful programming model. We show our preliminary results of Mapreduce4Azure as the first MapReduce on Microsoft Azure Cloud Platform. Twister interpolates between HPC (MPI) and Clouds (MapReduce) and promises a uniform programming environment for many data intensive applications. 

Bio: Dr. Judy Qiu is an Assistant Professor of Computer Science at Indiana University. Her areas of study include parallel and distributed systems, Cloud/Grid computing and high performance computing. She started the multicore project with Microsoft, Inc. in 2006 and initial Post Doctoral work focusing on performance of threading versus MPI in both kernels and data mining application. This research effort has evolved into the current SALSA project encompassing data-intensive computing at the intersection of Cloud and multicore technologies. Her research interests involve the architecture and use of leading-edge technologies, with special emphasis on their value to important applications such as life science applications and data intensive technologies. An extended research beyond MapReduce is to support iterative algorithms in data mining and machine learning and her research team has released both Java and Azure versions of Twister iterative MapReduce system. Her work has been funded by NSF, NIH, Microsoft, and Indiana University Faculty Research Support Program. Prof. Qiu is an Assistant Director of Digital Science Center, where she leads the SALSA research team and supervises research activities of both professional staff and PhD students from the IU School of Informatics and Computing. Prof. Qiu is also active in program service. She was the founder and the chair of Multicore2010, ECMLS2010, ECMSL2011, BigDataforScience2010 workshops and served as a Program Co-Chair of the 2nd IEEE International Conference of Cloud Computing Technology and Science 2010. She is on editorial board of International Journal of Cloud Computing.