MATRIX: MAny-Task computing execution fabRIc at eXascales
MATRIX is a distributed data-aware execution fabric supporting both high-performance computing (HPC) and many-task computing (MTC) workloads. The execution fabric is fault tolerant by having all compute nodes participate in the job submission and handling process; work stealing is used to achieve efficient distributed load balancing. The fabric guarantees job execution and dependencies, and relies on an underlying scalable distributed storage system for inter-process communication (e.g. FusionFS). Data-aware scheduling maximizes data locality by scheduling computational tasks close to the data. Computations are overlapped with I/O to reduce wasted resources and hide latencies. The fabric is elastic, allowing it to grow and shrink in resource usage based on the application demand. The fabric also support compact task representation to alleviate task submission bottlenecks for common patterns (e.g. “for each x do y”). The execution fabric is integrated with several other projects, including FusionFS(a distributed file system), D^3 (direct distributed data-structure), and Swift (parallel programming system). The work will be evaluated with many applications (e.g. bioinformatics, medicine, pharmaceuticals, astronomy, physics, climate modeling, economics, and analytics) through the Swift project collaboration on the largest high-end computing (HEC) systems.
-
Executive Summary (MATRIX)
-
Period: 01/2011 - Present
-
Web Site: TBA
-
Languages: C/C++
-
Features: TCP, UDP, Threads
-
Technologies: TBA
-
OS: Linux
-
Testbeds: Linux cluster
-
Scalability: TBA
-
Performance: TBA
-
Funding: TBA