Project -- Resource Management and Job Scheduling (Zhiling Lan)
In any large-scale system, workload management system (aka resource management and job scheduling) plays a critical role for efficient resource utilization and workload execution. Current workload management systems concentrate on CPU utilization and make an allocation decision solely based on application processor footprint. This is completely at odds with the emerging hybrid system architectures and diverse workload requirements. Future machines are expected to comprise millions of computing cores and embody extreme heterogeneity at many aspects (e.g., CPU and accelerators/co-processors, multi-level networking, memory hierarchy, burst buffer, storage, etc.). Meanwhile, the exponential growth in computing power has provided the enabling infrastructure to attack scientific problems that are much larger and more complex, and the emerging workloads comprise not only compute-intensive applications, but also memory- intensive, data-intensive, and on-demand applications. These applications have diverse resource requirements and exhibit different characteristics with regard to their execution. The extreme heterogeneity of hardware devices, combined with workload evolution, will render obsolete the current resource management infrastructure and force a disruptive change in the form, function, and interoperability of future resource management.
We propose the development of a flexible, intelligent, and multi-dimensional workload management system for extreme-scale high-performance computing. Flexibility indicates that resource management is driven by an adaptive process where diverse workload requirements are specified and exposed to allow differentiated services for the mixed workloads. Multi-dimensional capability provides an integrated and coordinated management of a variety of on-chip and off-chip resources and possibly other fine-grained system resources. Intelligence denotes information about system resources and workload requirements are automatically gathered, analyzed, and acted on for efficient resource utilization and application performance.
A critical challenge in developing workload management system is the inability to study the impact of job scheduling and allocation at scale. To remedy this problem, we will extend the open-source scheduling simulator CQSim developed by the team led by Lan. CQSim emulates the real batch scheduler: a real system takes jobs from user submission, while CQSim takes jobs by reading the job arrival information in the trace. Rather than executing jobs on a system, CQSim simulates job execution by advancing the simulation clock according to the job runtime information in the trace. We will expand CQsim by adding new functionalities and features to support flexible, intelligent, and multi-dimensional workload management system for current and next-generation computer clusters.
This project will be suitable for undergraduate students with general computer systems experience, such as data structures, system programming, and Python programming. With the right guidance from the mentor Lan and her senior PhD students working on workload management, we believe that students with a solid systems background will be productive on this topics over a 10-week period.