Taming Flexible Job Packing in Deep Learning Training Clusters

Pengyu Yang, Weihao Cui, Chunyu Xue, Han Zhao, Chen Chen, Quan Chen, Jing Yang, Minyi Guo

January, 2025

Abstract

Job packing is an efective technique to harvest the idle resources allocated to the deep learning (DL) training jobs but not fully utilized, especially when clusters may experience low utilization, and users may overestimate their resource needs. However, existing job packing techniques tend to be conservative due to the mismatch in scope and granularity between job packing and cluster scheduling. In particular, tapping the potential of job packing in the training cluster requires a local and ine-grained coordination mechanism. To this end, we propose a novel job-packing middleware named Gimbal, which operates between the cluster scheduler and the hardware resources. As middleware, Gimbal must not only facilitate coordination among the packed jobs but also support various scheduling objectives of diferent schedulers. Gimbal achieves dual functionality by introducing a set of worker calibration primitives designed to calibrate workers’ execution status in a ine-grained manner. The primitives obscure the complexity of the underlying job and resource management mechanisms, thus ofering the generality and extensibility for crafting coordination policies tailored to various scheduling objectives. We implement Gimbal on a real-world GPU cluster and evaluate it with a set of representative DL training jobs. The results show that Gimbal improves diferent scheduling objectives up to 1.32× compared with the state-of-the-art job packing techniques.

Type

Journal article

Publication

In ACM Transactions on Architecture and Code Optimization