CODA: Improving Resource Utilization by Slimming and Co-locating DNN and CPU Jobs

Han Zhao, Weihao Cui, Quan Chen, Jingwen Leng, Kai Yu, Deze Zeng, Chao Li, Minyi Guo

February, 2020

Abstract

While deep neural network (DNN) models are often trained on GPUs, many companies and research institutes build GPU clusters that are shared by different groups. On such GPU cluster, DNN training jobs also require CPU cores to run pre-processing, gradient synchronization. Our investigation shows that the number of cores allocated to a training job signiﬁcantly impact its performance. To this end, we characterize representative deep learning models on their requirement for CPU cores under different GPU resource conﬁgurations, and study the sensitivity of these models to other CPU-side shared resources. Based on the characterization, we propose CODA, a scheduling system that is comprised of an adaptive CPU allocator, a real-time contention eliminator, and a multi-array job scheduler. Experimental results show that CODA improves GPU utilization by 20.8% on average without increasing the queuing time of CPU jobs.

Type

Conference paper

Publication

In IEEE 40th International Conference on Distributed Computing Systems