Exploiting all intra-SM parallelism to maximize the throughput while ensuring QoS

Han Zhao, Junxiao Deng, Weihao Cui, Deze Zeng, Jing Yang, Minyi Guo

May, 2024

Abstract

To address the burgeoning demand for computational capacity, GPUs incorporate an array of both general-purpose and specialized computing units, including FP32 Core, INT32 Core, FP64 Core, Tensor Core, and RT Core, within their streaming multiprocessors (SM). Various types of GPUs may encompass a subset of these computing units. Despite the presence of multiple computing units within the SM, the parallelism among them is not elucidated in the hardware design documentation. Concurrently, oﬃcial scheduling interfaces lack the capability to facilitate the parallel utilization of these computing resources by co-running the kernels using diﬀerent computing units. Also, they could not support the runtime scheduling to optimize overall system throughput. Faced with the above problems, we propose a hardware-aware throughput-oriented kernel scheduling method Hato. Hato ﬁrst designs a parallelism-aware tool that supports ﬁnding all intra-SM parallelism for any GPU. Secondly, Hato proposes a kernel co-running modeling method, which supports the existing scheduling interfaces to utilize the intra-SM parallelism, and the accurate duration prediction for the co-running kernels. Finally, Hato proposes a throughput-oriented scheduling strategy that supports utilizing all possible intra-SM parallelism to maximize overall system throughput while ensuring service quality for latency-sensitive applications. Experimental results show that compared with the state-of-the-art scheduling system Tacker, Hato improves the system throughput by an average of 19.2% and by as much as 54.1%.

Type

Journal article

Publication

In Chinese Science Information Science