Exploiting all intra-SM parallelism to maximize the throughput while ensuring QoS

Abstract

To address the burgeoning demand for computational capacity, GPUs incorporate an array of both general-purpose and specialized computing units, including FP32 Core, INT32 Core, FP64 Core, Tensor Core, and RT Core, within their streaming multiprocessors (SM). Various types of GPUs may encompass a subset of these computing units. Despite the presence of multiple computing units within the SM, the parallelism among them is not elucidated in the hardware design documentation. Concurrently, official scheduling interfaces lack the capability to facilitate the parallel utilization of these computing resources by co-running the kernels using different computing units. Also, they could not support the runtime scheduling to optimize overall system throughput. Faced with the above problems, we propose a hardware-aware throughput-oriented kernel scheduling method Hato. Hato first designs a parallelism-aware tool that supports finding all intra-SM parallelism for any GPU. Secondly, Hato proposes a kernel co-running modeling method, which supports the existing scheduling interfaces to utilize the intra-SM parallelism, and the accurate duration prediction for the co-running kernels. Finally, Hato proposes a throughput-oriented scheduling strategy that supports utilizing all possible intra-SM parallelism to maximize overall system throughput while ensuring service quality for latency-sensitive applications. Experimental results show that compared with the state-of-the-art scheduling system Tacker, Hato improves the system throughput by an average of 19.2% and by as much as 54.1%.

Publication
In Chinese Science Information Science