Adaptive Kernel Fusion for Improving the GPU Utilization while Ensuring QoS

Han Zhao, Junxiao Deng, Weihao Cui, Quan Chen, Youtao Zhang, Deze Zeng, Minyi Guo

September, 2024

Abstract

The prosperity of machine learning applications has promoted the rapid development of GPU architecture. It continues to integrate more CUDA Cores, larger L2 cache and memory bandwidth within SM. Moreover, the GPU integrates Tensor Core dedicated to matrix multiplication. Although studies have shown that task co-location could effectively improve system throughput, existing works only focus on resource scheduling at the SM level and cannot improve resource utilization within the SM. In this paper, we propose Aker, a static kernel fusion and scheduling approach to improve resource utilization inside the SM while ensuring the QoS (Quality-of-Service) of co-located tasks. Aker consists of a static kernel fuser, a duration predictor for fused kernels, an adaptive fused kernel selector, and an enhanced QoS-aware kernel manager. The kernel fuser enables the static and flexible fusion for a kernel pair. The kernel pair could be Tensor Core kernel and CUDA Core kernel, or computing-prefer CUDA Core kernel and memory-prefer CUDA Core kernel. After the kernel fuser provides multiple fused kernel versions for a kernel pair, the duration predictor precisely predicts the duration of the fused kernels and the adaptive fused kernel selector locates the optimal fused kernel version. Finally, the kernel manager invokes the fused kernel or the original kernel based on the QoS headroom of latency-critical tasks to improve the system throughput. Our experimental results show that Aker improves the throughput of best-effort applications compared with state-of-the-art solutions by 50.1% on average, while ensuring the QoS of latency-critical tasks.

Type

Journal article

Publication

In IEEE TRANSACTIONS ON COMPUTERS