Tacker:Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS

Abstract

The proliferation of machine learning applications has promoted both CUDA Cores and Tensor Cores’ integration to meet their acceleration demands. While studies have shown that co-locating multiple tasks on the same GPU can effectively improve system throughput and resource utilization, existing schemes focus on scheduling the resources of traditional CUDA Cores and thus lack the ability to exploit the parallelism between Tensor Cores and CUDA Cores. In this paper, we propose Tacker, a static kernel fusion and scheduling approach to improve GPU utilization of both types of cores while ensuring the QoS (Quality-of-Service) of co-located tasks. Tacker consists of a Tensor-CUDA Core kernel fuser, a duration predictor for fused kernels, and a runtime QoS-aware kernel manager. The kernel fuser enables the flexible fusion of kernels that use Tensor Cores and CUDA Cores, respectively. The duration predictor precisely predicts the duration of the fused kernels. Finally, the kernel manager invokes the fused kernel or the original kernel based on the QoS headroom of latency-critical tasks to improve the system throughput. Our experimental results show that Tacker improves the throughput of best-effort applications compared with state-of-the-art solutions by 18.6% on average, while ensuring the QoS of latency-critical tasks.

Publication
In The 28th IEEE International Symposium on High-Performance Computer Architecture