Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

Abstract

Emerging GPUs have multiple Streaming Multiprocessors (SM), while each SM is comprised of CUDA Cores and Tensor Cores. While CUDA Cores do the general computation, Tensor Cores are designed to speed up matrix multiplication for deep learning applications. However, a GPU kernel often either uses CUDA Cores or Tensor Cores, leaving the other processing units idle. Although many prior research works have been proposed to co-locate kernels to improve GPU utilization, they cannot leverage the Intra-SM CUDA Core-Tensor Core Parallelism. Specifically, ISPA designs persistent and elastic block to solve the thread slot and shared memory contention between co-located kernels. ISPA also adopts the register allocation method to manage the register contention. These resource management methods are applicable for both white-box kernels and cudnn kernels. Experimental results on an Nvidia 2080Ti GPU show that ISPA improves the system-wide throughput by 15.3% for white-box workloads, and 7.1% for cudnn-based workloads compared with prior co-location work.

Publication
In ACM Transactions on Architecture and Code Optimization