Improving the Multi-Tenancy GPU Performance through Adaptive Bubbleless Spatial-Temporal Sharing

Abstract

While GPUs are becoming more powerful, cloud providers start to allow multiple tenants that often have lightweight workloads to share a GPU. Existing temporal or spatial shar- ing systems struggle to provide efficient and accurate quota assignments for tenants. We observe that the performance of the multi-tenancy system is often underestimated because of the existence of unused GPU “bubbles” and can be enhanced by squeezing the bubbles. Based on this observation, we de- sign Bless, a bubble-less spatial-temporal sharing GPU sys- tem that fine-tunes the GPU resource allocation to improve multi-tenancy performance. Bless leverages precise comput- ing resource management and fine-grained kernel schedul- ing to ensure stringent quota guarantees and reduce latency fairly for tenants with varying GPU quotas. We implement and evaluate Bless with multiple applications and work- loads. Our result shows that Bless achieves 21.1% − 37.3% average latency reduction over the state-of-the-art while guaranteeing the promised quota for all tenants.

Publication
In *European Conference on Computer Systems *