E2bird: Enhanced Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services

Abstract

We aim to tackle existing problems about deep learning serving on GPUs in the view of the system. GPUs have been widely adopted to serve online deep learning-based services that have stringent QoS(Quality-of-Service) requirements. However, emerging deep learning serving systems often result in poor responsiveness and low throughput of the inferences that damage user experience and increase the number of GPUs required to host an online service. Our investigation shows that the poor batching operation and the lack of data transfer-computation overlap are the root causes of the poor responsiveness and low throughput. To this end, we propose E2bird, a deep learning serving system that is comprised of a GPU-resident memory pool, a multi-granularity inference engine, and an elastic batch scheduler. The memory pool eliminates the unnecessary waiting of the batching operation and enables data transfer-computation overlap. The inference engine enables concurrent execution of different batches, improving the GPU resource utilization. The batch scheduler organizes inferences elastically to guarantee the QoS. Our experimental results on an Nvidia Titan RTX GPU show that E2bird reduces the response latency of inferences by up to 82.4% and improves the throughput by up to 62.8% while guaranteeing the QoS target compared with TensorFlow Serving.

Publication
In IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS