Publications

(2025). Taming Flexible Job Packing in Deep Learning Training Clusters. In TACO2025 (CCF-A).

PDF Cite

(2025). XPUTIMER: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale. In Arxiv (Under review).

PDF

(2025). Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing. In Eurosys2025 (CCF-A).

PDF

(2025). ARACHNE: Optimizing Distributed Parallel Applications with Reduced Inter-Process Communication. In TACO2025 (CCF-A).

(2024). Potamoi: Accelerating neural rendering via a unified streaming architecture. In TACO2024 (CCF-A).

PDF Cite

(2024). Adaptive Kernel Fusion for Improving the GPU Utilization while Ensuring QoS. In TC2024 (CCF-A).

PDF Cite

(2024). Exploiting all intra-SM parallelism to maximize the throughput while ensuring QoS. In Chinese Science Information Science 2024 (CCF-A).

PDF

(2024). FaaSMem: Improving Memory Efficiency of Serverless Computing with Memory Pool Architecture. In ASPLOS2024 (CCF-A).

PDF Cite

(2023). Maximizing the Utilization of GPUs Used by Cloud Gaming through Adaptive Co-location with Combo. In SoCC2023 (CCF-B) (Corresponding author).

PDF Cite

(2023). Improving Cluster Utilization Through Adaptive Resource Management for Deep Neural Network and CPU Jobs Colocation. In TC2023 (CCF-A).

PDF Cite

(2022). ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-grained Resource Management. In TC2022 (CCF-A).

PDF Cite

(2022). DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs. In ATC2022 (CCF-A).

PDF Cite

(2022). Tacker:Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS. In HPCA2022 (CCF-A).

PDF Cite

(2021). Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction. In SC2021 (CCF-A).

PDF Cite

(2021). Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks. In ICCD2021 (CCF-B).

PDF Cite

(2020). E2bird: Enhanced Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services. In TPDS2020 (CCF-A).

PDF Cite

(2020). CODA: Improving Resource Utilization by Slimming and Co-locating DNN and CPU Jobs. In ICDCS2020 (CCF-B).

PDF Cite

(2019). Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory. In TACO2019 (CCF-A).

PDF Cite