Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He

June 2022

PDF Project Source Document

Abstract

The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging, addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU/NVMe/GPU memory to enable high-throughput inference for models larger than aggregate GPU memory, reduces latency by 6.4x and increases throughput by 4x over the state-of-the-art while achieving 260 TFLOPS/GPU throughput (over 80% of A100 peak). It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).

Type

Conference paper

Publication

Super Computing

Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale

Abstract

Cheng Li

Member of Technical Staff