Cheng Li (李程)

Cheng Li (李程)

Senior Researcher


About Me

I am a senior researcher at Microsoft. My research lies in the field of GPU-accelerated applications, with an emphasis on Deep Learning (DL). My work has focused on understanding and optimizing DL workloads. At Microsoft, I work on improving the performance/usability of DL applications in production, building systematic profiling/optimization stacks for DL, and integrating system technology into Microsoft DeepSpeed.

I received my PhD in CS from University of Illinois at Urbana-Champaign. During my PhD, I developed a number of open-source tools to benchmark, profile, and summarize DL training and inference across hardware and software stacks. The tools have been used to inform system design for DL model serving and develop highly tuned GPU kernels for model inference.


  • GPU
  • Deep Learning
  • System Design and Engineering for Deep Learning


  • PhD in Computer Science, 2020

    University of Illinois Urbana-Champaign

  • MS in Computer Science and Engineering, 2015

    University of Michigan

  • BS in Computer Engineering, 2013

    University of Michigan

  • BS in Electrical Engineering, 2013

    Shanghai Jiao Tong University



Senior Researcher


Aug 2020 – Present Bellevue, WA

Research Intern

Alibaba Group

May 2019 – Aug 2019 Sunnyvale, CA

Teaching Assistant for the 9th Programming and Tuning Massively Parallel Systems + Artificial Intelligence summer school (PUMPS+AI)


Jul 2018 – Jul 2018 Barcelona, Spain

Research Intern

IBM TJ Watson Research Center

May 2018 – Aug 2018 Yorktown Heights, NY

Research Intern

IBM TJ Watson Research Center

May 2017 – Aug 2017 Yorktown Heights, NY

Head Teaching Assistant for ECE408/CS483: Applied Parallel Programming


Aug 2016 – Dec 2016 Champaign, IL


Quickly discover relevant content by filtering publications.

The Design and Implementation of a Scalable DL Benchmarking Platform (Best Paper Award)

The current Deep Learning (DL) landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks a DL benchmarking platform to facilitate evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. Due to the lack of a benchmarking platform, the current practice of evaluating the benefits of proposed DL innovations is both arduous and error-prone - stifling the adoption of the innovations.

In this work, we first identify 10 design features which are desirable within a DL benchmarking platform. These features include: performing the evaluation in a consistent, reproducible, and scalable manner, being framework and hardware agnostic, supporting real-world benchmarking workloads, providing in-depth model execution inspection across the HW/SW stack levels, etc. We then propose MLModelScope, a DL benchmarking platform design that realizes the 10 objectives. MLModelScope proposes a specification to define DL model evaluations and techniques to provision the evaluation workflow using the user-specified HW/SW stack. MLModelScope defines abstractions for frameworks and supports a broad range of DL models and evaluation scenarios. We implement MLModelScope as an open-source project with support for all major frameworks and hardware architectures. Through MLModelScope’s evaluation and automated analysis workflows, we performed case-study analyses of 37 models across 4 systems and show how model, hardware, and framework selection affects model accuracy and performance under different benchmarking scenarios. We further demonstrated how MLModelScope’s tracing capability gives a holistic view of model execution and helps pinpoint bottlenecks.

Benanza: Automatic μBenchmark Generation to Compute ''Lower-bound'' Latency and Inform Optimizations of Deep Learning Models on GPUs

As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow characterization/optimization cycles that cannot keep up with the fast pace at which new DL models are introduced.

We propose Benanza, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs. Benanza consists of four major components: a model processor that parses models into an internal representation, a configurable benchmark generator that automatically generates micro-benchmarks given a set of models, a database of benchmark results, and an analyzer that computes the “lower-bound” latency of DL models using the benchmark data and informs optimizations of model execution. The “lower-bound” latency metric estimates the ideal model execution on a GPU system and serves as the basis for identifying optimization opportunities in frameworks or system libraries. We used Benanza to evaluate 30 ONNX models in MXNet, ONNX Runtime, and PyTorch on 7 GPUs ranging from Kepler to the latest Turing, and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs (Best Paper Award)

There has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application domains. This has made profiling and characterization of ML model performance an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible system to serve ML models with the target latency, throughput, cost, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, framework, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack, which limits the thoroughness and usefulness of the profiling results.

This paper proposes XSP — an across-stack profiling design that gives a holistic and hierarchical view of ML model execution. XSP leverages distributed tracing to aggregate and correlate profile data from different sources. XSP introduces a leveled and iterative measurement approach that accurately captures the latencies at all levels of the HW/SW stack in spite of the profiling overhead. We couple the profiling design with an automated analysis pipeline to systematically analyze65state-of-the-art ML models. We demonstrate that XSP provides insights which would be difficult to discern otherwise.

DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs

The past few years have seen a surge of applying Deep Learning (DL) models for a wide array of tasks such as image classification, object detection, machine translation, etc. While DL models provide an opportunity to solve otherwise intractable tasks, their adoption relies on them being optimized to meet latency and resource requirements. Benchmarking is a key step in this process but has been hampered in part due to the lack of representative and up-to-date benchmarking suites. This is exacerbated by the fast-evolving pace of DL models.

This paper proposes DLBricks, a composable benchmark generation design that reduces the effort of developing, maintaining, and running DL benchmarks on CPUs. DLBricks decomposes DL models into a set of unique runnable networks and constructs the original model’s performance using the performance of the generated benchmarks. DLBricks leverages two key observations: DL layers are the performance building blocks of DL models and layers are extensively repeated within and across DL models. Since benchmarks are generated automatically and the benchmarking time is minimized, DLBricks can keep up-to-date with the latest proposed models, relieving the pressure of selecting representative DL models. Moreover, DLBricks allows users to represent proprietary models within benchmark suites. We evaluate DLBricks using 50 MXNet models spanning 5 DL tasks on 4 representative CPU systems. We show that DLBricks provides an accurate performance estimate for the DL models and reduces the benchmarking time across systems (e.g. within 95% accuracy and up to 4.4× benchmarking time speedup on Amazon EC2 c5.xlarge).


Python, C/C++, CUDA, Go, JavaScript, Bash, LaTeX, Mathematica

Chinese, English


  • 555 110th Ave NE, Bellevue, WA 98004