Cheng Li

Member of Technical Staff

Black Forest Labs

About Me

I am a Member of Technical Staff at Black Forest Labs, specializing in optimizing the training and inference efficiency of Large Language Models (LLMs) and Large Vision Models (LVMs).

Previously, I worked at Databricks Mosaic AI, where I played a key role in developing the DBRX model by optimizing memory utilization, computational efficiency, and communication strategies during training to achieve state-of-the-art performance (Building DBRX-class Custom LLMs with Mosaic AI Training). I collaborated with NVIDIA to resolve FP8 training challenges in TransformerEngine, enabling FP8 training for Mosaic AI models. Additionally, I led the technical effort to optimize the inference of Llama and DBRX models.

Prior to Databricks, I was part of Microsoft DeepSpeed, where I enhanced the performance and usability of LLMs in production systems such as GitHub Copilot and DALL·E2. My work included developing cutting-edge AI system technologies and scaling Microsoft DeepSpeed into a leading AI framework.

I created llm-analysis, an open-source tool for analyzing latency and memory in transformer models, helping with resource planning and optimization. Check it out!

Interests

Large Language Models
System Optimization and Engineering for Deep Learning
GPU and Parallel Computing

Education

PhD in Computer Science, 2020
University of Illinois Urbana-Champaign
MS in Computer Science and Engineering, 2015
University of Michigan
BS in Computer Engineering, 2013
University of Michigan
BS in Electrical Engineering, 2013
Shanghai Jiao Tong University

Experience

Member of Technical Staff

Black Forest Labs

Nov 2024 – Present Bellevue, WA

Research Engineer

Databricks

Aug 2023 – Nov 2024 Bellevue, WA

Researcher

Microsoft

Aug 2020 – Aug 2023 Bellevue, WA

Research Intern

Alibaba Group

May 2019 – Aug 2019 Sunnyvale, CA

Teaching Assistant for the 9th Programming and Tuning Massively Parallel Systems + Artificial Intelligence summer school (PUMPS+AI)

BSC, UPC and UIUC

Jul 2018 – Jul 2018 Barcelona, Spain

Research Intern

IBM TJ Watson Research Center

May 2018 – Aug 2018 Yorktown Heights, NY

Research Intern

IBM TJ Watson Research Center

May 2017 – Aug 2017 Yorktown Heights, NY

Head Teaching Assistant for ECE408/CS483: Applied Parallel Programming

UIUC

Aug 2016 – Dec 2016 Champaign, IL

Publications

Quickly discover relevant content by filtering publications.

Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He. Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. AAAI, 2024.

PDF Source Document

Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, Yuxiong He. Deepspeed data efficiency: Improving deep learning model quality and training efficiency via efficient data sampling and routing. AAAI, 2024.

PDF Source Document

Zhewei Yao, Cheng Li, Xiaoxia Wu, Stephen Youn, Yuxiong He. A Comprehensive Study on Post-Training Quantization for Large Language Models. In arXiv, 2023.

Cheng Li, Xiaoxia Wu, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He. Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. ICML, 2023.

Syed Zawad, Cheng Li, Zhewei Yao, Elton Zheng, Yuxiong He, Feng Yan. DySR: Adaptive Super-Resolution via Algorithm and System Co-design. ICLR, 2023.

PDF Source Document

Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He. Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers. In arXiv, 2022.

PDF Project Source Document

Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale. Super Computing, 2022.

PDF Project Source Document

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wen-mei Hwu. The Design and Implementation of a Scalable DL Benchmarking Platform (Best Paper Award). IEEE CLOUD, 2020.

PDF Code Project Slides

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wen-mei Hwu. DLSpec: A Deep Learning Task Exchange Specification. USENIX OpML, 2020.

PDF Project Slides Source Document

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wen-mei Hwu. Benanza: Automatic μBenchmark Generation to Compute ''Lower-bound'' Latency and Inform Optimizations of Deep Learning Models on GPUs. IPDPS, 2020.

PDF Project Slides Source Document

See all publications

Talks & Posters

SC 2019 - Across-Stack Profiling and Characterization of State-of-the-Art Machine Learning Models on GPUs

Nov 18, 2019 3:30 PM Denver, CO

Tutorial at IISWC 2019 - Challenges and Solutions for End-to-End and Across Stack ML Benchmarking

Nov 3, 2019 3:30 PM Orlando, FL

HotChips 2019 - MLModelScope: Evaluate and Profile ML Models at Scale and Across Stack

Aug 19, 2019 3:30 PM Palo Alto, California

IEEE Services 2019 - MLModelScope: Evaluate and Introspect Cognitive Pipelines

Jul 10, 2019 3:30 PM Milan, Italy

Tutorial at ISCA 2019 - Benchmarking Deep Learning Systems

Jun 22, 2019 3:30 PM Phoenix, AZ

Languages

Chinese, English