Accelerating Reduction and Scan Using Tensor Core Units

Abstract

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 X 4 or 16 X 16) to accelerate HPC and deep learning workloads. Although TCUs are prevalent and promise an increase in performance and/or energy efficiency and are heavily used within supercomputers to achieve exascale performance, they suffer from over-specialization — with only general matrix multiplication (GEMM) operations on small matrices being supported.

In this paper, we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs. To our knowledge, this paper is the first to try to broaden the class of algorithms expressible as TCU operations and is the first to show the benefits of this mapping in terms of program simplicity, efficiency, and performance. We implement the algorithms using NVIDIA V100 TCUs and achieve 89% − 98% of peak memory copy bandwidth, and are orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes (common in HPC and deep learning applications). Our implementation achieves this while decreasing the power consumption by up to 22% for reduction and 16% for scan.

Publication
International Conference on Supercomputing
Cheng Li
Cheng Li
Senior Software Engineer

My work focus on optimizing training/inference of Deep Learning models, particularly on LLM/LMM.

Related