Accelerating Reduction Using Tensor Core Units

Abstract

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs come under the guise of different marketing terms and are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. Although TCUs are prevalent and promise an increase in performance and/or energy efficiency, they suffer from over-specialization — with only general matrix-matrix multiplication (GEMM) being supported. This limits their applicability to general algorithms and makes them confined to narrowly specialized libraries and application domains. In this work, we leverage NVIDIA’s TCU to express reduction in terms of matrix multiplication and show the benefits — in terms of program simplicity, efficiency, and performance compared to start-of-the-art reduction methods on the GPU. Although this work targets GPUs, the motivation, methods, and observations apply to a wide number of TCU implementations and microarchitectures.

Publication
Intersection of High Performance Computing and Machine Learning 2019
Cheng Li
Cheng Li
Member of Technical Staff

I specialize in building efficient AI training and inference systems using GPUs, with a focus on optimizing performance for Large Language Models (LLMs) and Large Vision Models (LVMs).

Related