DeepSpeed is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference.
Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs.
Automatic μBenchmark Generation to Compute “Lower-bound” Latency and Inform Optimizations of Deep Learning Models on GPUs.
An open-source, framework and hardware agnostic, extensible and customizable, distributed platform design for evaluating and profiling ML models across datasets/frameworks/systems.
Leveraging NVIDIA’s Tensor Cores to express Collectives with matrix multiplication and exploring the benefits in terms of program simplicity, efficiency, and performance.
An extendable and customizable GPU benchmarking framework
A Scalable Project Submission System for Parallel Programming Courses.
Kernel Lauch Aggregation and Promotion (KLAP), a set of compiler techniques that improve the performance of GPU kernels which use dynamic parallelism.
Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function as a Service Environments.