Automatic μBenchmark Generation to Compute “Lower-bound” Latency and Inform Optimizations of Deep Learning Models on GPUs.
An open-source, framework and hardware agnostic, extensible and customizable, distributed platform design for evaluating and profiling ML models across datasets/frameworks/systems.
Leveraging NVIDIA’s Tensor Cores to express Collectives with matrix multiplication and exploring the benefits in terms of program simplicity, efficiency, and performance.
An extendable and customizable GPU benchmarking framework
A Scalable Project Submission System for Parallel Programming Courses.
Kernel Lauch Aggregation and Promotion (KLAP), a set of compiler techniques that improve the performance of GPU kernels which use dynamic parallelism.
Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function as a Service Environments.