Anshuman Agrawal #
I am a researcher and engineer focused on High-Performance Computing (HPC) and Deep Learning Systems.
While many focus on training larger models, I focus on making them run faster and leaner. My research bridges the gap between high-level AI frameworks and hardware reality, optimizing the "plumbing" of AI—from writing custom GPU kernels to analyzing distributed clusters.
I am currently an Undergraduate Researcher at the UPES AI Center, where I work on:
- Kernel Optimization: Writing custom OpenAI Triton kernels to fuse operations and optimize memory bandwidth, beating standard PyTorch eager execution.
- Efficient Inference: Implementing 4-bit Quantization (AWQ/GPTQ) pipelines to deploy 7B+ parameter models on consumer-grade hardware.
- Distributed Systems: Analyzing NCCL communication primitives (Ring All-Reduce) to characterize bottlenecks in multi-node training.
🛠 Active Research #
My work is open-source and documented in my primary engineering log:
- high-performance-deep-learning
A collection of my implementations including Fused Softmax kernels, Vectorized Monte Carlo engines, and Distributed Training simulations.

