Anshuman Agrawal

I am a researcher and engineer focused on High-Performance Computing (HPC) and Deep Learning Systems.

While many focus on training larger models, I focus on making them run faster and leaner. My research bridges the gap between high-level AI frameworks and hardware reality, optimizing the "plumbing" of AI—from writing custom GPU kernels to analyzing distributed clusters.

I am currently an Undergraduate Researcher at the UPES AI Center, where I work on:

Kernel Optimization: Writing custom OpenAI Triton kernels to fuse operations and optimize memory bandwidth, beating standard PyTorch eager execution.
Efficient Inference: Implementing 4-bit Quantization (AWQ/GPTQ) pipelines to deploy 7B+ parameter models on consumer-grade hardware.
Distributed Systems: Analyzing NCCL communication primitives (Ring All-Reduce) to characterize bottlenecks in multi-node training.

🛠 Active Research

My work is open-source and documented in my primary engineering log:

high-performance-deep-learning
A collection of my implementations including Fused Softmax kernels, Vectorized Monte Carlo engines, and Distributed Training simulations.

Anshuman Agrawal #

🛠 Active Research #

Anshuman Agrawal

🛠 Active Research