Passionate about parallel computing, inference systems, kernel-level optimization, and building efficient AI infrastructure close to the metal.
Local simulation of a multi-node HPC cluster for job scheduling and parallel workload validation.
A cross-platform benchmarking tool to identify bandwidth vs. compute bottlenecks across diverse hardware (CPU, CUDA, Metal).
Ported and modernized legacy DSP blocks to the new modular 4.0 architecture, enabling high-throughput signal chains.
Safety middleware for autonomous agents. Enforces deterministic output constraints and behavioral bounds for LLM-driven actions.
ReliabilityRequest coalescing backend. Implements Redis-based locking to reduce redundant upstream LLM calls and improve tail latency.
PerformanceLLVM project member with commit access, contributing to the Flang frontend with a focus on semantics correctness, diagnostics, and compiler reliability.
Contributor to the vLLM inference engine, focused on serving and runtime correctness, reproducible debugging, and systems-level reliability.