Passionate about parallel computing, optimizing hardware, and building efficient AI infrastructure.
A cross-platform benchmarking tool to identify bandwidth vs. compute bottlenecks across diverse hardware (CPU, CUDA, Metal).
Local simulation of a multi-node HPC cluster for job scheduling and parallel workload validation.
Ported and modernized legacy DSP blocks to the new modular 4.0 architecture, enabling high-throughput signal chains.
Safety middleware for autonomous agents. Enforces deterministic output constraints and behavioral bounds for LLM-driven actions.
ReliabilityRequest coalescing backend. Implements Redis-based locking to reduce redundant upstream LLM calls and improve tail latency.
PerformanceContributing to the Flang frontend for OpenMP support. Focused on semantics correctness and diagnostics.
Core contributor to the next-gen SDR runtime (GSoC '25).