"When something is important enough, you do it even if the odds are not in your favor."
krxgu@macbook-pro:~
whoami
Krish Gupta
cat role.json
"role": "Systems + Inference Engineer",
"focus": ["Inference Infrastructure", "Kernel Optimization", "Compiler Correctness"],
"stack": ["LLVM/Flang", "vLLM", "CUDA", "C++23", "Python"]

Krish Gupta

Passionate about parallel computing, inference systems, kernel-level optimization, and building efficient AI infrastructure close to the metal.

"How can we make inference hardware that is 10, 20, 50, 1,000 times more efficient than what we have today?"
— Jeff Dean (Google Chief Scientist)

Selected Engineering

Slurm Mini-Cluster

HPC Operations & Infrastructure

Local simulation of a multi-node HPC cluster for job scheduling and parallel workload validation.

  • Deployed multi-node Ubuntu cluster with Slurm + Munge authentication.
  • Validated scheduling policies with real MPI/OpenMP workloads.
  • Created comprehensive ops runbook for job triage and service recovery.
Linux Slurm MPI Bash
[Image Slot]
Add slurm.png to assets/images/

GPU Roofline Benchmark

Performance Engineering

A cross-platform benchmarking tool to identify bandwidth vs. compute bottlenecks across diverse hardware (CPU, CUDA, Metal).

  • Built micro-benchmarks (SAXPY/Triad/SGEMM) for peak performance measurement.
  • Automated device selection and roofline plotting logic.
  • Supports heterogeneous backends (CUDA, Metal, OpenMP).
C++17 CUDA Python CMake
[Image Slot]
Add roofline.png to assets/images/

GNU Radio 4.0 Expansion GSoC '25

Signal Processing & C++23

Ported and modernized legacy DSP blocks to the new modular 4.0 architecture, enabling high-throughput signal chains.

  • Implemented Math/Analog/Digital families using C++23 template-registry patterns.
  • Achieved ~2x throughput in targeted benchmarks vs legacy blocks.
  • Added 75+ GoogleTest units and established CI regression baselines.
C++23 SIMD GoogleTest
[Image Slot]
Add gnuradio.png to assets/images/

Building

ai-agent-guardrails ↗

Safety middleware for autonomous agents. Enforces deterministic output constraints and behavioral bounds for LLM-driven actions.

Reliability
vercel-fluid-mvp ↗

Request coalescing backend. Implements Redis-based locking to reduce redundant upstream LLM calls and improve tail latency.

Performance

Open Source

LLVM / Flang (OpenMP) ↗

LLVM project member with commit access, contributing to the Flang frontend with a focus on semantics correctness, diagnostics, and compiler reliability.

  • Merged fixes across OpenMP semantics and diagnostic behavior.
  • Expanded FileCheck and regression coverage for compiler correctness.
vLLM ↗

Contributor to the vLLM inference engine, focused on serving and runtime correctness, reproducible debugging, and systems-level reliability.

  • Closed PRs across real inference-engine issues, grounded in reproducible local debugging.
  • Working in the overlap of inference systems, CUDA-aware performance thinking, and runtime correctness.

Notes

View all on Medium →

Inspiration

Elon Musk

"Be a net contributor to society."

— Elon Musk
Steve Jobs

"I’m convinced that about half of what separates the successful entrepreneurs from the non-successful ones is pure perseverance."

— Steve Jobs (1995)
Thomas Edison

"Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time."

— Thomas Edison