Closing the bare-metal gap
Led B200 SKU enablement end to end; validated 397 TFLOPS single-GPU under VFIO passthrough, matching the NVIDIA SU1 bare-metal baseline; unlocked multi-tenant GPU VMs.
Director of Engineering · Senior Staff SWE
Hands-on architect and engineering leader — 17 years in large-scale distributed systems, now owning fleet-scale GPU and LLM inference infrastructure at Coupang Intelligence Cloud. The Kubernetes-native substrate beneath the company's model training and serving: ~5,000 H200/B200 GPUs across 200+ clusters, held at 99.99%. Close enough to the code to debug an OVS flow pipeline or an NCCL collective on a B200 fabric.
Kirkland, WA · Staff / Principal IC + Eng leadership
Led B200 SKU enablement end to end; validated 397 TFLOPS single-GPU under VFIO passthrough, matching the NVIDIA SU1 bare-metal baseline; unlocked multi-tenant GPU VMs.
Designed and wrote CIC's custom K8s GPU scheduler in Go (inspired by NVIDIA KAI-Scheduler): gang scheduling with transactional all-or-nothing allocation (zero partial allocations); an async Binder controller (BindRequest CRD) decoupling placement from API-server latency.
CompositeApplication CRD: one declarative spec the control plane composes and reconciles across compute, storage, networking, and identity; patent-pending; the basis for the tenant-facing API.
KVM + OVS + VXLAN overlay with BGP EVPN; per-tenant DNS identity, IPAM for InfiniBand/SR-IOV, and egress accounting; served as incident commander for fleet networking.
Offloaded the entire host network and storage data plane to BlueField-3 DPUs (hardware-offloaded OVS): near-line-rate throughput at negligible host CPU. Owned DPU lifecycle end to end — firmware/OS via Redfish and clusterware, network boot via NVIDIA DOCA SNAP, qemu-nbd → virtio-blk — with tenant IP/OS mobility, dual-path RAID-1 to DPU block devices, and active/standby dual-DPU failover tied to host UEFI boot order. Partnered with NVIDIA engineering on converged networking.
Migrated etcd out of NVIDIA Base Command Manager and transferred Day 0 / Day 2 ownership in-house; eliminated BCM licensing for the Kubernetes layer.
Built a duplicate-item-matching platform: parallel image + text deep-embedding pipelines with FAISS vector search across 50M+ catalogs at 3,500 RPS; complementary match sets enabled a union-of-candidates design giving a 106% recall lift over Elasticsearch.
Coupang Intelligence Cloud
Director of Engineering · Senior Staff SWE
Owning fleet-scale GPU and LLM inference infrastructure — the Kubernetes-native control plane beneath model training and serving across ~5,000 H200/B200 GPUs and 200+ clusters.
AWS
Senior SWE
WAFV2 + Firewall Manager — building edge security control planes operating at billions of requests per day.
Microsoft
Senior SWE
Dynamics CRM Online reliability — hardening a large multi-tenant SaaS platform.
Intel
Lead SWE · Foundry Services
Led engineering in Foundry Services, driving $3M in new revenue.
Open to the right room
Staff / Principal IC · Eng leadership · GPU, ML & distributed-systems infrastructure