Director of Engineering · Senior Staff SWE

I build the control plane underneath the GPUs.

Hands-on architect and engineering leader — 17 years in large-scale distributed systems, now owning fleet-scale GPU and LLM inference infrastructure at Coupang Intelligence Cloud. The Kubernetes-native substrate beneath the company's model training and serving: ~5,000 H200/B200 GPUs across 200+ clusters, held at 99.99%. Close enough to the code to debug an OVS flow pipeline or an NCCL collective on a B200 fabric.

Kirkland, WA · Staff / Principal IC + Eng leadership

Experience17 yrs
Fleet~5,000 gpus
Clusters200+
Availability99.99%
Cost saved$8M+
01

Engineering stories

GPU Virtualization · VFIO01 / 07

Closing the bare-metal gap

Led B200 SKU enablement end to end; validated 397 TFLOPS single-GPU under VFIO passthrough, matching the NVIDIA SU1 bare-metal baseline; unlocked multi-tenant GPU VMs.

397 TFLOPS = bare metal B200 · VFIO · SR-IOV
Custom Scheduler · Go02 / 07

A scheduler that refuses to half-allocate

Designed and wrote CIC's custom K8s GPU scheduler in Go (inspired by NVIDIA KAI-Scheduler): gang scheduling with transactional all-or-nothing allocation (zero partial allocations); an async Binder controller (BindRequest CRD) decoupling placement from API-server latency.

0 partial allocations Gang · Fractional · Preempt-reclaim
Patent-Pending · CRD03 / 07

One spec for a whole AI application

CompositeApplication CRD: one declarative spec the control plane composes and reconciles across compute, storage, networking, and identity; patent-pending; the basis for the tenant-facing API.

Patent-pending Declarative · Reconciled · Multi-resource
Patent filing · link on grant
Multi-Tenant SDN04 / 07

Isolating tenants down to the flow

KVM + OVS + VXLAN overlay with BGP EVPN; per-tenant DNS identity, IPAM for InfiniBand/SR-IOV, and egress accounting; served as incident commander for fleet networking.

OVS/OVN · BGP EVPN · VXLAN Per-tenant isolation
DPU-Assisted Bare-Metal Cloud05 / 07

Moving the whole data plane onto the DPU

Offloaded the entire host network and storage data plane to BlueField-3 DPUs (hardware-offloaded OVS): near-line-rate throughput at negligible host CPU. Owned DPU lifecycle end to end — firmware/OS via Redfish and clusterware, network boot via NVIDIA DOCA SNAP, qemu-nbd → virtio-blk — with tenant IP/OS mobility, dual-path RAID-1 to DPU block devices, and active/standby dual-DPU failover tied to host UEFI boot order. Partnered with NVIDIA engineering on converged networking.

Near-line-rate · negligible host CPU BlueField-3 · DOCA SNAP · hw-offload OVS Dual-DPU failover
Platform Ownership06 / 07

Cutting the vendor cord

Migrated etcd out of NVIDIA Base Command Manager and transferred Day 0 / Day 2 ownership in-house; eliminated BCM licensing for the Kubernetes layer.

Vendor licensing eliminated etcd migration · Self-owned cadence
Applied ML Platform07 / 07

Finding the same product twice — at catalog scale

Built a duplicate-item-matching platform: parallel image + text deep-embedding pipelines with FAISS vector search across 50M+ catalogs at 3,500 RPS; complementary match sets enabled a union-of-candidates design giving a 106% recall lift over Elasticsearch.

106% recall ↑ Embeddings · FAISS · Vector search Published
Publication · add paper link
02

The path

2020 — Now

Coupang Intelligence Cloud

Director of Engineering · Senior Staff SWE

Owning fleet-scale GPU and LLM inference infrastructure — the Kubernetes-native control plane beneath model training and serving across ~5,000 H200/B200 GPUs and 200+ clusters.

2018 — 2020

AWS

Senior SWE

WAFV2 + Firewall Manager — building edge security control planes operating at billions of requests per day.

2015 — 2018

Microsoft

Senior SWE

Dynamics CRM Online reliability — hardening a large multi-tenant SaaS platform.

2009 — 2015

Intel

Lead SWE · Foundry Services

Led engineering in Foundry Services, driving $3M in new revenue.

03

The stack

GPU & LLM Compute

  • vLLM
  • H200 / B200
  • VFIO
  • Fractional GPU
  • InfiniBand / SR-IOV
  • NCCL
  • RunAI

Kubernetes Control Plane

  • Custom CRDs / controllers
  • Custom scheduler
  • Gang / preempt-reclaim
  • Admission webhooks
  • etcd
  • Self-healing reconciliation

SDN & Overlay

  • KVM / OVS / OVN
  • VXLAN
  • BGP EVPN
  • OpenFlow
  • CNI / Calico
  • IPAM

Languages & Tooling

  • Go
  • Java
  • Python
  • gRPC / Protobuf
  • Linux internals
  • Prometheus / Grafana
  • Terraform / Helm / ArgoCD

Open to the right room

Let's talk control planes, GPUs, and teams.

Staff / Principal IC · Eng leadership · GPU, ML & distributed-systems infrastructure