Machine Learning Engineer
Sherwin Wang
Building fast, efficient inference for large language models.
I work on the systems that make LLMs fast — custom CUDA kernels, KV cache management, speculative decoding, and low-precision compute. I care about closing the gap between research and production throughput.
Skills & Tools
Python
C++
CUDA
PyTorch
Triton
FlashAttention
LLM Inference
ML Systems
vLLM
Quantization