Machine Learning Engineer

Sherwin Wang

Building fast, efficient inference for large language models.

I work on the systems that make LLMs fast — custom CUDA kernels, KV cache management, speculative decoding, and low-precision compute. I care about closing the gap between research and production throughput.

GitHub View Projects

Skills & Tools

Python C++ CUDA PyTorch Triton FlashAttention LLM Inference ML Systems vLLM Quantization