Period: fall '25
flashattention with kv caching inference on huggingface gpt-2
Highlights
- developed a custom flashattention kernel in cuda with kernel fusion to minimize off-chip memory traffic
- designed a tiling mechanism using shared memory and online softmax for numerical stability and scalability
- optimized kv cache management to reduce ttft, tbt & latency and increase the context window
Technologies
CUDAPyTorchGPU Architecture