*

Period: fall '25

flashattention with kv caching inference on huggingface gpt-2

Highlights

developed a custom flashattention kernel in cuda with kernel fusion to minimize off-chip memory traffic
designed a tiling mechanism using shared memory and online softmax for numerical stability and scalability
optimized kv cache management to reduce ttft, tbt & latency and increase the context window

Technologies

CUDAPyTorchGPU Architecture