cd ..
Period: fall '25

flashattention with kv caching inference on huggingface gpt-2

Highlights

  • developed a custom flashattention kernel in cuda with kernel fusion to minimize off-chip memory traffic
  • designed a tiling mechanism using shared memory and online softmax for numerical stability and scalability
  • optimized kv cache management to reduce ttft, tbt & latency and increase the context window

Technologies

CUDAPyTorchGPU Architecture