cd ..
Period: fall '25

scaled bert training with pipeline parallelism and fully sharded data parallelism

Highlights

  • scaled bert training with pipeline parallelism, tuning micro-batch chunks to minimize bubbles
  • optimized host-to-device transfers using pinned memory and cuda streams for async execution
  • implemented fsdp to shard model states, reducing peak gpu memory usage per rank versus ddp

Technologies

PyTorchSlurmDistributed Computing