Period: fall '25
scaled bert training with pipeline parallelism and fully sharded data parallelism
Highlights
- scaled bert training with pipeline parallelism, tuning micro-batch chunks to minimize bubbles
- optimized host-to-device transfers using pinned memory and cuda streams for async execution
- implemented fsdp to shard model states, reducing peak gpu memory usage per rank versus ddp
Technologies
PyTorchSlurmDistributed Computing