*

Period: fall '25

scaled bert training with pipeline parallelism and fully sharded data parallelism

Highlights

scaled bert training with pipeline parallelism, tuning micro-batch chunks to minimize bubbles
optimized host-to-device transfers using pinned memory and cuda streams for async execution
implemented fsdp to shard model states, reducing peak gpu memory usage per rank versus ddp

Technologies

PyTorchSlurmDistributed Computing