pytorchgpu加速_当代研究生应当掌握的5种Pytorch并行训练方法（单机多卡）

作者：啊沙发的非飞 | 来源：互联网 | 2023-09-17 19:37

↑点击蓝字关注极市平台作者丨纵横知乎来源丨https:zhuanlan.zhihu.comp98535650编辑丨极市平台极市导读利用PyTorch，作者编写了不同加

↑ 点击蓝字关注极市平台

作者丨纵横&＃64;知乎来源丨https://zhuanlan.zhihu.com/p/98535650编辑丨极市平台

极市导读

利用PyTorch&＃xff0c;作者编写了不同加速库在ImageNet上的单机多卡使用示例&＃xff0c;方便读者取用。

又到适宜划水的周五啦&＃xff0c;机器在学习&＃xff0c;人很无聊。在打开 b 站 “学习” 之前看着那空着一半的显卡决定写点什么喂饱它们&＃xff5e;因此&＃xff0c;从 V100-PICE/V100/K80 中各拿出 4 张卡&＃xff0c;试验一下哪种分布式学习库速度最快&＃xff01;这下终于能把剩下的显存吃完啦&＃xff0c;又是老师的勤奋好学生啦(我真是个小机灵鬼)!

Take-Away

笔者使用 PyTorch 编写了不同加速库在 ImageNet 上的使用示例(单机多卡)&＃xff0c;需要的同学可以当作 quickstart 将需要的部分 copy 到自己的项目中(Github 请点击下面链接)&＃xff1a;

1、简单方便的 nn.DataParallel

https://github.com/tczhangzhi/pytorch-distributed/blob/master/dataparallel.py

2、使用 torch.distributed 加速并行训练

https://github.com/tczhangzhi/pytorch-distributed/blob/master/distributed.py

3、使用 torch.multiprocessing 取代启动器

https://github.com/tczhangzhi/pytorch-distributed/blob/master/multiprocessing_distributed.py

4、使用 apex 再加速

https://github.com/tczhangzhi/pytorch-distributed/blob/master/apex_distributed.py

5、horovod 的优雅实现

https://github.com/tczhangzhi/pytorch-distributed/blob/master/horovod_distributed.py

这里&＃xff0c;笔者记录了使用 4 块 Tesla V100-PICE 在 ImageNet 进行了运行时间的测试&＃xff0c;测试结果发现 Apex 的加速效果最好&＃xff0c;但与 Horovod/Distributed 差别不大&＃xff0c;平时可以直接使用内置的 Distributed。Dataparallel 较慢&＃xff0c;不推荐使用。(后续会补上 V100/K80 上的测试结果&＃xff0c;穿插了一些试验所以中断了)

简要记录一下不同库的分布式训练方式&＃xff0c;当作代码的 README(我真是个小机灵鬼)&＃xff5e;

简单方便的 nn.DataParallel

DataParallel 可以帮助我们(使用单进程控)将模型和数据加载到多个 GPU 中&＃xff0c;控制数据在 GPU 之间的流动&＃xff0c;协同不同 GPU 上的模型进行并行训练(细粒度的方法有 scatter&＃xff0c;gather 等等)。

DataParallel 使用起来非常方便&＃xff0c;我们只需要用 DataParallel 包装模型&＃xff0c;再设置一些参数即可。需要定义的参数包括&＃xff1a;参与训练的 GPU 有哪些&＃xff0c;device_ids&＃61;gpus&＃xff1b;用于汇总梯度的 GPU 是哪个&＃xff0c;output_device&＃61;gpus[0] 。DataParallel 会自动帮我们将数据切分 load 到相应 GPU&＃xff0c;将模型复制到相应 GPU&＃xff0c;进行正向传播计算梯度并汇总&＃xff1a;

model &＃61; nn.DataParallel(model.cuda(), device_ids&＃61;gpus, output_device&＃61;gpus[0])

值得注意的是&＃xff0c;模型和数据都需要先 load 进 GPU 中&＃xff0c;DataParallel 的 module 才能对其进行处理&＃xff0c;否则会报错&＃xff1a;

# 这里要 model.cuda() model &＃61; nn.DataParallel(model.cuda(), device_ids&＃61;gpus, output_device&＃61;gpus[0]) for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): # 这里要 images/target.cuda() images &＃61; images.cuda(non_blocking&＃61;True) target &＃61; target.cuda(non_blocking&＃61;True) ... output &＃61; model(images) loss &＃61; criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step()汇总一下&＃xff0c;DataParallel 并行训练部分主要与如下代码段有关&＃xff1a;

# main.py import torch import torch.distributed as dist gpus &＃61; [0, 1, 2, 3] torch.cuda.set_device(&＃39;cuda:{}&＃39;.format(gpus[0])) train_dataset &＃61; ... train_loader &＃61; torch.utils.data.DataLoader(train_dataset, batch_size&＃61;...) model &＃61; ... model &＃61; nn.DataParallel(model.to(device), device_ids&＃61;gpus, output_device&＃61;gpus[0]) optimizer &＃61; optim.SGD(model.parameters()) for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): images &＃61; images.cuda(non_blocking&＃61;True) target &＃61; target.cuda(non_blocking&＃61;True) ... output &＃61; model(images) loss &＃61; criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step()在使用时&＃xff0c;使用 python 执行即可&＃xff1a;

python main.py在 ImageNet 上的完整训练代码&＃xff0c;请点击Github。

使用 torch.distributed 加速并行训练

在 pytorch 1.0 之后&＃xff0c;官方终于对分布式的常用方法进行了封装&＃xff0c;支持 all-reduce&＃xff0c;broadcast&＃xff0c;send 和 receive 等等。通过 MPI 实现 CPU 通信&＃xff0c;通过 NCCL 实现 GPU 通信。官方也曾经提到用 DistributedDataParallel 解决 DataParallel 速度慢&＃xff0c;GPU 负载不均衡的问题&＃xff0c;目前已经很成熟了&＃xff5e;

与 DataParallel 的单进程控制多 GPU 不同&＃xff0c;在 distributed 的帮助下&＃xff0c;我们只需要编写一份代码&＃xff0c;torch 就会自动将其分配给个进程&＃xff0c;分别在个 GPU 上运行。在 API 层面&＃xff0c;pytorch 为我们提供了 torch.distributed.launch 启动器&＃xff0c;用于在命令行分布式地执行 python 文件。在执行过程中&＃xff0c;启动器会将当前进程的(其实就是 GPU的)index 通过参数传递给 python&＃xff0c;我们可以这样获得当前进程的 index&＃xff1a;

parser &＃61; argparse.ArgumentParser() parser.add_argument(&＃39;--local_rank&＃39;, default&＃61;-1, type&＃61;int, help&＃61;&＃39;node rank for distributed training&＃39;) args &＃61; parser.parse_args() print(args.local_rank)接着&＃xff0c;使用 init_process_group 设置GPU 之间通信使用的后端和端口&＃xff1a;

dist.init_process_group(backend&＃61;&＃39;nccl&＃39;)之后&＃xff0c;使用 DistributedSampler 对数据集进行划分。如此前我们介绍的那样&＃xff0c;它能帮助我们将每个 batch 划分成几个 partition&＃xff0c;在当前进程中只需要获取和 rank 对应的那个 partition 进行训练&＃xff1a;

train_sampler &＃61; torch.utils.data.distributed.DistributedSampler(train_dataset) train_loader &＃61; torch.utils.data.DataLoader(train_dataset, batch_size&＃61;..., sampler&＃61;train_sampler)然后&＃xff0c;使用 DistributedDataParallel 包装模型&＃xff0c;它能帮助我们为不同 GPU 上求得的梯度进行 all reduce(即汇总不同 GPU 计算所得的梯度&＃xff0c;并同步计算结果)。all reduce 后不同 GPU 中模型的梯度均为 all reduce 之前各 GPU 梯度的均值&＃xff1a;

model &＃61; torch.nn.parallel.DistributedDataParallel(model, device_ids&＃61;[args.local_rank])最后&＃xff0c;把数据和模型加载到当前进程使用的 GPU 中&＃xff0c;正常进行正反向传播&＃xff1a;

torch.cuda.set_device(args.local_rank) model.cuda() for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): images &＃61; images.cuda(non_blocking&＃61;True) target &＃61; target.cuda(non_blocking&＃61;True) ... output &＃61; model(images) loss &＃61; criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step()汇总一下&＃xff0c;torch.distributed 并行训练部分主要与如下代码段有关&＃xff1a;

# main.py import torch import argparse import torch.distributed as dist parser &＃61; argparse.ArgumentParser() parser.add_argument(&＃39;--local_rank&＃39;, default&＃61;-1, type&＃61;int, help&＃61;&＃39;node rank for distributed training&＃39;) args &＃61; parser.parse_args() dist.init_process_group(backend&＃61;&＃39;nccl&＃39;) torch.cuda.set_device(args.local_rank) train_dataset &＃61; ... train_sampler &＃61; torch.utils.data.distributed.DistributedSampler(train_dataset) train_loader &＃61; torch.utils.data.DataLoader(train_dataset, batch_size&＃61;..., sampler&＃61;train_sampler) model &＃61; ... model &＃61; torch.nn.parallel.DistributedDataParallel(model, device_ids&＃61;[args.local_rank]) optimizer &＃61; optim.SGD(model.parameters()) for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): images &＃61; images.cuda(non_blocking&＃61;True) target &＃61; target.cuda(non_blocking&＃61;True) ... output &＃61; model(images) loss &＃61; criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step()在使用时&＃xff0c;调用 torch.distributed.launch 启动器启动&＃xff1a;

CUDA_VISIBLE_DEVICES&＃61;0,1,2,3 python -m torch.distributed.launch --nproc_per_node&＃61;4 main.py在 ImageNet 上的完整训练代码&＃xff0c;请点击Github。

使用 torch.multiprocessing 取代启动器

有的同学可能比较熟悉 torch.multiprocessing&＃xff0c;也可以手动使用 torch.multiprocessing 进行多进程控制。绕开 torch.distributed.launch 自动控制开启和退出进程的一些小毛病&＃xff5e;

使用时&＃xff0c;只需要调用 torch.multiprocessing.spawn&＃xff0c;torch.multiprocessing 就会帮助我们自动创建进程。如下面的代码所示&＃xff0c;spawn 开启了 nprocs&＃61;4 个线程&＃xff0c;每个线程执行 main_worker 并向其中传入 local_rank(当前进程 index)和 args(即 4 和 myargs)作为参数&＃xff1a;

import torch.multiprocessing as mp mp.spawn(main_worker, nprocs&＃61;4, args&＃61;(4, myargs))这里&＃xff0c;我们直接将原本需要 torch.distributed.launch 管理的执行内容&＃xff0c;封装进 main_worker 函数中&＃xff0c;其中 proc 对应 local_rank(当前进程 index)&＃xff0c;ngpus_per_node 对应 4&＃xff0c; args 对应 myargs&＃xff1a;

def main_worker(proc, ngpus_per_node, args): dist.init_process_group(backend&＃61;&＃39;nccl&＃39;, init_method&＃61;&＃39;tcp://127.0.0.1:23456&＃39;, world_size&＃61;4, rank&＃61;gpu) torch.cuda.set_device(args.local_rank) train_dataset &＃61; ... train_sampler &＃61; torch.utils.data.distributed.DistributedSampler(train_dataset) train_loader &＃61; torch.utils.data.DataLoader(train_dataset, batch_size&＃61;..., sampler&＃61;train_sampler) model &＃61; ... model &＃61; torch.nn.parallel.DistributedDataParallel(model, device_ids&＃61;[args.local_rank]) optimizer &＃61; optim.SGD(model.parameters()) for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): images &＃61; images.cuda(non_blocking&＃61;True) target &＃61; target.cuda(non_blocking&＃61;True) ... output &＃61; model(images) loss &＃61; criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step()在上面的代码中值得注意的是&＃xff0c;由于没有 torch.distributed.launch 读取的默认环境变量作为配置&＃xff0c;我们需要手动为 init_process_group 指定参数&＃xff1a;

dist.init_process_group(backend&＃61;&＃39;nccl&＃39;, init_method&＃61;&＃39;tcp://127.0.0.1:23456&＃39;, world_size&＃61;4, rank&＃61;gpu)汇总一下&＃xff0c;添加 multiprocessing 后并行训练部分主要与如下代码段有关&＃xff1a;

# main.py import torch import torch.distributed as dist import torch.multiprocessing as mp mp.spawn(main_worker, nprocs&＃61;4, args&＃61;(4, myargs)) def main_worker(proc, ngpus_per_node, args): dist.init_process_group(backend&＃61;&＃39;nccl&＃39;, init_method&＃61;&＃39;tcp://127.0.0.1:23456&＃39;, world_size&＃61;4, rank&＃61;gpu) torch.cuda.set_device(args.local_rank) train_dataset &＃61; ... train_sampler &＃61; torch.utils.data.distributed.DistributedSampler(train_dataset) train_loader &＃61; torch.utils.data.DataLoader(train_dataset, batch_size&＃61;..., sampler&＃61;train_sampler) model &＃61; ... model &＃61; torch.nn.parallel.DistributedDataParallel(model, device_ids&＃61;[args.local_rank]) optimizer &＃61; optim.SGD(model.parameters()) for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): images &＃61; images.cuda(non_blocking&＃61;True) target &＃61; target.cuda(non_blocking&＃61;True) ... output &＃61; model(images) loss &＃61; criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step()在使用时&＃xff0c;直接使用 python 运行就可以了&＃xff1a;

python main.py在 ImageNet 上的完整训练代码&＃xff0c;请点击Github。

使用 Apex 再加速

Apex 是 NVIDIA 开源的用于混合精度训练和分布式训练库。Apex 对混合精度训练的过程进行了封装&＃xff0c;改两三行配置就可以进行混合精度的训练&＃xff0c;从而大幅度降低显存占用&＃xff0c;节约运算时间。此外&＃xff0c;Apex 也提供了对分布式训练的封装&＃xff0c;针对 NVIDIA 的 NCCL 通信库进行了优化。

在混合精度训练上&＃xff0c;Apex 的封装十分优雅。直接使用 amp.initialize 包装模型和优化器&＃xff0c;apex 就会自动帮助我们管理模型参数和优化器的精度了&＃xff0c;根据精度需求不同可以传入其他配置参数。

from apex import amp model, optimizer &＃61; amp.initialize(model, optimizer)在分布式训练的封装上&＃xff0c;Apex 在胶水层的改动并不大&＃xff0c;主要是优化了 NCCL 的通信。因此&＃xff0c;大部分代码仍与 torch.distributed 保持一致。使用的时候只需要将 torch.nn.parallel.DistributedDataParallel 替换为 apex.parallel.DistributedDataParallel 用于包装模型。在 API 层面&＃xff0c;相对于 torch.distributed &＃xff0c;它可以自动管理一些参数(可以少传一点)&＃xff1a;

from apex.parallel import DistributedDataParallel model &＃61; DistributedDataParallel(model) # # torch.distributed # model &＃61; torch.nn.parallel.DistributedDataParallel(model, device_ids&＃61;[args.local_rank]) # model &＃61; torch.nn.parallel.DistributedDataParallel(model, device_ids&＃61;[args.local_rank], output_device&＃61;args.local_rank)在正向传播计算 loss 时&＃xff0c;Apex 需要使用 amp.scale_loss 包装&＃xff0c;用于根据 loss 值自动对精度进行缩放&＃xff1a;

with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()汇总一下&＃xff0c;Apex 的并行训练部分主要与如下代码段有关&＃xff1a;

# main.py import torch import argparse import torch.distributed as dist from apex.parallel import DistributedDataParallel parser &＃61; argparse.ArgumentParser() parser.add_argument(&＃39;--local_rank&＃39;, default&＃61;-1, type&＃61;int, help&＃61;&＃39;node rank for distributed training&＃39;) args &＃61; parser.parse_args() dist.init_process_group(backend&＃61;&＃39;nccl&＃39;) torch.cuda.set_device(args.local_rank) train_dataset &＃61; ... train_sampler &＃61; torch.utils.data.distributed.DistributedSampler(train_dataset) train_loader &＃61; torch.utils.data.DataLoader(train_dataset, batch_size&＃61;..., sampler&＃61;train_sampler) model &＃61; ... model, optimizer &＃61; amp.initialize(model, optimizer) model &＃61; DistributedDataParallel(model, device_ids&＃61;[args.local_rank]) optimizer &＃61; optim.SGD(model.parameters()) for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): images &＃61; images.cuda(non_blocking&＃61;True) target &＃61; target.cuda(non_blocking&＃61;True) ... output &＃61; model(images) loss &＃61; criterion(output, target) optimizer.zero_grad() with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() optimizer.step()在使用时&＃xff0c;调用 torch.distributed.launch 启动器启动&＃xff1a;

UDA_VISIBLE_DEVICES&＃61;0,1,2,3 python -m torch.distributed.launch --nproc_per_node&＃61;4 main.py在 ImageNet 上的完整训练代码&＃xff0c;请点击Github。

Horovod 的优雅实现

Horovod 是 Uber 开源的深度学习工具&＃xff0c;它的发展吸取了 Facebook "Training ImageNet In 1 Hour" 与百度 "Ring Allreduce" 的优点&＃xff0c;可以无痛与 PyTorch/Tensorflow 等深度学习框架结合&＃xff0c;实现并行训练。

在 API 层面&＃xff0c;Horovod 和 torch.distributed 十分相似。在 mpirun 的基础上&＃xff0c;Horovod 提供了自己封装的 horovodrun 作为启动器。与 torch.distributed.launch 相似&＃xff0c;我们只需要编写一份代码&＃xff0c;horovodrun 启动器就会自动将其分配给个进程&＃xff0c;分别在个 GPU 上运行。在执行过程中&＃xff0c;启动器会将当前进程的(其实就是 GPU的)index 注入 hvd&＃xff0c;我们可以这样获得当前进程的 index&＃xff1a;

import horovod.torch as hvd hvd.local_rank()与 init_process_group 相似&＃xff0c;Horovod 使用 init 设置GPU 之间通信使用的后端和端口&＃xff1a;

hvd.init()接着&＃xff0c;使用 DistributedSampler 对数据集进行划分。如此前我们介绍的那样&＃xff0c;它能帮助我们将每个 batch 划分成几个 partition&＃xff0c;在当前进程中只需要获取和 rank 对应的那个 partition 进行训练&＃xff1a;

train_sampler &＃61; torch.utils.data.distributed.DistributedSampler(train_dataset) train_loader &＃61; torch.utils.data.DataLoader(train_dataset, batch_size&＃61;..., sampler&＃61;train_sampler)之后&＃xff0c;使用 broadcast_parameters 包装模型参数&＃xff0c;将模型参数从编号为 root_rank 的 GPU 复制到所有其他 GPU 中&＃xff1a;

hvd.broadcast_parameters(model.state_dict(), root_rank&＃61;0)然后&＃xff0c;使用 DistributedOptimizer 包装优化器。它能帮助我们为不同 GPU 上求得的梯度进行 all reduce(即汇总不同 GPU 计算所得的梯度&＃xff0c;并同步计算结果)。all reduce 后不同 GPU 中模型的梯度均为 all reduce 之前各 GPU 梯度的均值&＃xff1a;

hvd.DistributedOptimizer(optimizer, named_parameters&＃61;model.named_parameters(), compression&＃61;hvd.Compression.fp16)最后&＃xff0c;把数据加载到当前 GPU 中。在编写代码时&＃xff0c;我们只需要关注正常进行正向传播和反向传播&＃xff1a;

torch.cuda.set_device(args.local_rank) for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): images &＃61; images.cuda(non_blocking&＃61;True) target &＃61; target.cuda(non_blocking&＃61;True) ... output &＃61; model(images) loss &＃61; criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step()汇总一下&＃xff0c;Horovod 的并行训练部分主要与如下代码段有关&＃xff1a;

# main.py import torch import horovod.torch as hvd hvd.init() torch.cuda.set_device(hvd.local_rank()) train_dataset &＃61; ... train_sampler &＃61; torch.utils.data.distributed.DistributedSampler( train_dataset, num_replicas&＃61;hvd.size(), rank&＃61;hvd.rank()) train_loader &＃61; torch.utils.data.DataLoader(train_dataset, batch_size&＃61;..., sampler&＃61;train_sampler) model &＃61; ... model.cuda() optimizer &＃61; optim.SGD(model.parameters()) optimizer &＃61; hvd.DistributedOptimizer(optimizer, named_parameters&＃61;model.named_parameters()) hvd.broadcast_parameters(model.state_dict(), root_rank&＃61;0) for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): images &＃61; images.cuda(non_blocking&＃61;True) target &＃61; target.cuda(non_blocking&＃61;True) ... output &＃61; model(images) loss &＃61; criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step()在使用时&＃xff0c;调用 horovodrun 启动器启动&＃xff1a;

CUDA_VISIBLE_DEVICES&＃61;0,1,2,3 horovodrun -np 4 -H localhost:4 --verbose python main.py在 ImageNet 上的完整训练代码&＃xff0c;请点击Github。

尾注

本文中使用的 V100-PICE (前 4 个 GPU)的配置&＃xff1a;

图 2&＃xff1a;配置详情本文中使用的 V100 (前 4 个 GPU)的配置&＃xff1a;

图 3&＃xff1a;配置详情本文中使用的 K80 (前 4 个 GPU)的配置&＃xff1a;

图 4&＃xff1a;配置详情笔者本身是 CV 研究生&＃xff0c;今天摸鱼的时候一时兴起研究了一下&＃xff0c;后面再慢慢完善&＃xff5e;工业界的同学应该有自己的 best practice&＃xff0c;feel free to 提 PR 或者留言&＃xff5e;

推荐阅读

PyTorch多GPU并行训练方法及问题整理
PyTorch并行训练指南&＃xff1a;单机多卡并行、混合精度、同步BN训练
pytorch多gpu并行训练

添加极市小助手微信(ID : cvmart2)&＃xff0c;备注&＃xff1a;姓名-学校/公司-研究方向-城市(如&＃xff1a;小极-北大-目标检测-深圳)&＃xff0c;即可申请加入极市目标检测/图像分割/工业检测/人脸/医学影像/3D/SLAM/自动驾驶/超分辨率/姿态估计/ReID/GAN/图像增强/OCR/视频理解等技术交流群&＃xff1a;每月大咖直播分享、真实项目需求对接、求职内推、算法竞赛、干货资讯汇总、与 10000&＃43;来自港科大、北大、清华、中科院、CMU、腾讯、百度等名校名企视觉开发者互动交流~