作者:rge4688618 | 来源:互联网 | 2023-08-13 14:51
如果之前使用的训练命令是python train.py--devicegpu--save_dir.checkpoints添加-mpaddle.distributed.launch就
如果之前使用的训练命令是 python train.py --device gpu --save_dir ./checkpoints
添加 -m paddle.distributed.launch
就能使用分布式训练,python -m paddle.distributed.launch train.py --device gpu --save_dir ./checkpoints
然后报错了 error code is libnccl.so: cannot open shared object file: No such file or directory
根据提示缺少nccl,并提供了下载地址https://developer.nvidia.com/nccl/nccl-download
一定要注册才能下载。。。记录下来吧:
Network Installer for Ubuntu18.04
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
$ sudo apt-get update
Network Installer for Ubuntu16.04
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-ubuntu1604.pin
$ sudo mv cuda-ubuntu1604.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/ /"
$ sudo apt-get update
then run the following command to installer NCCL:
For Ubuntu:
sudo apt install libnccl2=2.11.4-1+cuda10.2 libnccl-dev=2.11.4-1+cuda10.2
哈哈,再次执行发现可以了
可见同时使用了4张卡,
为了不影响其他正在使用的,推荐先使用 export CUDA_VISIBLE_DEVICES=2,3
指定显卡的可用性
还可以查看每个卡的使用情况,会在当前路径下生成log文件夹: