Implementation of torch+vscode distributed training and debugging with conda source replacement

Conda source replacement installation torch (linux system + anaconda)

1. Find the .condarc file

2. Replace with the following content:

channels:
  - /anaconda/cloud/pytorch/
  - defaults
show_channel_urls: true
default_channels:
  - /anaconda/pkgs/main
  - /anaconda/pkgs/r
  - /anaconda/pkgs/msys2
custom_channels:
  conda-forge: /anaconda/cloud
  msys2: /anaconda/cloud
  bioconda: /anaconda/cloud
  menpo: /anaconda/cloud
  pytorch: /anaconda/cloud
  simpleitk: /anaconda/cloud

3. The conda info command checks the source currently used

4. Check the highest version of cuda supported by your computer through nvidia-smi

5. Conda search pytorch Check the pytorch version currently available for conda installation. Note that not only check the version number, but also the release version below

6. Install pytorch GPU version: conda install pytorch=1.12.1=gpu_cuda113py38h19ae3d8_1

7. The above solution can be directly executed in a virtual environment. There is no need to install cuda and cudnn separately, but torchvision also needs to be installed. Similarly, after conda search torchvision, follow the subsequent release version to view the version that matches your cuda (113), execute the installation: conda install torchvision=0.13.1=py38_cu113

8. The work is completed, this is the easiest installation method that has been tried.

9. In the past, we only paid attention to the version number when installing, such as 1.12.1, but a version number may correspond to many release versions, and direct installation often does not match, so we need to add a limit to the release version later.

vscode distributed training and debugging

Single card training is often easy to debug, what should I do if there are so many cards? It is very simple, modify it in:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: /fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Current File",
            "type": "python",
            "request": "launch",
            "program": "/home/{Your own username}/.conda/envs/{Virtual environment name}/lib/python3.7/site-packages/torch/distributed/",
            "console": "integratedTerminal",
            "args": [
                "--nproc_per_node=1",
                "",
            ],
            "env": {"CUDA_VISIBLE_DEVICES":"0"},
        }
    ]
}

In general, it is to find the location replacement program in your current virtual environment and replace the main file you execute.

This is the article about the implementation of the distributed training and debugging of torch+vscode in the source replacement conda. For more related content on the source replacement conda, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!