introduction
CUDA version compatibility issues are a cliché when developing deep learning using PyTorch. This article will analyze the scenarios of conflict between the CUDA runtime library of the PyTorch virtual environment and the system's global CUDA environment through a real investigation process, and analyze the problems and locate the causes step by step, and finally provide a solution.
Problem recurring: ImportError: undefined symbol
Start with a seemingly simple import torch statement
(modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ python Python 3.12.9 (main, Feb 12 2025, 14:50:50) [Clang 19.1.6 ] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> Traceback (most recent call last): >>> File "<stdin>", line 1, in <module> >>> File "/home/wangh/codes/ModelForger/.venv/lib/python3.12/site-packages/torch/__init__.py", line 367, in <module> >>> from torch._C import * # noqa: F403 >>> ^^^^^^^^^^^^^^^^^^^^^^ >>> ImportError: /home/wangh/codes/ModelForger/.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cusparse/lib/.12: undefined symbol: __nvJitLinkComplete_12_4, version .12
The error message is clear, in.12
The symbol cannot be found in__nvJitLinkComplete_12_4
, which usually means there is a problem of version mismatch.
Preliminary Troubleshooting: Environment & CUDA Version
First, let's check the environment and CUDA version
(modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ echo $LD_LIBRARY_PATH /usr/local/cuda/lib64: (modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0 (modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ uv pip list | grep nvidia Using Python 3.12.9 environment at: /home/wangh/codes/ModelForger/.venv nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127
Two key information were found
-
nvcc --version
The CUDA version installed by the system is 12.3. -
nvidia-*
The version number of nvjitlink installed in the virtual environment is 12.4.127.
According to the error message, the dynamic library in the PyTorch virtual environment.12
What is needed is.12
of__nvJitLinkComplete_12_4
Version,pip
The installed dependency package version itself has no problem, so it is speculated that it may be incorrectly linked to the CUDA 12.3 in the system..12
。
Analysis: Dynamic link library loading path
To verify the conjecture, we usepatchelf
andldd
Command View.12
Dynamic link status:
(modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ patchelf --print-rpath .12 $ORIGIN:$ORIGIN/../../nvjitlink/lib (modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ ldd .12 .1 (0x00007ffc507e2000) .12 => /usr/local/cuda/lib64/.12 (0x00007f867a399000) .0 => /lib/x86_64-linux-gnu/.0 (0x00007f867a353000) .1 => /lib/x86_64-linux-gnu/.1 (0x00007f867a349000) .2 => /lib/x86_64-linux-gnu/.2 (0x00007f867a343000) .6 => /lib/x86_64-linux-gnu/.6 (0x00007f867a1f4000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f867a1d7000) .6 => /lib/x86_64-linux-gnu/.6 (0x00007f8679fe5000) /lib64/.2 (0x00007f868e62d000)
Sure enough.12
Depend on.12
Loaded to the system CUDA directory (/usr/local/cuda/lib64
) directory under , instead of predefined PyTorch virtual environment.
The root of the problem: LD_LIBRARY_PATH priority
(modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ echo $LD_LIBRARY_PATH /usr/local/cuda/lib64:
At this point, the root cause of the problem has been clarified: the LD_LIBRARY_PATH environment variable causes the system CUDA library path to be loaded before the CUDA library path of the PyTorch virtual environment. This results in a version mismatch and PyTorch cannot find the desired symbol.
Solution: unset LD_LIBRARY_PATH
The most direct way to solve this problem is to remove the LD_LIBRARY_PATH setting of the system CUDA path:
(modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ unset LD_LIBRARY_PATH
Check again.12
Dynamic links:
(modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ ldd .12 .1 (0x00007fff959a7000) .12 => /home/wangh/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib/./../../nvjitlink/lib/.12 (0x00007f303000e000) .0 => /lib/x86_64-linux-gnu/.0 (0x00007f302ffc8000) .1 => /lib/x86_64-linux-gnu/.1 (0x00007f302ffbe000) .2 => /lib/x86_64-linux-gnu/.2 (0x00007f302ffb8000) .6 => /lib/x86_64-linux-gnu/.6 (0x00007f302fe69000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f302fe4c000) .6 => /lib/x86_64-linux-gnu/.6 (0x00007f302fc5a000) /lib64/.2 (0x00007f30443f9000)
Now,.12
Correctly loaded into the directory of the PyTorch virtual environment.
Verification: Problem Solving
(modelforger) wangh@ubuntu:~/codes/ModelForger/.venv/lib/python3.12/site-packages/nvidia/cusparse/lib$ python Python 3.12.9 (main, Feb 12 2025, 14:50:50) [Clang 19.1.6 ] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch
import torch success! Problem solved.
Best Practices and Summary
- Avoid global settings LD_LIBRARY_PATH: Setting LD_LIBRARY_PATH in global environment variables such as .bashrc or .bash_profile will interfere with the independence of the virtual environment.
- Understand dynamic linking mechanisms: Understanding the role of LD_LIBRARY_PATH and the loading order of dynamic link libraries can help quickly locate and resolve similar problems.
The above is the detailed content of the problem of CUDA version conflict in the PyTorch environment. For more information about PyTorch CUDA version conflict, please pay attention to my other related articles!