目录

安装 CUDA Toolkit

下载

wget https://developer.download.nvidia.com/compute/cuda/11.6.0/local_installers/cuda_11.6.0_510.39.01_linux.run

安装

$ sudo sh cuda_11.5.1_495.29.05_linux.run
===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-11.5/
Samples:  Installed in /home/lnsoft/, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-11.5/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.5/lib64, or, add /usr/local/cuda-11.5/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.5/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log

查看 GPU 信息

$ nvidia-smi
Tue Feb  8 09:12:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:43:00.0 Off |                    0 |
| N/A   35C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:47:00.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:8E:00.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:92:00.0 Off |                    0 |
| N/A   32C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

查看 Nvidia 驱动版本

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  495.29.05  Thu Sep 30 16:00:29 UTC 2021
GCC version:  gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

卸载驱动

查看使用 Nvidia GPU 卡的进程

$ sudo lsof /dev/nvidia*
COMMAND    PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
kubelet 744987 root   31u   CHR 195,255      0t0  565 /dev/nvidiactl
kubelet 744987 root   42u   CHR   195,0      0t0  569 /dev/nvidia0
kubelet 744987 root   45u   CHR   195,1      0t0  572 /dev/nvidia1
kubelet 744987 root   46u   CHR   195,2      0t0  579 /dev/nvidia2
kubelet 744987 root   47u   CHR   195,3      0t0  584 /dev/nvidia3
kubelet 744987 root   48u   CHR   195,0      0t0  569 /dev/nvidia0
kubelet 744987 root   49u   CHR   195,0      0t0  569 /dev/nvidia0
kubelet 744987 root   50u   CHR   195,1      0t0  572 /dev/nvidia1
kubelet 744987 root   51u   CHR   195,1      0t0  572 /dev/nvidia1
kubelet 744987 root   52u   CHR   195,2      0t0  579 /dev/nvidia2
kubelet 744987 root   53u   CHR   195,2      0t0  579 /dev/nvidia2
kubelet 744987 root   54u   CHR   195,3      0t0  584 /dev/nvidia3
kubelet 744987 root   55u   CHR   195,3      0t0  584 /dev/nvidia3

删除进程

$ sudo kill -9 744987

删除 Nvidia Driver

$ sudo rmmod nvidia

FAQ

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
$ sudo rmmod nvidia
rmmod: ERROR: Module nvidia is in use

参考资料