NVIDIA Driver 安装
类别: NVIDIA-Driver 标签: NVIDIA-Driver NVIDIA CUDA GPU lsof apt dpkg libc6-dev build-essential目录
困难重重 😭
服务器是 NVIDIA Tesla T4,系统是 Ubuntu 20.04,从 Kubernetes 集群中分离出来的,因 Tabby 请求 CUDA >= 11.7,需要重新安装新版本的驱动。
下载 NVIDIA Driver
安装 NVIDIA Driver
sudo sh NVIDIA-Linux-x86_64-535.129.03.run
就两步就完成了,简单吧 😄
实际安装过程 😭
安装驱动
sudo sh NVIDIA-Linux-x86_64-535.129.03.run
日志查看错误信息
vim /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Fri Jan 5 09:38:12 2024
installer version: 535.129.03
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
Using: nvidia-installer ncurses v6 user interface
-> Detected 64 CPUs online; setting concurrency level to 32.
ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Google 搜索、ChatGPT、之前的解决方法都不行,最后通过运行 sudo lsof -n -w /dev/nvidia*
找到了问题所在,原来是 Kubernetes 的 kubelet 进程占用了 GPU。
sudo lsof -n -w /dev/nvidia*
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kubelet 47259 root 40u CHR 195,255 0t0 505 /dev/nvidiactl
kubelet 47259 root 44u CHR 195,0 0t0 548 /dev/nvidia0
kubelet 47259 root 47u CHR 195,1 0t0 551 /dev/nvidia1
kubelet 47259 root 48u CHR 195,2 0t0 556 /dev/nvidia2
kubelet 47259 root 49u CHR 195,3 0t0 561 /dev/nvidia3
kubelet 47259 root 50u CHR 195,0 0t0 548 /dev/nvidia0
kubelet 47259 root 51u CHR 195,0 0t0 548 /dev/nvidia0
kubelet 47259 root 52u CHR 195,1 0t0 551 /dev/nvidia1
kubelet 47259 root 53u CHR 195,1 0t0 551 /dev/nvidia1
kubelet 47259 root 54u CHR 195,2 0t0 556 /dev/nvidia2
kubelet 47259 root 55u CHR 195,2 0t0 556 /dev/nvidia2
kubelet 47259 root 56u CHR 195,3 0t0 561 /dev/nvidia3
kubelet 47259 root 57u CHR 195,3 0t0 561 /dev/nvidia3
lsof
是一个在 Unix 和类 Unix 系统(如 Linux)上的命令行工具,用于列出当前系统打开的文件。在这里,”文件” 的概念很广泛,除了常见的文件和目录,还包括网络套接字、设备、管道等。
- -n 参数告诉 lsof 不要将网络号转换为主机名,这可以加快 lsof 的运行速度。
- -w 参数告诉 lsof 不要抑制警告信息。
- /dev/nvidia* 是要查看的文件的路径,* 是通配符,表示所有以 /dev/nvidia 开头的文件。在这里,这些文件通常代表 NVIDIA 的设备。
所以,sudo lsof -n -w /dev/nvidia* 命令的作用是查看哪些进程正在使用 NVIDIA 设备。
需要停止 kubelet 服务,这里不能简单的杀掉进程,需要通过 systemctl
停止服务,否则 kubelet 会自动重启。
sudo systemctl stop kubelet
再次运行安装命令出现了新的错误,提示缺少 libc 头文件,需要安装 libc 开发包。
ERROR: You do not appear to have libc header files installed on your system. Please install your distribution's libc development package.
运行 sudo apt update
更新软件源,发现使用的是 arm64 架构的软件源,需要切换到 amd64 架构的软件源。
从其它服务器拷贝 /etc/apt/sources.list
文件到当前服务器,然后重新运行 sudo apt update
更新软件源。
sudo apt update
sudo apt-get install libc6-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package libc6-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
libcrypt-dev
E: Package 'libc6-dev' has no installation candidate
使用 Docker 下载 build-essential
开发包。先模拟安装,找到需要的开发包,然后下载。
mkdir offline
docker run --rm -it -v `pwd`/offline:/offline ubuntu:20.04 bash
root@a7ce40aba423:/# apt update
root@a7ce40aba423:/# cd offline/
root@a7ce40aba423:/offline# apt-get install -s build-essential
root@a7ce40aba423:/offline# apt-get download binutils binutils-common binutils-x86-64-linux-gnu build-essential cpp cpp-9 dirmngr dpkg-dev fakeroot g++ g++-9 gcc gcc-9 gcc-9-base gnupg gnupg-l10n gnupg-utils gpg gpg-agent gpg-wks-client gpg-wks-server gpgconf gpgsm libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl libasan5 libasn1-8-heimdal libassuan0 libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libdpkg-perl libfakeroot libfile-fcntllock-perl libgcc-9-dev libgdbm-compat4 libgdbm6 libgomp1 libgssapi3-heimdal libhcrypto4-heimdal libheimbase1-heimdal libheimntlm0-heimdal libhx509-5-heimdal libisl22 libitm1 libkrb5-26-heimdal libksba8 libldap-2.4-2 libldap-common liblocale-gettext-perl liblsan0 libmpc3 libmpfr6 libnpth0 libperl5.30 libquadmath0 libreadline8 libroken18-heimdal libsasl2-2 libsasl2-modules libsasl2-modules-db libsqlite3-0 libssl1.1 libstdc++-9-dev libtsan0 libubsan1 libwind0-heimdal linux-libc-dev make manpages manpages-dev netbase patch perl perl-modules-5.30 pinentry-curses readline-common xz-utils
安装本地开发包。
sudo dpkg -i offline/*.deb
查看 CUDA 版本。
nvidia-smi
Mon Jan 8 10:01:16 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:43:00.0 Off | 0 |
| N/A 49C P0 27W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:47:00.0 Off | 0 |
| N/A 49C P0 28W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:8E:00.0 Off | 0 |
| N/A 47C P0 27W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:92:00.0 Off | 0 |
| N/A 47C P0 27W / 70W | 2MiB / 15360MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
FAQ
每张显卡都运行进程 /usr/lib/xorg/Xorg
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:43:00.0 Off | 0 |
| N/A 35C P8 9W / 70W | 11MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:47:00.0 Off | 0 |
| N/A 36C P8 9W / 70W | 11MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:8E:00.0 Off | 0 |
| N/A 34C P8 10W / 70W | 11MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:92:00.0 Off | 0 |
| N/A 34C P8 9W / 70W | 11MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3189 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 3189 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 3189 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 3189 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
/usr/lib/xorg/Xorg
是 X Window 系统的核心组件,它负责与硬件交互并提供图形环境。如果你正在使用图形界面,那么停止这个进程可能会导致你的图形界面停止工作。
使用以下的命令来查看你的系统正在使用哪个显示管理器:
cat /etc/X11/default-display-manager
/usr/sbin/gdm3
这个命令会输出你的默认显示管理器的完整路径。然后,你可以使用相应的命令来停止你的显示管理器。你可以使用以下的命令来停止它:
sudo service gdm stop
sudo service lightdm stop
这个命令会停止 lightdm 服务,lightdm 是 Ubuntu 默认的显示管理器,它负责启动 Xorg 进程。请注意,这个命令会立即停止你的图形界面,你需要保存所有的工作并关闭所有的程序。
如果你的系统使用的是其他的显示管理器,例如 gdm 或 kdm,你需要将 lightdm 替换为你的显示管理器的名称。