目录

我的 GPU:GP106 [GeForce GTX 1060 6GB]

安装 NVIDIA 驱动

查看哪些进程正在使用 NVIDIA 设备

lsof -n -w /dev/nvidia*

lsof 是一个在 Unix 和类 Unix 系统(如 Linux)上的命令行工具,用于列出当前系统打开的文件。在这里,”文件” 的概念很广泛,除了常见的文件和目录,还包括网络套接字、设备、管道等。

  • -n 参数告诉 lsof 不要将网络号转换为主机名,这可以加快 lsof 的运行速度。
  • -w 参数告诉 lsof 不要抑制警告信息。
  • /dev/nvidia* 是要查看的文件的路径,* 是通配符,表示所有以 /dev/nvidia 开头的文件。在这里,这些文件通常代表 NVIDIA 的设备。

所以,sudo lsof -n -w /dev/nvidia* 命令的作用是查看哪些进程正在使用 NVIDIA 设备。

杀死使用 NVIDIA 设备的进程或停止服务

  • kill -9 <pid>
  • sudo systemctl stop <service_name>

列出系统中所有需要驱动的设备

sudo ubuntu-drivers devices
WARNING:root:_pkg_get_support nvidia-driver-525: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-525-server: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-545: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-390: package has invalid Support Legacyheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-515-server: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-535: package has invalid Support PBheader, cannot determine support level
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C03sv00001043sd00008618bc03sc00i00
vendor   : NVIDIA Corporation
model    : GP106 [GeForce GTX 1060 6GB]
manual_install: True
driver   : nvidia-driver-410 - third-party non-free
driver   : nvidia-driver-418-server - distro non-free
driver   : nvidia-driver-525 - third-party non-free
driver   : nvidia-driver-470 - third-party non-free recommended
driver   : nvidia-driver-525-server - distro non-free
driver   : nvidia-driver-545 - third-party non-free
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-515-server - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-535 - third-party non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

安装 NVIDIA 驱动

sudo apt install nvidia-driver-525 -y

重启系统

sudo reboot

查看 NVIDIA 驱动是否安装成功

nvidia-smi
Tue Jan  9 21:48:15 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   47C    P8     5W / 120W |      2MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

安装 Docker

  • 设置 Docker 的 apt 源
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
  • 安装 Docker 包
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
  • 设置无需 sudo 权限就可以运行 docker 命令
sudo usermod -aG docker wjunjian

这个命令是用来将用户 wjunjian 添加到 docker 用户组中。

  • sudo:用于以 root 用户权限运行命令。
  • usermod:是一个用于修改用户账户的命令。
  • -aG:这是 usermod 命令的两个选项,-a 表示追加用户到指定的组,而不是替换用户所在的组,-G 用于指定组名。
  • docker:这是要添加的用户组的名称。
  • wjunjian:这是要添加到用户组的用户名。

执行这个命令后,用户 wjunjian 将成为 docker 用户组的成员。这通常用于允许用户无需 sudo 权限就可以运行 docker 命令,因为 docker 用户组的成员被授予了执行 docker 命令的权限。

卸载 Docker

  • Uninstall the Docker Engine, CLI, containerd, and Docker Compose packages:
sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
  • Images, containers, volumes, or custom configuration files on your host aren’t automatically removed. To delete all images, containers, and volumes:
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd

安装 NVIDIA Container Toolkit

运行 docker run --gpus all ...,出现错误: docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

设置 nvidia-container-runtime 的 apt 源

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
	  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
	  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update

安装 nvidia-container-runtime

sudo apt install nvidia-container-runtime

重启 Docker

sudo systemctl restart docker

运行 Tabby 服务

代码模型:DeepseekCoder-1.3B

docker run --rm -it --gpus all --name tabby -p 8080:8080 \
    -v ~/.tabby:/data tabbyml/tabby \
    serve --model TabbyML/DeepseekCoder-1.3B  --device cuda
2024-01-09T15:59:08.283475Z  INFO tabby::serve: crates/tabby/src/serve.rs:111: Starting server, this might takes a few minutes...
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1
2024-01-09T15:59:10.445063Z  INFO tabby::routes: crates/tabby/src/routes/mod.rs:35: Listening at 0.0.0.0:8080

查看 NVIDIA 显卡使用情况

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   49C    P8     5W / 120W |   2606MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     26131      C   /opt/tabby/bin/tabby             2600MiB |
+-----------------------------------------------------------------------------+

测试代码生成服务

curl -X 'POST' \
  'http://gpu1060:8080/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "language": "python",
  "segments": {
    "prefix": "def swap(x, y):\n    ",
    "suffix": "\n    return x, y"
  }
}'
{
  "id": "cmpl-4f3f4974-7c19-4a17-86ad-d6cb92515181",
  "choices": [
    {
      "index": 0,
      "text": "x, y = y, x"
    }
  ]
}