目录

配置每个NVIDIA GPU节点上的Docker

  1. 增加"default-runtime": "nvidia"
    $ sudo vim /etc/docker/daemon.json
    {
     "registry-mirrors": ["https://75oltije.mirror.aliyuncs.com"],
     "default-runtime": "nvidia",
     "runtimes": {
         "nvidia": {
             "path": "nvidia-container-runtime",
             "runtimeArgs": []
         }
     }
    }
    
  2. 重启服务
    sudo systemctl restart docker
    

设置每个节点的污点

GPU 节点

kubectl taint node gpu1 nvidia.com/gpu:NoSchedule
kubectl taint node gpu2 nvidia.com/gpu:NoSchedule

CPU 节点

kubectl taint node ln2 node-type=production:NoSchedule
kubectl taint node ln3 node-type=production:NoSchedule
kubectl taint node ln4 node-type=production:NoSchedule
kubectl taint node ln5 node-type=production:NoSchedule

安装

  1. 更新Helm仓库
    helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
    helm repo update
    
  2. 使用Helm安装
    helm install --generate-name nvdp/nvidia-device-plugin
    

查看状态

查看DaemonSet

$ kubectl get daemonset -n kube-system nvidia-device-plugin-1614240442 
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
nvidia-device-plugin-1614240442   2         2         1       2            1           <none>          22h

查看Pod

$ kubectl get pod -n kube-system -o wide | grep nvidia-device-plugin
NAME                                    READY   STATUS              RESTARTS   AGE     IP              NODE   NOMINATED NODE   READINESS GATES
nvidia-device-plugin-1614240442-6qz7w   1/1     Running             14         22h     10.46.0.5       gpu1   <none>           <none>
nvidia-device-plugin-1614240442-wfh6c   0/1     CrashLoopBackOff    273        22h     10.34.0.4       gpu2   <none>           <none>

查看Pod日志

  • 成功
    $ kubectl logs -n kube-system nvidia-device-plugin-1614240442-6qz7w 
    2021/02/25 08:53:42 Loading NVML
    2021/02/25 08:53:42 Starting FS watcher.
    2021/02/25 08:53:42 Starting OS watcher.
    2021/02/25 08:53:42 Retreiving plugins.
    2021/02/25 08:53:42 Starting GRPC server for 'nvidia.com/gpu'
    2021/02/25 08:53:42 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
    2021/02/25 08:53:42 Registered device plugin for 'nvidia.com/gpu' with Kubelet
    
  • 失败(gpu2节点的Docker没有配置好)
    $ kubectl logs -n kube-system nvidia-device-plugin-1614240442-wfh6c 
    2021/02/26 07:03:48 Loading NVML
    2021/02/26 07:03:48 Failed to initialize NVML: could not load NVML library.
    2021/02/26 07:03:48 If this is a GPU node, did you set the docker default runtime to `nvidia`?
    2021/02/26 07:03:48 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
    2021/02/26 07:03:48 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
    2021/02/26 07:03:48 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
    

查看节点的GPU资源

  • 有效GPU资源
    $ kubectl describe node gpu1
    ......
    Capacity:
    cpu:                64
    ephemeral-storage:  575261800Ki
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             263781748Ki
    nvidia.com/gpu:     4
    pods:               110
    Allocatable:
    cpu:                64
    ephemeral-storage:  530161274003
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             263679348Ki
    nvidia.com/gpu:     4
    pods:               110
    ......
    
  • 无效GPU资源
    $ kubectl describe node gpu2
    ......
    Capacity:
    cpu:                64
    ephemeral-storage:  575261800Ki
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             263781756Ki
    pods:               110
    Allocatable:
    cpu:                64
    ephemeral-storage:  530161274003
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             263679356Ki
    pods:               110
    ......
    

测试

  1. 编辑gpu-pod.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod
    spec:
      restartPolicy: OnFailure
      containers:
      - image: nvcr.io/nvidia/cuda:9.0-devel
     name: cuda9
     command: ["sleep"]
     args: ["100000"]
    
     resources:
       limits:
         nvidia.com/gpu: 1
    
  2. 创建Pod
    kubectl apply -f gpu-pod.yaml 
    
  3. 查看Pod状态
    $ kubectl get pod gpu-pod -o wide
    NAME      READY   STATUS    RESTARTS   AGE   IP          NODE   NOMINATED NODE   READINESS GATES
    gpu-pod   1/1     Running   0          22h   10.46.0.8   gpu1   <none>           <none>
    

参考资料