Install NVIDIA device plugin for Kubernetes
类别: Kubernetes 标签: Ubuntu Helm Nvidia GPU Docker nvidia-docker2 kubectl目录
配置每个NVIDIA GPU节点上的Docker
- 增加
"default-runtime": "nvidia"
$ sudo vim /etc/docker/daemon.json { "registry-mirrors": ["https://75oltije.mirror.aliyuncs.com"], "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
- 重启服务
sudo systemctl restart docker
设置每个节点的污点
GPU 节点
kubectl taint node gpu1 nvidia.com/gpu:NoSchedule
kubectl taint node gpu2 nvidia.com/gpu:NoSchedule
CPU 节点
kubectl taint node ln2 node-type=production:NoSchedule
kubectl taint node ln3 node-type=production:NoSchedule
kubectl taint node ln4 node-type=production:NoSchedule
kubectl taint node ln5 node-type=production:NoSchedule
安装
- 更新Helm仓库
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update
- 使用Helm安装
helm install --generate-name nvdp/nvidia-device-plugin
查看状态
查看DaemonSet
$ kubectl get daemonset -n kube-system nvidia-device-plugin-1614240442
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-device-plugin-1614240442 2 2 1 2 1 <none> 22h
查看Pod
$ kubectl get pod -n kube-system -o wide | grep nvidia-device-plugin
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nvidia-device-plugin-1614240442-6qz7w 1/1 Running 14 22h 10.46.0.5 gpu1 <none> <none>
nvidia-device-plugin-1614240442-wfh6c 0/1 CrashLoopBackOff 273 22h 10.34.0.4 gpu2 <none> <none>
查看Pod日志
- 成功
$ kubectl logs -n kube-system nvidia-device-plugin-1614240442-6qz7w 2021/02/25 08:53:42 Loading NVML 2021/02/25 08:53:42 Starting FS watcher. 2021/02/25 08:53:42 Starting OS watcher. 2021/02/25 08:53:42 Retreiving plugins. 2021/02/25 08:53:42 Starting GRPC server for 'nvidia.com/gpu' 2021/02/25 08:53:42 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2021/02/25 08:53:42 Registered device plugin for 'nvidia.com/gpu' with Kubelet
- 失败(gpu2节点的Docker没有配置好)
$ kubectl logs -n kube-system nvidia-device-plugin-1614240442-wfh6c 2021/02/26 07:03:48 Loading NVML 2021/02/26 07:03:48 Failed to initialize NVML: could not load NVML library. 2021/02/26 07:03:48 If this is a GPU node, did you set the docker default runtime to `nvidia`? 2021/02/26 07:03:48 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites 2021/02/26 07:03:48 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start 2021/02/26 07:03:48 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
查看节点的GPU资源
- 有效GPU资源
$ kubectl describe node gpu1 ...... Capacity: cpu: 64 ephemeral-storage: 575261800Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263781748Ki nvidia.com/gpu: 4 pods: 110 Allocatable: cpu: 64 ephemeral-storage: 530161274003 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263679348Ki nvidia.com/gpu: 4 pods: 110 ......
- 无效GPU资源
$ kubectl describe node gpu2 ...... Capacity: cpu: 64 ephemeral-storage: 575261800Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263781756Ki pods: 110 Allocatable: cpu: 64 ephemeral-storage: 530161274003 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263679356Ki pods: 110 ......
测试
- 编辑gpu-pod.yaml
apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: restartPolicy: OnFailure containers: - image: nvcr.io/nvidia/cuda:9.0-devel name: cuda9 command: ["sleep"] args: ["100000"] resources: limits: nvidia.com/gpu: 1
- 创建Pod
kubectl apply -f gpu-pod.yaml
- 查看Pod状态
$ kubectl get pod gpu-pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-pod 1/1 Running 0 22h 10.46.0.8 gpu1 <none> <none>
参考资料
- NVIDIA device plugin for Kubernetes
- 调度 GPUs
- 设备插件
- Install Kubernetes
- Supporting Multi-Instance GPUs (MIG) in Kubernetes
- Virtual GPU device plugin for inference workloads in Kubernetes
- kubernetes 1.10 Failed to initialize NVML: could not load NVML library. #60
- 污点和容忍度
- Google Kubernetes Engine (GKE) 用节点污点控制调度
- 从零开始入门 K8s GPU 管理和 Device Plugin 工作机制