Kubernetes中的GPU共享

2021-05-20 2 minute read

构建应用

Scheduler Extender

git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git && cd gpushare-scheduler-extender
docker build -t gouchicao/gpushare-scheduler-extender .

Device Plugin

git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git && cd gpushare-device-plugin
docker build -t gouchicao/gpushare-device-plugin .

Kubectl Extension

wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare

安装

在控制平面中部署 GPU 共享调度程序扩展器

cd /etc/kubernetes
sudo wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.json

mkdir ~/gpu-sharing
cd ~/gpu-sharing
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
sed -i 's|registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-schd-extender:1.11-d170d8a|gouchicao/gpushare-scheduler-extender|g' gpushare-schd-extender.yaml
kubectl apply -f gpushare-schd-extender.yaml

配置调度程序策略

编辑 /etc/kubernetes/manifests/kube-scheduler.yaml 文件，使用 GPU 的调度策略 /etc/kubernetes/scheduler-policy-config.json 进行配置。

sudo vim /etc/kubernetes/manifests/kube-scheduler.yaml

添加策略配置文件

- --policy-config-file=/etc/kubernetes/scheduler-policy-config.json

将卷挂载添加到Pod

- mountPath: /etc/kubernetes/scheduler-policy-config.json
  name: scheduler-policy-config
  readOnly: true

- hostPath:
   path: /etc/kubernetes/scheduler-policy-config.json
   type: FileOrCreate
  name: scheduler-policy-config

最终修改为

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --policy-config-file=/etc/kubernetes/scheduler-policy-config.json
    image: k8s.gcr.io/kube-scheduler:v1.16.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10251
        scheme: HTTP
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /etc/kubernetes/scheduler-policy-config.json
      name: scheduler-policy-config
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /etc/kubernetes/scheduler-policy-config.json
      type: FileOrCreate
    name: scheduler-policy-config

status: {}

部署 Device Plugin

wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl apply -f device-plugin-rbac.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
sed -i 's|registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23|gouchicao/gpushare-device-plugin|g' device-plugin-ds.yaml
kubectl apply -f device-plugin-ds.yaml

将标签 gpushare 添加到需要 GPU 共享的节点

kubectl label node <target_node> gpushare=true

例如

kubectl label node k8s-gpu1 gpushare=true

安装 Kubectl extension

cd /usr/bin/
sudo wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
sudo chmod u+x /usr/bin/kubectl-inspect-gpushare

使用

在应用中使用属性 aliyun.com/gpu-mem 申请显存

申请 4G 显存

apiVersion: apps/v1
kind: Deployment

metadata:
  name: binpack-4
  labels:
    app: binpack-4

spec:
  replicas: 1

  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-4

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-4

    spec:
      containers:
      - name: binpack-4
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 4

申请 6G 显存

apiVersion: apps/v1
kind: Deployment

metadata:
  name: binpack-6
  labels:
    app: binpack-6

spec:
  replicas: 1

  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-6

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-6

    spec:
      containers:
      - name: binpack-6
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 6

申请 10G 显存

apiVersion: apps/v1
kind: Deployment

metadata:
  name: binpack-10
  labels:
    app: binpack-10

spec:
  replicas: 1

  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-10

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-10

    spec:
      containers:
      - name: binpack-10
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 10

部署应用

kubectl apply -f binpack-6G.yaml
kubectl apply -f binpack-10G.yaml
kubectl apply -f binpack-4G.yaml

查看 GPU 使用情况

kubectl inspect gpushare -d

NAME:       k8s-gpu1
IPADDRESS:  172.16.33.69

NAME                         NAMESPACE  GPU0(Allocated)  GPU1(Allocated)  GPU2(Allocated)  GPU3(Allocated)  
binpack-6-7d9f48946c-9strc   default    6                0                0                0                
binpack-10-6b5fcd9ddc-wzdzv  default    0                10               0                0                
binpack-4-5c857c579-5shjc    default    0                4                0                0                
Allocated :                  20 (35%)   
Total :                      56         
----------------------------------------------------------------------------------------------------------

Allocated/Total GPU Memory In Cluster:  20/56 (35%)

构建应用

Scheduler Extender

Device Plugin

Kubectl Extension

安装

在控制平面中部署 GPU 共享调度程序扩展器

配置调度程序策略

部署 Device Plugin

将标签 gpushare 添加到需要 GPU 共享的节点

安装 Kubectl extension

使用

在应用中使用属性 aliyun.com/gpu-mem 申请显存

申请 4G 显存

申请 6G 显存

申请 10G 显存

部署应用

查看 GPU 使用情况

参考资料