40 篇文章带有标签 “Docker”

Kubernetes中的GPU共享

  1. 添加策略配置文件
- --policy-config-file=/etc/kubernetes/scheduler-policy-config.json
  1. 将卷挂载添加到Pod
- mountPath: /etc/kubernetes/scheduler-policy-config.json
  name: scheduler-policy-config
  readOnly: true
- hostPath:
      path: /etc/kubernetes/scheduler-policy-config.json
      type: FileOrCreate
  name: scheduler-policy-config

最终修改为 apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: component: kube-scheduler tier: control-plane name: kube-scheduler namespace: kube-system spec: containers: - command: - kube-scheduler - --authentication-kubeconfig=/etc/kubernetes/scheduler.

Install NVIDIA device plugin for Kubernetes

  1. 重启服务
sudo systemctl restart docker
  1. 使用Helm安装
helm install --generate-name nvdp/nvidia-device-plugin

失败(gpu2节点的Docker没有配置好) $ kubectl logs -n kube-system nvidia-device-plugin-1614240442-wfh6c 2021/02/26 07:03:48 Loading NVML 2021/02/26 07:03:48 Failed to initialize NVML: could not load NVML library. 2021/02/26 07:03:48 If this is a GPU node, did you set the docker default runtime to nvidia? 2021/02/26 07:03:48 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites 2021/02/26 07:03:48 You can learn how to set the runtime at: https://github.

基于PyPIServer创建私有Python软件包存储库

  • 客户端不仅可以下载还可以上传(当我们自己开发了Python的软件时)
#创建用户名和密码
sudo apt install apache2-utils -y
sudo mkdir /data/pypi-packages
sudo htpasswd -sc /data/pypi-packages/htpasswd.txt wjj
#当您需要再创建用户名时就不需要加参数 -c
sudo htpasswd -s /data/pypi-packages/htpasswd.txt test
#容器部署
docker run -d --restart=always --name pypiserver -p 8080:8080 \
    -v /data/pypi-packages/:/data/packages \
    pypiserver/pypiserver:latest -P /data/packages/htpasswd.txt

安装 pip3 install tensorflow Looking in indexes: http://172.16.33.174:8080/simple/, https://mirrors.aliyun.com/pypi/simple/ Collecting tensorflow Downloading http://172.16.33.

Dockerfile OpenCV4 Ubuntu20.04

  1. ImportError: libgthread-2.0.so.0: cannot open shared object file: No such file or directory
>>> import cv2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/cv2/__init__.py", line 5, in <module>
    from .cv2 import *
ImportError: libgthread-2.0.so.0: cannot open shared object file: No such file or directory

apt install -y libglib2.0-dev

需要交互,不能自动化安装。 apt install libglib2.0-dev Setting up tzdata (2020f-0ubuntu0.20.04.1) ...

Building ONNX Runtime

  • 拉取容器(编译环境)
docker pull nvidia/cuda:11.1-cudnn8-devel-ubuntu20.04
  • 运行容器
docker run -it --name build-onnxruntime-gpu --runtime nvidia \
    -v $(pwd)/onnxruntime:/onnxruntime -w /onnxruntime \
    nvidia/cuda:11.1-cudnn8-devel-ubuntu20.04
  • 更新apt镜像源
sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list
apt-get update
  • 安装依赖包
apt-get install language-pack-en git cmake python3 python3-pip -y
  • 修改语言环境
locale-gen en_US.UTF-8
update-locale LANG=en_US.UTF-8
  • 更新pip镜像源
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/
  • 安装numpy
pip3 install numpy

编译 ./build.

Dockerfile ONNXRuntime GPU

FROM nvidia/cuda:11.1-cudnn8-devel-ubuntu20.04 AS builder
LABEL maintainer="wang-junjian@qq.com"

#E: Failed to fetch https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/f10fc2a7a0d072ddcf141af2ef28f1e97ab4b3a5c3b9bbe34ed845d174fb4979  404  Not Found [IP: 61.155.167.2 443]
#E: Some index files failed to download. They have been ignored, or old ones used instead.
RUN rm /etc/apt/sources.list.d/cuda.list /etc/apt/sources.list.d/nvidia-ml.list

RUN sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list && \
    apt-get update && \
    apt-get install language-pack-en git python3 python3-pip -y && \
    DEBIAN_FRONTEND=noninteractive apt-get install cmake -y && \
    locale-gen en_US.UTF-8 && \
    update-locale LANG=en_US.UTF-8

RUN pip3 install numpy -i https://mirrors.aliyun.com/pypi/simple/
// ...

在Ubuntu上下载docker和nvidia-docker2离线安装包

以下是容器内操作

分析要下载的依赖安装包 $ apt-get install -s nvidia-docker2 Reading package lists... Done Building dependency tree Reading state information... Done The following additional packages will be installed: libcap2 libnvidia-container-tools libnvidia-container1 nvidia-container-runtime nvidia-container-toolkit The following NEW packages will be installed: libcap2 libnvidia-container-tools libnvidia-container1 nvidia-container-runtime nvidia-container-toolkit nvidia-docker2 0 upgraded, 6 newly installed, 0 to remove and 0 not upgraded. Inst libcap2 (1:2.32-1 Ubuntu:20.

使用YOLOv5训练自定义数据集

在 Ubuntu20.04 系统上使用4张GPU卡基于容器训练模型。

  • 运行容器
$ docker run --ipc=host --runtime=nvidia -it --name project_name-yolov5 \
    -v project_dir:/usr/src/app/project ultralytics/yolov5:latest
  • 替换所有模型网络的类别
$ sed -i 's/nc: 80/nc: 2/g' project/models/yolov5?.yaml

验证替换结果 $ head -n 2 project/models/yolov5?.yaml ==> project/models/yolov5l.yaml <== # parameters nc: 2 # number of classes ==> project/models/yolov5m.yaml <== # parameters nc: 2 # number of classes ==> project/models/yolov5s.yaml <== # parameters nc: 2 # number of classes ==> project/models/yolov5x.

配置Docker镜像源

加速 Docker Hub 镜像拉取速度。

## 另一种方法()
cat << EOF >/etc/docker/daemon.json
{
  "registry-mirrors": ["https://75oltije.mirror.aliyuncs.com"]
}
EOF

安装Harbor

  1. 生成CA证书
openssl req -x509 -new -nodes -sha512 -days 3650 \
 -subj "/C=CN/ST=Shandong/L=Jinan/O=LNSoft/OU=AI/CN=lnsoft.com" \
 -key ca.key \
 -out ca.crt
  1. 生成私钥
openssl genrsa -out lnsoft.com.key 4096
  1. 生成证书签名请求(CSR)

调整-subj选项中的值以反映您的组织。 如果使用FQDN连接Harbor主机,则必须将其指定为公用名(CN)属性,并在密钥和CSR文件名中使用它。

openssl req -sha512 -new \
    -subj "/C=CN/ST=Shandong/L=Jinan/O=LNSoft/OU=AI/CN=lnsoft.com" \
    -key lnsoft.com.key \
    -out lnsoft.com.csr

生成一个x509 v3扩展文件 cat > v3.

使用RetinaNet算法训练自定义数据集

#标注后的目录结构
project
└── labelimg
    ├── 20190128155421222575013.jpg
    ├── 20190128155421222575013.xml
    ├── 20190128155703035712899.jpg
    ├── 20190128155703035712899.xml
    ├── 20190129091126392737624.jpg
    └── 20190129091126392737624.xml

手动构建 FROM gouchicao/tensorflow:2.2.0-gpu-jupyter-opencv4-pillow-wget-curl-git-nano LABEL maintainer="wang-junjian@qq.com" WORKDIR / RUN mkdir -p /root/.keras/models/ && \ wget -O /root/.keras/models/ResNet-50-model.keras.h5 https://github.com/fizyr/keras-models/releases/download/v0.0.1/ResNet-50-model.keras.

构建YOLOv4容器应用在自定义数据集上

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04
LABEL maintainer="wang-junjian@qq.com"

#auto install tzdata(opencv depend)
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    git wget nano \
    libopencv-dev python3-opencv \
    && rm -rf /var/lib/apt/lists/*

#set your localtime
RUN ln -fs /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

WORKDIR /
// ...
  • 构建容器
docker build -t darknet:latest-gpu-yolov4 .
  • 训练的样本:train.txt
images/IMG_9255.JPG
images/IMG_9266.JPG
images/IMG_9280.JPG
  • 验证的样本:valid.txt
images/IMG_9263.JPG
  • 标注类型:voc.names
close
open

使用Detectron在自定义数据集上训练MaskRCNN

  1. 修改网络配置文件
nano /detectron/project/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml
MODEL:
  TYPE: generalized_rcnn
  CONV_BODY: FPN.add_fpn_ResNet101_conv5_body
  NUM_CLASSES: 2
  FASTER_RCNN: True
  MASK_ON: True
NUM_GPUS: 1
SOLVER:
  WEIGHT_DECAY: 0.0001
  LR_POLICY: steps_with_decay
  BASE_LR: 0.002
  GAMMA: 0.1
  MAX_ITER: 4000
  STEPS: [0, 3000, 4000]
FPN:
  FPN_ON: True
  MULTILEVEL_ROIS: True
  MULTILEVEL_RPN: True
FAST_RCNN:
  ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head
  ROI_XFORM_METHOD: RoIAlign
  ROI_XFORM_RESOLUTION: 7
  ROI_XFORM_SAMPLING_RATIO: 2
MRCNN:
  ROI_MASK_HEAD: mask_rcnn_heads.mask_rcnn_fcn_head_v1up4convs
  RESOLUTION: 28  # (output mask resolution) default 14
  ROI_XFORM_METHOD: RoIAlign
  ROI_XFORM_RESOLUTION: 14  # default 7
  ROI_XFORM_SAMPLING_RATIO: 2  # default 0
  DILATION: 1  # default 2
  CONV_INIT: MSRAFill  # default GaussianFill
TRAIN:
  WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/MSRA/R-101.pkl
  DATASETS: ('coco_helmet_train', 'coco_helmet_val')
  SCALES: (800,)
  MAX_SIZE: 1333
  BATCH_SIZE_PER_IM: 512
  RPN_PRE_NMS_TOP_N: 2000  # Per FPN level
TEST:
  DATASETS: ('coco_2014_minival',)
  SCALE: 800
  MAX_SIZE: 1333
  NMS: 0.5
  RPN_PRE_NMS_TOP_N: 1000  # Per FPN level
  RPN_POST_NMS_TOP_N: 1000
OUTPUT_DIR: .

基于Darknet框架的YOLOv3算法开发的模型训练和部署的容器化产品

  • 举例:这里以platen-switch为例
    platen-switch/
    ├── cfg
    │   └── voc.names
    ├── images
    │   ├── IMG_9255.JPG
    │   ├── IMG_9263.JPG
    │   ├── IMG_9266.JPG
    │   └── IMG_9280.JPG
    ├── labels
    │   ├── IMG_9255.txt
    │   ├── IMG_9263.txt
    │   ├── IMG_9266.txt
    │   └── IMG_9280.txt
    └── test
    ├── IMG_9256.JPG
    └── IMG_9271.JPG
    
  1. 运行darknet容器
    • 将工程目录作为挂载点绑定到容器
    # 使用您的工程绝对路径设置变量 project_dir
    $ project_dir='/home/wjunjian/github/gouchicao/darknet/model-zoo/platen-switch'
    $ sudo docker run --runtime=nvidia -it --name=darknet \
        --volume=$project_dir:/darknet/project \
        gouchicao/darknet:latest-gpu
    

创建工程,用于模型训练 创建工程,自动生成训练前需要的数据 $ python3 create_project