DeepSeek-OCR 研究与实测

2025-10-24 2 minute read

DeepSeek-OCR：上下文光学压缩

DeepSeek-OCR 架构

训练数据

数据组成

数据标注

训练流程

训练 DeepEncoder

方法：遵循 Vary，使用紧凑语言模型和下一词元预测（next token prediction）框架进行训练。
数据：使用所有 OCR 1.0 和 OCR 2.0 数据，以及从 LAION 数据集中采样的 1 亿（100M）通用数据。
训练细节：训练 2 个 epoch，批次大小为 1280，使用 AdamW 优化器，配合余弦退火（cosine annealing）调度器，学习率为 5e-5。训练序列长度为 4096。

训练 DeepSeek-OCR

时机： DeepEncoder 准备好后进行。
数据： 使用训练数据。
并行策略： 采用流水线并行（PP），模型被分为 4 部分：
- DeepEncoder (PP0, PP1)：
  - PP0： 包含 SAM 和压缩器（作为视觉词元分析器），参数冻结。
  - PP1： 包含 CLIP 部分（作为输入嵌入层），权重不冻结，参与训练。
- 语言模型 (PP2, PP3)： DeepSeek3B-MoE 共有 12 层，PP2 和 PP3 各放置 6 层。
硬件与批次： 使用 20 个节点（每个节点配备 8 块 A100-40G GPU）进行训练，数据并行（DP）为 40，全局批次大小为 640。
优化器： 使用 AdamW 优化器，配合基于步数的调度器（step-based scheduler），初始学习率为 3e-5。
训练速度： 纯文本数据：900 亿词元/天（90B tokens/day）；多模态数据：700 亿词元/天（70B tokens/day）。

提示词

<image>\nFree OCR.
<image>\n<|grounding|>Convert the document to markdown.
<image>\nParse the figure.
<image>\nLocate <|ref|>11-2=<|/ref|> in the image.
<image>\nDescribe this image in detail.
<image>\n请详细描述这张图片。
<image>\nLocate <|ref|>the teacher<|/ref|> in the image.
<image>\nIdentify all objects in the image and output them in bounding boxes.
<image>\n这是一张
<image>\n<|grounding|>OCR the images.
君不见，黄河之水天上来

评估

Fox 基准测试

OmniDocBench 基准测试

实际效果

图像转换 markdown

深度解析

通用视觉理解

<image>\n<|grounding|>Convert the document to markdown.

模拟人类记忆遗忘机制

构建运行环境

运行 vllm 容器

docker run -it \
  --ipc=host \
  --net=host \
  --runtime=nvidia \
  --name=vllm \
  -v /home/lnsoft/wjj/models:/models \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.cache/modelscope:/root/.cache/modelscope \
  nvcr.io/nvidia/vllm:25.09-py3 \
  bash

克隆 DeepSeek-OCR 项目

git clone https://github.com/deepseek-ai/DeepSeek-OCR.git

安装依赖

cd DeepSeek-OCR

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

运行 DeepSeek-OCR

Transformers 推理

run_dpsk_ocr.py

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

# prompt = "<image>\nFree OCR. "
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'outputs'

res = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path = output_path, 
    base_size = 1024, 
    image_size = 640, 
    crop_mode=True, 
    save_results = True, 
    test_compress = True
)

cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.py

FAQ

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Traceback (most recent call last):
  File "/models/deepseek-ai/DeepSeek-OCR/demo.py", line 8, in <module>
    model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/auto_factory.py", line 547, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/configuration_auto.py", line 1264, in from_pretrained
    config_class = get_class_from_dynamic_module(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/dynamic_module_utils.py", line 582, in get_class_from_dynamic_module
    return get_class_in_module(class_name, final_module, force_reload=force_download)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/dynamic_module_utils.py", line 277, in get_class_in_module
    module_spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-OCR/modeling_deepseekocr.py", line 1, in <module>
    from .modeling_deepseekv2 import DeepseekV2Model, DeepseekV2ForCausalLM
  File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-OCR/modeling_deepseekv2.py", line 37, in <module>
    from transformers.models.llama.modeling_llama import (
ImportError: cannot import name 'LlamaFlashAttention2' from 'transformers.models.llama.modeling_llama' (/usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py). Did you mean: 'LlamaAttention'?

参考资料

deepseek_ocr SFT