23 篇文章带有标签 “vLLM”

2024年3月15日星期五

vLLM 部署 Qwen1.5 LLM

下载模型

git clone https://www.modelscope.cn/qwen/Qwen1.5-7B-Chat-GPTQ-Int4.git

启动服务

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 9000 \
    --model Qwen/Qwen1.5-7B-Chat-GPTQ-Int4 \
    --quantization gptq \
    --tensor-parallel-size 2 \
    --dtype=half \
    --gpu-memory-utilization 0.95

可以使用环境变量 CUDA_VISIBLE_DEVICES=2,3 来指定使用的 GPU。
--dtype=half T4 不支持 bfloat16，可以使用 float16。
--gpu-memory-utilization 默认为 0.9，这里因为 Qwen 的上下文为 32k，0.9 还不能满足，也可以通过 max-model-len 参数来调整上下文长度。

使用 curl 测试

chat completions curl http://localhost:9000/v1/chat/completions \ -H "Content-Type: application/json" \ -d &

2024年3月15日 2 分钟 427 字

2024年1月17日星期三

LLM 的基准测试

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Turing GPU T4 不支持，需要使用 FlashAttention 1.x，否则会报错 ❌：

data: {
  "text": "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**\n\n(FlashAttention only supports Ampere GPUs or newer.)", 
  "error_code": 50001
}

2024年1月17日 4 分钟 958 字

LLM Benchmark 测速 wrk Qwen FastChat vLLM TeslaT4

2024年1月16日星期二

使用 FastChat 在 CUDA 上部署 LLM

pip install "fschat[model_worker,webui]"

如果不支持 bfloat16，则降至 float16

📌 模型变动需要重新部署聊天机器人

2024年1月16日 1 分钟 277 字

FastChat vLLM CUDA