2 篇文章带有标签 “TeslaT4”

2024年3月15日星期五

vLLM 部署 Qwen1.5 LLM

下载模型

git clone https://www.modelscope.cn/qwen/Qwen1.5-7B-Chat-GPTQ-Int4.git

启动服务

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 9000 \
    --model Qwen/Qwen1.5-7B-Chat-GPTQ-Int4 \
    --quantization gptq \
    --tensor-parallel-size 2 \
    --dtype=half \
    --gpu-memory-utilization 0.95

可以使用环境变量 CUDA_VISIBLE_DEVICES=2,3 来指定使用的 GPU。
--dtype=half T4 不支持 bfloat16，可以使用 float16。
--gpu-memory-utilization 默认为 0.9，这里因为 Qwen 的上下文为 32k，0.9 还不能满足，也可以通过 max-model-len 参数来调整上下文长度。

使用 curl 测试

chat completions curl http://localhost:9000/v1/chat/completions \ -H "Content-Type: application/json" \ -d &

2024年3月15日 2 分钟 427 字

2024年1月17日星期三

LLM 的基准测试

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Turing GPU T4 不支持，需要使用 FlashAttention 1.x，否则会报错 ❌：

data: {
  "text": "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**\n\n(FlashAttention only supports Ampere GPUs or newer.)", 
  "error_code": 50001
}

2024年1月17日 4 分钟 958 字

LLM Benchmark 测速 wrk Qwen FastChat vLLM TeslaT4

2 篇文章带有标签 “TeslaT4”

2024年3月15日 星期五

vLLM 部署 Qwen1.5 LLM

2024年1月17日 星期三

LLM 的基准测试

2024年3月15日星期五

2024年1月17日星期三