vllm - 第 2 页 - 标签 - 军舰的日志

2024年10月10日星期四

华为 Atlas 800I A2 服务器的大模型推理性能压测

大模型推理性能压测工具

git clone https://github.com/modelscope/evalscope
cd evalscope

pip install -e .

压测命令的使用

evalscope perf \
    --api openai \
    --url 'http://127.0.0.1:1025/v1/chat/completions' \
    --model 'qwen' \
    --dataset openqa \
    --dataset-path './datasets/open_qa.jsonl' \
    --max-prompt-length 8000 \
    --stop '<|im_end|>' \
    --read-timeout=120 \
    --parallel 100 \
    -n 1000

❌ --stream 不要加，经常出问题。

--read-timeout: 网络读取超时
--parallel: 并发数
-n: 请求数

数据集中文聊天 HC3-Chinese mkdir datasets wget https://modelscope.cn/datasets/AI-ModelScope/HC3-Chinese/resolve/master/open_qa.

2024-10-10 10:00

2024年10月8日星期二

在华为 Atlas 800I A2 服务器上搭建大模型推理服务

华为昇腾 NPU 与英伟达 GPU 生态层级对比:

NPU	GPU
CANN	CUDA
MindSpore	PyTorch
MindFormer	Transformers
MindIE	vLLM

下载大模型

cd /home/luruan/disk1/models

大型语言模型

Qwen1.5-7B

git clone https://www.modelscope.cn/Qwen/Qwen1.5-7B-Chat.git

Qwen2-7B ❌

git clone https://www.modelscope.cn/Qwen/Qwen2-7B-Instruct.git

Qwen2-72B

git clone https://www.modelscope.cn/Qwen/Qwen2-72B-Instruct.git

代码大模型

DeepSeek-Coder-6.7B

git clone https://www.modelscope.cn/deepseek-ai/deepseek-coder-6.7b-instruct.git

StarCoder2-15B ❌

git clone https://www.modelscope.cn/AI-ModelScope/starcoder2-15b.git

CodeGeeX2-6B ❌

git clone https://www.modelscope.cn/ZhipuAI/codegeex2-6b.git

2024-10-08 10:00

huawei-atlas ascend-npu mindie vllm qwen modelscope docker 国产化 llm-inference

2024年10月3日星期四

部署 LLM 多 LoRA 适配器的推理服务

Text Generation Inference

conda create -n text-generation-inference python=3.9
conda activate text-generation-inference

git clone https://github.com/huggingface/text-generation-inference.git && cd text-generation-inference
BUILD_EXTENSIONS=True make install

vLLM

conda create -n vllm python=3.10 -y
conda activate vllm
pip install vllm

cd ~/HuggingFace/mistralai/Mistral-7B-v0.1
git clone https://huggingface.co/predibase/magicoder adapters/magicoder

vllm serve `pwd` \
    --enable-lora \
    --lora-modules magicoder=`pwd`/adapters/magicoder

2024-10-03 10:00

lora vllm text-generation-inference multi-lora llm-inference hugging-face mistral nvidia-nim

2024年10月1日星期二

推测解码 (Speculative Decoding)

Speculative Decoding

Fast Inference from Transformers via Speculative Decoding

初步生成：使用一个小而快速的模型（称为Mq），生成一系列初步的 tokens。这个模型很高效，所以能快速得到结果。
并行评估：接着，使用一个更大的目标模型（称为Mp）来同时评估Mq生成的所有 tokens。Mp会判断每个 token 的概率，选择那些可能性高的结果。
修正输出：对于那些被Mq生成但被Mp拒绝的低概率 token，Mp会提供新的替代 token。这一步确保了输出的质量，同时提高了生成的速度。

Serving AI models faster with speculative decoding
1. 生成多个猜测候选: 使用一个更小更高效的"草稿"模型或者是主模型本身的最后一层，生成多个可能的下一个token作为猜测。
2. 并行评估猜测: 利用主要的大型语言模型（LLM）并行地对这些猜测进行评估，计算每个猜测的概率分布。
3. 接受或拒绝猜测: 通过比较每个猜测在 LLM 和草稿模型下的概率，以及生成一个随机数进行判断，决定是否接受该猜测。
4. 调整并重采样: 如果所有猜测都被接受，则直接从 LLM 采样下一个token。如果有猜测被拒绝，则从调整后的概率分布中重新采样被拒绝的猜测。
5. 输出结果: 最终输出包括所有被接受的猜测以及从 LLM 采样或重采样得到的token。

2024-10-01 10:00

speculative-decoding llm-inference inference-acceleration draft-model vllm text-generation-inference qwen

2024年9月6日星期五

SGLang is a fast serving framework for large language models and vision language models. It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

SGLang 是用于大型语言模型和视觉语言模型的快速服务框架。通过协同设计后端运行时和前端语言，使您与模型的交互更快速、更可控。

The core features include:

核心功能包括： Fast Backend Runtime: Efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, and quantization (AWQ/FP8/GPTQ/Marlin).

2024-09-06 08:00

sglang vllm llm-serving flashinfer tensor-parallelism quantization qwen2 cuda

2024年3月15日星期五

vLLM 部署 Qwen1.5 LLM

安装 vLLM

# (Optional) Create a new conda environment.
conda create -n vllm python=3.9 -y
conda activate vllm

# Install vLLM with CUDA 12.1.
pip install vllm

vLLM 帮助 vLLM 兼容 OpenAI 的 RESTful API 服务器。可选参数： -h, --help 显示此帮助信息并退出 --host HOST 主机名 --port PORT 端口号 --allow-credentials 允许凭证 --allowed-origins ALLOWED_ORIGINS 允许的来源 --allowed-methods ALLOWED_METHODS 允许的方法 --allowed-headers ALLOWED_HEADERS 允许的头部 --api-key API_KEY 如果提供，服务器将要求在头部中呈现此密钥。 --served-model-name SERVED_MODEL_NAME 在API中使用的模型名称。如果没有指定，模型名称将与huggingface名称相同。 --lora-modules LORA_MODULES [LORA_MODULES ...] LoRA模块配置，格式为名称=路径。可以指定多个模块。

2024-03-15 10:00

vllm llm qwen qwen1.5 deployment model-serving quantization tensor-parallelism gpu tesla-t4

2024年1月17日星期三

LLM 的基准测试

安装 FastChat & vLLM

安装 FastChat

安装 FlashAttention

FlashAttention-2 currently supports:

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Turing GPU T4 不支持，需要使用 FlashAttention 1.x，否则会报错 ❌：

2024-01-17 08:00

llm benchmarking 测速 fastchat vllm qwen wrk tesla-t4

2024年1月16日星期二

使用 FastChat 在 CUDA 上部署 LLM

安装 FastChat & vLLM

安装 FastChat

pip install "fschat[model_worker,webui]"

安装 FlashAttention

Turing GPU T4 不支持 FlashAttention 2，需要使用 FlashAttention 1.x 。
Turing GPU T4 不支持 bf16，需要使用 fp16 。

安装 vLLM

pip install vllm -i https://mirrors.aliyun.com/pypi/simple/

升级 FastChat & vLLM

git pull
pip install -e ".[model_worker,webui]"
pip install -U vllm

部署 LLM

运行 Controller

python -m fastchat.serve.controller

运行 OpenAI API Server

python -m fastchat.serve.openai_api_server

运行 Model Worker Qwen-1_8B-Chat export CUDA_VISIBLE_DEVIC

2024-01-16 08:00

fastchat vllm cuda qwen chatglm llm-deployment openai-api flash-attention

28 篇文章带有标签 “vllm”

2024年10月10日星期四

华为 Atlas 800I A2 服务器的大模型推理性能压测

2024年10月8日星期二

在华为 Atlas 800I A2 服务器上搭建大模型推理服务

2024年10月3日星期四

部署 LLM 多 LoRA 适配器的推理服务

2024年10月1日星期二

推测解码 (Speculative Decoding)

2024年9月6日星期五

SGLang 大模型服务框架

2024年3月15日星期五

vLLM 部署 Qwen1.5 LLM

2024年1月17日星期三

LLM 的基准测试

2024年1月16日星期二

使用 FastChat 在 CUDA 上部署 LLM

28 篇文章带有标签 “vllm”

2024年10月10日 星期四

2024年10月8日 星期二

2024年10月3日 星期四

2024年10月1日 星期二

2024年9月6日 星期五

2024年3月15日 星期五

2024年1月17日 星期三

2024年1月16日 星期二

2024年10月10日星期四

2024年10月8日星期二

2024年10月3日星期四

2024年10月1日星期二

2024年9月6日星期五

2024年3月15日星期五

2024年1月17日星期三

2024年1月16日星期二