llm-inference - 标签 - 军舰的日志

2024年10月10日星期四

华为 Atlas 800I A2 服务器的大模型推理性能压测

大模型推理性能压测工具

git clone https://github.com/modelscope/evalscope
cd evalscope

pip install -e .

压测命令的使用

evalscope perf \
    --api openai \
    --url 'http://127.0.0.1:1025/v1/chat/completions' \
    --model 'qwen' \
    --dataset openqa \
    --dataset-path './datasets/open_qa.jsonl' \
    --max-prompt-length 8000 \
    --stop '<|im_end|>' \
    --read-timeout=120 \
    --parallel 100 \
    -n 1000

❌ --stream 不要加，经常出问题。

--read-timeout: 网络读取超时
--parallel: 并发数
-n: 请求数

数据集中文聊天 HC3-Chinese mkdir datasets wget https://modelscope.cn/datasets/AI-ModelScope/HC3-Chinese/resolve/master/open_qa.

2024-10-10 10:00

2024年10月8日星期二

在华为 Atlas 800I A2 服务器上搭建大模型推理服务

华为昇腾 NPU 与英伟达 GPU 生态层级对比:

NPU	GPU
CANN	CUDA
MindSpore	PyTorch
MindFormer	Transformers
MindIE	vLLM

下载大模型

cd /home/luruan/disk1/models

大型语言模型

Qwen1.5-7B

git clone https://www.modelscope.cn/Qwen/Qwen1.5-7B-Chat.git

Qwen2-7B ❌

git clone https://www.modelscope.cn/Qwen/Qwen2-7B-Instruct.git

Qwen2-72B

git clone https://www.modelscope.cn/Qwen/Qwen2-72B-Instruct.git

代码大模型

DeepSeek-Coder-6.7B

git clone https://www.modelscope.cn/deepseek-ai/deepseek-coder-6.7b-instruct.git

StarCoder2-15B ❌

git clone https://www.modelscope.cn/AI-ModelScope/starcoder2-15b.git

CodeGeeX2-6B ❌

git clone https://www.modelscope.cn/ZhipuAI/codegeex2-6b.git

2024-10-08 10:00

huawei-atlas ascend-npu mindie vllm qwen modelscope docker 国产化 llm-inference

2024年10月7日星期一

OpenAI API Compatibility

设置 API Key

export LITELLM_API_KEY=sk-1234

服务端口

Ollama: 11434
LiteLLM: 4000
XInference: 9997
MindIE: 1025

models

Ollama

curl -s http://localhost:11434/v1/models \
    | jq -r '.data[].id'

curl -s: -s 选项表示静默模式，不输出进度信息。
jq -r: -r 选项表示以原始格式输出，去掉了引号。

LiteLLM

curl -s http://localhost:4000/v1/models \
    -H "Authorization: Bearer $LITELLM_API_KEY" \
    | jq -r '.data[].id'

在 Bash 中，单引号和双引号的使用有一些重要的区别：

单引号 (')
- 完全字面值：单引号内的内容被视为字面值，不会对其中的任何字符进行扩展或解析。
- 变量不扩展：在单引号内，变量不会被解析。例如，' $LITELLM_API_KEY' 会被视为字符串 '$ LITELLM_API_KEY'，而不是变量的值。
```
echo '$LITELLM_API_KEY'  # 输出: $LITELLM_API_KEY
```

2024-10-07 10:00

openai-api ollama litellm xinference mindie api-compatibility curl llm-inference

2024年10月3日星期四

部署 LLM 多 LoRA 适配器的推理服务

Text Generation Inference

conda create -n text-generation-inference python=3.9
conda activate text-generation-inference

git clone https://github.com/huggingface/text-generation-inference.git && cd text-generation-inference
BUILD_EXTENSIONS=True make install

vLLM

conda create -n vllm python=3.10 -y
conda activate vllm
pip install vllm

cd ~/HuggingFace/mistralai/Mistral-7B-v0.1
git clone https://huggingface.co/predibase/magicoder adapters/magicoder

vllm serve `pwd` \
    --enable-lora \
    --lora-modules magicoder=`pwd`/adapters/magicoder

2024-10-03 10:00

lora vllm text-generation-inference multi-lora llm-inference hugging-face mistral nvidia-nim

2024年10月1日星期二

推测解码 (Speculative Decoding)

Speculative Decoding

Fast Inference from Transformers via Speculative Decoding

初步生成：使用一个小而快速的模型（称为Mq），生成一系列初步的 tokens。这个模型很高效，所以能快速得到结果。
并行评估：接着，使用一个更大的目标模型（称为Mp）来同时评估Mq生成的所有 tokens。Mp会判断每个 token 的概率，选择那些可能性高的结果。
修正输出：对于那些被Mq生成但被Mp拒绝的低概率 token，Mp会提供新的替代 token。这一步确保了输出的质量，同时提高了生成的速度。

Serving AI models faster with speculative decoding
1. 生成多个猜测候选: 使用一个更小更高效的"草稿"模型或者是主模型本身的最后一层，生成多个可能的下一个token作为猜测。
2. 并行评估猜测: 利用主要的大型语言模型（LLM）并行地对这些猜测进行评估，计算每个猜测的概率分布。
3. 接受或拒绝猜测: 通过比较每个猜测在 LLM 和草稿模型下的概率，以及生成一个随机数进行判断，决定是否接受该猜测。
4. 调整并重采样: 如果所有猜测都被接受，则直接从 LLM 采样下一个token。如果有猜测被拒绝，则从调整后的概率分布中重新采样被拒绝的猜测。
5. 输出结果: 最终输出包括所有被接受的猜测以及从 LLM 采样或重采样得到的token。

2024-10-01 10:00

speculative-decoding llm-inference inference-acceleration draft-model vllm text-generation-inference qwen

5 篇文章带有标签 “llm-inference”

2024年10月10日 星期四

华为 Atlas 800I A2 服务器的大模型推理性能压测

2024年10月8日 星期二

在华为 Atlas 800I A2 服务器上搭建大模型推理服务

2024年10月7日 星期一

OpenAI API Compatibility

2024年10月3日 星期四

部署 LLM 多 LoRA 适配器的推理服务

2024年10月1日 星期二

推测解码 (Speculative Decoding)

2024年10月10日星期四

2024年10月8日星期二

2024年10月7日星期一

2024年10月3日星期四

2024年10月1日星期二