3 篇文章带有标签 “tesla-t4”

2024年3月15日星期五

vLLM 部署 Qwen1.5 LLM

# (Optional) Create a new conda environment.
conda create -n vllm python=3.9 -y
conda activate vllm

# Install vLLM with CUDA 12.1.
pip install vllm

vLLM 帮助 vLLM 兼容 OpenAI 的 RESTful API 服务器。可选参数： -h, --help 显示此帮助信息并退出 --host HOST 主机名 --port PORT 端口号 --allow-credentials 允许凭证 --allowed-origins ALLOWED_ORIGINS 允许的来源 --allowed-methods ALLOWED_METHODS 允许的方法 --allowed-headers ALLOWED_HEADERS 允许的头部 --api-key API_KEY 如果提供，服务器将要求在头部中呈现此密钥。 --served-model-name SERVED_MODEL_NAME 在API中使用的模型名称。如果没有指定，模型名称将与huggingface名称相同。 --lora-modules LORA_MODULES [LORA_MODULES ...] LoRA模块配置，格式为名称=路径。可以指定多个模块。

2024-03-15 10:00

2024年1月19日星期五

使用 llama.cpp 构建兼容 OpenAI API 服务

[llama.cpp][llama.cpp]

使用 llama.cpp 构建本地聊天服务

模型量化量化类型 ./quantize --help Allowed quantization types: 2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B 3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B 8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B 9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B 19 or IQ2_XXS : 2.06 bpw quantization 20 or IQ2_XS : 2.31 bpw quantization 10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B 21 or Q2_K_S : 2.16G, +9.0634 ppl @ LLaMA-v1-7B 12 or Q3_K : alias for Q3_K_M 11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B 12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B 13 or Q3_K_L : 3.35G, +0.

2024-01-19 08:00

llama.cpp llama-cpp-python quantization qwen deepseek openai-api perplexity cuda tesla-t4 macbook-pro-m2-max

2024年1月17日星期三

LLM 的基准测试

安装 FastChat & vLLM

安装 FastChat

安装 FlashAttention

FlashAttention-2 currently supports:

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Turing GPU T4 不支持，需要使用 FlashAttention 1.x，否则会报错 ❌：

2024-01-17 08:00

llm benchmarking 测速 fastchat vllm qwen wrk tesla-t4