3 篇文章带有标签 “qwen2”

2024年9月23日星期一

Qwen2 Technical Report

Abstract(摘要)

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.

2024-09-23 08:00

2024年9月6日星期五

SGLang 大模型服务框架

SGLang

SGLang is a fast serving framework for large language models and vision language models. It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

SGLang 是用于大型语言模型和视觉语言模型的快速服务框架。通过协同设计后端运行时和前端语言，使您与模型的交互更快速、更可控。

The core features include:

核心功能包括： Fast Backend Runtime: Efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, and quantization (AWQ/FP8/GPTQ/Marlin).

2024-09-06 08:00

sglang vllm llm-serving flashinfer tensor-parallelism quantization qwen2 cuda

2024年9月3日星期二

大模型推理需要多少显存？

基于 Qwen2 效率评估计算大模型推理需要的显存.xlsx
这里计算的显存都是指使用 transformers 库进行推理，对于 vLLM，由于 GPU 显存预分配，实际显存使用难以评估。

计算加载模型需要的显存

模型参数（B）	参数使用的位数（bits）	加载需要显存（G）
0.5	16	1
1.5	16	3
7	16	14
9	16	18
22	16	44
72	16	144

计算支持不同长度的上下文需要的显存

2024-09-03 08:00

llm gpu vram inference qwen2 transformers 显存计算大模型推理