返回首页
Atlas800 Benchmark

华为 Atlas 800I A2 服务器的大模型推理性能压测

大模型推理性能压测工具

安装 EvalScope

git clone https://github.com/modelscope/evalscope
cd evalscope

pip install -e .

压测命令的使用

evalscope perf \
    --api openai \
    --url 'http://127.0.0.1:1025/v1/chat/completions' \
    --model 'qwen' \
    --dataset openqa \
    --dataset-path './datasets/open_qa.jsonl' \
    --max-prompt-length 8000 \
    --stop '<|im_end|>' \
    --read-timeout=120 \
    --parallel 100 \
    -n 1000

--stream 不要加,经常出问题。

  • --read-timeout: 网络读取超时
  • --parallel: 并发数
  • -n: 请求数

数据集

中文聊天 HC3-Chinese

mkdir datasets
wget https://modelscope.cn/datasets/AI-ModelScope/HC3-Chinese/resolve/master/open_qa.jsonl \
    -O datasets/open_qa.jsonl

压测命令

evalscope perf \
    --api openai \
    --url 'http://127.0.0.1:1025/v1/chat/completions' \
    --model 'qwen' \
    --dataset openqa \
    --dataset-path './datasets/open_qa.jsonl' \
    --max-prompt-length 8000 \
    --stop '<|im_end|>' \
    --read-timeout=120 \
    --parallel 1 \
    -n 1

代码问答 Codefuse-Evol-Instruct-Clean

wget https://modelscope.cn/datasets/Banksy235/Codefuse-Evol-Instruct-Clean/resolve/master/data.json \
    -O datasets/Codefuse-Evol-Instruct-Clean-data.jsonl

# 修改数据集格式,将 "input" 改为 "question",以适应 EvalScope 的数据集格式 openqa
sed -i 's/"input"/"question"/g' datasets/Codefuse-Evol-Instruct-Clean-data.jsonl

压测命令

evalscope perf \
    --api openai \
    --url 'http://127.0.0.1:1025/v1/chat/completions' \
    --model 'qwen' \
    --dataset openqa \
    --dataset-path './datasets/Codefuse-Evol-Instruct-Clean-data.jsonl' \
    --max-prompt-length 4000 \
    --stop '<|im_end|>' \
    --read-timeout=120 \
    --parallel 1 \
    -n 1

构造长输入和输出的数据集

编辑文件:datasets/long.jsonl

{"question":"Learning to Reason with LLMs\nWe are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.\n\nContributions\nOpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users⁠(opens in a new window).\n\nOur large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.\n\nThe image shows two scatter plots comparing o1 AIME accuracy during training and at test time. Both charts have pass@1 accuracy on the y-axis and compute (log scale) on the x-axis. The dots indicate increasing accuracy with more compute time.\no1 performance smoothly improves with both train-time and test-time compute\n\nEvals\nTo highlight the reasoning improvement over GPT-4o, we tested our models on a diverse set of human exams and ML benchmarks. We show that o1 significantly outperforms GPT-4o on the vast majority of these reasoning-heavy tasks. Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting.\n\nCompetition evals for Math (AIME 2024), Code (CodeForces), and PhD-Level Science Questions (GPQA Diamond)\no1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples.\nBreakdown of the accuracy and raw score of gpt-4o vs. o1 on various competition evals\no1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories. Seven are shown for illustration.\nIn many reasoning-heavy benchmarks, o1 rivals the performance of human experts. Recent frontier models1 do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models.\nA score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.\n\nWe also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology. In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions. We found that o1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark. These results do not imply that o1 is more capable than a PhD in all respects — only that the model is more proficient in solving some problems that a PhD would be expected to solve. On several other ML benchmarks, o1 improved over the state-of-the-art. With its vision perception capabilities enabled, o1 scored 78.2 on MMMU, making it the first model to be competitive with human experts. It also outperformed GPT-4o on 54 out of 57 MMLU subcategories.\n\nChain of Thought\nSimilar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason. To illustrate this leap forward, we showcase the chain of thought from o1-preview on several difficult problems below.\n\nCoding\nWe trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.\n\nFor each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.\n\nWith a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.  \n\nFinally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors.\nHuman preference evaluation\nIn addition to exams and academic benchmarks, we also evaluated human preference of o1-preview vs GPT-4o on challenging, open-ended prompts in a broad spectrum of domains. In this evaluation, human trainers were shown anonymized responses to a prompt from o1-preview and GPT-4o, and voted for which response they preferred. o1-preview is preferred to gpt-4o by a large margin in reasoning-heavy categories like data analysis, coding, and math. However, o1-preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases.\n\nThe image shows a horizontal bar chart comparing five models scores with error bars representing confidence intervals. The x-axis ranges from 0 to 100, with a dashed line as a reference point for performance.\nSafety\nChain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles.\nWhat does all of this mean for founders in the AI market? What does this mean for incumbent software companies? And where do we, as investors, see the most promising layer for returns in the Generative AI stack?\nIn our latest essay on the state of the Generative AI market, we’ll explore how the consolidation of the foundational LLM layer has set the stage for the race to scale these higher-order reasoning and agentic capabilities, and discuss a new generation of “killer apps” with novel cognitive architectures and user interfaces.\nThis is where System 2 thinking comes in, and it’s the focus of the latest wave of AI research. When a model “stops to think,” it isn’t just generating learned patterns or spitting out predictions based on past data. It’s generating a range of possibilities, considering potential outcomes and making a decision based on reasoning. \n\nTranslate to France."}

输入和输出 Tokens 大约在 3500

压测命令

evalscope-perf http://127.0.0.1:1025/v1/chat/completions qwen \
     ./datasets/long.jsonl \
     --max-prompt-length 8000 \
     --read-timeout=120 \
     --parallels 1 \
     --n 1

实验结果对比

🏆 vLLM ⚔️ XInference (T4: 4X16G)

从结果看,在生产环境中还是要使用 vLLM,推理性能更好且稳定更棒。

代码

import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

# 设置中文字体
font_path = '/System/Library/Fonts/Hiragino Sans GB.ttc'  # 替换为你的字体文件路径
font_prop = FontProperties(fname=font_path)

# 数据
batch_sizes = [8, 16, 32, 64, 100, 128, 150, 200, 300, 400, 500]

vllm_qps = [0.970, 1.588, 2.443, 3.503, 3.593, 3.580, 3.509, 3.076, 2.847, 2.820, 3.060]
xinf_qps = [0.783, 1.288, 1.958, 2.472, 2.353, 2.334, 2.046, 1.750, 1.664, 1.254, 1.163]

vllm_latency = [8.213, 9.944, 12.846, 16.913, 19.182, 19.831, 17.806, 16.743, 16.285, 16.285, 17.649]
xinf_latency = [10.128, 12.307, 15.749, 19.225, 19.235, 19.151, 17.479, 15.949, 16.718, 14.750, 15.771]

vllm_throughput = [298.860, 496.458, 753.176, 1073.697, 1039.610, 1005.881, 1032.794, 925.476, 839.437, 816.049, 875.565]
xinf_throughput = [254.695, 424.701, 631.260, 753.852, 700.681, 700.324, 637.260, 550.200, 510.211, 392.697, 362.167]

vllm_failures = [0, 0, 0, 15, 73, 115, 26, 11, 19, 23, 26]
xinf_failures = [1, 0, 12, 71, 101, 72, 32, 25, 66, 47, 89]

# 创建子图
fig, axs = plt.subplots(2, 2, figsize=(15, 10))

# QPS
axs[0, 0].plot(batch_sizes, vllm_qps, label='vLLM', marker='o')
axs[0, 0].plot(batch_sizes, xinf_qps, label='XInferencev(LLM)', marker='o')
axs[0, 0].set_title('QPS', fontproperties=font_prop)
axs[0, 0].set_xlabel('并行数', fontproperties=font_prop)
axs[0, 0].set_ylabel('QPS', fontproperties=font_prop)
axs[0, 0].legend()

# 延迟
axs[0, 1].plot(batch_sizes, vllm_latency, label='vLLM', marker='o')
axs[0, 1].plot(batch_sizes, xinf_latency, label='XInference(vLLM)', marker='o')
axs[0, 1].set_title('延迟', fontproperties=font_prop)
axs[0, 1].set_xlabel('并行数', fontproperties=font_prop)
axs[0, 1].set_ylabel('延迟 (秒)', fontproperties=font_prop)
axs[0, 1].legend()

# 吞吐量
axs[1, 0].plot(batch_sizes, vllm_throughput, label='vLLM', marker='o')
axs[1, 0].plot(batch_sizes, xinf_throughput, label='XInference(vLLM)', marker='o')
axs[1, 0].set_title('吞吐量', fontproperties=font_prop)
axs[1, 0].set_xlabel('并行数', fontproperties=font_prop)
axs[1, 0].set_ylabel('吞吐量 (每秒Tokens)', fontproperties=font_prop)
axs[1, 0].legend()

# 失败率
axs[1, 1].plot(batch_sizes, vllm_failures, label='vLLM', marker='o')
axs[1, 1].plot(batch_sizes, xinf_failures, label='XInference(vLLM)', marker='o')
axs[1, 1].set_title('失败率', fontproperties=font_prop)
axs[1, 1].set_xlabel('并行数', fontproperties=font_prop)
axs[1, 1].set_ylabel('失败数', fontproperties=font_prop)
axs[1, 1].legend()

# 调整布局
plt.tight_layout()
plt.show()

🏆 MindIE (910B4: 8X32G) ⚔️ vLLM (T4: 4X16G)

和我们现有服务器 T4 的性能对比

代码

import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

# 设置中文字体
font_path = '/System/Library/Fonts/Hiragino Sans GB.ttc'  # 替换为你的字体文件路径
font_prop = FontProperties(fname=font_path)

# 数据
batch_sizes_mindie = [8, 16, 32, 64, 128, 150, 200, 256, 300, 400, 512, 720]
batch_sizes_vllm = [8, 16, 32, 64, 100, 128, 150, 200, 300, 400, 500]

mindie_qps = [2.474, 4.649, 8.273, 12.065, 18.924, 20.457, 22.294, 23.392, 22.868, 23.328, 24.007, 24.643]
vllm_qps = [0.970, 1.588, 2.443, 3.503, 3.593, 3.580, 3.509, 3.076, 2.847, 2.820, 3.060]

mindie_latency = [3.213, 3.391, 3.724, 4.974, 6.108, 6.517, 7.805, 9.208, 10.904, 13.628, 16.031, 18.790]
vllm_latency = [8.213, 9.944, 12.846, 16.913, 19.182, 19.831, 17.806, 16.743, 16.285, 16.285, 17.649]

mindie_throughput = [594.929, 1119.102, 1989.159, 2903.023, 4559.489, 4920.856, 5354.701, 5636.846, 5506.567, 5618.110, 5772.230, 5940.622]
vllm_throughput = [298.860, 496.458, 753.176, 1073.697, 1039.610, 1005.881, 1032.794, 925.476, 839.437, 816.049, 875.565]

mindie_failures = [0] * len(batch_sizes_mindie)  # 假设 MindIE(910B4 8*32) 没有失败数据
vllm_failures = [0, 0, 0, 15, 73, 115, 26, 11, 19, 23, 26]

# 创建子图
fig, axs = plt.subplots(2, 2, figsize=(15, 10))

# QPS
axs[0, 0].plot(batch_sizes_mindie, mindie_qps, label='MindIE(910B4 8*32)', marker='o')
axs[0, 0].plot(batch_sizes_vllm, vllm_qps, label='vLLM(T4 4*16)', marker='o')
axs[0, 0].set_title('QPS', fontproperties=font_prop)
axs[0, 0].set_xlabel('并行数', fontproperties=font_prop)
axs[0, 0].set_ylabel('QPS', fontproperties=font_prop)
axs[0, 0].legend()

# 延迟
axs[0, 1].plot(batch_sizes_mindie, mindie_latency, label='MindIE(910B4 8*32)', marker='o')
axs[0, 1].plot(batch_sizes_vllm, vllm_latency, label='vLLM(T4 4*16)', marker='o')
axs[0, 1].set_title('延迟', fontproperties=font_prop)
axs[0, 1].set_xlabel('并行数', fontproperties=font_prop)
axs[0, 1].set_ylabel('延迟 (秒)', fontproperties=font_prop)
axs[0, 1].legend()

# 吞吐量
axs[1, 0].plot(batch_sizes_mindie, mindie_throughput, label='MindIE(910B4 8*32)', marker='o')
axs[1, 0].plot(batch_sizes_vllm, vllm_throughput, label='vLLM(T4 4*16)', marker='o')
axs[1, 0].set_title('吞吐量', fontproperties=font_prop)
axs[1, 0].set_xlabel('并行数', fontproperties=font_prop)
axs[1, 0].set_ylabel('吞吐量 (每秒Tokens)', fontproperties=font_prop)
axs[1, 0].legend()

# 失败率
axs[1, 1].plot(batch_sizes_mindie, mindie_failures, label='MindIE(910B4 8*32)', marker='o')
axs[1, 1].plot(batch_sizes_vllm, vllm_failures, label='vLLM(T4 4*16)', marker='o')
axs[1, 1].set_title('失败率', fontproperties=font_prop)
axs[1, 1].set_xlabel('并行数', fontproperties=font_prop)
axs[1, 1].set_ylabel('失败数', fontproperties=font_prop)
axs[1, 1].legend()

# 调整布局
plt.tight_layout()
plt.show()

实验结果(MindIE)

Qwen1.5-7B-Chat

指标 8 16 32 64 128 150 200 256 300 400 512 720
用时 404.284 215.085 120.876 82.884 52.844 48.884 44.856 42.750 43.729 42.866 41.655 40.580
QPS 2.474 4.649 8.273 12.065 18.924 20.457 22.294 23.392 22.868 23.328 24.007 24.643
延迟 3.213 3.391 3.724 4.974 6.108 6.517 7.805 9.208 10.904 13.628 16.031 18.790
吞吐量 594.929 1119.102 1989.159 2903.023 4559.489 4920.856 5354.701 5636.846 5506.567 5618.110 5772.230 5940.622
p50 3.2461 3.4271 3.7514 5.0248 6.1491 6.5487 7.7782 9.1754 10.9164 13.9850 17.0675 20.4161
p90 4.9771 5.2905 5.8484 7.7522 9.5705 10.1980 12.3493 13.6320 15.8667 19.2667 22.7640 26.8183
  • 平均每个请求的输入 token 数: 40

  • 平均每个请求的输出 token 数: 240

  • parallel 8

Benchmarking summary: 
 Time taken for tests: 404.284 seconds
 Expected number of requests: 1000
 Number of concurrency: 8
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 2.474
 Average latency: 3.213
 Throughput(average output tokens per second): 594.929
 Average time to first token: 3.213
 Average input tokens per request: 40.296
 Average output tokens per request: 240.520
 Average time per output token: 0.00168
 Average package per request: 1.000
 Average package latency: 3.213
 Percentile of time to first token: 
     p50: 3.2461
     p66: 3.7587
     p75: 4.2213
     p80: 4.4208
     p90: 4.9771
     p95: 5.6460
     p98: 6.3678
     p99: 6.8545
 Percentile of request latency: 
     p50: 3.2461
     p66: 3.7587
     p75: 4.2213
     p80: 4.4208
     p90: 4.9771
     p95: 5.6460
     p98: 6.3678
     p99: 6.8545
  • parallel 16
Benchmarking summary: 
 Time taken for tests: 215.085 seconds
 Expected number of requests: 1000
 Number of concurrency: 16
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 4.649
 Average latency: 3.391
 Throughput(average output tokens per second): 1119.102
 Average time to first token: 3.391
 Average input tokens per request: 40.296
 Average output tokens per request: 240.702
 Average time per output token: 0.00089
 Average package per request: 1.000
 Average package latency: 3.391
 Percentile of time to first token: 
     p50: 3.4271
     p66: 3.9816
     p75: 4.3792
     p80: 4.6188
     p90: 5.2905
     p95: 5.9389
     p98: 6.7555
     p99: 7.2478
 Percentile of request latency: 
     p50: 3.4271
     p66: 3.9816
     p75: 4.3792
     p80: 4.6188
     p90: 5.2905
     p95: 5.9389
     p98: 6.7555
     p99: 7.2478
  • parallel 32
Benchmarking summary: 
 Time taken for tests: 120.876 seconds
 Expected number of requests: 1000
 Number of concurrency: 32
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 8.273
 Average latency: 3.724
 Throughput(average output tokens per second): 1989.159
 Average time to first token: 3.724
 Average input tokens per request: 40.296
 Average output tokens per request: 240.442
 Average time per output token: 0.00050
 Average package per request: 1.000
 Average package latency: 3.724
 Percentile of time to first token: 
     p50: 3.7514
     p66: 4.3989
     p75: 4.8352
     p80: 5.1087
     p90: 5.8484
     p95: 6.5664
     p98: 7.3057
     p99: 8.0644
 Percentile of request latency: 
     p50: 3.7514
     p66: 4.3989
     p75: 4.8352
     p80: 5.1087
     p90: 5.8484
     p95: 6.5664
     p98: 7.3057
     p99: 8.0644
  • parallel 64
Benchmarking summary: 
 Time taken for tests: 82.884 seconds
 Expected number of requests: 1000
 Number of concurrency: 64
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 12.065
 Average latency: 4.974
 Throughput(average output tokens per second): 2903.023
 Average time to first token: 4.974
 Average input tokens per request: 40.296
 Average output tokens per request: 240.615
 Average time per output token: 0.00034
 Average package per request: 1.000
 Average package latency: 4.974
 Percentile of time to first token: 
     p50: 5.0248
     p66: 5.8985
     p75: 6.4676
     p80: 6.8351
     p90: 7.7522
     p95: 8.8266
     p98: 9.9534
     p99: 10.6036
 Percentile of request latency: 
     p50: 5.0248
     p66: 5.8985
     p75: 6.4676
     p80: 6.8351
     p90: 7.7522
     p95: 8.8266
     p98: 9.9534
     p99: 10.6036
  • parallel 128
Benchmarking summary: 
 Time taken for tests: 52.844 seconds
 Expected number of requests: 1000
 Number of concurrency: 128
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 18.924
 Average latency: 6.108
 Throughput(average output tokens per second): 4559.489
 Average time to first token: 6.108
 Average input tokens per request: 40.296
 Average output tokens per request: 240.943
 Average time per output token: 0.00022
 Average package per request: 1.000
 Average package latency: 6.108
 Percentile of time to first token: 
     p50: 6.1491
     p66: 7.2622
     p75: 7.9560
     p80: 8.3894
     p90: 9.5705
     p95: 10.7209
     p98: 12.2300
     p99: 13.1657
 Percentile of request latency: 
     p50: 6.1491
     p66: 7.2622
     p75: 7.9560
     p80: 8.3894
     p90: 9.5705
     p95: 10.7209
     p98: 12.2300
     p99: 13.1657
  • parallel 150
Benchmarking summary: 
 Time taken for tests: 48.884 seconds
 Expected number of requests: 1000
 Number of concurrency: 150
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 20.457
 Average latency: 6.517
 Throughput(average output tokens per second): 4920.856
 Average time to first token: 6.517
 Average input tokens per request: 40.296
 Average output tokens per request: 240.550
 Average time per output token: 0.00020
 Average package per request: 1.000
 Average package latency: 6.517
 Percentile of time to first token: 
     p50: 6.5487
     p66: 7.7580
     p75: 8.4394
     p80: 8.9248
     p90: 10.1980
     p95: 11.4446
     p98: 13.0906
     p99: 13.7333
 Percentile of request latency: 
     p50: 6.5487
     p66: 7.7580
     p75: 8.4394
     p80: 8.9248
     p90: 10.1980
     p95: 11.4446
     p98: 13.0906
     p99: 13.7333
  • parallel 200
Benchmarking summary: 
 Time taken for tests: 44.856 seconds
 Expected number of requests: 1000
 Number of concurrency: 200
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 22.294
 Average latency: 7.805
 Throughput(average output tokens per second): 5354.701
 Average time to first token: 7.805
 Average input tokens per request: 40.296
 Average output tokens per request: 240.188
 Average time per output token: 0.00019
 Average package per request: 1.000
 Average package latency: 7.805
 Percentile of time to first token: 
     p50: 7.7782
     p66: 9.2457
     p75: 10.0596
     p80: 10.8689
     p90: 12.3493
     p95: 13.7108
     p98: 15.1361
     p99: 16.3464
 Percentile of request latency: 
     p50: 7.7782
     p66: 9.2457
     p75: 10.0596
     p80: 10.8689
     p90: 12.3493
     p95: 13.7108
     p98: 15.1361
     p99: 16.3464
  • parallel 256
Benchmarking summary: 
 Time taken for tests: 42.750 seconds
 Expected number of requests: 1000
 Number of concurrency: 256
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 23.392
 Average latency: 9.208
 Throughput(average output tokens per second): 5636.846
 Average time to first token: 9.208
 Average input tokens per request: 40.296
 Average output tokens per request: 240.975
 Average time per output token: 0.00018
 Average package per request: 1.000
 Average package latency: 9.208
 Percentile of time to first token: 
     p50: 9.1754
     p66: 10.6423
     p75: 11.5348
     p80: 12.1507
     p90: 13.6320
     p95: 14.9237
     p98: 16.5329
     p99: 18.0215
 Percentile of request latency: 
     p50: 9.1754
     p66: 10.6423
     p75: 11.5348
     p80: 12.1507
     p90: 13.6320
     p95: 14.9237
     p98: 16.5329
     p99: 18.0215
  • parallel 300
Benchmarking summary: 
 Time taken for tests: 43.729 seconds
 Expected number of requests: 1000
 Number of concurrency: 300
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 22.868
 Average latency: 10.904
 Throughput(average output tokens per second): 5506.567
 Average time to first token: 10.904
 Average input tokens per request: 40.296
 Average output tokens per request: 240.795
 Average time per output token: 0.00018
 Average package per request: 1.000
 Average package latency: 10.904
 Percentile of time to first token: 
     p50: 10.9164
     p66: 12.5896
     p75: 13.6076
     p80: 14.2442
     p90: 15.8667
     p95: 17.2967
     p98: 18.8841
     p99: 20.2304
 Percentile of request latency: 
     p50: 10.9164
     p66: 12.5896
     p75: 13.6076
     p80: 14.2442
     p90: 15.8667
     p95: 17.2967
     p98: 18.8841
     p99: 20.2304
  • parallel 400
Benchmarking summary: 
 Time taken for tests: 42.866 seconds
 Expected number of requests: 1000
 Number of concurrency: 400
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 23.328
 Average latency: 13.628
 Throughput(average output tokens per second): 5618.110
 Average time to first token: 13.628
 Average input tokens per request: 40.296
 Average output tokens per request: 240.828
 Average time per output token: 0.00018
 Average package per request: 1.000
 Average package latency: 13.628
 Percentile of time to first token: 
     p50: 13.9850
     p66: 15.7791
     p75: 16.8451
     p80: 17.6249
     p90: 19.2667
     p95: 20.8091
     p98: 22.5674
     p99: 23.6675
 Percentile of request latency: 
     p50: 13.9850
     p66: 15.7791
     p75: 16.8451
     p80: 17.6249
     p90: 19.2667
     p95: 20.8091
     p98: 22.5674
     p99: 23.6675
  • parallel 512
Benchmarking summary: 
 Time taken for tests: 41.655 seconds
 Expected number of requests: 1000
 Number of concurrency: 512
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 24.007
 Average latency: 16.031
 Throughput(average output tokens per second): 5772.230
 Average time to first token: 16.031
 Average input tokens per request: 40.296
 Average output tokens per request: 240.440
 Average time per output token: 0.00017
 Average package per request: 1.000
 Average package latency: 16.031
 Percentile of time to first token: 
     p50: 17.0675
     p66: 18.9757
     p75: 20.0632
     p80: 20.8715
     p90: 22.7640
     p95: 24.0828
     p98: 25.3913
     p99: 26.6549
 Percentile of request latency: 
     p50: 17.0675
     p66: 18.9757
     p75: 20.0632
     p80: 20.8715
     p90: 22.7640
     p95: 24.0828
     p98: 25.3913
     p99: 26.6549
  • parallel 720
Benchmarking summary: 
 Time taken for tests: 40.580 seconds
 Expected number of requests: 1000
 Number of concurrency: 720
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 24.643
 Average latency: 18.790
 Throughput(average output tokens per second): 5940.622
 Average time to first token: 18.790
 Average input tokens per request: 40.296
 Average output tokens per request: 241.071
 Average time per output token: 0.00017
 Average package per request: 1.000
 Average package latency: 18.790
 Percentile of time to first token: 
     p50: 20.4161
     p66: 22.8199
     p75: 24.1332
     p80: 24.8298
     p90: 26.8183
     p95: 28.8718
     p98: 30.2479
     p99: 31.1723
 Percentile of request latency: 
     p50: 20.4161
     p66: 22.8199
     p75: 24.1332
     p80: 24.8298
     p90: 26.8183
     p95: 28.8718
     p98: 30.2479
     p99: 31.1723

Qwen1.5-7B-Chat (long.jsonl)

指标 32 64 80 100
用时 227.466 176.405 177.402 176.430
QPS 0.879 1.134 1.127 1.134
延迟 34.012 51.059 62.032 73.359
吞吐量 1534.689 1889.768 1864.759 1869.820
p50 34.7112 48.4820 59.2757 80.3454
p90 36.6736 68.2014 84.9306 93.7045
  • 平均每个请求的输入 token 数: 1614

  • 平均每个请求的输出 token 数: 1654

  • parallel 32

Benchmarking summary: 
 Time taken for tests: 227.466 seconds
 Expected number of requests: 200
 Number of concurrency: 32
 Total requests: 200
 Succeed requests: 200
 Failed requests: 0
 Average QPS: 0.879
 Average latency: 34.012
 Throughput(average output tokens per second): 1534.689
 Average time to first token: 34.012
 Average input tokens per request: 1614.000
 Average output tokens per request: 1745.450
 Average time per output token: 0.00065
 Average package per request: 1.000
 Average package latency: 34.012
 Percentile of time to first token: 
     p50: 34.7112
     p66: 36.2749
     p75: 36.4008
     p80: 36.4714
     p90: 36.6736
     p95: 36.7247
     p98: 36.7508
     p99: 36.7635
 Percentile of request latency: 
     p50: 34.7112
     p66: 36.2749
     p75: 36.4008
     p80: 36.4714
     p90: 36.6736
     p95: 36.7247
     p98: 36.7508
     p99: 36.7635
  • parallel 64
Benchmarking summary: 
 Time taken for tests: 176.405 seconds
 Expected number of requests: 200
 Number of concurrency: 64
 Total requests: 200
 Succeed requests: 200
 Failed requests: 0
 Average QPS: 1.134
 Average latency: 51.059
 Throughput(average output tokens per second): 1889.768
 Average time to first token: 51.059
 Average input tokens per request: 1614.000
 Average output tokens per request: 1666.820
 Average time per output token: 0.00053
 Average package per request: 1.000
 Average package latency: 51.059
 Percentile of time to first token: 
     p50: 48.4820
     p66: 52.8716
     p75: 58.2624
     p80: 60.0386
     p90: 68.2014
     p95: 74.7256
     p98: 78.7428
     p99: 79.3227
 Percentile of request latency: 
     p50: 48.4820
     p66: 52.8716
     p75: 58.2624
     p80: 60.0386
     p90: 68.2014
     p95: 74.7256
     p98: 78.7428
     p99: 79.3227
  • parallel 80
Benchmarking summary: 
 Time taken for tests: 177.402 seconds
 Expected number of requests: 200
 Number of concurrency: 80
 Total requests: 200
 Succeed requests: 200
 Failed requests: 0
 Average QPS: 1.127
 Average latency: 62.032
 Throughput(average output tokens per second): 1864.759
 Average time to first token: 62.032
 Average input tokens per request: 1614.000
 Average output tokens per request: 1654.060
 Average time per output token: 0.00054
 Average package per request: 1.000
 Average package latency: 62.032
 Percentile of time to first token: 
     p50: 59.2757
     p66: 71.6039
     p75: 74.4594
     p80: 76.9160
     p90: 84.9306
     p95: 91.9959
     p98: 95.0497
     p99: 98.3784
 Percentile of request latency: 
     p50: 59.2757
     p66: 71.6039
     p75: 74.4594
     p80: 76.9160
     p90: 84.9306
     p95: 91.9959
     p98: 95.0497
     p99: 98.3784
  • parallel 100
Benchmarking summary: 
 Time taken for tests: 176.430 seconds
 Expected number of requests: 200
 Number of concurrency: 100
 Total requests: 200
 Succeed requests: 200
 Failed requests: 0
 Average QPS: 1.134
 Average latency: 73.359
 Throughput(average output tokens per second): 1869.820
 Average time to first token: 73.359
 Average input tokens per request: 1614.000
 Average output tokens per request: 1649.460
 Average time per output token: 0.00053
 Average package per request: 1.000
 Average package latency: 73.359
 Percentile of time to first token: 
     p50: 80.3454
     p66: 85.7741
     p75: 88.7250
     p80: 90.7941
     p90: 93.7045
     p95: 97.3420
     p98: 99.5837
     p99: 101.2294
 Percentile of request latency: 
     p50: 80.3454
     p66: 85.7741
     p75: 88.7250
     p80: 90.7941
     p90: 93.7045
     p95: 97.3420
     p98: 99.5837
     p99: 101.2294

Qwen1.5-14B-Chat

指标 8 16 32 64 128 150 200 256 512
用时 578.571 361.169 253.040 204.961 170.001 169.981 162.999 159.840 153.937
QPS 1.727 2.766 3.952 4.874 5.882 5.877 6.129 6.250 6.490
延迟 3.712 3.928 4.581 5.628 7.223 8.004 9.205 11.446 22.695
吞吐量 480.043 897.511 915.133 2333.955 1363.656 3621.096 4310.525 4333.013 3806.748
p50 3.7038 3.9261 4.4544 5.6047 7.0534 7.9484 9.0271 11.3066 23.6591
p90 5.7184 6.0562 6.9198 8.7597 11.1194 12.5181 14.5017 16.8646 33.3052
  • parallel 8
Benchmarking summary: 
 Time taken for tests: 578.571 seconds
 Expected number of requests: 1000
 Number of concurrency: 8
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 1.727
 Average latency: 3.712
 Throughput(average output tokens per second): 480.043
 Average time to first token: 3.712
 Average input tokens per request: 40.287
 Average output tokens per request: 224.138
 Average time per output token: 0.00208
 Average package per request: 1.000
 Average package latency: 3.712
 Percentile of time to first token: 
     p50: 3.7038
     p66: 4.4215
     p75: 4.8051
     p80: 5.0476
     p90: 5.7184
     p95: 6.2956
     p98: 7.0707
     p99: 7.4415
 Percentile of request latency: 
     p50: 3.7038
     p66: 4.4215
     p75: 4.8051
     p80: 5.0476
     p90: 5.7184
     p95: 6.2956
     p98: 7.0707
     p99: 7.4415
  • parallel 16
Benchmarking summary: 
 Time taken for tests: 361.169 seconds
 Expected number of requests: 1000
 Number of concurrency: 16
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 2.766
 Average latency: 3.928
 Throughput(average output tokens per second): 897.511
 Average time to first token: 3.928
 Average input tokens per request: 40.287
 Average output tokens per request: 223.750
 Average time per output token: 0.00111
 Average package per request: 1.000
 Average package latency: 3.928
 Percentile of time to first token: 
     p50: 3.9261
     p66: 4.6890
     p75: 5.1204
     p80: 5.3290
     p90: 6.0562
     p95: 6.6784
     p98: 7.3113
     p99: 7.8980
 Percentile of request latency: 
     p50: 3.9261
     p66: 4.6890
     p75: 5.1204
     p80: 5.3290
     p90: 6.0562
     p95: 6.6784
     p98: 7.3113
     p99: 7.8980
  • parallel 32
Benchmarking summary: 
 Time taken for tests: 253.040 seconds
 Expected number of requests: 1000
 Number of concurrency: 32
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 3.952
 Average latency: 4.581
 Throughput(average output tokens per second): 915.133
 Average time to first token: 4.581
 Average input tokens per request: 40.296
 Average output tokens per request: 231.565
 Average time per output token: 0.00109
 Average package per request: 1.000
 Average package latency: 4.581
 Percentile of time to first token: 
     p50: 4.4544
     p66: 5.2905
     p75: 5.8235
     p80: 6.1074
     p90: 6.9198
     p95: 7.5185
     p98: 8.5143
     p99: 9.3296
 Percentile of request latency: 
     p50: 4.4544
     p66: 5.2905
     p75: 5.8235
     p80: 6.1074
     p90: 6.9198
     p95: 7.5185
     p98: 8.5143
     p99: 9.3296
  • parallel 64
Benchmarking summary: 
 Time taken for tests: 204.961 seconds
 Expected number of requests: 1000
 Number of concurrency: 64
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 4.874
 Average latency: 5.628
 Throughput(average output tokens per second): 2333.955
 Average time to first token: 5.628
 Average input tokens per request: 40.287
 Average output tokens per request: 223.930
 Average time per output token: 0.00043
 Average package per request: 1.000
 Average package latency: 5.628
 Percentile of time to first token: 
     p50: 5.6047
     p66: 6.6600
     p75: 7.3423
     p80: 7.7040
     p90: 8.7597
     p95: 9.6390
     p98: 10.7844
     p99: 11.6003
 Percentile of request latency: 
     p50: 5.6047
     p66: 6.6600
     p75: 7.3423
     p80: 7.7040
     p90: 8.7597
     p95: 9.6390
     p98: 10.7844
     p99: 11.6003
  • parallel 128
Benchmarking summary: 
 Time taken for tests: 170.001 seconds
 Expected number of requests: 1000
 Number of concurrency: 128
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 5.882
 Average latency: 7.223
 Throughput(average output tokens per second): 1363.656
 Average time to first token: 7.223
 Average input tokens per request: 40.296
 Average output tokens per request: 231.823
 Average time per output token: 0.00073
 Average package per request: 1.000
 Average package latency: 7.223
 Percentile of time to first token: 
     p50: 7.0534
     p66: 8.4098
     p75: 9.3191
     p80: 9.7640
     p90: 11.1194
     p95: 12.2800
     p98: 13.8248
     p99: 14.6733
 Percentile of request latency: 
     p50: 7.0534
     p66: 8.4098
     p75: 9.3191
     p80: 9.7640
     p90: 11.1194
     p95: 12.2800
     p98: 13.8248
     p99: 14.6733
  • parallel 150
Benchmarking summary: 
 Time taken for tests: 169.981 seconds
 Expected number of requests: 1000
 Number of concurrency: 150
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 5.877
 Average latency: 8.004
 Throughput(average output tokens per second): 3621.096
 Average time to first token: 8.004
 Average input tokens per request: 40.287
 Average output tokens per request: 224.225
 Average time per output token: 0.00028
 Average package per request: 1.000
 Average package latency: 8.004
 Percentile of time to first token: 
     p50: 7.9484
     p66: 9.3969
     p75: 10.5065
     p80: 11.0508
     p90: 12.5181
     p95: 13.6289
     p98: 15.3693
     p99: 16.5225
 Percentile of request latency: 
     p50: 7.9484
     p66: 9.3969
     p75: 10.5065
     p80: 11.0508
     p90: 12.5181
     p95: 13.6289
     p98: 15.3693
     p99: 16.5225
  • parallel 200
Benchmarking summary: 
 Time taken for tests: 162.999 seconds
 Expected number of requests: 1000
 Number of concurrency: 200
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 6.129
 Average latency: 9.205
 Throughput(average output tokens per second): 4310.525
 Average time to first token: 9.205
 Average input tokens per request: 40.287
 Average output tokens per request: 223.865
 Average time per output token: 0.00023
 Average package per request: 1.000
 Average package latency: 9.205
 Percentile of time to first token: 
     p50: 9.0271
     p66: 10.8104
     p75: 12.0957
     p80: 12.8212
     p90: 14.5017
     p95: 15.7233
     p98: 17.6891
     p99: 19.2401
 Percentile of request latency: 
     p50: 9.0271
     p66: 10.8104
     p75: 12.0957
     p80: 12.8212
     p90: 14.5017
     p95: 15.7233
     p98: 17.6891
     p99: 19.2401
  • parallel 256
Benchmarking summary: 
 Time taken for tests: 159.840 seconds
 Expected number of requests: 1000
 Number of concurrency: 256
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 6.250
 Average latency: 11.446
 Throughput(average output tokens per second): 4333.013
 Average time to first token: 11.446
 Average input tokens per request: 40.287
 Average output tokens per request: 224.384
 Average time per output token: 0.00023
 Average package per request: 1.000
 Average package latency: 11.446
 Percentile of time to first token: 
     p50: 11.3066
     p66: 13.1698
     p75: 14.4698
     p80: 15.2733
     p90: 16.8646
     p95: 18.3524
     p98: 20.0468
     p99: 21.3758
 Percentile of request latency: 
     p50: 11.3066
     p66: 13.1698
     p75: 14.4698
     p80: 15.2733
     p90: 16.8646
     p95: 18.3524
     p98: 20.0468
     p99: 21.3758
  • parallel 512
Benchmarking summary: 
 Time taken for tests: 153.937 seconds
 Expected number of requests: 1000
 Number of concurrency: 512
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 6.490
 Average latency: 22.695
 Throughput(average output tokens per second): 3806.748
 Average time to first token: 22.695
 Average input tokens per request: 40.287
 Average output tokens per request: 224.177
 Average time per output token: 0.00026
 Average package per request: 1.000
 Average package latency: 22.695
 Percentile of time to first token: 
     p50: 23.6591
     p66: 27.2753
     p75: 29.3330
     p80: 30.5309
     p90: 33.3052
     p95: 35.7777
     p98: 37.4897
     p99: 38.1186
 Percentile of request latency: 
     p50: 23.6591
     p66: 27.2753
     p75: 29.3330
     p80: 30.5309
     p90: 33.3052
     p95: 35.7777
     p98: 37.4897
     p99: 38.1186

Qwen2-72B-Chat

指标 8 16 32 64 128 150 200 256 512
用时 1569.707 909.001 567.479 382.247 179.015 270.054 251.060 237.063 206.734
QPS 0.636 1.099 1.759 2.613 5.586 3.699 3.975 4.214 4.832
延迟 11.705 12.806 14.526 17.296 21.041 23.856 28.198 33.176 59.912
吞吐量 188.589 342.764 588.262 973.795 1549.069 1595.176 1748.784 1866.155 1828.363
p50 11.8443 12.8275 14.7115 17.5290 21.3863 23.8551 28.4407 32.7329 63.0404
p90 16.2106 17.5466 20.0371 23.9890 29.5537 33.7459 40.4210 45.7666 81.6602
  • 平均每个请求的输入 token 数: 40

  • 平均每个请求的输出 token 数: 277

  • parallel 8

Benchmarking summary: 
 Time taken for tests: 1569.707 seconds
 Expected number of requests: 1000
 Number of concurrency: 8
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 0.636
 Average latency: 11.705
 Throughput(average output tokens per second): 188.589
 Average time to first token: 11.705
 Average input tokens per request: 40.303
 Average output tokens per request: 277.618
 Average time per output token: 0.00530
 Average package per request: 1.000
 Average package latency: 11.705
 Percentile of time to first token: 
     p50: 11.8443
     p66: 13.1671
     p75: 13.9665
     p80: 14.5981
     p90: 16.2106
     p95: 17.8844
     p98: 20.0471
     p99: 23.0309
 Percentile of request latency: 
     p50: 11.8443
     p66: 13.1671
     p75: 13.9665
     p80: 14.5981
     p90: 16.2106
     p95: 17.8844
     p98: 20.0471
     p99: 23.0309
  • parallel 16
Benchmarking summary: 
 Time taken for tests: 909.001 seconds
 Expected number of requests: 1000
 Number of concurrency: 16
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 1.099
 Average latency: 12.806
 Throughput(average output tokens per second): 342.764
 Average time to first token: 12.806
 Average input tokens per request: 40.303
 Average output tokens per request: 278.224
 Average time per output token: 0.00292
 Average package per request: 1.000
 Average package latency: 12.806
 Percentile of time to first token: 
     p50: 12.8275
     p66: 14.3998
     p75: 15.3983
     p80: 16.1443
     p90: 17.5466
     p95: 19.6906
     p98: 22.2533
     p99: 25.1283
 Percentile of request latency: 
     p50: 12.8275
     p66: 14.3998
     p75: 15.3983
     p80: 16.1443
     p90: 17.5466
     p95: 19.6906
     p98: 22.2533
     p99: 25.1283
  • parallel 32
Benchmarking summary: 
 Time taken for tests: 567.479 seconds
 Expected number of requests: 1000
 Number of concurrency: 32
 Total requests: 1000
 Succeed requests: 998
 Failed requests: 2
 Average QPS: 1.759
 Average latency: 14.526
 Throughput(average output tokens per second): 588.262
 Average time to first token: 14.526
 Average input tokens per request: 40.297
 Average output tokens per request: 277.259
 Average time per output token: 0.00170
 Average package per request: 1.000
 Average package latency: 14.526
 Percentile of time to first token: 
     p50: 14.7115
     p66: 16.2993
     p75: 17.4013
     p80: 18.2002
     p90: 20.0371
     p95: 21.8216
     p98: 24.5539
     p99: 27.3373
 Percentile of request latency: 
     p50: 14.7115
     p66: 16.2993
     p75: 17.4013
     p80: 18.2002
     p90: 20.0371
     p95: 21.8216
     p98: 24.5539
     p99: 27.3373
  • parallel 64
Benchmarking summary: 
 Time taken for tests: 382.247 seconds
 Expected number of requests: 1000
 Number of concurrency: 64
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 2.613
 Average latency: 17.296
 Throughput(average output tokens per second): 973.795
 Average time to first token: 17.296
 Average input tokens per request: 40.303
 Average output tokens per request: 276.968
 Average time per output token: 0.00103
 Average package per request: 1.000
 Average package latency: 17.296
 Percentile of time to first token: 
     p50: 17.5290
     p66: 19.5218
     p75: 20.7063
     p80: 21.6443
     p90: 23.9890
     p95: 26.0998
     p98: 29.7887
     p99: 32.3975
 Percentile of request latency: 
     p50: 17.5290
     p66: 19.5218
     p75: 20.7063
     p80: 21.6443
     p90: 23.9890
     p95: 26.0998
     p98: 29.7887
     p99: 32.3975
  • parallel 128
Benchmarking summary: 
 Time taken for tests: 179.015 seconds
 Expected number of requests: 1000
 Number of concurrency: 128
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 5.586
 Average latency: 21.041
 Throughput(average output tokens per second): 1549.069
 Average time to first token: 21.041
 Average input tokens per request: 40.296
 Average output tokens per request: 277.307
 Average time per output token: 0.00065
 Average package per request: 1.000
 Average package latency: 21.041
 Percentile of time to first token: 
     p50: 21.3863
     p66: 23.7687
     p75: 25.1877
     p80: 26.3636
     p90: 29.5537
     p95: 32.3925
     p98: 36.5261
     p99: 39.3924
 Percentile of request latency: 
     p50: 21.3863
     p66: 23.7687
     p75: 25.1877
     p80: 26.3636
     p90: 29.5537
     p95: 32.3925
     p98: 36.5261
     p99: 39.3924
  • parallel 150
Benchmarking summary: 
 Time taken for tests: 270.054 seconds
 Expected number of requests: 1000
 Number of concurrency: 150
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 3.699
 Average latency: 23.856
 Throughput(average output tokens per second): 1595.176
 Average time to first token: 23.856
 Average input tokens per request: 40.303
 Average output tokens per request: 277.760
 Average time per output token: 0.00063
 Average package per request: 1.000
 Average package latency: 23.856
 Percentile of time to first token: 
     p50: 23.8551
     p66: 26.6484
     p75: 28.7350
     p80: 29.8586
     p90: 33.7459
     p95: 36.6390
     p98: 41.2772
     p99: 47.8515
 Percentile of request latency: 
     p50: 23.8551
     p66: 26.6484
     p75: 28.7350
     p80: 29.8586
     p90: 33.7459
     p95: 36.6390
     p98: 41.2772
     p99: 47.8515
  • parallel 200
Benchmarking summary: 
 Time taken for tests: 251.060 seconds
 Expected number of requests: 1000
 Number of concurrency: 200
 Total requests: 1000
 Succeed requests: 998
 Failed requests: 2
 Average QPS: 3.975
 Average latency: 28.198
 Throughput(average output tokens per second): 1748.784
 Average time to first token: 28.198
 Average input tokens per request: 40.308
 Average output tokens per request: 276.789
 Average time per output token: 0.00057
 Average package per request: 1.000
 Average package latency: 28.198
 Percentile of time to first token: 
     p50: 28.4407
     p66: 31.6658
     p75: 34.0785
     p80: 35.5489
     p90: 40.4210
     p95: 43.0363
     p98: 48.1876
     p99: 52.8204
 Percentile of request latency: 
     p50: 28.4407
     p66: 31.6658
     p75: 34.0785
     p80: 35.5489
     p90: 40.4210
     p95: 43.0363
     p98: 48.1876
     p99: 52.8204
  • parallel 256
Benchmarking summary: 
 Time taken for tests: 237.063 seconds
 Expected number of requests: 1000
 Number of concurrency: 256
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 4.214
 Average latency: 33.176
 Throughput(average output tokens per second): 1866.155
 Average time to first token: 33.176
 Average input tokens per request: 40.303
 Average output tokens per request: 276.399
 Average time per output token: 0.00054
 Average package per request: 1.000
 Average package latency: 33.176
 Percentile of time to first token: 
     p50: 32.7329
     p66: 37.0212
     p75: 39.6246
     p80: 41.3947
     p90: 45.7666
     p95: 49.4765
     p98: 54.4858
     p99: 58.0206
 Percentile of request latency: 
     p50: 32.7329
     p66: 37.0212
     p75: 39.6246
     p80: 41.3947
     p90: 45.7666
     p95: 49.4765
     p98: 54.4858
     p99: 58.0206
  • parallel 512
Benchmarking summary: 
 Time taken for tests: 206.734 seconds
 Expected number of requests: 1000
 Number of concurrency: 512
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 4.832
 Average latency: 59.912
 Throughput(average output tokens per second): 1828.363
 Average time to first token: 59.912
 Average input tokens per request: 40.303
 Average output tokens per request: 277.592
 Average time per output token: 0.00055
 Average package per request: 1.000
 Average package latency: 59.912
 Percentile of time to first token: 
     p50: 63.0404
     p66: 69.5273
     p75: 73.3044
     p80: 75.7455
     p90: 81.6602
     p95: 87.3050
     p98: 92.2078
     p99: 97.6189
 Percentile of request latency: 
     p50: 63.0404
     p66: 69.5273
     p75: 73.3044
     p80: 75.7455
     p90: 81.6602
     p95: 87.3050
     p98: 92.2078
     p99: 97.6189

Qwen2-72B-Chat (long.jsonl)

指标 8 12 20 30 40 50
用时 3091.468 2385.800 1598.805 1542.828 1509.713 1408.587
QPS 0.032 0.042 0.063 0.063 0.056 0.043
延迟 238.540 268.873 294.636 382.291 414.746 403.061
吞吐量 158.986 206.011 307.418 299.114 268.391 199.552
p50 239.3327 270.9418 292.3093 348.5759 396.9046 350.8392
p90 239.6762 271.3905 313.2425 514.8233 567.8100 597.3057
  • 平均每个请求的输入 token 数: 6385

  • 平均每个请求的输出 token 数: 4915

  • parallel 8

Benchmarking summary: 
 Time taken for tests: 3091.468 seconds
 Expected number of requests: 100
 Number of concurrency: 8
 Total requests: 100
 Succeed requests: 100
 Failed requests: 0
 Average QPS: 0.032
 Average latency: 238.540
 Throughput(average output tokens per second): 158.986
 Average time to first token: 238.540
 Average input tokens per request: 6385.000
 Average output tokens per request: 4915.000
 Average time per output token: 0.00629
 Average package per request: 1.000
 Average package latency: 238.540
 Percentile of time to first token: 
     p50: 239.3327
     p66: 239.5209
     p75: 239.5622
     p80: 239.6446
     p90: 239.6762
     p95: 240.0079
     p98: 240.0081
     p99: 240.0121
 Percentile of request latency: 
     p50: 239.3327
     p66: 239.5209
     p75: 239.5622
     p80: 239.6446
     p90: 239.6762
     p95: 240.0079
     p98: 240.0081
     p99: 240.0121
  • parallel 12
Benchmarking summary: 
 Time taken for tests: 2385.800 seconds
 Expected number of requests: 100
 Number of concurrency: 12
 Total requests: 100
 Succeed requests: 100
 Failed requests: 0
 Average QPS: 0.042
 Average latency: 268.873
 Throughput(average output tokens per second): 206.011
 Average time to first token: 268.873
 Average input tokens per request: 6385.000
 Average output tokens per request: 4915.000
 Average time per output token: 0.00485
 Average package per request: 1.000
 Average package latency: 268.873
 Percentile of time to first token: 
     p50: 270.9418
     p66: 271.2864
     p75: 271.3597
     p80: 271.3610
     p90: 271.3905
     p95: 271.3978
     p98: 271.4502
     p99: 271.4786
 Percentile of request latency: 
     p50: 270.9418
     p66: 271.2864
     p75: 271.3597
     p80: 271.3610
     p90: 271.3905
     p95: 271.3978
     p98: 271.4502
     p99: 271.4786
  • parallel 20
Benchmarking summary: 
 Time taken for tests: 1598.805 seconds
 Expected number of requests: 100
 Number of concurrency: 20
 Total requests: 100
 Succeed requests: 100
 Failed requests: 0
 Average QPS: 0.063
 Average latency: 294.636
 Throughput(average output tokens per second): 307.418
 Average time to first token: 294.636
 Average input tokens per request: 6385.000
 Average output tokens per request: 4915.020
 Average time per output token: 0.00325
 Average package per request: 1.000
 Average package latency: 294.636
 Percentile of time to first token: 
     p50: 292.3093
     p66: 293.7218
     p75: 296.3762
     p80: 296.4460
     p90: 313.2425
     p95: 323.9384
     p98: 334.5282
     p99: 357.6738
 Percentile of request latency: 
     p50: 292.3093
     p66: 293.7218
     p75: 296.3762
     p80: 296.4460
     p90: 313.2425
     p95: 323.9384
     p98: 334.5282
     p99: 357.6738
  • parallel 30
Benchmarking summary: 
 Time taken for tests: 1542.828 seconds
 Expected number of requests: 100
 Number of concurrency: 30
 Total requests: 97
 Succeed requests: 97
 Failed requests: 0
 Average QPS: 0.063
 Average latency: 382.291
 Throughput(average output tokens per second): 299.114
 Average time to first token: 382.291
 Average input tokens per request: 6385.000
 Average output tokens per request: 4757.546
 Average time per output token: 0.00334
 Average package per request: 1.000
 Average package latency: 382.291
 Percentile of time to first token: 
     p50: 348.5759
     p66: 378.9814
     p75: 420.5971
     p80: 443.5496
     p90: 514.8233
     p95: 548.8156
     p98: 559.4441
     p99: 590.6074
 Percentile of request latency: 
     p50: 348.5759
     p66: 378.9814
     p75: 420.5971
     p80: 443.5496
     p90: 514.8233
     p95: 548.8156
     p98: 559.4441
     p99: 590.6074
  • parallel 40
Benchmarking summary: 
 Time taken for tests: 1509.713 seconds
 Expected number of requests: 100
 Number of concurrency: 40
 Total requests: 87
 Succeed requests: 85
 Failed requests: 2
 Average QPS: 0.056
 Average latency: 414.746
 Throughput(average output tokens per second): 268.391
 Average time to first token: 414.746
 Average input tokens per request: 6385.000
 Average output tokens per request: 4766.976
 Average time per output token: 0.00373
 Average package per request: 1.000
 Average package latency: 414.746
 Percentile of time to first token: 
     p50: 396.9046
     p66: 458.9677
     p75: 482.9745
     p80: 521.3878
     p90: 567.8100
     p95: 580.6507
     p98: 586.6928
     p99: 587.2930
 Percentile of request latency: 
     p50: 396.9046
     p66: 458.9677
     p75: 482.9745
     p80: 521.3878
     p90: 567.8100
     p95: 580.6507
     p98: 586.6928
     p99: 587.2930
  • parallel 50
Benchmarking summary: 
 Time taken for tests: 1408.587 seconds
 Expected number of requests: 100
 Number of concurrency: 50
 Total requests: 78
 Succeed requests: 60
 Failed requests: 18
 Average QPS: 0.043
 Average latency: 403.061
 Throughput(average output tokens per second): 199.552
 Average time to first token: 403.061
 Average input tokens per request: 6385.000
 Average output tokens per request: 4684.783
 Average time per output token: 0.00501
 Average package per request: 1.000
 Average package latency: 403.061
 Percentile of time to first token: 
     p50: 350.8392
     p66: 386.2204
     p75: 516.6384
     p80: 558.8747
     p90: 597.3057
     p95: 597.3098
     p98: 597.5291
     p99: 598.8612
 Percentile of request latency: 
     p50: 350.8392
     p66: 386.2204
     p75: 516.6384
     p80: 558.8747
     p90: 597.3057
     p95: 597.3098
     p98: 597.5291
     p99: 598.8612

DeepSeek-Coder-6.7B-Instruct

指标 8 16 32 64 128 150 200 300 400 500 600
用时 621.642 325.248 178.007 109.977 70.124 67.204 61.252 63.928 70.753 72.668 75.559
QPS 1.609 3.075 5.618 9.093 14.261 14.880 16.326 15.643 12.494 7.527 5.294
延迟 4.967 5.153 5.590 6.847 8.644 9.535 12.010 17.516 20.797 18.579 21.424
吞吐量 643.457 1229.830 2247.103 3637.131 5704.202 5952.054 6530.446 6257.033 4997.658 3010.939 2117.557
p50 4.9568 5.1310 5.5455 6.9095 8.6159 9.7244 12.1211 14.5084 20.3585 17.9382 22.6927
p90 5.0456 5.3116 6.0241 6.9913 8.9456 10.0026 12.3709 23.8041 28.1569 28.4858 24.4160
失败 0 0 0 0 0 0 0 0 0 184 216
  • 平均每个请求的输入 token 数: 157

  • 平均每个请求的输出 token 数: 400

  • parallel 8

Benchmarking summary: 
 Time taken for tests: 621.642 seconds
 Expected number of requests: 1000
 Number of concurrency: 8
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 1.609
 Average latency: 4.967
 Throughput(average output tokens per second): 643.457
 Average time to first token: 4.967
 Average input tokens per request: 157.292
 Average output tokens per request: 400.000
 Average time per output token: 0.00155
 Average package per request: 1.000
 Average package latency: 4.967
 Percentile of time to first token: 
     p50: 4.9568
     p66: 4.9916
     p75: 5.0074
     p80: 5.0173
     p90: 5.0456
     p95: 5.0796
     p98: 5.1067
     p99: 5.1223
 Percentile of request latency: 
     p50: 4.9568
     p66: 4.9916
     p75: 5.0074
     p80: 5.0173
     p90: 5.0456
     p95: 5.0796
     p98: 5.1067
     p99: 5.1223
  • parallel 16
Benchmarking summary: 
 Time taken for tests: 325.248 seconds
 Expected number of requests: 1000
 Number of concurrency: 16
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 3.075
 Average latency: 5.153
 Throughput(average output tokens per second): 1229.830
 Average time to first token: 5.153
 Average input tokens per request: 157.292
 Average output tokens per request: 400.000
 Average time per output token: 0.00081
 Average package per request: 1.000
 Average package latency: 5.153
 Percentile of time to first token: 
     p50: 5.1310
     p66: 5.1707
     p75: 5.2303
     p80: 5.2545
     p90: 5.3116
     p95: 5.4090
     p98: 5.4987
     p99: 5.5322
 Percentile of request latency: 
     p50: 5.1310
     p66: 5.1707
     p75: 5.2303
     p80: 5.2545
     p90: 5.3116
     p95: 5.4090
     p98: 5.4987
     p99: 5.5322
  • parallel 32
Benchmarking summary: 
 Time taken for tests: 178.007 seconds
 Expected number of requests: 1000
 Number of concurrency: 32
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 5.618
 Average latency: 5.590
 Throughput(average output tokens per second): 2247.103
 Average time to first token: 5.590
 Average input tokens per request: 157.292
 Average output tokens per request: 400.000
 Average time per output token: 0.00045
 Average package per request: 1.000
 Average package latency: 5.590
 Percentile of time to first token: 
     p50: 5.5455
     p66: 5.5729
     p75: 5.6628
     p80: 5.7434
     p90: 6.0241
     p95: 6.0393
     p98: 6.1004
     p99: 6.1034
 Percentile of request latency: 
     p50: 5.5455
     p66: 5.5729
     p75: 5.6628
     p80: 5.7434
     p90: 6.0241
     p95: 6.0393
     p98: 6.1004
     p99: 6.1034
  • parallel 64
Benchmarking summary: 
 Time taken for tests: 109.977 seconds
 Expected number of requests: 1000
 Number of concurrency: 64
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 9.093
 Average latency: 6.847
 Throughput(average output tokens per second): 3637.131
 Average time to first token: 6.847
 Average input tokens per request: 157.292
 Average output tokens per request: 400.000
 Average time per output token: 0.00027
 Average package per request: 1.000
 Average package latency: 6.847
 Percentile of time to first token: 
     p50: 6.9095
     p66: 6.9292
     p75: 6.9419
     p80: 6.9551
     p90: 6.9913
     p95: 7.0102
     p98: 7.0201
     p99: 7.0224
 Percentile of request latency: 
     p50: 6.9095
     p66: 6.9292
     p75: 6.9419
     p80: 6.9551
     p90: 6.9913
     p95: 7.0102
     p98: 7.0201
     p99: 7.0224
  • parallel 128
Benchmarking summary: 
 Time taken for tests: 70.124 seconds
 Expected number of requests: 1000
 Number of concurrency: 128
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 14.261
 Average latency: 8.644
 Throughput(average output tokens per second): 5704.202
 Average time to first token: 8.644
 Average input tokens per request: 157.292
 Average output tokens per request: 400.000
 Average time per output token: 0.00018
 Average package per request: 1.000
 Average package latency: 8.644
 Percentile of time to first token: 
     p50: 8.6159
     p66: 8.7015
     p75: 8.7256
     p80: 8.8329
     p90: 8.9456
     p95: 8.9532
     p98: 8.9652
     p99: 8.9727
 Percentile of request latency: 
     p50: 8.6159
     p66: 8.7015
     p75: 8.7256
     p80: 8.8329
     p90: 8.9456
     p95: 8.9532
     p98: 8.9652
     p99: 8.9727
  • parallel 150
Benchmarking summary: 
 Time taken for tests: 67.204 seconds
 Expected number of requests: 1000
 Number of concurrency: 150
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 14.880
 Average latency: 9.535
 Throughput(average output tokens per second): 5952.054
 Average time to first token: 9.535
 Average input tokens per request: 157.292
 Average output tokens per request: 400.000
 Average time per output token: 0.00017
 Average package per request: 1.000
 Average package latency: 9.535
 Percentile of time to first token: 
     p50: 9.7244
     p66: 9.8244
     p75: 9.8670
     p80: 9.8891
     p90: 10.0026
     p95: 10.0508
     p98: 10.0876
     p99: 10.1092
 Percentile of request latency: 
     p50: 9.7244
     p66: 9.8244
     p75: 9.8670
     p80: 9.8891
     p90: 10.0026
     p95: 10.0508
     p98: 10.0876
     p99: 10.1092
  • parallel 200
Benchmarking summary: 
 Time taken for tests: 61.252 seconds
 Expected number of requests: 1000
 Number of concurrency: 200
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 16.326
 Average latency: 12.010
 Throughput(average output tokens per second): 6530.446
 Average time to first token: 12.010
 Average input tokens per request: 157.292
 Average output tokens per request: 400.000
 Average time per output token: 0.00015
 Average package per request: 1.000
 Average package latency: 12.010
 Percentile of time to first token: 
     p50: 12.1211
     p66: 12.2211
     p75: 12.2520
     p80: 12.2641
     p90: 12.3709
     p95: 12.3958
     p98: 12.4472
     p99: 12.4868
 Percentile of request latency: 
     p50: 12.1211
     p66: 12.2211
     p75: 12.2520
     p80: 12.2641
     p90: 12.3709
     p95: 12.3958
     p98: 12.4472
     p99: 12.4868
  • parallel 300
Benchmarking summary: 
 Time taken for tests: 63.928 seconds
 Expected number of requests: 1000
 Number of concurrency: 300
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 15.643
 Average latency: 17.516
 Throughput(average output tokens per second): 6257.033
 Average time to first token: 17.516
 Average input tokens per request: 157.292
 Average output tokens per request: 400.000
 Average time per output token: 0.00016
 Average package per request: 1.000
 Average package latency: 17.516
 Percentile of time to first token: 
     p50: 14.5084
     p66: 21.8597
     p75: 22.9786
     p80: 23.3034
     p90: 23.8041
     p95: 25.4313
     p98: 25.8759
     p99: 26.0190
 Percentile of request latency: 
     p50: 14.5084
     p66: 21.8597
     p75: 22.9786
     p80: 23.3034
     p90: 23.8041
     p95: 25.4313
     p98: 25.8759
     p99: 26.0190
  • parallel 400
Benchmarking summary: 
 Time taken for tests: 70.753 seconds
 Expected number of requests: 1000
 Number of concurrency: 400
 Total requests: 884
 Succeed requests: 884
 Failed requests: 0
 Average QPS: 12.494
 Average latency: 20.797
 Throughput(average output tokens per second): 4997.658
 Average time to first token: 20.797
 Average input tokens per request: 157.958
 Average output tokens per request: 400.000
 Average time per output token: 0.00020
 Average package per request: 1.000
 Average package latency: 20.797
 Percentile of time to first token: 
     p50: 20.3585
     p66: 25.7757
     p75: 26.3887
     p80: 27.0304
     p90: 28.1569
     p95: 28.6731
     p98: 29.6462
     p99: 29.8135
 Percentile of request latency: 
     p50: 20.3585
     p66: 25.7757
     p75: 26.3887
     p80: 27.0304
     p90: 28.1569
     p95: 28.6731
     p98: 29.6462
     p99: 29.8135
  • parallel 500
Benchmarking summary: 
 Time taken for tests: 72.668 seconds
 Expected number of requests: 1000
 Number of concurrency: 500
 Total requests: 731
 Succeed requests: 547
 Failed requests: 184
 Average QPS: 7.527
 Average latency: 18.579
 Throughput(average output tokens per second): 3010.939
 Average time to first token: 18.579
 Average input tokens per request: 156.399
 Average output tokens per request: 400.000
 Average time per output token: 0.00033
 Average package per request: 1.000
 Average package latency: 18.579
 Percentile of time to first token: 
     p50: 17.9382
     p66: 19.5846
     p75: 20.2549
     p80: 20.4220
     p90: 28.4858
     p95: 29.5889
     p98: 29.9512
     p99: 30.0487
 Percentile of request latency: 
     p50: 17.9382
     p66: 19.5846
     p75: 20.2549
     p80: 20.4220
     p90: 28.4858
     p95: 29.5889
     p98: 29.9512
     p99: 30.0487
  • parallel 600
Benchmarking summary: 
 Time taken for tests: 75.559 seconds
 Expected number of requests: 1000
 Number of concurrency: 600
 Total requests: 616
 Succeed requests: 400
 Failed requests: 216
 Average QPS: 5.294
 Average latency: 21.424
 Throughput(average output tokens per second): 2117.557
 Average time to first token: 21.424
 Average input tokens per request: 157.625
 Average output tokens per request: 400.000
 Average time per output token: 0.00047
 Average package per request: 1.000
 Average package latency: 21.424
 Percentile of time to first token: 
     p50: 22.6927
     p66: 23.5150
     p75: 24.1050
     p80: 24.2716
     p90: 24.4160
     p95: 24.5849
     p98: 29.1271
     p99: 30.2161
 Percentile of request latency: 
     p50: 22.6927
     p66: 23.5150
     p75: 24.1050
     p80: 24.2716
     p90: 24.4160
     p95: 24.5849
     p98: 29.1271
     p99: 30.2161

安装依赖库

pip install evalscope-perf
pip install evalscope

执行命令

evalscope-perf http://127.0.0.1:1025/v1/chat/completions qwen \
    ./datasets/Codefuse-Evol-Instruct-Clean-data.jsonl \
    --parallels 32 \
    --parallels 64 \
    --parallels 100 \
    --parallels 128 \
    --parallels 150 \
    --parallels 200 \
    --parallels 256 \
    --parallels 300 \
    --parallels 400 \
    --parallels 500 \
    --parallels 600 \
    --parallels 700 \
    --parallels 800 \
    --parallels 900 \
    --parallels 1000 \
    --n 2000

绘图代码

import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

# 设置中文字体
font_path =  '/System/Library/Fonts/Hiragino Sans GB.ttc' # 替换为你的字体文件路径
font_prop = FontProperties(fname=font_path)

# 数据: Qwen1.5-7B-Chat
concurrency = [8, 16, 32, 64, 128, 150, 200, 256, 300, 400, 512, 720]
time = [404.284, 215.085, 120.876, 82.884, 52.844, 48.884, 44.856, 42.750, 43.729, 42.866, 41.655, 40.580]
qps = [2.474, 4.649, 8.273, 12.065, 18.924, 20.457, 22.294, 23.392, 22.868, 23.328, 24.007, 24.643]
latency = [3.213, 3.391, 3.724, 4.974, 6.108, 6.517, 7.805, 9.208, 10.904, 13.628, 16.031, 18.790]
throughput = [594.929, 1119.102, 1989.159, 2903.023, 4559.489, 4920.856, 5354.701, 5636.846, 5506.567, 5618.110, 5772.230, 5940.622]
p50 = [3.2461, 3.4271, 3.7514, 5.0248, 6.1491, 6.5487, 7.7782, 9.1754, 10.9164, 13.9850, 17.0675, 20.4161]
p90 = [4.9771, 5.2905, 5.8484, 7.7522, 9.5705, 10.1980, 12.3493, 13.6320, 15.8667, 19.2667, 22.7640, 26.8183]

# 数据: Qwen1.5-14B-Chat
# concurrency = [8, 16, 32, 64, 128, 150, 200, 256, 512]
# time = [578.571, 361.169, 253.040, 204.961, 170.001, 169.981, 162.999, 159.840, 153.937]
# qps = [1.727, 2.766, 3.952, 4.874, 5.882, 5.877, 6.129, 6.250, 6.490]
# latency = [3.712, 3.928, 4.581, 5.628, 7.223, 8.004, 9.205, 11.446, 22.695]
# throughput = [480.043, 897.511, 915.133, 2333.955, 1363.656, 3621.096, 4310.525, 4333.013, 3806.748]
# p50 = [3.7038, 3.9261, 4.4544, 5.6047, 7.0534, 7.9484, 9.0271, 11.3066, 23.6591]
# p90 = [5.7184, 6.0562, 6.9198, 8.7597, 11.1194, 12.5181, 14.5017, 16.8646, 33.3052]

# 数据: Qwen2-72B-Chat
# concurrency = [8, 16, 32, 64, 128, 150, 200, 256, 512]
# time = [1569.707, 909.001, 567.479, 382.247, 179.015, 270.054, 251.060, 237.063, 206.734]
# qps = [0.636, 1.099, 1.759, 2.613, 5.586, 3.699, 3.975, 4.214, 4.832]
# latency = [11.705, 12.806, 14.526, 17.296, 21.041, 23.856, 28.198, 33.176, 59.912]
# throughput = [188.589, 342.764, 588.262, 973.795, 1549.069, 1595.176, 1748.784, 1866.155, 1828.363]
# p50 = [11.8443, 12.8275, 14.7115, 17.5290, 21.3863, 23.8551, 28.4407, 32.7329, 63.0404]
# p90 = [16.2106, 17.5466, 20.0371, 23.9890, 29.5537, 33.7459, 40.4210, 45.7666, 81.6602]


# 绘制曲线
plt.figure(figsize=(12, 8))

# 用时 vs 并行数
plt.subplot(2, 3, 1)
plt.plot(concurrency, time, marker='o')
plt.title('用时 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('用时 (秒)', fontproperties=font_prop)

# QPS vs 并行数
plt.subplot(2, 3, 2)
plt.plot(concurrency, qps, marker='o')
plt.title('QPS vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('QPS', fontproperties=font_prop)

# 延迟 vs 并行数
plt.subplot(2, 3, 3)
plt.plot(concurrency, latency, marker='o')
plt.title('延迟 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('延迟 (秒)', fontproperties=font_prop)

# 吞吐量 vs 并行数
plt.subplot(2, 3, 4)
plt.plot(concurrency, throughput, marker='o')
plt.title('吞吐量 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('吞吐量 (每秒输出的token数)', fontproperties=font_prop)

# p50 vs 并行数
plt.subplot(2, 3, 5)
plt.plot(concurrency, p50, marker='o')
plt.title('p50 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('p50 (秒)', fontproperties=font_prop)

# p90 vs 并行数
plt.subplot(2, 3, 6)
plt.plot(concurrency, p90, marker='o')
plt.title('p90 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('p90 (秒)', fontproperties=font_prop)

# 显示图表
plt.tight_layout()
plt.show()

实验结果(vLLM)

Qwen1.5-7B-Chat

指标 8 16 32 64 100 128 150 200
用时 2555.302 1355.736 800.953 515.309 403.138 375.187 386.202 355.307
QPS 0.391 0.738 1.249 1.941 2.481 2.660 2.569 2.730
延迟 20.326 21.475 24.877 31.015 37.778 43.181 52.514 61.304
吞吐量 94.803 177.014 300.603 469.235 595.103 638.980 612.597 640.172
p50 20.5326 21.6749 24.9076 31.2051 37.8700 42.3652 52.1732 61.0796
p90 31.8381 33.7150 38.8008 48.5248 59.2335 68.8249 84.6935 96.6051
6 28
  • 平均每个请求的输入 token 数: 40
  • 平均每个请求的输出 token 数: 240
# 数据
concurrency = [8, 16, 32, 64, 100, 128, 150, 200]
time = [2555.302, 1355.736, 800.953, 515.309, 403.138, 375.187, 386.202, 355.307]
qps = [0.391, 0.738, 1.249, 1.941, 2.481, 2.660, 2.569, 2.730]
latency = [20.326, 21.475, 24.877, 31.015, 37.778, 43.181, 52.514, 61.304]
throughput = [94.803, 177.014, 300.603, 469.235, 595.103, 638.980, 612.597, 640.172]
p50 = [20.5326, 21.6749, 24.9076, 31.2051, 37.8700, 42.3652, 52.1732, 61.0796]
p90 = [31.8381, 33.7150, 38.8008, 48.5248, 59.2335, 68.8249, 84.6935, 96.6051]
  • parallel 8
Benchmarking summary: 
 Time taken for tests: 2555.302 seconds
 Expected number of requests: 1000
 Number of concurrency: 8
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 0.391
 Average latency: 20.326
 Throughput(average output tokens per second): 94.803
 Average time to first token: 20.326
 Average input tokens per request: 40.296
 Average output tokens per request: 242.251
 Average time per output token: 0.01055
 Average package per request: 1.000
 Average package latency: 20.326
 Percentile of time to first token: 
     p50: 20.5326
     p66: 23.7691
     p75: 26.5282
     p80: 28.1502
     p90: 31.8381
     p95: 35.6152
     p98: 40.9497
     p99: 45.7076
 Percentile of request latency: 
     p50: 20.5326
     p66: 23.7691
     p75: 26.5282
     p80: 28.1502
     p90: 31.8381
     p95: 35.6152
     p98: 40.9497
     p99: 45.7076
  • parallel 16
Benchmarking summary: 
 Time taken for tests: 1355.736 seconds
 Expected number of requests: 1000
 Number of concurrency: 16
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 0.738
 Average latency: 21.475
 Throughput(average output tokens per second): 177.014
 Average time to first token: 21.475
 Average input tokens per request: 40.296
 Average output tokens per request: 239.984
 Average time per output token: 0.00565
 Average package per request: 1.000
 Average package latency: 21.475
 Percentile of time to first token: 
     p50: 21.6749
     p66: 25.2886
     p75: 27.4429
     p80: 29.0391
     p90: 33.7150
     p95: 37.1781
     p98: 42.4568
     p99: 45.9629
 Percentile of request latency: 
     p50: 21.6749
     p66: 25.2886
     p75: 27.4429
     p80: 29.0391
     p90: 33.7150
     p95: 37.1781
     p98: 42.4568
     p99: 45.9629
  • parallel 32
Benchmarking summary: 
 Time taken for tests: 800.953 seconds
 Expected number of requests: 1000
 Number of concurrency: 32
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 1.249
 Average latency: 24.877
 Throughput(average output tokens per second): 300.603
 Average time to first token: 24.877
 Average input tokens per request: 40.296
 Average output tokens per request: 240.769
 Average time per output token: 0.00333
 Average package per request: 1.000
 Average package latency: 24.877
 Percentile of time to first token: 
     p50: 24.9076
     p66: 29.1147
     p75: 32.1535
     p80: 34.1477
     p90: 38.8008
     p95: 43.0980
     p98: 48.4589
     p99: 53.6278
 Percentile of request latency: 
     p50: 24.9076
     p66: 29.1147
     p75: 32.1535
     p80: 34.1477
     p90: 38.8008
     p95: 43.0980
     p98: 48.4589
     p99: 53.6278
  • parallel 64
Benchmarking summary: 
 Time taken for tests: 515.309 seconds
 Expected number of requests: 1000
 Number of concurrency: 64
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 1.941
 Average latency: 31.015
 Throughput(average output tokens per second): 469.235
 Average time to first token: 31.015
 Average input tokens per request: 40.296
 Average output tokens per request: 241.801
 Average time per output token: 0.00213
 Average package per request: 1.000
 Average package latency: 31.015
 Percentile of time to first token: 
     p50: 31.2051
     p66: 36.5305
     p75: 40.1962
     p80: 42.1270
     p90: 48.5248
     p95: 53.4686
     p98: 60.3826
     p99: 65.7931
 Percentile of request latency: 
     p50: 31.2051
     p66: 36.5305
     p75: 40.1962
     p80: 42.1270
     p90: 48.5248
     p95: 53.4686
     p98: 60.3826
     p99: 65.7931
  • parallel 100
Benchmarking summary: 
 Time taken for tests: 403.138 seconds
 Expected number of requests: 1000
 Number of concurrency: 100
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 2.481
 Average latency: 37.778
 Throughput(average output tokens per second): 595.103
 Average time to first token: 37.778
 Average input tokens per request: 40.296
 Average output tokens per request: 239.909
 Average time per output token: 0.00168
 Average package per request: 1.000
 Average package latency: 37.778
 Percentile of time to first token: 
     p50: 37.8700
     p66: 44.4992
     p75: 49.1464
     p80: 52.3330
     p90: 59.2335
     p95: 64.9315
     p98: 74.1723
     p99: 80.4544
 Percentile of request latency: 
     p50: 37.8700
     p66: 44.4992
     p75: 49.1464
     p80: 52.3330
     p90: 59.2335
     p95: 64.9315
     p98: 74.1723
     p99: 80.4544
  • parallel 128
Benchmarking summary: 
 Time taken for tests: 375.187 seconds
 Expected number of requests: 1000
 Number of concurrency: 128
 Total requests: 999
 Succeed requests: 998
 Failed requests: 1
 Average QPS: 2.660
 Average latency: 43.181
 Throughput(average output tokens per second): 638.980
 Average time to first token: 43.181
 Average input tokens per request: 40.297
 Average output tokens per request: 240.217
 Average time per output token: 0.00156
 Average package per request: 1.000
 Average package latency: 43.181
 Percentile of time to first token: 
     p50: 42.3652
     p66: 51.6154
     p75: 56.2693
     p80: 59.2260
     p90: 68.8249
     p95: 75.4546
     p98: 85.2362
     p99: 93.3652
 Percentile of request latency: 
     p50: 42.3652
     p66: 51.6154
     p75: 56.2693
     p80: 59.2260
     p90: 68.8249
     p95: 75.4546
     p98: 85.2362
     p99: 93.3652
  • parallel 150
Benchmarking summary: 
 Time taken for tests: 386.202 seconds
 Expected number of requests: 1000
 Number of concurrency: 150
 Total requests: 998
 Succeed requests: 992
 Failed requests: 6
 Average QPS: 2.569
 Average latency: 52.514
 Throughput(average output tokens per second): 612.597
 Average time to first token: 52.514
 Average input tokens per request: 40.303
 Average output tokens per request: 238.494
 Average time per output token: 0.00163
 Average package per request: 1.000
 Average package latency: 52.514
 Percentile of time to first token: 
     p50: 52.1732
     p66: 61.4450
     p75: 67.9417
     p80: 72.0179
     p90: 84.6935
     p95: 92.2286
     p98: 101.7002
     p99: 107.7996
 Percentile of request latency: 
     p50: 52.1732
     p66: 61.4450
     p75: 67.9417
     p80: 72.0179
     p90: 84.6935
     p95: 92.2286
     p98: 101.7002
     p99: 107.7996
  • parallel 200
Benchmarking summary: 
 Time taken for tests: 355.307 seconds
 Expected number of requests: 1000
 Number of concurrency: 200
 Total requests: 998
 Succeed requests: 970
 Failed requests: 28
 Average QPS: 2.730
 Average latency: 61.304
 Throughput(average output tokens per second): 640.172
 Average time to first token: 61.304
 Average input tokens per request: 40.287
 Average output tokens per request: 234.493
 Average time per output token: 0.00156
 Average package per request: 1.000
 Average package latency: 61.304
 Percentile of time to first token: 
     p50: 61.0796
     p66: 74.1724
     p75: 80.9575
     p80: 84.8420
     p90: 96.6051
     p95: 105.2873
     p98: 114.3178
     p99: 115.7257
 Percentile of request latency: 
     p50: 61.0796
     p66: 74.1724
     p75: 80.9575
     p80: 84.8420
     p90: 96.6051
     p95: 105.2873
     p98: 114.3178
     p99: 115.7257

Qwen2.5-72B-Chat

指标 16 32 64
用时 406.618 223.379 159.293
QPS 0.236 0.448 0.609
延迟 53.359 54.293 60.959
吞吐量 69.242 129.193 185.140
p50 52.8675 56.3690 58.7318
p90 80.6415 87.1464 90.6065
  • 平均每个请求的输入 token 数: 50
  • 平均每个请求的输出 token 数: 290
# 数据
concurrency = [16, 32, 64]
time = [406.618, 223.379, 159.293]
qps = [0.236, 0.448, 0.609]
latency = [53.359, 54.293, 60.959]
throughput = [69.242, 129.193, 185.140]
p50 = [52.8675, 56.3690, 58.7318]
p90 = [80.6415, 87.1464, 90.6065]
  • parallel 1
Benchmarking summary: 
 Time taken for tests: 80.089 seconds
 Expected number of requests: 1
 Number of concurrency: 1
 Total requests: 1
 Succeed requests: 1
 Failed requests: 0
 Average QPS: 0.012
 Average latency: 79.353
 Throughput(average output tokens per second): 5.819
 Average time to first token: 79.353
 Average input tokens per request: 44.000
 Average output tokens per request: 466.000
 Average time per output token: 0.17186
 Average package per request: 1.000
 Average package latency: 79.353
  • parallel 16
Benchmarking summary: 
 Time taken for tests: 406.618 seconds
 Expected number of requests: 100
 Number of concurrency: 16
 Total requests: 98
 Succeed requests: 96
 Failed requests: 2
 Average QPS: 0.236
 Average latency: 53.359
 Throughput(average output tokens per second): 69.242
 Average time to first token: 53.359
 Average input tokens per request: 50.156
 Average output tokens per request: 293.281
 Average time per output token: 0.01444
 Average package per request: 1.000
 Average package latency: 53.359
 Percentile of time to first token: 
     p50: 52.8675
     p66: 62.7250
     p75: 70.2725
     p80: 73.0569
     p90: 80.6415
     p95: 91.7408
     p98: 96.5681
     p99: 114.6619
 Percentile of request latency: 
     p50: 52.8675
     p66: 62.7250
     p75: 70.2725
     p80: 73.0569
     p90: 80.6415
     p95: 91.7408
     p98: 96.5681
     p99: 114.6619
  • parallel 32
Benchmarking summary: 
 Time taken for tests: 223.379 seconds
 Expected number of requests: 100
 Number of concurrency: 32
 Total requests: 100
 Succeed requests: 100
 Failed requests: 0
 Average QPS: 0.448
 Average latency: 54.293
 Throughput(average output tokens per second): 129.193
 Average time to first token: 54.293
 Average input tokens per request: 49.890
 Average output tokens per request: 288.590
 Average time per output token: 0.00774
 Average package per request: 1.000
 Average package latency: 54.293
 Percentile of time to first token: 
     p50: 56.3690
     p66: 61.8036
     p75: 73.1621
     p80: 77.8210
     p90: 87.1464
     p95: 97.1933
     p98: 102.0623
     p99: 107.4325
 Percentile of request latency: 
     p50: 56.3690
     p66: 61.8036
     p75: 73.1621
     p80: 77.8210
     p90: 87.1464
     p95: 97.1933
     p98: 102.0623
     p99: 107.4325
  • parallel 64
Benchmarking summary: 
 Time taken for tests: 159.293 seconds
 Expected number of requests: 100
 Number of concurrency: 64
 Total requests: 99
 Succeed requests: 97
 Failed requests: 2
 Average QPS: 0.609
 Average latency: 60.959
 Throughput(average output tokens per second): 185.140
 Average time to first token: 60.959
 Average input tokens per request: 50.093
 Average output tokens per request: 296.392
 Average time per output token: 0.00540
 Average package per request: 1.000
 Average package latency: 60.959
 Percentile of time to first token: 
     p50: 58.7318
     p66: 72.0510
     p75: 77.6354
     p80: 82.2268
     p90: 90.6065
     p95: 98.0913
     p98: 108.4727
     p99: 112.9350
 Percentile of request latency: 
     p50: 58.7318
     p66: 72.0510
     p75: 77.6354
     p80: 82.2268
     p90: 90.6065
     p95: 98.0913
     p98: 108.4727
     p99: 112.9350

实验结果(XInference - MindIE)

  • ❌ 部署时间长了,请求无响应。
  • ❌ 部署多副本,压测一次后,服务挂掉。

实验结果(Nvidia T4: XInference - vLLM)

Qwen2-7B-Instruct

指标 8 16 32 64 100 128 150 200 300 400 500
用时 1276.077 776.170 504.480 372.118 363.016 365.024 418.874 459.912 401.385 458.450 375.796
QPS 0.783 1.288 1.958 2.472 2.353 2.334 2.046 1.750 1.664 1.254 1.163
延迟 10.128 12.307 15.749 19.225 19.235 19.151 17.479 15.949 16.718 14.750 15.771
吞吐量 254.695 424.701 631.260 753.852 700.681 700.324 637.260 550.200 510.211 392.697 362.167
p50 10.3076 12.5569 16.3129 20.0818 20.1519 20.0230 17.9699 16.2526 17.0123 14.8173 15.8847
p90 14.3874 17.6026 22.5020 26.8947 26.5725 26.8284 24.7138 23.0809 24.4670 21.4278 23.7326
失败 1 0 12 71 101 72 32 25 66 47 89
  • 平均每个请求的输入 token 数: 40
  • 平均每个请求的输出 token 数: 300
evalscope-perf http://172.16.33.66:9997/v1/chat/completions gpt-4-32k \
    ./datasets/open_qa.jsonl \
    --parallels 8 \
    --parallels 16 \
    --parallels 32 \
    --parallels 64 \
    --parallels 100 \
    --parallels 128 \
    --parallels 150 \
    --parallels 200 \
    --parallels 300 \
    --parallels 400 \
    --parallels 500 \
    --n 1000
  • parallel 8
Benchmarking summary: 
 Time taken for tests: 1276.077 seconds
 Expected number of requests: 1000
 Number of concurrency: 8
 Total requests: 1000
 Succeed requests: 999
 Failed requests: 1
 Average QPS: 0.783
 Average latency: 10.128
 Throughput(average output tokens per second): 254.695
 Average time to first token: 10.128
 Average input tokens per request: 40.295
 Average output tokens per request: 325.335
 Average time per output token: 0.00393
 Average package per request: 1.000
 Average package latency: 10.128
 Percentile of time to first token: 
     p50: 10.3076
     p66: 11.5736
     p75: 12.3458
     p80: 12.6923
     p90: 14.3874
     p95: 15.7724
     p98: 17.3099
     p99: 18.2167
 Percentile of request latency: 
     p50: 10.3076
     p66: 11.5736
     p75: 12.3458
     p80: 12.6923
     p90: 14.3874
     p95: 15.7724
     p98: 17.3099
     p99: 18.2167
  • parallel 16
Benchmarking summary: 
 Time taken for tests: 776.170 seconds
 Expected number of requests: 1000
 Number of concurrency: 16
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 1.288
 Average latency: 12.307
 Throughput(average output tokens per second): 424.701
 Average time to first token: 12.307
 Average input tokens per request: 40.296
 Average output tokens per request: 329.640
 Average time per output token: 0.00235
 Average package per request: 1.000
 Average package latency: 12.307
 Percentile of time to first token: 
     p50: 12.5569
     p66: 13.8731
     p75: 14.9178
     p80: 15.6629
     p90: 17.6026
     p95: 19.0937
     p98: 21.7158
     p99: 22.9113
 Percentile of request latency: 
     p50: 12.5569
     p66: 13.8731
     p75: 14.9178
     p80: 15.6629
     p90: 17.6026
     p95: 19.0937
     p98: 21.7158
     p99: 22.9113
  • parallel 32
Benchmarking summary: 
 Time taken for tests: 504.480 seconds
 Expected number of requests: 1000
 Number of concurrency: 32
 Total requests: 1000
 Succeed requests: 988
 Failed requests: 12
 Average QPS: 1.958
 Average latency: 15.749
 Throughput(average output tokens per second): 631.260
 Average time to first token: 15.749
 Average input tokens per request: 40.277
 Average output tokens per request: 322.326
 Average time per output token: 0.00158
 Average package per request: 1.000
 Average package latency: 15.749
 Percentile of time to first token: 
     p50: 16.3129
     p66: 18.2424
     p75: 19.2717
     p80: 19.9721
     p90: 22.5020
     p95: 24.8820
     p98: 26.8200
     p99: 27.5026
 Percentile of request latency: 
     p50: 16.3129
     p66: 18.2424
     p75: 19.2717
     p80: 19.9721
     p90: 22.5020
     p95: 24.8820
     p98: 26.8200
     p99: 27.5026
  • parallel 64
Benchmarking summary: 
 Time taken for tests: 372.118 seconds
 Expected number of requests: 1000
 Number of concurrency: 64
 Total requests: 991
 Succeed requests: 920
 Failed requests: 71
 Average QPS: 2.472
 Average latency: 19.225
 Throughput(average output tokens per second): 753.852
 Average time to first token: 19.225
 Average input tokens per request: 40.204
 Average output tokens per request: 304.915
 Average time per output token: 0.00133
 Average package per request: 1.000
 Average package latency: 19.225
 Percentile of time to first token: 
     p50: 20.0818
     p66: 22.5724
     p75: 23.7237
     p80: 24.5372
     p90: 26.8947
     p95: 28.2781
     p98: 29.2713
     p99: 29.5592
 Percentile of request latency: 
     p50: 20.0818
     p66: 22.5724
     p75: 23.7237
     p80: 24.5372
     p90: 26.8947
     p95: 28.2781
     p98: 29.2713
     p99: 29.5592
  • parallel 100
Benchmarking summary: 
 Time taken for tests: 363.016 seconds
 Expected number of requests: 1000
 Number of concurrency: 100
 Total requests: 955
 Succeed requests: 854
 Failed requests: 101
 Average QPS: 2.353
 Average latency: 19.235
 Throughput(average output tokens per second): 700.681
 Average time to first token: 19.235
 Average input tokens per request: 40.218
 Average output tokens per request: 297.843
 Average time per output token: 0.00143
 Average package per request: 1.000
 Average package latency: 19.235
 Percentile of time to first token: 
     p50: 20.1519
     p66: 22.4520
     p75: 23.9718
     p80: 24.6514
     p90: 26.5725
     p95: 28.0242
     p98: 29.1395
     p99: 29.5591
 Percentile of request latency: 
     p50: 20.1519
     p66: 22.4520
     p75: 23.9718
     p80: 24.6514
     p90: 26.5725
     p95: 28.0242
     p98: 29.1395
     p99: 29.5591
  • parallel 128
Benchmarking summary: 
 Time taken for tests: 365.024 seconds
 Expected number of requests: 1000
 Number of concurrency: 128
 Total requests: 924
 Succeed requests: 852
 Failed requests: 72
 Average QPS: 2.334
 Average latency: 19.151
 Throughput(average output tokens per second): 700.324
 Average time to first token: 19.151
 Average input tokens per request: 40.188
 Average output tokens per request: 300.041
 Average time per output token: 0.00143
 Average package per request: 1.000
 Average package latency: 19.151
 Percentile of time to first token: 
     p50: 20.0230
     p66: 22.2805
     p75: 23.6442
     p80: 24.5745
     p90: 26.8284
     p95: 28.1978
     p98: 29.3051
     p99: 29.6282
 Percentile of request latency: 
     p50: 20.0230
     p66: 22.2805
     p75: 23.6442
     p80: 24.5745
     p90: 26.8284
     p95: 28.1978
     p98: 29.3051
     p99: 29.6282
  • parallel 150
Benchmarking summary: 
 Time taken for tests: 418.874 seconds
 Expected number of requests: 1000
 Number of concurrency: 150
 Total requests: 889
 Succeed requests: 857
 Failed requests: 32
 Average QPS: 2.046
 Average latency: 17.479
 Throughput(average output tokens per second): 637.260
 Average time to first token: 17.479
 Average input tokens per request: 40.272
 Average output tokens per request: 311.473
 Average time per output token: 0.00157
 Average package per request: 1.000
 Average package latency: 17.479
 Percentile of time to first token: 
     p50: 17.9699
     p66: 20.0651
     p75: 21.5481
     p80: 22.2761
     p90: 24.7138
     p95: 27.1316
     p98: 28.8103
     p99: 29.3229
 Percentile of request latency: 
     p50: 17.9699
     p66: 20.0651
     p75: 21.5481
     p80: 22.2761
     p90: 24.7138
     p95: 27.1316
     p98: 28.8103
     p99: 29.3229
  • parallel 200
Benchmarking summary: 
 Time taken for tests: 459.912 seconds
 Expected number of requests: 1000
 Number of concurrency: 200
 Total requests: 830
 Succeed requests: 805
 Failed requests: 25
 Average QPS: 1.750
 Average latency: 15.949
 Throughput(average output tokens per second): 550.200
 Average time to first token: 15.949
 Average input tokens per request: 40.337
 Average output tokens per request: 314.340
 Average time per output token: 0.00182
 Average package per request: 1.000
 Average package latency: 15.949
 Percentile of time to first token: 
     p50: 16.2526
     p66: 18.0521
     p75: 19.4394
     p80: 20.1756
     p90: 23.0809
     p95: 25.5287
     p98: 28.5592
     p99: 29.4651
 Percentile of request latency: 
     p50: 16.2526
     p66: 18.0521
     p75: 19.4394
     p80: 20.1756
     p90: 23.0809
     p95: 25.5287
     p98: 28.5592
     p99: 29.4651
  • parallel 300
Benchmarking summary: 
 Time taken for tests: 401.385 seconds
 Expected number of requests: 1000
 Number of concurrency: 300
 Total requests: 734
 Succeed requests: 668
 Failed requests: 66
 Average QPS: 1.664
 Average latency: 16.718
 Throughput(average output tokens per second): 510.211
 Average time to first token: 16.718
 Average input tokens per request: 40.383
 Average output tokens per request: 306.573
 Average time per output token: 0.00196
 Average package per request: 1.000
 Average package latency: 16.718
 Percentile of time to first token: 
     p50: 17.0123
     p66: 19.2765
     p75: 20.8575
     p80: 21.9712
     p90: 24.4670
     p95: 26.6873
     p98: 28.2457
     p99: 29.1315
 Percentile of request latency: 
     p50: 17.0123
     p66: 19.2765
     p75: 20.8575
     p80: 21.9712
     p90: 24.4670
     p95: 26.6873
     p98: 28.2457
     p99: 29.1315
  • parallel 400
Benchmarking summary: 
 Time taken for tests: 458.450 seconds
 Expected number of requests: 1000
 Number of concurrency: 400
 Total requests: 622
 Succeed requests: 575
 Failed requests: 47
 Average QPS: 1.254
 Average latency: 14.750
 Throughput(average output tokens per second): 392.697
 Average time to first token: 14.750
 Average input tokens per request: 40.310
 Average output tokens per request: 313.099
 Average time per output token: 0.00255
 Average package per request: 1.000
 Average package latency: 14.750
 Percentile of time to first token: 
     p50: 14.8173
     p66: 16.7375
     p75: 18.0656
     p80: 18.7752
     p90: 21.4278
     p95: 24.3040
     p98: 28.1092
     p99: 29.1603
 Percentile of request latency: 
     p50: 14.8173
     p66: 16.7375
     p75: 18.0656
     p80: 18.7752
     p90: 21.4278
     p95: 24.3040
     p98: 28.1092
     p99: 29.1603
  • parallel 500
Benchmarking summary: 
 Time taken for tests: 375.796 seconds
 Expected number of requests: 1000
 Number of concurrency: 500
 Total requests: 526
 Succeed requests: 437
 Failed requests: 89
 Average QPS: 1.163
 Average latency: 15.771
 Throughput(average output tokens per second): 362.167
 Average time to first token: 15.771
 Average input tokens per request: 40.268
 Average output tokens per request: 311.444
 Average time per output token: 0.00276
 Average package per request: 1.000
 Average package latency: 15.771
 Percentile of time to first token: 
     p50: 15.8847
     p66: 17.4772
     p75: 18.8302
     p80: 20.0079
     p90: 23.7326
     p95: 26.3385
     p98: 28.6426
     p99: 29.0919
 Percentile of request latency: 
     p50: 15.8847
     p66: 17.4772
     p75: 18.8302
     p80: 20.0079
     p90: 23.7326
     p95: 26.3385
     p98: 28.6426
     p99: 29.0919

实验结果(Nvidia T4: vLLM)

Qwen2-7B-Instruct

指标 8 16 32 64 100 128 150 200 300 400 500
用时 1030.632 629.843 409.251 280.896 252.708 236.597 254.208 274.376 256.799 222.726 175.166
QPS 0.970 1.588 2.443 3.503 3.593 3.580 3.509 3.076 2.847 2.820 3.060
延迟 8.213 9.944 12.846 16.913 19.182 19.831 17.806 16.743 16.285 16.285 17.649
吞吐量 298.860 496.458 753.176 1073.697 1039.610 1005.881 1032.794 925.476 839.437 816.049 875.565
p50 8.4703 10.2975 13.3060 17.5850 20.1962 20.9986 18.6905 17.0203 16.6503 16.4550 18.0830
p90 11.8119 14.2264 18.5003 23.7261 27.0515 27.5461 25.4481 24.0718 23.5927 24.0938 25.7348
失败 0 0 0 15 73 115 26 11 19 23 26
  • 平均每个请求的输入 token 数: 40
  • 平均每个请求的输出 token 数: 300

部署

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8008 \
    --model /data/models/llm/qwen/Qwen2-7B-Instruct/ \
    --served-model-name qwen2-7b \
    --tensor-parallel-size 4 \
    --dtype=float16 \
    --max-model-len 16000

压测

evalscope-perf http://172.16.33.66:8008/v1/chat/completions qwen2-7b \
    ./datasets/open_qa.jsonl \
    --parallels 8 \
    --parallels 16 \
    --parallels 32 \
    --parallels 64 \
    --parallels 100 \
    --parallels 128 \
    --parallels 150 \
    --parallels 200 \
    --parallels 300 \
    --parallels 400 \
    --parallels 500 \
    --n 1000
  • parallel 8
Benchmarking summary: 
 Time taken for tests: 1030.632 seconds
 Expected number of requests: 1000
 Number of concurrency: 8
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 0.970
 Average latency: 8.213
 Throughput(average output tokens per second): 298.860
 Average time to first token: 8.213
 Average input tokens per request: 40.296
 Average output tokens per request: 308.015
 Average time per output token: 0.00335
 Average package per request: 1.000
 Average package latency: 8.213
 Percentile of time to first token: 
     p50: 8.4703
     p66: 9.3882
     p75: 10.1059
     p80: 10.5404
     p90: 11.8119
     p95: 12.8131
     p98: 14.0085
     p99: 15.0006
 Percentile of request latency: 
     p50: 8.4703
     p66: 9.3882
     p75: 10.1059
     p80: 10.5404
     p90: 11.8119
     p95: 12.8131
     p98: 14.0085
     p99: 15.0006
  • parallel 16
Benchmarking summary: 
 Time taken for tests: 629.843 seconds
 Expected number of requests: 1000
 Number of concurrency: 16
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 1.588
 Average latency: 9.944
 Throughput(average output tokens per second): 496.458
 Average time to first token: 9.944
 Average input tokens per request: 40.296
 Average output tokens per request: 312.691
 Average time per output token: 0.00201
 Average package per request: 1.000
 Average package latency: 9.944
 Percentile of time to first token: 
     p50: 10.2975
     p66: 11.4685
     p75: 12.2523
     p80: 12.7172
     p90: 14.2264
     p95: 15.5556
     p98: 17.0453
     p99: 18.0745
 Percentile of request latency: 
     p50: 10.2975
     p66: 11.4685
     p75: 12.2523
     p80: 12.7172
     p90: 14.2264
     p95: 15.5556
     p98: 17.0453
     p99: 18.0745
  • parallel 32
Benchmarking summary: 
 Time taken for tests: 409.251 seconds
 Expected number of requests: 1000
 Number of concurrency: 32
 Total requests: 1000
 Succeed requests: 1000
 Failed requests: 0
 Average QPS: 2.443
 Average latency: 12.846
 Throughput(average output tokens per second): 753.176
 Average time to first token: 12.846
 Average input tokens per request: 40.296
 Average output tokens per request: 308.238
 Average time per output token: 0.00133
 Average package per request: 1.000
 Average package latency: 12.846
 Percentile of time to first token: 
     p50: 13.3060
     p66: 14.7707
     p75: 15.6738
     p80: 16.5227
     p90: 18.5003
     p95: 20.2770
     p98: 22.3723
     p99: 23.2546
 Percentile of request latency: 
     p50: 13.3060
     p66: 14.7707
     p75: 15.6738
     p80: 16.5227
     p90: 18.5003
     p95: 20.2770
     p98: 22.3723
     p99: 23.2546
  • parallel 64
Benchmarking summary: 
 Time taken for tests: 280.896 seconds
 Expected number of requests: 1000
 Number of concurrency: 64
 Total requests: 999
 Succeed requests: 984
 Failed requests: 15
 Average QPS: 3.503
 Average latency: 16.913
 Throughput(average output tokens per second): 1073.697
 Average time to first token: 16.913
 Average input tokens per request: 40.278
 Average output tokens per request: 306.501
 Average time per output token: 0.00093
 Average package per request: 1.000
 Average package latency: 16.913
 Percentile of time to first token: 
     p50: 17.5850
     p66: 19.7422
     p75: 20.8753
     p80: 21.5526
     p90: 23.7261
     p95: 25.7272
     p98: 28.0227
     p99: 28.5129
 Percentile of request latency: 
     p50: 17.5850
     p66: 19.7422
     p75: 20.8753
     p80: 21.5526
     p90: 23.7261
     p95: 25.7272
     p98: 28.0227
     p99: 28.5129
  • parallel 100
Benchmarking summary: 
 Time taken for tests: 252.708 seconds
 Expected number of requests: 1000
 Number of concurrency: 100
 Total requests: 981
 Succeed requests: 908
 Failed requests: 73
 Average QPS: 3.593
 Average latency: 19.182
 Throughput(average output tokens per second): 1039.610
 Average time to first token: 19.182
 Average input tokens per request: 40.166
 Average output tokens per request: 289.337
 Average time per output token: 0.00096
 Average package per request: 1.000
 Average package latency: 19.182
 Percentile of time to first token: 
     p50: 20.1962
     p66: 22.6797
     p75: 23.9937
     p80: 24.8991
     p90: 27.0515
     p95: 28.3231
     p98: 29.3312
     p99: 29.6593
 Percentile of request latency: 
     p50: 20.1962
     p66: 22.6797
     p75: 23.9937
     p80: 24.8991
     p90: 27.0515
     p95: 28.3231
     p98: 29.3312
     p99: 29.6593
  • parallel 128
Benchmarking summary: 
 Time taken for tests: 236.597 seconds
 Expected number of requests: 1000
 Number of concurrency: 128
 Total requests: 962
 Succeed requests: 847
 Failed requests: 115
 Average QPS: 3.580
 Average latency: 19.831
 Throughput(average output tokens per second): 1005.881
 Average time to first token: 19.831
 Average input tokens per request: 40.038
 Average output tokens per request: 280.979
 Average time per output token: 0.00099
 Average package per request: 1.000
 Average package latency: 19.831
 Percentile of time to first token: 
     p50: 20.9986
     p66: 23.7704
     p75: 25.2322
     p80: 25.8841
     p90: 27.5461
     p95: 28.7326
     p98: 29.5878
     p99: 29.8692
 Percentile of request latency: 
     p50: 20.9986
     p66: 23.7704
     p75: 25.2322
     p80: 25.8841
     p90: 27.5461
     p95: 28.7326
     p98: 29.5878
     p99: 29.8692
  • parallel 150
Benchmarking summary: 
 Time taken for tests: 254.208 seconds
 Expected number of requests: 1000
 Number of concurrency: 150
 Total requests: 918
 Succeed requests: 892
 Failed requests: 26
 Average QPS: 3.509
 Average latency: 17.806
 Throughput(average output tokens per second): 1032.794
 Average time to first token: 17.806
 Average input tokens per request: 40.286
 Average output tokens per request: 294.333
 Average time per output token: 0.00097
 Average package per request: 1.000
 Average package latency: 17.806
 Percentile of time to first token: 
     p50: 18.6905
     p66: 20.7610
     p75: 22.0715
     p80: 23.0934
     p90: 25.4481
     p95: 27.6829
     p98: 29.1659
     p99: 29.6321
 Percentile of request latency: 
     p50: 18.6905
     p66: 20.7610
     p75: 22.0715
     p80: 23.0934
     p90: 25.4481
     p95: 27.6829
     p98: 29.1659
     p99: 29.6321
  • parallel 200
Benchmarking summary: 
 Time taken for tests: 274.376 seconds
 Expected number of requests: 1000
 Number of concurrency: 200
 Total requests: 855
 Succeed requests: 844
 Failed requests: 11
 Average QPS: 3.076
 Average latency: 16.743
 Throughput(average output tokens per second): 925.476
 Average time to first token: 16.743
 Average input tokens per request: 40.307
 Average output tokens per request: 300.863
 Average time per output token: 0.00108
 Average package per request: 1.000
 Average package latency: 16.743
 Percentile of time to first token: 
     p50: 17.0203
     p66: 19.1086
     p75: 20.4436
     p80: 21.6162
     p90: 24.0718
     p95: 26.2962
     p98: 28.3664
     p99: 29.4007
 Percentile of request latency: 
     p50: 17.0203
     p66: 19.1086
     p75: 20.4436
     p80: 21.6162
     p90: 24.0718
     p95: 26.2962
     p98: 28.3664
     p99: 29.4007
  • parallel 300
Benchmarking summary: 
 Time taken for tests: 256.799 seconds
 Expected number of requests: 1000
 Number of concurrency: 300
 Total requests: 750
 Succeed requests: 731
 Failed requests: 19
 Average QPS: 2.847
 Average latency: 16.285
 Throughput(average output tokens per second): 839.437
 Average time to first token: 16.285
 Average input tokens per request: 40.453
 Average output tokens per request: 294.893
 Average time per output token: 0.00119
 Average package per request: 1.000
 Average package latency: 16.285
 Percentile of time to first token: 
     p50: 16.6503
     p66: 18.6659
     p75: 19.8392
     p80: 20.8286
     p90: 23.5927
     p95: 26.8649
     p98: 28.0731
     p99: 29.3460
 Percentile of request latency: 
     p50: 16.6503
     p66: 18.6659
     p75: 19.8392
     p80: 20.8286
     p90: 23.5927
     p95: 26.8649
     p98: 28.0731
     p99: 29.3460
  • parallel 400
Benchmarking summary: 
 Time taken for tests: 222.726 seconds
 Expected number of requests: 1000
 Number of concurrency: 400
 Total requests: 651
 Succeed requests: 628
 Failed requests: 23
 Average QPS: 2.820
 Average latency: 16.285
 Throughput(average output tokens per second): 816.049
 Average time to first token: 16.285
 Average input tokens per request: 40.247
 Average output tokens per request: 289.419
 Average time per output token: 0.00123
 Average package per request: 1.000
 Average package latency: 16.285
 Percentile of time to first token: 
     p50: 16.4550
     p66: 18.4089
     p75: 20.1452
     p80: 20.9587
     p90: 24.0938
     p95: 26.5017
     p98: 28.0750
     p99: 28.6327
 Percentile of request latency: 
     p50: 16.4550
     p66: 18.4089
     p75: 20.1452
     p80: 20.9587
     p90: 24.0938
     p95: 26.5017
     p98: 28.0750
     p99: 28.6327
  • parallel 500
Benchmarking summary: 
 Time taken for tests: 175.166 seconds
 Expected number of requests: 1000
 Number of concurrency: 500
 Total requests: 562
 Succeed requests: 536
 Failed requests: 26
 Average QPS: 3.060
 Average latency: 17.649
 Throughput(average output tokens per second): 875.565
 Average time to first token: 17.649
 Average input tokens per request: 40.192
 Average output tokens per request: 286.136
 Average time per output token: 0.00114
 Average package per request: 1.000
 Average package latency: 17.649
 Percentile of time to first token: 
     p50: 18.0830
     p66: 20.6764
     p75: 22.3065
     p80: 23.0424
     p90: 25.7348
     p95: 27.9985
     p98: 29.2158
     p99: 29.5974
 Percentile of request latency: 
     p50: 18.0830
     p66: 20.6764
     p75: 22.3065
     p80: 23.0424
     p90: 25.7348
     p95: 27.9985
     p98: 29.2158
     p99: 29.5974

Qwen2-7B-Instruct (long.jsonl)

指标 4 8 12 16 20 25 30 35 40
用时 1501.129 831.393 661.167 553.051 492.972 482.926 503.931 708.094 1708.086
QPS 0.066 0.120 0.151 0.181 0.203 0.197 0.169 0.102 0.036
延迟 58.530 63.844 75.483 81.761 93.514 95.232 82.990 67.363 55.090
吞吐量 150.200 268.802 340.991 411.723 450.299 437.290 369.642 224.586 79.384
p50 61.1869 67.7709 79.9282 85.2388 101.0449 100.2105 84.4174 67.4104 57.0599
p90 63.1877 70.4871 84.9531 89.7831 106.6341 113.0055 106.2575 81.7474 59.3524
失败 1 0 0 0 0 5 15 28 38
  • 平均每个请求的输入 token 数: 1600
  • 平均每个请求的输出 token 数: 2200

部署

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8008 \
    --model /data/models/llm/qwen/Qwen2-7B-Instruct/ \
    --served-model-name qwen2-7b \
    --tensor-parallel-size 4 \
    --dtype=float16 \
    --max-model-len 16000

压测

evalscope-perf http://172.16.33.66:8008/v1/chat/completions qwen2-7b \
    ./datasets/open_qa.jsonl \
    --parallels 4 \
    --parallels 8 \
    --parallels 12 \
    --parallels 16 \
    --parallels 20 \
    --parallels 25 \
    --parallels 30 \
    --parallels 35 \
    --parallels 40 \
    --n 100
  • parallel 4
Benchmarking summary: 
 Time taken for tests: 1501.129 seconds
 Expected number of requests: 100
 Number of concurrency: 4
 Total requests: 100
 Succeed requests: 99
 Failed requests: 1
 Average QPS: 0.066
 Average latency: 58.530
 Throughput(average output tokens per second): 150.200
 Average time to first token: 58.530
 Average input tokens per request: 1614.000
 Average output tokens per request: 2277.475
 Average time per output token: 0.00666
 Average package per request: 1.000
 Average package latency: 58.530
 Percentile of time to first token: 
     p50: 61.1869
     p66: 61.9394
     p75: 62.4250
     p80: 62.7114
     p90: 63.1877
     p95: 63.8551
     p98: 64.0589
     p99: 64.3522
 Percentile of request latency: 
     p50: 61.1869
     p66: 61.9394
     p75: 62.4250
     p80: 62.7114
     p90: 63.1877
     p95: 63.8551
     p98: 64.0589
     p99: 64.3522
  • parallel 8
Benchmarking summary: 
 Time taken for tests: 831.393 seconds
 Expected number of requests: 100
 Number of concurrency: 8
 Total requests: 100
 Succeed requests: 100
 Failed requests: 0
 Average QPS: 0.120
 Average latency: 63.844
 Throughput(average output tokens per second): 268.802
 Average time to first token: 63.844
 Average input tokens per request: 1614.000
 Average output tokens per request: 2234.800
 Average time per output token: 0.00372
 Average package per request: 1.000
 Average package latency: 63.844
 Percentile of time to first token: 
     p50: 67.7709
     p66: 68.7757
     p75: 69.2135
     p80: 69.4998
     p90: 70.4871
     p95: 71.4362
     p98: 74.7053
     p99: 77.3827
 Percentile of request latency: 
     p50: 67.7709
     p66: 68.7757
     p75: 69.2135
     p80: 69.4998
     p90: 70.4871
     p95: 71.4362
     p98: 74.7053
     p99: 77.3827
  • parallel 12
Benchmarking summary: 
 Time taken for tests: 661.167 seconds
 Expected number of requests: 100
 Number of concurrency: 12
 Total requests: 100
 Succeed requests: 100
 Failed requests: 0
 Average QPS: 0.151
 Average latency: 75.483
 Throughput(average output tokens per second): 340.991
 Average time to first token: 75.483
 Average input tokens per request: 1614.000
 Average output tokens per request: 2254.520
 Average time per output token: 0.00293
 Average package per request: 1.000
 Average package latency: 75.483
 Percentile of time to first token: 
     p50: 79.9282
     p66: 81.6170
     p75: 82.3043
     p80: 82.8302
     p90: 84.9531
     p95: 87.1365
     p98: 94.9953
     p99: 96.4239
 Percentile of request latency: 
     p50: 79.9282
     p66: 81.6170
     p75: 82.3043
     p80: 82.8302
     p90: 84.9531
     p95: 87.1365
     p98: 94.9953
     p99: 96.4239
  • parallel 16
Benchmarking summary: 
 Time taken for tests: 553.051 seconds
 Expected number of requests: 100
 Number of concurrency: 16
 Total requests: 100
 Succeed requests: 100
 Failed requests: 0
 Average QPS: 0.181
 Average latency: 81.761
 Throughput(average output tokens per second): 411.723
 Average time to first token: 81.761
 Average input tokens per request: 1614.000
 Average output tokens per request: 2277.040
 Average time per output token: 0.00243
 Average package per request: 1.000
 Average package latency: 81.761
 Percentile of time to first token: 
     p50: 85.2388
     p66: 86.6946
     p75: 87.8569
     p80: 88.2254
     p90: 89.7831
     p95: 91.9183
     p98: 93.1188
     p99: 94.1187
 Percentile of request latency: 
     p50: 85.2388
     p66: 86.6946
     p75: 87.8569
     p80: 88.2254
     p90: 89.7831
     p95: 91.9183
     p98: 93.1188
     p99: 94.1187
  • parallel 20
Benchmarking summary: 
 Time taken for tests: 492.972 seconds
 Expected number of requests: 100
 Number of concurrency: 20
 Total requests: 100
 Succeed requests: 100
 Failed requests: 0
 Average QPS: 0.203
 Average latency: 93.514
 Throughput(average output tokens per second): 450.299
 Average time to first token: 93.514
 Average input tokens per request: 1614.000
 Average output tokens per request: 2219.850
 Average time per output token: 0.00222
 Average package per request: 1.000
 Average package latency: 93.514
 Percentile of time to first token: 
     p50: 101.0449
     p66: 103.1628
     p75: 104.5066
     p80: 105.2498
     p90: 106.6341
     p95: 109.0586
     p98: 112.6580
     p99: 114.0485
 Percentile of request latency: 
     p50: 101.0449
     p66: 103.1628
     p75: 104.5066
     p80: 105.2498
     p90: 106.6341
     p95: 109.0586
     p98: 112.6580
     p99: 114.0485
  • parallel 25
Benchmarking summary: 
 Time taken for tests: 482.926 seconds
 Expected number of requests: 100
 Number of concurrency: 25
 Total requests: 95
 Succeed requests: 95
 Failed requests: 0
 Average QPS: 0.197
 Average latency: 95.232
 Throughput(average output tokens per second): 437.290
 Average time to first token: 95.232
 Average input tokens per request: 1614.000
 Average output tokens per request: 2222.937
 Average time per output token: 0.00229
 Average package per request: 1.000
 Average package latency: 95.232
 Percentile of time to first token: 
     p50: 100.2105
     p66: 103.2044
     p75: 104.4999
     p80: 105.5968
     p90: 113.0055
     p95: 117.3441
     p98: 119.4187
     p99: 119.9363
 Percentile of request latency: 
     p50: 100.2105
     p66: 103.2044
     p75: 104.4999
     p80: 105.5968
     p90: 113.0055
     p95: 117.3441
     p98: 119.4187
     p99: 119.9363
  • parallel 30
Benchmarking summary: 
 Time taken for tests: 503.931 seconds
 Expected number of requests: 100
 Number of concurrency: 30
 Total requests: 85
 Succeed requests: 85
 Failed requests: 0
 Average QPS: 0.169
 Average latency: 82.990
 Throughput(average output tokens per second): 369.642
 Average time to first token: 82.990
 Average input tokens per request: 1614.000
 Average output tokens per request: 2191.459
 Average time per output token: 0.00271
 Average package per request: 1.000
 Average package latency: 82.990
 Percentile of time to first token: 
     p50: 84.4174
     p66: 86.3143
     p75: 88.0736
     p80: 91.9183
     p90: 106.2575
     p95: 109.8099
     p98: 118.7324
     p99: 119.7411
 Percentile of request latency: 
     p50: 84.4174
     p66: 86.3143
     p75: 88.0736
     p80: 91.9183
     p90: 106.2575
     p95: 109.8099
     p98: 118.7324
     p99: 119.7411
  • parallel 35
Benchmarking summary: 
 Time taken for tests: 708.094 seconds
 Expected number of requests: 100
 Number of concurrency: 35
 Total requests: 72
 Succeed requests: 72
 Failed requests: 0
 Average QPS: 0.102
 Average latency: 67.363
 Throughput(average output tokens per second): 224.586
 Average time to first token: 67.363
 Average input tokens per request: 1614.000
 Average output tokens per request: 2208.722
 Average time per output token: 0.00445
 Average package per request: 1.000
 Average package latency: 67.363
 Percentile of time to first token: 
     p50: 67.4104
     p66: 67.9889
     p75: 68.3605
     p80: 68.5264
     p90: 81.7474
     p95: 116.1246
     p98: 118.3541
     p99: 119.0251
 Percentile of request latency: 
     p50: 67.4104
     p66: 67.9889
     p75: 68.3605
     p80: 68.5264
     p90: 81.7474
     p95: 116.1246
     p98: 118.3541
     p99: 119.0251
  • parallel 40
Benchmarking summary: 
 Time taken for tests: 1708.086 seconds
 Expected number of requests: 100
 Number of concurrency: 40
 Total requests: 62
 Succeed requests: 62
 Failed requests: 0
 Average QPS: 0.036
 Average latency: 55.090
 Throughput(average output tokens per second): 79.384
 Average time to first token: 55.090
 Average input tokens per request: 1614.000
 Average output tokens per request: 2187.000
 Average time per output token: 0.01260
 Average package per request: 1.000
 Average package latency: 55.090
 Percentile of time to first token: 
     p50: 57.0599
     p66: 57.4411
     p75: 58.2631
     p80: 58.5435
     p90: 59.3524
     p95: 62.2720
     p98: 96.8205
     p99: 99.6312
 Percentile of request latency: 
     p50: 57.0599
     p66: 57.4411
     p75: 58.2631
     p80: 58.5435
     p90: 59.3524
     p95: 62.2720
     p98: 96.8205
     p99: 99.6312

参考资料