目录

大模型推理性能压测工具

安装 EvalScope

git clone https://github.com/modelscope/evalscope
cd evalscope

pip install -e .

压测命令的使用

evalscope perf \
    --api openai \
    --url 'http://127.0.0.1:1025/v1/chat/completions' \
    --model 'qwen' \
    --dataset openqa \
    --dataset-path './datasets/open_qa.jsonl' \
    --max-prompt-length 8000 \
    --stop '<|im_end|>' \
    --read-timeout=120 \
    --parallel 100 \
    -n 1000

–stream 不要加,经常出问题。

  • --read-timeout: 网络读取超时
  • --parallel: 并发数
  • -n: 请求数

数据集

中文聊天 HC3-Chinese

mkdir datasets
wget https://modelscope.cn/datasets/AI-ModelScope/HC3-Chinese/resolve/master/open_qa.jsonl \
    -O datasets/open_qa.jsonl

压测命令

evalscope perf \
    --api openai \
    --url 'http://127.0.0.1:1025/v1/chat/completions' \
    --model 'qwen' \
    --dataset openqa \
    --dataset-path './datasets/open_qa.jsonl' \
    --max-prompt-length 8000 \
    --stop '<|im_end|>' \
    --read-timeout=120 \
    --parallel 1 \
    -n 1

代码问答 Codefuse-Evol-Instruct-Clean

wget https://modelscope.cn/datasets/Banksy235/Codefuse-Evol-Instruct-Clean/resolve/master/data.json \
    -O datasets/Codefuse-Evol-Instruct-Clean-data.jsonl

# 修改数据集格式,将 "input" 改为 "question",以适应 EvalScope 的数据集格式 openqa
sed -i 's/"input"/"question"/g' datasets/Codefuse-Evol-Instruct-Clean-data.jsonl

压测命令

evalscope perf \
    --api openai \
    --url 'http://127.0.0.1:1025/v1/chat/completions' \
    --model 'qwen' \
    --dataset openqa \
    --dataset-path './datasets/Codefuse-Evol-Instruct-Clean-data.jsonl' \
    --max-prompt-length 4000 \
    --stop '<|im_end|>' \
    --read-timeout=120 \
    --parallel 1 \
    -n 1

构造长输入和输出的数据集

编辑文件:datasets/long.jsonl

{"question":"Learning to Reason with LLMs\nWe are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.\n\nContributions\nOpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users⁠(opens in a new window).\n\nOur large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.\n\nThe image shows two scatter plots comparing o1 AIME accuracy during training and at test time. Both charts have pass@1 accuracy on the y-axis and compute (log scale) on the x-axis. The dots indicate increasing accuracy with more compute time.\no1 performance smoothly improves with both train-time and test-time compute\n\nEvals\nTo highlight the reasoning improvement over GPT-4o, we tested our models on a diverse set of human exams and ML benchmarks. We show that o1 significantly outperforms GPT-4o on the vast majority of these reasoning-heavy tasks. Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting.\n\nCompetition evals for Math (AIME 2024), Code (CodeForces), and PhD-Level Science Questions (GPQA Diamond)\no1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples.\nBreakdown of the accuracy and raw score of gpt-4o vs. o1 on various competition evals\no1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories. Seven are shown for illustration.\nIn many reasoning-heavy benchmarks, o1 rivals the performance of human experts. Recent frontier models1 do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models.\nA score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.\n\nWe also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology. In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions. We found that o1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark. These results do not imply that o1 is more capable than a PhD in all respects — only that the model is more proficient in solving some problems that a PhD would be expected to solve. On several other ML benchmarks, o1 improved over the state-of-the-art. With its vision perception capabilities enabled, o1 scored 78.2 on MMMU, making it the first model to be competitive with human experts. It also outperformed GPT-4o on 54 out of 57 MMLU subcategories.\n\nChain of Thought\nSimilar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason. To illustrate this leap forward, we showcase the chain of thought from o1-preview on several difficult problems below.\n\nCoding\nWe trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.\n\nFor each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.\n\nWith a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.  \n\nFinally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors.\nHuman preference evaluation\nIn addition to exams and academic benchmarks, we also evaluated human preference of o1-preview vs GPT-4o on challenging, open-ended prompts in a broad spectrum of domains. In this evaluation, human trainers were shown anonymized responses to a prompt from o1-preview and GPT-4o, and voted for which response they preferred. o1-preview is preferred to gpt-4o by a large margin in reasoning-heavy categories like data analysis, coding, and math. However, o1-preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases.\n\nThe image shows a horizontal bar chart comparing five models scores with error bars representing confidence intervals. The x-axis ranges from 0 to 100, with a dashed line as a reference point for performance.\nSafety\nChain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles.\nWhat does all of this mean for founders in the AI market? What does this mean for incumbent software companies? And where do we, as investors, see the most promising layer for returns in the Generative AI stack?\nIn our latest essay on the state of the Generative AI market, we’ll explore how the consolidation of the foundational LLM layer has set the stage for the race to scale these higher-order reasoning and agentic capabilities, and discuss a new generation of “killer apps” with novel cognitive architectures and user interfaces.\nThis is where System 2 thinking comes in, and it’s the focus of the latest wave of AI research. When a model “stops to think,” it isn’t just generating learned patterns or spitting out predictions based on past data. It’s generating a range of possibilities, considering potential outcomes and making a decision based on reasoning. \n\nTranslate to France."}

输入和输出 Tokens 大约在 3500

压测命令

evalscope-perf http://127.0.0.1:1025/v1/chat/completions qwen \
     ./datasets/long.jsonl \
     --max-prompt-length 8000 \
     --read-timeout=120 \
     --parallels 1 \
     --n 1

实验结果对比

🏆 vLLM ⚔️ XInference (T4: 4X16G)

从结果看,在生产环境中还是要使用 vLLM,推理性能更好且稳定更棒。

代码

import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

# 设置中文字体
font_path = '/System/Library/Fonts/Hiragino Sans GB.ttc'  # 替换为你的字体文件路径
font_prop = FontProperties(fname=font_path)

# 数据
batch_sizes = [8, 16, 32, 64, 100, 128, 150, 200, 300, 400, 500]

vllm_qps = [0.970, 1.588, 2.443, 3.503, 3.593, 3.580, 3.509, 3.076, 2.847, 2.820, 3.060]
xinf_qps = [0.783, 1.288, 1.958, 2.472, 2.353, 2.334, 2.046, 1.750, 1.664, 1.254, 1.163]

vllm_latency = [8.213, 9.944, 12.846, 16.913, 19.182, 19.831, 17.806, 16.743, 16.285, 16.285, 17.649]
xinf_latency = [10.128, 12.307, 15.749, 19.225, 19.235, 19.151, 17.479, 15.949, 16.718, 14.750, 15.771]

vllm_throughput = [298.860, 496.458, 753.176, 1073.697, 1039.610, 1005.881, 1032.794, 925.476, 839.437, 816.049, 875.565]
xinf_throughput = [254.695, 424.701, 631.260, 753.852, 700.681, 700.324, 637.260, 550.200, 510.211, 392.697, 362.167]

vllm_failures = [0, 0, 0, 15, 73, 115, 26, 11, 19, 23, 26]
xinf_failures = [1, 0, 12, 71, 101, 72, 32, 25, 66, 47, 89]

# 创建子图
fig, axs = plt.subplots(2, 2, figsize=(15, 10))

# QPS
axs[0, 0].plot(batch_sizes, vllm_qps, label='vLLM', marker='o')
axs[0, 0].plot(batch_sizes, xinf_qps, label='XInferencev(LLM)', marker='o')
axs[0, 0].set_title('QPS', fontproperties=font_prop)
axs[0, 0].set_xlabel('并行数', fontproperties=font_prop)
axs[0, 0].set_ylabel('QPS', fontproperties=font_prop)
axs[0, 0].legend()

# 延迟
axs[0, 1].plot(batch_sizes, vllm_latency, label='vLLM', marker='o')
axs[0, 1].plot(batch_sizes, xinf_latency, label='XInference(vLLM)', marker='o')
axs[0, 1].set_title('延迟', fontproperties=font_prop)
axs[0, 1].set_xlabel('并行数', fontproperties=font_prop)
axs[0, 1].set_ylabel('延迟 (秒)', fontproperties=font_prop)
axs[0, 1].legend()

# 吞吐量
axs[1, 0].plot(batch_sizes, vllm_throughput, label='vLLM', marker='o')
axs[1, 0].plot(batch_sizes, xinf_throughput, label='XInference(vLLM)', marker='o')
axs[1, 0].set_title('吞吐量', fontproperties=font_prop)
axs[1, 0].set_xlabel('并行数', fontproperties=font_prop)
axs[1, 0].set_ylabel('吞吐量 (每秒Tokens)', fontproperties=font_prop)
axs[1, 0].legend()

# 失败率
axs[1, 1].plot(batch_sizes, vllm_failures, label='vLLM', marker='o')
axs[1, 1].plot(batch_sizes, xinf_failures, label='XInference(vLLM)', marker='o')
axs[1, 1].set_title('失败率', fontproperties=font_prop)
axs[1, 1].set_xlabel('并行数', fontproperties=font_prop)
axs[1, 1].set_ylabel('失败数', fontproperties=font_prop)
axs[1, 1].legend()

# 调整布局
plt.tight_layout()
plt.show()

🏆 MindIE (910B4: 8X32G) ⚔️ vLLM (T4: 4X16G)

和我们现有服务器 T4 的性能对比

代码

import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

# 设置中文字体
font_path = '/System/Library/Fonts/Hiragino Sans GB.ttc'  # 替换为你的字体文件路径
font_prop = FontProperties(fname=font_path)

# 数据
batch_sizes_mindie = [8, 16, 32, 64, 128, 150, 200, 256, 300, 400, 512, 720]
batch_sizes_vllm = [8, 16, 32, 64, 100, 128, 150, 200, 300, 400, 500]

mindie_qps = [2.474, 4.649, 8.273, 12.065, 18.924, 20.457, 22.294, 23.392, 22.868, 23.328, 24.007, 24.643]
vllm_qps = [0.970, 1.588, 2.443, 3.503, 3.593, 3.580, 3.509, 3.076, 2.847, 2.820, 3.060]

mindie_latency = [3.213, 3.391, 3.724, 4.974, 6.108, 6.517, 7.805, 9.208, 10.904, 13.628, 16.031, 18.790]
vllm_latency = [8.213, 9.944, 12.846, 16.913, 19.182, 19.831, 17.806, 16.743, 16.285, 16.285, 17.649]

mindie_throughput = [594.929, 1119.102, 1989.159, 2903.023, 4559.489, 4920.856, 5354.701, 5636.846, 5506.567, 5618.110, 5772.230, 5940.622]
vllm_throughput = [298.860, 496.458, 753.176, 1073.697, 1039.610, 1005.881, 1032.794, 925.476, 839.437, 816.049, 875.565]

mindie_failures = [0] * len(batch_sizes_mindie)  # 假设 MindIE(910B4 8*32) 没有失败数据
vllm_failures = [0, 0, 0, 15, 73, 115, 26, 11, 19, 23, 26]

# 创建子图
fig, axs = plt.subplots(2, 2, figsize=(15, 10))

# QPS
axs[0, 0].plot(batch_sizes_mindie, mindie_qps, label='MindIE(910B4 8*32)', marker='o')
axs[0, 0].plot(batch_sizes_vllm, vllm_qps, label='vLLM(T4 4*16)', marker='o')
axs[0, 0].set_title('QPS', fontproperties=font_prop)
axs[0, 0].set_xlabel('并行数', fontproperties=font_prop)
axs[0, 0].set_ylabel('QPS', fontproperties=font_prop)
axs[0, 0].legend()

# 延迟
axs[0, 1].plot(batch_sizes_mindie, mindie_latency, label='MindIE(910B4 8*32)', marker='o')
axs[0, 1].plot(batch_sizes_vllm, vllm_latency, label='vLLM(T4 4*16)', marker='o')
axs[0, 1].set_title('延迟', fontproperties=font_prop)
axs[0, 1].set_xlabel('并行数', fontproperties=font_prop)
axs[0, 1].set_ylabel('延迟 (秒)', fontproperties=font_prop)
axs[0, 1].legend()

# 吞吐量
axs[1, 0].plot(batch_sizes_mindie, mindie_throughput, label='MindIE(910B4 8*32)', marker='o')
axs[1, 0].plot(batch_sizes_vllm, vllm_throughput, label='vLLM(T4 4*16)', marker='o')
axs[1, 0].set_title('吞吐量', fontproperties=font_prop)
axs[1, 0].set_xlabel('并行数', fontproperties=font_prop)
axs[1, 0].set_ylabel('吞吐量 (每秒Tokens)', fontproperties=font_prop)
axs[1, 0].legend()

# 失败率
axs[1, 1].plot(batch_sizes_mindie, mindie_failures, label='MindIE(910B4 8*32)', marker='o')
axs[1, 1].plot(batch_sizes_vllm, vllm_failures, label='vLLM(T4 4*16)', marker='o')
axs[1, 1].set_title('失败率', fontproperties=font_prop)
axs[1, 1].set_xlabel('并行数', fontproperties=font_prop)
axs[1, 1].set_ylabel('失败数', fontproperties=font_prop)
axs[1, 1].legend()

# 调整布局
plt.tight_layout()
plt.show()

实验结果(MindIE)

Qwen1.5-7B-Chat

指标 8 16 32 64 128 150 200 256 300 400 512 720
用时 404.284 215.085 120.876 82.884 52.844 48.884 44.856 42.750 43.729 42.866 41.655 40.580
QPS 2.474 4.649 8.273 12.065 18.924 20.457 22.294 23.392 22.868 23.328 24.007 24.643
延迟 3.213 3.391 3.724 4.974 6.108 6.517 7.805 9.208 10.904 13.628 16.031 18.790
吞吐量 594.929 1119.102 1989.159 2903.023 4559.489 4920.856 5354.701 5636.846 5506.567 5618.110 5772.230 5940.622
p50 3.2461 3.4271 3.7514 5.0248 6.1491 6.5487 7.7782 9.1754 10.9164 13.9850 17.0675 20.4161
p90 4.9771 5.2905 5.8484 7.7522 9.5705 10.1980 12.3493 13.6320 15.8667 19.2667 22.7640 26.8183
  • 平均每个请求的输入 token 数: 40
  • 平均每个请求的输出 token 数: 240

  • parallel 8
    Benchmarking summary: 
     Time taken for tests: 404.284 seconds
     Expected number of requests: 1000
     Number of concurrency: 8
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 2.474
     Average latency: 3.213
     Throughput(average output tokens per second): 594.929
     Average time to first token: 3.213
     Average input tokens per request: 40.296
     Average output tokens per request: 240.520
     Average time per output token: 0.00168
     Average package per request: 1.000
     Average package latency: 3.213
     Percentile of time to first token: 
       p50: 3.2461
       p66: 3.7587
       p75: 4.2213
       p80: 4.4208
       p90: 4.9771
       p95: 5.6460
       p98: 6.3678
       p99: 6.8545
     Percentile of request latency: 
       p50: 3.2461
       p66: 3.7587
       p75: 4.2213
       p80: 4.4208
       p90: 4.9771
       p95: 5.6460
       p98: 6.3678
       p99: 6.8545
    
  • parallel 16
    Benchmarking summary: 
     Time taken for tests: 215.085 seconds
     Expected number of requests: 1000
     Number of concurrency: 16
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 4.649
     Average latency: 3.391
     Throughput(average output tokens per second): 1119.102
     Average time to first token: 3.391
     Average input tokens per request: 40.296
     Average output tokens per request: 240.702
     Average time per output token: 0.00089
     Average package per request: 1.000
     Average package latency: 3.391
     Percentile of time to first token: 
       p50: 3.4271
       p66: 3.9816
       p75: 4.3792
       p80: 4.6188
       p90: 5.2905
       p95: 5.9389
       p98: 6.7555
       p99: 7.2478
     Percentile of request latency: 
       p50: 3.4271
       p66: 3.9816
       p75: 4.3792
       p80: 4.6188
       p90: 5.2905
       p95: 5.9389
       p98: 6.7555
       p99: 7.2478
    
  • parallel 32
    Benchmarking summary: 
     Time taken for tests: 120.876 seconds
     Expected number of requests: 1000
     Number of concurrency: 32
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 8.273
     Average latency: 3.724
     Throughput(average output tokens per second): 1989.159
     Average time to first token: 3.724
     Average input tokens per request: 40.296
     Average output tokens per request: 240.442
     Average time per output token: 0.00050
     Average package per request: 1.000
     Average package latency: 3.724
     Percentile of time to first token: 
       p50: 3.7514
       p66: 4.3989
       p75: 4.8352
       p80: 5.1087
       p90: 5.8484
       p95: 6.5664
       p98: 7.3057
       p99: 8.0644
     Percentile of request latency: 
       p50: 3.7514
       p66: 4.3989
       p75: 4.8352
       p80: 5.1087
       p90: 5.8484
       p95: 6.5664
       p98: 7.3057
       p99: 8.0644
    
  • parallel 64
    Benchmarking summary: 
     Time taken for tests: 82.884 seconds
     Expected number of requests: 1000
     Number of concurrency: 64
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 12.065
     Average latency: 4.974
     Throughput(average output tokens per second): 2903.023
     Average time to first token: 4.974
     Average input tokens per request: 40.296
     Average output tokens per request: 240.615
     Average time per output token: 0.00034
     Average package per request: 1.000
     Average package latency: 4.974
     Percentile of time to first token: 
       p50: 5.0248
       p66: 5.8985
       p75: 6.4676
       p80: 6.8351
       p90: 7.7522
       p95: 8.8266
       p98: 9.9534
       p99: 10.6036
     Percentile of request latency: 
       p50: 5.0248
       p66: 5.8985
       p75: 6.4676
       p80: 6.8351
       p90: 7.7522
       p95: 8.8266
       p98: 9.9534
       p99: 10.6036
    
  • parallel 128
    Benchmarking summary: 
     Time taken for tests: 52.844 seconds
     Expected number of requests: 1000
     Number of concurrency: 128
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 18.924
     Average latency: 6.108
     Throughput(average output tokens per second): 4559.489
     Average time to first token: 6.108
     Average input tokens per request: 40.296
     Average output tokens per request: 240.943
     Average time per output token: 0.00022
     Average package per request: 1.000
     Average package latency: 6.108
     Percentile of time to first token: 
       p50: 6.1491
       p66: 7.2622
       p75: 7.9560
       p80: 8.3894
       p90: 9.5705
       p95: 10.7209
       p98: 12.2300
       p99: 13.1657
     Percentile of request latency: 
       p50: 6.1491
       p66: 7.2622
       p75: 7.9560
       p80: 8.3894
       p90: 9.5705
       p95: 10.7209
       p98: 12.2300
       p99: 13.1657
    
  • parallel 150
    Benchmarking summary: 
     Time taken for tests: 48.884 seconds
     Expected number of requests: 1000
     Number of concurrency: 150
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 20.457
     Average latency: 6.517
     Throughput(average output tokens per second): 4920.856
     Average time to first token: 6.517
     Average input tokens per request: 40.296
     Average output tokens per request: 240.550
     Average time per output token: 0.00020
     Average package per request: 1.000
     Average package latency: 6.517
     Percentile of time to first token: 
       p50: 6.5487
       p66: 7.7580
       p75: 8.4394
       p80: 8.9248
       p90: 10.1980
       p95: 11.4446
       p98: 13.0906
       p99: 13.7333
     Percentile of request latency: 
       p50: 6.5487
       p66: 7.7580
       p75: 8.4394
       p80: 8.9248
       p90: 10.1980
       p95: 11.4446
       p98: 13.0906
       p99: 13.7333
    
  • parallel 200
    Benchmarking summary: 
     Time taken for tests: 44.856 seconds
     Expected number of requests: 1000
     Number of concurrency: 200
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 22.294
     Average latency: 7.805
     Throughput(average output tokens per second): 5354.701
     Average time to first token: 7.805
     Average input tokens per request: 40.296
     Average output tokens per request: 240.188
     Average time per output token: 0.00019
     Average package per request: 1.000
     Average package latency: 7.805
     Percentile of time to first token: 
       p50: 7.7782
       p66: 9.2457
       p75: 10.0596
       p80: 10.8689
       p90: 12.3493
       p95: 13.7108
       p98: 15.1361
       p99: 16.3464
     Percentile of request latency: 
       p50: 7.7782
       p66: 9.2457
       p75: 10.0596
       p80: 10.8689
       p90: 12.3493
       p95: 13.7108
       p98: 15.1361
       p99: 16.3464
    
  • parallel 256
    Benchmarking summary: 
     Time taken for tests: 42.750 seconds
     Expected number of requests: 1000
     Number of concurrency: 256
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 23.392
     Average latency: 9.208
     Throughput(average output tokens per second): 5636.846
     Average time to first token: 9.208
     Average input tokens per request: 40.296
     Average output tokens per request: 240.975
     Average time per output token: 0.00018
     Average package per request: 1.000
     Average package latency: 9.208
     Percentile of time to first token: 
       p50: 9.1754
       p66: 10.6423
       p75: 11.5348
       p80: 12.1507
       p90: 13.6320
       p95: 14.9237
       p98: 16.5329
       p99: 18.0215
     Percentile of request latency: 
       p50: 9.1754
       p66: 10.6423
       p75: 11.5348
       p80: 12.1507
       p90: 13.6320
       p95: 14.9237
       p98: 16.5329
       p99: 18.0215
    
  • parallel 300
    Benchmarking summary: 
     Time taken for tests: 43.729 seconds
     Expected number of requests: 1000
     Number of concurrency: 300
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 22.868
     Average latency: 10.904
     Throughput(average output tokens per second): 5506.567
     Average time to first token: 10.904
     Average input tokens per request: 40.296
     Average output tokens per request: 240.795
     Average time per output token: 0.00018
     Average package per request: 1.000
     Average package latency: 10.904
     Percentile of time to first token: 
       p50: 10.9164
       p66: 12.5896
       p75: 13.6076
       p80: 14.2442
       p90: 15.8667
       p95: 17.2967
       p98: 18.8841
       p99: 20.2304
     Percentile of request latency: 
       p50: 10.9164
       p66: 12.5896
       p75: 13.6076
       p80: 14.2442
       p90: 15.8667
       p95: 17.2967
       p98: 18.8841
       p99: 20.2304
    
  • parallel 400
    Benchmarking summary: 
     Time taken for tests: 42.866 seconds
     Expected number of requests: 1000
     Number of concurrency: 400
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 23.328
     Average latency: 13.628
     Throughput(average output tokens per second): 5618.110
     Average time to first token: 13.628
     Average input tokens per request: 40.296
     Average output tokens per request: 240.828
     Average time per output token: 0.00018
     Average package per request: 1.000
     Average package latency: 13.628
     Percentile of time to first token: 
       p50: 13.9850
       p66: 15.7791
       p75: 16.8451
       p80: 17.6249
       p90: 19.2667
       p95: 20.8091
       p98: 22.5674
       p99: 23.6675
     Percentile of request latency: 
       p50: 13.9850
       p66: 15.7791
       p75: 16.8451
       p80: 17.6249
       p90: 19.2667
       p95: 20.8091
       p98: 22.5674
       p99: 23.6675
    
  • parallel 512
    Benchmarking summary: 
     Time taken for tests: 41.655 seconds
     Expected number of requests: 1000
     Number of concurrency: 512
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 24.007
     Average latency: 16.031
     Throughput(average output tokens per second): 5772.230
     Average time to first token: 16.031
     Average input tokens per request: 40.296
     Average output tokens per request: 240.440
     Average time per output token: 0.00017
     Average package per request: 1.000
     Average package latency: 16.031
     Percentile of time to first token: 
       p50: 17.0675
       p66: 18.9757
       p75: 20.0632
       p80: 20.8715
       p90: 22.7640
       p95: 24.0828
       p98: 25.3913
       p99: 26.6549
     Percentile of request latency: 
       p50: 17.0675
       p66: 18.9757
       p75: 20.0632
       p80: 20.8715
       p90: 22.7640
       p95: 24.0828
       p98: 25.3913
       p99: 26.6549
    
  • parallel 720
    Benchmarking summary: 
     Time taken for tests: 40.580 seconds
     Expected number of requests: 1000
     Number of concurrency: 720
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 24.643
     Average latency: 18.790
     Throughput(average output tokens per second): 5940.622
     Average time to first token: 18.790
     Average input tokens per request: 40.296
     Average output tokens per request: 241.071
     Average time per output token: 0.00017
     Average package per request: 1.000
     Average package latency: 18.790
     Percentile of time to first token: 
       p50: 20.4161
       p66: 22.8199
       p75: 24.1332
       p80: 24.8298
       p90: 26.8183
       p95: 28.8718
       p98: 30.2479
       p99: 31.1723
     Percentile of request latency: 
       p50: 20.4161
       p66: 22.8199
       p75: 24.1332
       p80: 24.8298
       p90: 26.8183
       p95: 28.8718
       p98: 30.2479
       p99: 31.1723
    

Qwen1.5-7B-Chat (long.jsonl)

指标 32 64 80 100
用时 227.466 176.405 177.402 176.430
QPS 0.879 1.134 1.127 1.134
延迟 34.012 51.059 62.032 73.359
吞吐量 1534.689 1889.768 1864.759 1869.820
p50 34.7112 48.4820 59.2757 80.3454
p90 36.6736 68.2014 84.9306 93.7045
  • 平均每个请求的输入 token 数: 1614
  • 平均每个请求的输出 token 数: 1654

  • parallel 32
    Benchmarking summary: 
     Time taken for tests: 227.466 seconds
     Expected number of requests: 200
     Number of concurrency: 32
     Total requests: 200
     Succeed requests: 200
     Failed requests: 0
     Average QPS: 0.879
     Average latency: 34.012
     Throughput(average output tokens per second): 1534.689
     Average time to first token: 34.012
     Average input tokens per request: 1614.000
     Average output tokens per request: 1745.450
     Average time per output token: 0.00065
     Average package per request: 1.000
     Average package latency: 34.012
     Percentile of time to first token: 
       p50: 34.7112
       p66: 36.2749
       p75: 36.4008
       p80: 36.4714
       p90: 36.6736
       p95: 36.7247
       p98: 36.7508
       p99: 36.7635
     Percentile of request latency: 
       p50: 34.7112
       p66: 36.2749
       p75: 36.4008
       p80: 36.4714
       p90: 36.6736
       p95: 36.7247
       p98: 36.7508
       p99: 36.7635
    
  • parallel 64
    Benchmarking summary: 
     Time taken for tests: 176.405 seconds
     Expected number of requests: 200
     Number of concurrency: 64
     Total requests: 200
     Succeed requests: 200
     Failed requests: 0
     Average QPS: 1.134
     Average latency: 51.059
     Throughput(average output tokens per second): 1889.768
     Average time to first token: 51.059
     Average input tokens per request: 1614.000
     Average output tokens per request: 1666.820
     Average time per output token: 0.00053
     Average package per request: 1.000
     Average package latency: 51.059
     Percentile of time to first token: 
       p50: 48.4820
       p66: 52.8716
       p75: 58.2624
       p80: 60.0386
       p90: 68.2014
       p95: 74.7256
       p98: 78.7428
       p99: 79.3227
     Percentile of request latency: 
       p50: 48.4820
       p66: 52.8716
       p75: 58.2624
       p80: 60.0386
       p90: 68.2014
       p95: 74.7256
       p98: 78.7428
       p99: 79.3227
    
  • parallel 80
    Benchmarking summary: 
     Time taken for tests: 177.402 seconds
     Expected number of requests: 200
     Number of concurrency: 80
     Total requests: 200
     Succeed requests: 200
     Failed requests: 0
     Average QPS: 1.127
     Average latency: 62.032
     Throughput(average output tokens per second): 1864.759
     Average time to first token: 62.032
     Average input tokens per request: 1614.000
     Average output tokens per request: 1654.060
     Average time per output token: 0.00054
     Average package per request: 1.000
     Average package latency: 62.032
     Percentile of time to first token: 
       p50: 59.2757
       p66: 71.6039
       p75: 74.4594
       p80: 76.9160
       p90: 84.9306
       p95: 91.9959
       p98: 95.0497
       p99: 98.3784
     Percentile of request latency: 
       p50: 59.2757
       p66: 71.6039
       p75: 74.4594
       p80: 76.9160
       p90: 84.9306
       p95: 91.9959
       p98: 95.0497
       p99: 98.3784
    
  • parallel 100
    Benchmarking summary: 
     Time taken for tests: 176.430 seconds
     Expected number of requests: 200
     Number of concurrency: 100
     Total requests: 200
     Succeed requests: 200
     Failed requests: 0
     Average QPS: 1.134
     Average latency: 73.359
     Throughput(average output tokens per second): 1869.820
     Average time to first token: 73.359
     Average input tokens per request: 1614.000
     Average output tokens per request: 1649.460
     Average time per output token: 0.00053
     Average package per request: 1.000
     Average package latency: 73.359
     Percentile of time to first token: 
       p50: 80.3454
       p66: 85.7741
       p75: 88.7250
       p80: 90.7941
       p90: 93.7045
       p95: 97.3420
       p98: 99.5837
       p99: 101.2294
     Percentile of request latency: 
       p50: 80.3454
       p66: 85.7741
       p75: 88.7250
       p80: 90.7941
       p90: 93.7045
       p95: 97.3420
       p98: 99.5837
       p99: 101.2294
    

Qwen1.5-14B-Chat

指标 8 16 32 64 128 150 200 256 512
用时 578.571 361.169 253.040 204.961 170.001 169.981 162.999 159.840 153.937
QPS 1.727 2.766 3.952 4.874 5.882 5.877 6.129 6.250 6.490
延迟 3.712 3.928 4.581 5.628 7.223 8.004 9.205 11.446 22.695
吞吐量 480.043 897.511 915.133 2333.955 1363.656 3621.096 4310.525 4333.013 3806.748
p50 3.7038 3.9261 4.4544 5.6047 7.0534 7.9484 9.0271 11.3066 23.6591
p90 5.7184 6.0562 6.9198 8.7597 11.1194 12.5181 14.5017 16.8646 33.3052
  • parallel 8
    Benchmarking summary: 
     Time taken for tests: 578.571 seconds
     Expected number of requests: 1000
     Number of concurrency: 8
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 1.727
     Average latency: 3.712
     Throughput(average output tokens per second): 480.043
     Average time to first token: 3.712
     Average input tokens per request: 40.287
     Average output tokens per request: 224.138
     Average time per output token: 0.00208
     Average package per request: 1.000
     Average package latency: 3.712
     Percentile of time to first token: 
       p50: 3.7038
       p66: 4.4215
       p75: 4.8051
       p80: 5.0476
       p90: 5.7184
       p95: 6.2956
       p98: 7.0707
       p99: 7.4415
     Percentile of request latency: 
       p50: 3.7038
       p66: 4.4215
       p75: 4.8051
       p80: 5.0476
       p90: 5.7184
       p95: 6.2956
       p98: 7.0707
       p99: 7.4415
    
  • parallel 16
    Benchmarking summary: 
     Time taken for tests: 361.169 seconds
     Expected number of requests: 1000
     Number of concurrency: 16
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 2.766
     Average latency: 3.928
     Throughput(average output tokens per second): 897.511
     Average time to first token: 3.928
     Average input tokens per request: 40.287
     Average output tokens per request: 223.750
     Average time per output token: 0.00111
     Average package per request: 1.000
     Average package latency: 3.928
     Percentile of time to first token: 
       p50: 3.9261
       p66: 4.6890
       p75: 5.1204
       p80: 5.3290
       p90: 6.0562
       p95: 6.6784
       p98: 7.3113
       p99: 7.8980
     Percentile of request latency: 
       p50: 3.9261
       p66: 4.6890
       p75: 5.1204
       p80: 5.3290
       p90: 6.0562
       p95: 6.6784
       p98: 7.3113
       p99: 7.8980
    
  • parallel 32
    Benchmarking summary: 
     Time taken for tests: 253.040 seconds
     Expected number of requests: 1000
     Number of concurrency: 32
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 3.952
     Average latency: 4.581
     Throughput(average output tokens per second): 915.133
     Average time to first token: 4.581
     Average input tokens per request: 40.296
     Average output tokens per request: 231.565
     Average time per output token: 0.00109
     Average package per request: 1.000
     Average package latency: 4.581
     Percentile of time to first token: 
       p50: 4.4544
       p66: 5.2905
       p75: 5.8235
       p80: 6.1074
       p90: 6.9198
       p95: 7.5185
       p98: 8.5143
       p99: 9.3296
     Percentile of request latency: 
       p50: 4.4544
       p66: 5.2905
       p75: 5.8235
       p80: 6.1074
       p90: 6.9198
       p95: 7.5185
       p98: 8.5143
       p99: 9.3296
    
  • parallel 64
    Benchmarking summary: 
     Time taken for tests: 204.961 seconds
     Expected number of requests: 1000
     Number of concurrency: 64
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 4.874
     Average latency: 5.628
     Throughput(average output tokens per second): 2333.955
     Average time to first token: 5.628
     Average input tokens per request: 40.287
     Average output tokens per request: 223.930
     Average time per output token: 0.00043
     Average package per request: 1.000
     Average package latency: 5.628
     Percentile of time to first token: 
       p50: 5.6047
       p66: 6.6600
       p75: 7.3423
       p80: 7.7040
       p90: 8.7597
       p95: 9.6390
       p98: 10.7844
       p99: 11.6003
     Percentile of request latency: 
       p50: 5.6047
       p66: 6.6600
       p75: 7.3423
       p80: 7.7040
       p90: 8.7597
       p95: 9.6390
       p98: 10.7844
       p99: 11.6003
    
  • parallel 128
    Benchmarking summary: 
     Time taken for tests: 170.001 seconds
     Expected number of requests: 1000
     Number of concurrency: 128
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 5.882
     Average latency: 7.223
     Throughput(average output tokens per second): 1363.656
     Average time to first token: 7.223
     Average input tokens per request: 40.296
     Average output tokens per request: 231.823
     Average time per output token: 0.00073
     Average package per request: 1.000
     Average package latency: 7.223
     Percentile of time to first token: 
       p50: 7.0534
       p66: 8.4098
       p75: 9.3191
       p80: 9.7640
       p90: 11.1194
       p95: 12.2800
       p98: 13.8248
       p99: 14.6733
     Percentile of request latency: 
       p50: 7.0534
       p66: 8.4098
       p75: 9.3191
       p80: 9.7640
       p90: 11.1194
       p95: 12.2800
       p98: 13.8248
       p99: 14.6733
    
  • parallel 150
    Benchmarking summary: 
     Time taken for tests: 169.981 seconds
     Expected number of requests: 1000
     Number of concurrency: 150
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 5.877
     Average latency: 8.004
     Throughput(average output tokens per second): 3621.096
     Average time to first token: 8.004
     Average input tokens per request: 40.287
     Average output tokens per request: 224.225
     Average time per output token: 0.00028
     Average package per request: 1.000
     Average package latency: 8.004
     Percentile of time to first token: 
       p50: 7.9484
       p66: 9.3969
       p75: 10.5065
       p80: 11.0508
       p90: 12.5181
       p95: 13.6289
       p98: 15.3693
       p99: 16.5225
     Percentile of request latency: 
       p50: 7.9484
       p66: 9.3969
       p75: 10.5065
       p80: 11.0508
       p90: 12.5181
       p95: 13.6289
       p98: 15.3693
       p99: 16.5225
    
  • parallel 200
    Benchmarking summary: 
     Time taken for tests: 162.999 seconds
     Expected number of requests: 1000
     Number of concurrency: 200
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 6.129
     Average latency: 9.205
     Throughput(average output tokens per second): 4310.525
     Average time to first token: 9.205
     Average input tokens per request: 40.287
     Average output tokens per request: 223.865
     Average time per output token: 0.00023
     Average package per request: 1.000
     Average package latency: 9.205
     Percentile of time to first token: 
       p50: 9.0271
       p66: 10.8104
       p75: 12.0957
       p80: 12.8212
       p90: 14.5017
       p95: 15.7233
       p98: 17.6891
       p99: 19.2401
     Percentile of request latency: 
       p50: 9.0271
       p66: 10.8104
       p75: 12.0957
       p80: 12.8212
       p90: 14.5017
       p95: 15.7233
       p98: 17.6891
       p99: 19.2401
    
  • parallel 256
    Benchmarking summary: 
     Time taken for tests: 159.840 seconds
     Expected number of requests: 1000
     Number of concurrency: 256
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 6.250
     Average latency: 11.446
     Throughput(average output tokens per second): 4333.013
     Average time to first token: 11.446
     Average input tokens per request: 40.287
     Average output tokens per request: 224.384
     Average time per output token: 0.00023
     Average package per request: 1.000
     Average package latency: 11.446
     Percentile of time to first token: 
       p50: 11.3066
       p66: 13.1698
       p75: 14.4698
       p80: 15.2733
       p90: 16.8646
       p95: 18.3524
       p98: 20.0468
       p99: 21.3758
     Percentile of request latency: 
       p50: 11.3066
       p66: 13.1698
       p75: 14.4698
       p80: 15.2733
       p90: 16.8646
       p95: 18.3524
       p98: 20.0468
       p99: 21.3758
    
  • parallel 512
    Benchmarking summary: 
     Time taken for tests: 153.937 seconds
     Expected number of requests: 1000
     Number of concurrency: 512
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 6.490
     Average latency: 22.695
     Throughput(average output tokens per second): 3806.748
     Average time to first token: 22.695
     Average input tokens per request: 40.287
     Average output tokens per request: 224.177
     Average time per output token: 0.00026
     Average package per request: 1.000
     Average package latency: 22.695
     Percentile of time to first token: 
       p50: 23.6591
       p66: 27.2753
       p75: 29.3330
       p80: 30.5309
       p90: 33.3052
       p95: 35.7777
       p98: 37.4897
       p99: 38.1186
     Percentile of request latency: 
       p50: 23.6591
       p66: 27.2753
       p75: 29.3330
       p80: 30.5309
       p90: 33.3052
       p95: 35.7777
       p98: 37.4897
       p99: 38.1186
    

Qwen2-72B-Chat

指标 8 16 32 64 128 150 200 256 512
用时 1569.707 909.001 567.479 382.247 179.015 270.054 251.060 237.063 206.734
QPS 0.636 1.099 1.759 2.613 5.586 3.699 3.975 4.214 4.832
延迟 11.705 12.806 14.526 17.296 21.041 23.856 28.198 33.176 59.912
吞吐量 188.589 342.764 588.262 973.795 1549.069 1595.176 1748.784 1866.155 1828.363
p50 11.8443 12.8275 14.7115 17.5290 21.3863 23.8551 28.4407 32.7329 63.0404
p90 16.2106 17.5466 20.0371 23.9890 29.5537 33.7459 40.4210 45.7666 81.6602
  • 平均每个请求的输入 token 数: 40
  • 平均每个请求的输出 token 数: 277

  • parallel 8
    Benchmarking summary: 
     Time taken for tests: 1569.707 seconds
     Expected number of requests: 1000
     Number of concurrency: 8
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 0.636
     Average latency: 11.705
     Throughput(average output tokens per second): 188.589
     Average time to first token: 11.705
     Average input tokens per request: 40.303
     Average output tokens per request: 277.618
     Average time per output token: 0.00530
     Average package per request: 1.000
     Average package latency: 11.705
     Percentile of time to first token: 
       p50: 11.8443
       p66: 13.1671
       p75: 13.9665
       p80: 14.5981
       p90: 16.2106
       p95: 17.8844
       p98: 20.0471
       p99: 23.0309
     Percentile of request latency: 
       p50: 11.8443
       p66: 13.1671
       p75: 13.9665
       p80: 14.5981
       p90: 16.2106
       p95: 17.8844
       p98: 20.0471
       p99: 23.0309
    
  • parallel 16
    Benchmarking summary: 
     Time taken for tests: 909.001 seconds
     Expected number of requests: 1000
     Number of concurrency: 16
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 1.099
     Average latency: 12.806
     Throughput(average output tokens per second): 342.764
     Average time to first token: 12.806
     Average input tokens per request: 40.303
     Average output tokens per request: 278.224
     Average time per output token: 0.00292
     Average package per request: 1.000
     Average package latency: 12.806
     Percentile of time to first token: 
       p50: 12.8275
       p66: 14.3998
       p75: 15.3983
       p80: 16.1443
       p90: 17.5466
       p95: 19.6906
       p98: 22.2533
       p99: 25.1283
     Percentile of request latency: 
       p50: 12.8275
       p66: 14.3998
       p75: 15.3983
       p80: 16.1443
       p90: 17.5466
       p95: 19.6906
       p98: 22.2533
       p99: 25.1283
    
  • parallel 32
    Benchmarking summary: 
     Time taken for tests: 567.479 seconds
     Expected number of requests: 1000
     Number of concurrency: 32
     Total requests: 1000
     Succeed requests: 998
     Failed requests: 2
     Average QPS: 1.759
     Average latency: 14.526
     Throughput(average output tokens per second): 588.262
     Average time to first token: 14.526
     Average input tokens per request: 40.297
     Average output tokens per request: 277.259
     Average time per output token: 0.00170
     Average package per request: 1.000
     Average package latency: 14.526
     Percentile of time to first token: 
       p50: 14.7115
       p66: 16.2993
       p75: 17.4013
       p80: 18.2002
       p90: 20.0371
       p95: 21.8216
       p98: 24.5539
       p99: 27.3373
     Percentile of request latency: 
       p50: 14.7115
       p66: 16.2993
       p75: 17.4013
       p80: 18.2002
       p90: 20.0371
       p95: 21.8216
       p98: 24.5539
       p99: 27.3373
    
  • parallel 64
    Benchmarking summary: 
     Time taken for tests: 382.247 seconds
     Expected number of requests: 1000
     Number of concurrency: 64
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 2.613
     Average latency: 17.296
     Throughput(average output tokens per second): 973.795
     Average time to first token: 17.296
     Average input tokens per request: 40.303
     Average output tokens per request: 276.968
     Average time per output token: 0.00103
     Average package per request: 1.000
     Average package latency: 17.296
     Percentile of time to first token: 
       p50: 17.5290
       p66: 19.5218
       p75: 20.7063
       p80: 21.6443
       p90: 23.9890
       p95: 26.0998
       p98: 29.7887
       p99: 32.3975
     Percentile of request latency: 
       p50: 17.5290
       p66: 19.5218
       p75: 20.7063
       p80: 21.6443
       p90: 23.9890
       p95: 26.0998
       p98: 29.7887
       p99: 32.3975
    
  • parallel 128
    Benchmarking summary: 
     Time taken for tests: 179.015 seconds
     Expected number of requests: 1000
     Number of concurrency: 128
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 5.586
     Average latency: 21.041
     Throughput(average output tokens per second): 1549.069
     Average time to first token: 21.041
     Average input tokens per request: 40.296
     Average output tokens per request: 277.307
     Average time per output token: 0.00065
     Average package per request: 1.000
     Average package latency: 21.041
     Percentile of time to first token: 
       p50: 21.3863
       p66: 23.7687
       p75: 25.1877
       p80: 26.3636
       p90: 29.5537
       p95: 32.3925
       p98: 36.5261
       p99: 39.3924
     Percentile of request latency: 
       p50: 21.3863
       p66: 23.7687
       p75: 25.1877
       p80: 26.3636
       p90: 29.5537
       p95: 32.3925
       p98: 36.5261
       p99: 39.3924
    
  • parallel 150
    Benchmarking summary: 
     Time taken for tests: 270.054 seconds
     Expected number of requests: 1000
     Number of concurrency: 150
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 3.699
     Average latency: 23.856
     Throughput(average output tokens per second): 1595.176
     Average time to first token: 23.856
     Average input tokens per request: 40.303
     Average output tokens per request: 277.760
     Average time per output token: 0.00063
     Average package per request: 1.000
     Average package latency: 23.856
     Percentile of time to first token: 
       p50: 23.8551
       p66: 26.6484
       p75: 28.7350
       p80: 29.8586
       p90: 33.7459
       p95: 36.6390
       p98: 41.2772
       p99: 47.8515
     Percentile of request latency: 
       p50: 23.8551
       p66: 26.6484
       p75: 28.7350
       p80: 29.8586
       p90: 33.7459
       p95: 36.6390
       p98: 41.2772
       p99: 47.8515
    
  • parallel 200
    Benchmarking summary: 
     Time taken for tests: 251.060 seconds
     Expected number of requests: 1000
     Number of concurrency: 200
     Total requests: 1000
     Succeed requests: 998
     Failed requests: 2
     Average QPS: 3.975
     Average latency: 28.198
     Throughput(average output tokens per second): 1748.784
     Average time to first token: 28.198
     Average input tokens per request: 40.308
     Average output tokens per request: 276.789
     Average time per output token: 0.00057
     Average package per request: 1.000
     Average package latency: 28.198
     Percentile of time to first token: 
       p50: 28.4407
       p66: 31.6658
       p75: 34.0785
       p80: 35.5489
       p90: 40.4210
       p95: 43.0363
       p98: 48.1876
       p99: 52.8204
     Percentile of request latency: 
       p50: 28.4407
       p66: 31.6658
       p75: 34.0785
       p80: 35.5489
       p90: 40.4210
       p95: 43.0363
       p98: 48.1876
       p99: 52.8204
    
  • parallel 256
    Benchmarking summary: 
     Time taken for tests: 237.063 seconds
     Expected number of requests: 1000
     Number of concurrency: 256
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 4.214
     Average latency: 33.176
     Throughput(average output tokens per second): 1866.155
     Average time to first token: 33.176
     Average input tokens per request: 40.303
     Average output tokens per request: 276.399
     Average time per output token: 0.00054
     Average package per request: 1.000
     Average package latency: 33.176
     Percentile of time to first token: 
       p50: 32.7329
       p66: 37.0212
       p75: 39.6246
       p80: 41.3947
       p90: 45.7666
       p95: 49.4765
       p98: 54.4858
       p99: 58.0206
     Percentile of request latency: 
       p50: 32.7329
       p66: 37.0212
       p75: 39.6246
       p80: 41.3947
       p90: 45.7666
       p95: 49.4765
       p98: 54.4858
       p99: 58.0206
    
  • parallel 512
    Benchmarking summary: 
     Time taken for tests: 206.734 seconds
     Expected number of requests: 1000
     Number of concurrency: 512
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 4.832
     Average latency: 59.912
     Throughput(average output tokens per second): 1828.363
     Average time to first token: 59.912
     Average input tokens per request: 40.303
     Average output tokens per request: 277.592
     Average time per output token: 0.00055
     Average package per request: 1.000
     Average package latency: 59.912
     Percentile of time to first token: 
       p50: 63.0404
       p66: 69.5273
       p75: 73.3044
       p80: 75.7455
       p90: 81.6602
       p95: 87.3050
       p98: 92.2078
       p99: 97.6189
     Percentile of request latency: 
       p50: 63.0404
       p66: 69.5273
       p75: 73.3044
       p80: 75.7455
       p90: 81.6602
       p95: 87.3050
       p98: 92.2078
       p99: 97.6189
    

Qwen2-72B-Chat (long.jsonl)

指标 8 12 20 30 40 50
用时 3091.468 2385.800 1598.805 1542.828 1509.713 1408.587
QPS 0.032 0.042 0.063 0.063 0.056 0.043
延迟 238.540 268.873 294.636 382.291 414.746 403.061
吞吐量 158.986 206.011 307.418 299.114 268.391 199.552
p50 239.3327 270.9418 292.3093 348.5759 396.9046 350.8392
p90 239.6762 271.3905 313.2425 514.8233 567.8100 597.3057
  • 平均每个请求的输入 token 数: 6385
  • 平均每个请求的输出 token 数: 4915

  • parallel 8
    Benchmarking summary: 
     Time taken for tests: 3091.468 seconds
     Expected number of requests: 100
     Number of concurrency: 8
     Total requests: 100
     Succeed requests: 100
     Failed requests: 0
     Average QPS: 0.032
     Average latency: 238.540
     Throughput(average output tokens per second): 158.986
     Average time to first token: 238.540
     Average input tokens per request: 6385.000
     Average output tokens per request: 4915.000
     Average time per output token: 0.00629
     Average package per request: 1.000
     Average package latency: 238.540
     Percentile of time to first token: 
       p50: 239.3327
       p66: 239.5209
       p75: 239.5622
       p80: 239.6446
       p90: 239.6762
       p95: 240.0079
       p98: 240.0081
       p99: 240.0121
     Percentile of request latency: 
       p50: 239.3327
       p66: 239.5209
       p75: 239.5622
       p80: 239.6446
       p90: 239.6762
       p95: 240.0079
       p98: 240.0081
       p99: 240.0121
    
  • parallel 12
    Benchmarking summary: 
     Time taken for tests: 2385.800 seconds
     Expected number of requests: 100
     Number of concurrency: 12
     Total requests: 100
     Succeed requests: 100
     Failed requests: 0
     Average QPS: 0.042
     Average latency: 268.873
     Throughput(average output tokens per second): 206.011
     Average time to first token: 268.873
     Average input tokens per request: 6385.000
     Average output tokens per request: 4915.000
     Average time per output token: 0.00485
     Average package per request: 1.000
     Average package latency: 268.873
     Percentile of time to first token: 
       p50: 270.9418
       p66: 271.2864
       p75: 271.3597
       p80: 271.3610
       p90: 271.3905
       p95: 271.3978
       p98: 271.4502
       p99: 271.4786
     Percentile of request latency: 
       p50: 270.9418
       p66: 271.2864
       p75: 271.3597
       p80: 271.3610
       p90: 271.3905
       p95: 271.3978
       p98: 271.4502
       p99: 271.4786
    
  • parallel 20
    Benchmarking summary: 
     Time taken for tests: 1598.805 seconds
     Expected number of requests: 100
     Number of concurrency: 20
     Total requests: 100
     Succeed requests: 100
     Failed requests: 0
     Average QPS: 0.063
     Average latency: 294.636
     Throughput(average output tokens per second): 307.418
     Average time to first token: 294.636
     Average input tokens per request: 6385.000
     Average output tokens per request: 4915.020
     Average time per output token: 0.00325
     Average package per request: 1.000
     Average package latency: 294.636
     Percentile of time to first token: 
       p50: 292.3093
       p66: 293.7218
       p75: 296.3762
       p80: 296.4460
       p90: 313.2425
       p95: 323.9384
       p98: 334.5282
       p99: 357.6738
     Percentile of request latency: 
       p50: 292.3093
       p66: 293.7218
       p75: 296.3762
       p80: 296.4460
       p90: 313.2425
       p95: 323.9384
       p98: 334.5282
       p99: 357.6738
    
  • parallel 30
    Benchmarking summary: 
     Time taken for tests: 1542.828 seconds
     Expected number of requests: 100
     Number of concurrency: 30
     Total requests: 97
     Succeed requests: 97
     Failed requests: 0
     Average QPS: 0.063
     Average latency: 382.291
     Throughput(average output tokens per second): 299.114
     Average time to first token: 382.291
     Average input tokens per request: 6385.000
     Average output tokens per request: 4757.546
     Average time per output token: 0.00334
     Average package per request: 1.000
     Average package latency: 382.291
     Percentile of time to first token: 
       p50: 348.5759
       p66: 378.9814
       p75: 420.5971
       p80: 443.5496
       p90: 514.8233
       p95: 548.8156
       p98: 559.4441
       p99: 590.6074
     Percentile of request latency: 
       p50: 348.5759
       p66: 378.9814
       p75: 420.5971
       p80: 443.5496
       p90: 514.8233
       p95: 548.8156
       p98: 559.4441
       p99: 590.6074
    
  • parallel 40
    Benchmarking summary: 
     Time taken for tests: 1509.713 seconds
     Expected number of requests: 100
     Number of concurrency: 40
     Total requests: 87
     Succeed requests: 85
     Failed requests: 2
     Average QPS: 0.056
     Average latency: 414.746
     Throughput(average output tokens per second): 268.391
     Average time to first token: 414.746
     Average input tokens per request: 6385.000
     Average output tokens per request: 4766.976
     Average time per output token: 0.00373
     Average package per request: 1.000
     Average package latency: 414.746
     Percentile of time to first token: 
       p50: 396.9046
       p66: 458.9677
       p75: 482.9745
       p80: 521.3878
       p90: 567.8100
       p95: 580.6507
       p98: 586.6928
       p99: 587.2930
     Percentile of request latency: 
       p50: 396.9046
       p66: 458.9677
       p75: 482.9745
       p80: 521.3878
       p90: 567.8100
       p95: 580.6507
       p98: 586.6928
       p99: 587.2930
    
  • parallel 50
    Benchmarking summary: 
     Time taken for tests: 1408.587 seconds
     Expected number of requests: 100
     Number of concurrency: 50
     Total requests: 78
     Succeed requests: 60
     Failed requests: 18
     Average QPS: 0.043
     Average latency: 403.061
     Throughput(average output tokens per second): 199.552
     Average time to first token: 403.061
     Average input tokens per request: 6385.000
     Average output tokens per request: 4684.783
     Average time per output token: 0.00501
     Average package per request: 1.000
     Average package latency: 403.061
     Percentile of time to first token: 
       p50: 350.8392
       p66: 386.2204
       p75: 516.6384
       p80: 558.8747
       p90: 597.3057
       p95: 597.3098
       p98: 597.5291
       p99: 598.8612
     Percentile of request latency: 
       p50: 350.8392
       p66: 386.2204
       p75: 516.6384
       p80: 558.8747
       p90: 597.3057
       p95: 597.3098
       p98: 597.5291
       p99: 598.8612
    

DeepSeek-Coder-6.7B-Instruct

指标 8 16 32 64 128 150 200 300 400 500 600
用时 621.642 325.248 178.007 109.977 70.124 67.204 61.252 63.928 70.753 72.668 75.559
QPS 1.609 3.075 5.618 9.093 14.261 14.880 16.326 15.643 12.494 7.527 5.294
延迟 4.967 5.153 5.590 6.847 8.644 9.535 12.010 17.516 20.797 18.579 21.424
吞吐量 643.457 1229.830 2247.103 3637.131 5704.202 5952.054 6530.446 6257.033 4997.658 3010.939 2117.557
p50 4.9568 5.1310 5.5455 6.9095 8.6159 9.7244 12.1211 14.5084 20.3585 17.9382 22.6927
p90 5.0456 5.3116 6.0241 6.9913 8.9456 10.0026 12.3709 23.8041 28.1569 28.4858 24.4160
失败 0 0 0 0 0 0 0 0 0 184 216
  • 平均每个请求的输入 token 数: 157
  • 平均每个请求的输出 token 数: 400

  • parallel 8
    Benchmarking summary: 
     Time taken for tests: 621.642 seconds
     Expected number of requests: 1000
     Number of concurrency: 8
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 1.609
     Average latency: 4.967
     Throughput(average output tokens per second): 643.457
     Average time to first token: 4.967
     Average input tokens per request: 157.292
     Average output tokens per request: 400.000
     Average time per output token: 0.00155
     Average package per request: 1.000
     Average package latency: 4.967
     Percentile of time to first token: 
       p50: 4.9568
       p66: 4.9916
       p75: 5.0074
       p80: 5.0173
       p90: 5.0456
       p95: 5.0796
       p98: 5.1067
       p99: 5.1223
     Percentile of request latency: 
       p50: 4.9568
       p66: 4.9916
       p75: 5.0074
       p80: 5.0173
       p90: 5.0456
       p95: 5.0796
       p98: 5.1067
       p99: 5.1223
    
  • parallel 16
    Benchmarking summary: 
     Time taken for tests: 325.248 seconds
     Expected number of requests: 1000
     Number of concurrency: 16
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 3.075
     Average latency: 5.153
     Throughput(average output tokens per second): 1229.830
     Average time to first token: 5.153
     Average input tokens per request: 157.292
     Average output tokens per request: 400.000
     Average time per output token: 0.00081
     Average package per request: 1.000
     Average package latency: 5.153
     Percentile of time to first token: 
       p50: 5.1310
       p66: 5.1707
       p75: 5.2303
       p80: 5.2545
       p90: 5.3116
       p95: 5.4090
       p98: 5.4987
       p99: 5.5322
     Percentile of request latency: 
       p50: 5.1310
       p66: 5.1707
       p75: 5.2303
       p80: 5.2545
       p90: 5.3116
       p95: 5.4090
       p98: 5.4987
       p99: 5.5322
    
  • parallel 32
    Benchmarking summary: 
     Time taken for tests: 178.007 seconds
     Expected number of requests: 1000
     Number of concurrency: 32
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 5.618
     Average latency: 5.590
     Throughput(average output tokens per second): 2247.103
     Average time to first token: 5.590
     Average input tokens per request: 157.292
     Average output tokens per request: 400.000
     Average time per output token: 0.00045
     Average package per request: 1.000
     Average package latency: 5.590
     Percentile of time to first token: 
       p50: 5.5455
       p66: 5.5729
       p75: 5.6628
       p80: 5.7434
       p90: 6.0241
       p95: 6.0393
       p98: 6.1004
       p99: 6.1034
     Percentile of request latency: 
       p50: 5.5455
       p66: 5.5729
       p75: 5.6628
       p80: 5.7434
       p90: 6.0241
       p95: 6.0393
       p98: 6.1004
       p99: 6.1034
    
  • parallel 64
    Benchmarking summary: 
     Time taken for tests: 109.977 seconds
     Expected number of requests: 1000
     Number of concurrency: 64
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 9.093
     Average latency: 6.847
     Throughput(average output tokens per second): 3637.131
     Average time to first token: 6.847
     Average input tokens per request: 157.292
     Average output tokens per request: 400.000
     Average time per output token: 0.00027
     Average package per request: 1.000
     Average package latency: 6.847
     Percentile of time to first token: 
       p50: 6.9095
       p66: 6.9292
       p75: 6.9419
       p80: 6.9551
       p90: 6.9913
       p95: 7.0102
       p98: 7.0201
       p99: 7.0224
     Percentile of request latency: 
       p50: 6.9095
       p66: 6.9292
       p75: 6.9419
       p80: 6.9551
       p90: 6.9913
       p95: 7.0102
       p98: 7.0201
       p99: 7.0224
    
  • parallel 128
    Benchmarking summary: 
     Time taken for tests: 70.124 seconds
     Expected number of requests: 1000
     Number of concurrency: 128
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 14.261
     Average latency: 8.644
     Throughput(average output tokens per second): 5704.202
     Average time to first token: 8.644
     Average input tokens per request: 157.292
     Average output tokens per request: 400.000
     Average time per output token: 0.00018
     Average package per request: 1.000
     Average package latency: 8.644
     Percentile of time to first token: 
       p50: 8.6159
       p66: 8.7015
       p75: 8.7256
       p80: 8.8329
       p90: 8.9456
       p95: 8.9532
       p98: 8.9652
       p99: 8.9727
     Percentile of request latency: 
       p50: 8.6159
       p66: 8.7015
       p75: 8.7256
       p80: 8.8329
       p90: 8.9456
       p95: 8.9532
       p98: 8.9652
       p99: 8.9727
    
  • parallel 150
    Benchmarking summary: 
     Time taken for tests: 67.204 seconds
     Expected number of requests: 1000
     Number of concurrency: 150
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 14.880
     Average latency: 9.535
     Throughput(average output tokens per second): 5952.054
     Average time to first token: 9.535
     Average input tokens per request: 157.292
     Average output tokens per request: 400.000
     Average time per output token: 0.00017
     Average package per request: 1.000
     Average package latency: 9.535
     Percentile of time to first token: 
       p50: 9.7244
       p66: 9.8244
       p75: 9.8670
       p80: 9.8891
       p90: 10.0026
       p95: 10.0508
       p98: 10.0876
       p99: 10.1092
     Percentile of request latency: 
       p50: 9.7244
       p66: 9.8244
       p75: 9.8670
       p80: 9.8891
       p90: 10.0026
       p95: 10.0508
       p98: 10.0876
       p99: 10.1092
    
  • parallel 200
    Benchmarking summary: 
     Time taken for tests: 61.252 seconds
     Expected number of requests: 1000
     Number of concurrency: 200
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 16.326
     Average latency: 12.010
     Throughput(average output tokens per second): 6530.446
     Average time to first token: 12.010
     Average input tokens per request: 157.292
     Average output tokens per request: 400.000
     Average time per output token: 0.00015
     Average package per request: 1.000
     Average package latency: 12.010
     Percentile of time to first token: 
       p50: 12.1211
       p66: 12.2211
       p75: 12.2520
       p80: 12.2641
       p90: 12.3709
       p95: 12.3958
       p98: 12.4472
       p99: 12.4868
     Percentile of request latency: 
       p50: 12.1211
       p66: 12.2211
       p75: 12.2520
       p80: 12.2641
       p90: 12.3709
       p95: 12.3958
       p98: 12.4472
       p99: 12.4868
    
  • parallel 300
    Benchmarking summary: 
     Time taken for tests: 63.928 seconds
     Expected number of requests: 1000
     Number of concurrency: 300
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 15.643
     Average latency: 17.516
     Throughput(average output tokens per second): 6257.033
     Average time to first token: 17.516
     Average input tokens per request: 157.292
     Average output tokens per request: 400.000
     Average time per output token: 0.00016
     Average package per request: 1.000
     Average package latency: 17.516
     Percentile of time to first token: 
       p50: 14.5084
       p66: 21.8597
       p75: 22.9786
       p80: 23.3034
       p90: 23.8041
       p95: 25.4313
       p98: 25.8759
       p99: 26.0190
     Percentile of request latency: 
       p50: 14.5084
       p66: 21.8597
       p75: 22.9786
       p80: 23.3034
       p90: 23.8041
       p95: 25.4313
       p98: 25.8759
       p99: 26.0190
    
  • parallel 400
    Benchmarking summary: 
     Time taken for tests: 70.753 seconds
     Expected number of requests: 1000
     Number of concurrency: 400
     Total requests: 884
     Succeed requests: 884
     Failed requests: 0
     Average QPS: 12.494
     Average latency: 20.797
     Throughput(average output tokens per second): 4997.658
     Average time to first token: 20.797
     Average input tokens per request: 157.958
     Average output tokens per request: 400.000
     Average time per output token: 0.00020
     Average package per request: 1.000
     Average package latency: 20.797
     Percentile of time to first token: 
       p50: 20.3585
       p66: 25.7757
       p75: 26.3887
       p80: 27.0304
       p90: 28.1569
       p95: 28.6731
       p98: 29.6462
       p99: 29.8135
     Percentile of request latency: 
       p50: 20.3585
       p66: 25.7757
       p75: 26.3887
       p80: 27.0304
       p90: 28.1569
       p95: 28.6731
       p98: 29.6462
       p99: 29.8135
    
  • parallel 500
    Benchmarking summary: 
     Time taken for tests: 72.668 seconds
     Expected number of requests: 1000
     Number of concurrency: 500
     Total requests: 731
     Succeed requests: 547
     Failed requests: 184
     Average QPS: 7.527
     Average latency: 18.579
     Throughput(average output tokens per second): 3010.939
     Average time to first token: 18.579
     Average input tokens per request: 156.399
     Average output tokens per request: 400.000
     Average time per output token: 0.00033
     Average package per request: 1.000
     Average package latency: 18.579
     Percentile of time to first token: 
       p50: 17.9382
       p66: 19.5846
       p75: 20.2549
       p80: 20.4220
       p90: 28.4858
       p95: 29.5889
       p98: 29.9512
       p99: 30.0487
     Percentile of request latency: 
       p50: 17.9382
       p66: 19.5846
       p75: 20.2549
       p80: 20.4220
       p90: 28.4858
       p95: 29.5889
       p98: 29.9512
       p99: 30.0487
    
  • parallel 600
    Benchmarking summary: 
     Time taken for tests: 75.559 seconds
     Expected number of requests: 1000
     Number of concurrency: 600
     Total requests: 616
     Succeed requests: 400
     Failed requests: 216
     Average QPS: 5.294
     Average latency: 21.424
     Throughput(average output tokens per second): 2117.557
     Average time to first token: 21.424
     Average input tokens per request: 157.625
     Average output tokens per request: 400.000
     Average time per output token: 0.00047
     Average package per request: 1.000
     Average package latency: 21.424
     Percentile of time to first token: 
       p50: 22.6927
       p66: 23.5150
       p75: 24.1050
       p80: 24.2716
       p90: 24.4160
       p95: 24.5849
       p98: 29.1271
       p99: 30.2161
     Percentile of request latency: 
       p50: 22.6927
       p66: 23.5150
       p75: 24.1050
       p80: 24.2716
       p90: 24.4160
       p95: 24.5849
       p98: 29.1271
       p99: 30.2161
    

安装依赖库

pip install evalscope-perf
pip install evalscope

执行命令

evalscope-perf http://127.0.0.1:1025/v1/chat/completions qwen \
    ./datasets/Codefuse-Evol-Instruct-Clean-data.jsonl \
    --parallels 32 \
    --parallels 64 \
    --parallels 100 \
    --parallels 128 \
    --parallels 150 \
    --parallels 200 \
    --parallels 256 \
    --parallels 300 \
    --parallels 400 \
    --parallels 500 \
    --parallels 600 \
    --parallels 700 \
    --parallels 800 \
    --parallels 900 \
    --parallels 1000 \
    --n 2000

绘图代码

import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

# 设置中文字体
font_path =  '/System/Library/Fonts/Hiragino Sans GB.ttc' # 替换为你的字体文件路径
font_prop = FontProperties(fname=font_path)

# 数据: Qwen1.5-7B-Chat
concurrency = [8, 16, 32, 64, 128, 150, 200, 256, 300, 400, 512, 720]
time = [404.284, 215.085, 120.876, 82.884, 52.844, 48.884, 44.856, 42.750, 43.729, 42.866, 41.655, 40.580]
qps = [2.474, 4.649, 8.273, 12.065, 18.924, 20.457, 22.294, 23.392, 22.868, 23.328, 24.007, 24.643]
latency = [3.213, 3.391, 3.724, 4.974, 6.108, 6.517, 7.805, 9.208, 10.904, 13.628, 16.031, 18.790]
throughput = [594.929, 1119.102, 1989.159, 2903.023, 4559.489, 4920.856, 5354.701, 5636.846, 5506.567, 5618.110, 5772.230, 5940.622]
p50 = [3.2461, 3.4271, 3.7514, 5.0248, 6.1491, 6.5487, 7.7782, 9.1754, 10.9164, 13.9850, 17.0675, 20.4161]
p90 = [4.9771, 5.2905, 5.8484, 7.7522, 9.5705, 10.1980, 12.3493, 13.6320, 15.8667, 19.2667, 22.7640, 26.8183]

# 数据: Qwen1.5-14B-Chat
# concurrency = [8, 16, 32, 64, 128, 150, 200, 256, 512]
# time = [578.571, 361.169, 253.040, 204.961, 170.001, 169.981, 162.999, 159.840, 153.937]
# qps = [1.727, 2.766, 3.952, 4.874, 5.882, 5.877, 6.129, 6.250, 6.490]
# latency = [3.712, 3.928, 4.581, 5.628, 7.223, 8.004, 9.205, 11.446, 22.695]
# throughput = [480.043, 897.511, 915.133, 2333.955, 1363.656, 3621.096, 4310.525, 4333.013, 3806.748]
# p50 = [3.7038, 3.9261, 4.4544, 5.6047, 7.0534, 7.9484, 9.0271, 11.3066, 23.6591]
# p90 = [5.7184, 6.0562, 6.9198, 8.7597, 11.1194, 12.5181, 14.5017, 16.8646, 33.3052]

# 数据: Qwen2-72B-Chat
# concurrency = [8, 16, 32, 64, 128, 150, 200, 256, 512]
# time = [1569.707, 909.001, 567.479, 382.247, 179.015, 270.054, 251.060, 237.063, 206.734]
# qps = [0.636, 1.099, 1.759, 2.613, 5.586, 3.699, 3.975, 4.214, 4.832]
# latency = [11.705, 12.806, 14.526, 17.296, 21.041, 23.856, 28.198, 33.176, 59.912]
# throughput = [188.589, 342.764, 588.262, 973.795, 1549.069, 1595.176, 1748.784, 1866.155, 1828.363]
# p50 = [11.8443, 12.8275, 14.7115, 17.5290, 21.3863, 23.8551, 28.4407, 32.7329, 63.0404]
# p90 = [16.2106, 17.5466, 20.0371, 23.9890, 29.5537, 33.7459, 40.4210, 45.7666, 81.6602]


# 绘制曲线
plt.figure(figsize=(12, 8))

# 用时 vs 并行数
plt.subplot(2, 3, 1)
plt.plot(concurrency, time, marker='o')
plt.title('用时 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('用时 (秒)', fontproperties=font_prop)

# QPS vs 并行数
plt.subplot(2, 3, 2)
plt.plot(concurrency, qps, marker='o')
plt.title('QPS vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('QPS', fontproperties=font_prop)

# 延迟 vs 并行数
plt.subplot(2, 3, 3)
plt.plot(concurrency, latency, marker='o')
plt.title('延迟 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('延迟 (秒)', fontproperties=font_prop)

# 吞吐量 vs 并行数
plt.subplot(2, 3, 4)
plt.plot(concurrency, throughput, marker='o')
plt.title('吞吐量 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('吞吐量 (每秒输出的token数)', fontproperties=font_prop)

# p50 vs 并行数
plt.subplot(2, 3, 5)
plt.plot(concurrency, p50, marker='o')
plt.title('p50 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('p50 (秒)', fontproperties=font_prop)

# p90 vs 并行数
plt.subplot(2, 3, 6)
plt.plot(concurrency, p90, marker='o')
plt.title('p90 vs 并行数', fontproperties=font_prop)
plt.xlabel('并行数', fontproperties=font_prop)
plt.ylabel('p90 (秒)', fontproperties=font_prop)

# 显示图表
plt.tight_layout()
plt.show()

实验结果(vLLM)

Qwen1.5-7B-Chat

指标 8 16 32 64 100 128 150 200
用时 2555.302 1355.736 800.953 515.309 403.138 375.187 386.202 355.307
QPS 0.391 0.738 1.249 1.941 2.481 2.660 2.569 2.730
延迟 20.326 21.475 24.877 31.015 37.778 43.181 52.514 61.304
吞吐量 94.803 177.014 300.603 469.235 595.103 638.980 612.597 640.172
p50 20.5326 21.6749 24.9076 31.2051 37.8700 42.3652 52.1732 61.0796
p90 31.8381 33.7150 38.8008 48.5248 59.2335 68.8249 84.6935 96.6051
            6 28
  • 平均每个请求的输入 token 数: 40
  • 平均每个请求的输出 token 数: 240
# 数据
concurrency = [8, 16, 32, 64, 100, 128, 150, 200]
time = [2555.302, 1355.736, 800.953, 515.309, 403.138, 375.187, 386.202, 355.307]
qps = [0.391, 0.738, 1.249, 1.941, 2.481, 2.660, 2.569, 2.730]
latency = [20.326, 21.475, 24.877, 31.015, 37.778, 43.181, 52.514, 61.304]
throughput = [94.803, 177.014, 300.603, 469.235, 595.103, 638.980, 612.597, 640.172]
p50 = [20.5326, 21.6749, 24.9076, 31.2051, 37.8700, 42.3652, 52.1732, 61.0796]
p90 = [31.8381, 33.7150, 38.8008, 48.5248, 59.2335, 68.8249, 84.6935, 96.6051]
  • parallel 8
    Benchmarking summary: 
     Time taken for tests: 2555.302 seconds
     Expected number of requests: 1000
     Number of concurrency: 8
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 0.391
     Average latency: 20.326
     Throughput(average output tokens per second): 94.803
     Average time to first token: 20.326
     Average input tokens per request: 40.296
     Average output tokens per request: 242.251
     Average time per output token: 0.01055
     Average package per request: 1.000
     Average package latency: 20.326
     Percentile of time to first token: 
       p50: 20.5326
       p66: 23.7691
       p75: 26.5282
       p80: 28.1502
       p90: 31.8381
       p95: 35.6152
       p98: 40.9497
       p99: 45.7076
     Percentile of request latency: 
       p50: 20.5326
       p66: 23.7691
       p75: 26.5282
       p80: 28.1502
       p90: 31.8381
       p95: 35.6152
       p98: 40.9497
       p99: 45.7076
    
  • parallel 16
    Benchmarking summary: 
     Time taken for tests: 1355.736 seconds
     Expected number of requests: 1000
     Number of concurrency: 16
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 0.738
     Average latency: 21.475
     Throughput(average output tokens per second): 177.014
     Average time to first token: 21.475
     Average input tokens per request: 40.296
     Average output tokens per request: 239.984
     Average time per output token: 0.00565
     Average package per request: 1.000
     Average package latency: 21.475
     Percentile of time to first token: 
       p50: 21.6749
       p66: 25.2886
       p75: 27.4429
       p80: 29.0391
       p90: 33.7150
       p95: 37.1781
       p98: 42.4568
       p99: 45.9629
     Percentile of request latency: 
       p50: 21.6749
       p66: 25.2886
       p75: 27.4429
       p80: 29.0391
       p90: 33.7150
       p95: 37.1781
       p98: 42.4568
       p99: 45.9629
    
  • parallel 32
    Benchmarking summary: 
     Time taken for tests: 800.953 seconds
     Expected number of requests: 1000
     Number of concurrency: 32
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 1.249
     Average latency: 24.877
     Throughput(average output tokens per second): 300.603
     Average time to first token: 24.877
     Average input tokens per request: 40.296
     Average output tokens per request: 240.769
     Average time per output token: 0.00333
     Average package per request: 1.000
     Average package latency: 24.877
     Percentile of time to first token: 
       p50: 24.9076
       p66: 29.1147
       p75: 32.1535
       p80: 34.1477
       p90: 38.8008
       p95: 43.0980
       p98: 48.4589
       p99: 53.6278
     Percentile of request latency: 
       p50: 24.9076
       p66: 29.1147
       p75: 32.1535
       p80: 34.1477
       p90: 38.8008
       p95: 43.0980
       p98: 48.4589
       p99: 53.6278
    
  • parallel 64
    Benchmarking summary: 
     Time taken for tests: 515.309 seconds
     Expected number of requests: 1000
     Number of concurrency: 64
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 1.941
     Average latency: 31.015
     Throughput(average output tokens per second): 469.235
     Average time to first token: 31.015
     Average input tokens per request: 40.296
     Average output tokens per request: 241.801
     Average time per output token: 0.00213
     Average package per request: 1.000
     Average package latency: 31.015
     Percentile of time to first token: 
       p50: 31.2051
       p66: 36.5305
       p75: 40.1962
       p80: 42.1270
       p90: 48.5248
       p95: 53.4686
       p98: 60.3826
       p99: 65.7931
     Percentile of request latency: 
       p50: 31.2051
       p66: 36.5305
       p75: 40.1962
       p80: 42.1270
       p90: 48.5248
       p95: 53.4686
       p98: 60.3826
       p99: 65.7931
    
  • parallel 100
    Benchmarking summary: 
     Time taken for tests: 403.138 seconds
     Expected number of requests: 1000
     Number of concurrency: 100
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 2.481
     Average latency: 37.778
     Throughput(average output tokens per second): 595.103
     Average time to first token: 37.778
     Average input tokens per request: 40.296
     Average output tokens per request: 239.909
     Average time per output token: 0.00168
     Average package per request: 1.000
     Average package latency: 37.778
     Percentile of time to first token: 
       p50: 37.8700
       p66: 44.4992
       p75: 49.1464
       p80: 52.3330
       p90: 59.2335
       p95: 64.9315
       p98: 74.1723
       p99: 80.4544
     Percentile of request latency: 
       p50: 37.8700
       p66: 44.4992
       p75: 49.1464
       p80: 52.3330
       p90: 59.2335
       p95: 64.9315
       p98: 74.1723
       p99: 80.4544
    
  • parallel 128
    Benchmarking summary: 
     Time taken for tests: 375.187 seconds
     Expected number of requests: 1000
     Number of concurrency: 128
     Total requests: 999
     Succeed requests: 998
     Failed requests: 1
     Average QPS: 2.660
     Average latency: 43.181
     Throughput(average output tokens per second): 638.980
     Average time to first token: 43.181
     Average input tokens per request: 40.297
     Average output tokens per request: 240.217
     Average time per output token: 0.00156
     Average package per request: 1.000
     Average package latency: 43.181
     Percentile of time to first token: 
       p50: 42.3652
       p66: 51.6154
       p75: 56.2693
       p80: 59.2260
       p90: 68.8249
       p95: 75.4546
       p98: 85.2362
       p99: 93.3652
     Percentile of request latency: 
       p50: 42.3652
       p66: 51.6154
       p75: 56.2693
       p80: 59.2260
       p90: 68.8249
       p95: 75.4546
       p98: 85.2362
       p99: 93.3652
    
  • parallel 150
    Benchmarking summary: 
     Time taken for tests: 386.202 seconds
     Expected number of requests: 1000
     Number of concurrency: 150
     Total requests: 998
     Succeed requests: 992
     Failed requests: 6
     Average QPS: 2.569
     Average latency: 52.514
     Throughput(average output tokens per second): 612.597
     Average time to first token: 52.514
     Average input tokens per request: 40.303
     Average output tokens per request: 238.494
     Average time per output token: 0.00163
     Average package per request: 1.000
     Average package latency: 52.514
     Percentile of time to first token: 
       p50: 52.1732
       p66: 61.4450
       p75: 67.9417
       p80: 72.0179
       p90: 84.6935
       p95: 92.2286
       p98: 101.7002
       p99: 107.7996
     Percentile of request latency: 
       p50: 52.1732
       p66: 61.4450
       p75: 67.9417
       p80: 72.0179
       p90: 84.6935
       p95: 92.2286
       p98: 101.7002
       p99: 107.7996
    
  • parallel 200
    Benchmarking summary: 
     Time taken for tests: 355.307 seconds
     Expected number of requests: 1000
     Number of concurrency: 200
     Total requests: 998
     Succeed requests: 970
     Failed requests: 28
     Average QPS: 2.730
     Average latency: 61.304
     Throughput(average output tokens per second): 640.172
     Average time to first token: 61.304
     Average input tokens per request: 40.287
     Average output tokens per request: 234.493
     Average time per output token: 0.00156
     Average package per request: 1.000
     Average package latency: 61.304
     Percentile of time to first token: 
       p50: 61.0796
       p66: 74.1724
       p75: 80.9575
       p80: 84.8420
       p90: 96.6051
       p95: 105.2873
       p98: 114.3178
       p99: 115.7257
     Percentile of request latency: 
       p50: 61.0796
       p66: 74.1724
       p75: 80.9575
       p80: 84.8420
       p90: 96.6051
       p95: 105.2873
       p98: 114.3178
       p99: 115.7257
    

Qwen2.5-72B-Chat

指标 16 32 64
用时 406.618 223.379 159.293
QPS 0.236 0.448 0.609
延迟 53.359 54.293 60.959
吞吐量 69.242 129.193 185.140
p50 52.8675 56.3690 58.7318
p90 80.6415 87.1464 90.6065
  • 平均每个请求的输入 token 数: 50
  • 平均每个请求的输出 token 数: 290
# 数据
concurrency = [16, 32, 64]
time = [406.618, 223.379, 159.293]
qps = [0.236, 0.448, 0.609]
latency = [53.359, 54.293, 60.959]
throughput = [69.242, 129.193, 185.140]
p50 = [52.8675, 56.3690, 58.7318]
p90 = [80.6415, 87.1464, 90.6065]
  • parallel 1
    Benchmarking summary: 
     Time taken for tests: 80.089 seconds
     Expected number of requests: 1
     Number of concurrency: 1
     Total requests: 1
     Succeed requests: 1
     Failed requests: 0
     Average QPS: 0.012
     Average latency: 79.353
     Throughput(average output tokens per second): 5.819
     Average time to first token: 79.353
     Average input tokens per request: 44.000
     Average output tokens per request: 466.000
     Average time per output token: 0.17186
     Average package per request: 1.000
     Average package latency: 79.353
    
  • parallel 16
    Benchmarking summary: 
     Time taken for tests: 406.618 seconds
     Expected number of requests: 100
     Number of concurrency: 16
     Total requests: 98
     Succeed requests: 96
     Failed requests: 2
     Average QPS: 0.236
     Average latency: 53.359
     Throughput(average output tokens per second): 69.242
     Average time to first token: 53.359
     Average input tokens per request: 50.156
     Average output tokens per request: 293.281
     Average time per output token: 0.01444
     Average package per request: 1.000
     Average package latency: 53.359
     Percentile of time to first token: 
       p50: 52.8675
       p66: 62.7250
       p75: 70.2725
       p80: 73.0569
       p90: 80.6415
       p95: 91.7408
       p98: 96.5681
       p99: 114.6619
     Percentile of request latency: 
       p50: 52.8675
       p66: 62.7250
       p75: 70.2725
       p80: 73.0569
       p90: 80.6415
       p95: 91.7408
       p98: 96.5681
       p99: 114.6619
    
  • parallel 32
    Benchmarking summary: 
     Time taken for tests: 223.379 seconds
     Expected number of requests: 100
     Number of concurrency: 32
     Total requests: 100
     Succeed requests: 100
     Failed requests: 0
     Average QPS: 0.448
     Average latency: 54.293
     Throughput(average output tokens per second): 129.193
     Average time to first token: 54.293
     Average input tokens per request: 49.890
     Average output tokens per request: 288.590
     Average time per output token: 0.00774
     Average package per request: 1.000
     Average package latency: 54.293
     Percentile of time to first token: 
       p50: 56.3690
       p66: 61.8036
       p75: 73.1621
       p80: 77.8210
       p90: 87.1464
       p95: 97.1933
       p98: 102.0623
       p99: 107.4325
     Percentile of request latency: 
       p50: 56.3690
       p66: 61.8036
       p75: 73.1621
       p80: 77.8210
       p90: 87.1464
       p95: 97.1933
       p98: 102.0623
       p99: 107.4325
    
  • parallel 64
    Benchmarking summary: 
     Time taken for tests: 159.293 seconds
     Expected number of requests: 100
     Number of concurrency: 64
     Total requests: 99
     Succeed requests: 97
     Failed requests: 2
     Average QPS: 0.609
     Average latency: 60.959
     Throughput(average output tokens per second): 185.140
     Average time to first token: 60.959
     Average input tokens per request: 50.093
     Average output tokens per request: 296.392
     Average time per output token: 0.00540
     Average package per request: 1.000
     Average package latency: 60.959
     Percentile of time to first token: 
       p50: 58.7318
       p66: 72.0510
       p75: 77.6354
       p80: 82.2268
       p90: 90.6065
       p95: 98.0913
       p98: 108.4727
       p99: 112.9350
     Percentile of request latency: 
       p50: 58.7318
       p66: 72.0510
       p75: 77.6354
       p80: 82.2268
       p90: 90.6065
       p95: 98.0913
       p98: 108.4727
       p99: 112.9350
    

实验结果(XInference - MindIE)

  • ❌ 部署时间长了,请求无响应。
  • ❌ 部署多副本,压测一次后,服务挂掉。

实验结果(Nvidia T4: XInference - vLLM)

Qwen2-7B-Instruct

指标 8 16 32 64 100 128 150 200 300 400 500
用时 1276.077 776.170 504.480 372.118 363.016 365.024 418.874 459.912 401.385 458.450 375.796
QPS 0.783 1.288 1.958 2.472 2.353 2.334 2.046 1.750 1.664 1.254 1.163
延迟 10.128 12.307 15.749 19.225 19.235 19.151 17.479 15.949 16.718 14.750 15.771
吞吐量 254.695 424.701 631.260 753.852 700.681 700.324 637.260 550.200 510.211 392.697 362.167
p50 10.3076 12.5569 16.3129 20.0818 20.1519 20.0230 17.9699 16.2526 17.0123 14.8173 15.8847
p90 14.3874 17.6026 22.5020 26.8947 26.5725 26.8284 24.7138 23.0809 24.4670 21.4278 23.7326
失败 1 0 12 71 101 72 32 25 66 47 89
  • 平均每个请求的输入 token 数: 40
  • 平均每个请求的输出 token 数: 300
evalscope-perf http://172.16.33.66:9997/v1/chat/completions gpt-4-32k \
    ./datasets/open_qa.jsonl \
    --parallels 8 \
    --parallels 16 \
    --parallels 32 \
    --parallels 64 \
    --parallels 100 \
    --parallels 128 \
    --parallels 150 \
    --parallels 200 \
    --parallels 300 \
    --parallels 400 \
    --parallels 500 \
    --n 1000
  • parallel 8
    Benchmarking summary: 
     Time taken for tests: 1276.077 seconds
     Expected number of requests: 1000
     Number of concurrency: 8
     Total requests: 1000
     Succeed requests: 999
     Failed requests: 1
     Average QPS: 0.783
     Average latency: 10.128
     Throughput(average output tokens per second): 254.695
     Average time to first token: 10.128
     Average input tokens per request: 40.295
     Average output tokens per request: 325.335
     Average time per output token: 0.00393
     Average package per request: 1.000
     Average package latency: 10.128
     Percentile of time to first token: 
       p50: 10.3076
       p66: 11.5736
       p75: 12.3458
       p80: 12.6923
       p90: 14.3874
       p95: 15.7724
       p98: 17.3099
       p99: 18.2167
     Percentile of request latency: 
       p50: 10.3076
       p66: 11.5736
       p75: 12.3458
       p80: 12.6923
       p90: 14.3874
       p95: 15.7724
       p98: 17.3099
       p99: 18.2167
    
  • parallel 16
    Benchmarking summary: 
     Time taken for tests: 776.170 seconds
     Expected number of requests: 1000
     Number of concurrency: 16
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 1.288
     Average latency: 12.307
     Throughput(average output tokens per second): 424.701
     Average time to first token: 12.307
     Average input tokens per request: 40.296
     Average output tokens per request: 329.640
     Average time per output token: 0.00235
     Average package per request: 1.000
     Average package latency: 12.307
     Percentile of time to first token: 
       p50: 12.5569
       p66: 13.8731
       p75: 14.9178
       p80: 15.6629
       p90: 17.6026
       p95: 19.0937
       p98: 21.7158
       p99: 22.9113
     Percentile of request latency: 
       p50: 12.5569
       p66: 13.8731
       p75: 14.9178
       p80: 15.6629
       p90: 17.6026
       p95: 19.0937
       p98: 21.7158
       p99: 22.9113
    
  • parallel 32
    Benchmarking summary: 
     Time taken for tests: 504.480 seconds
     Expected number of requests: 1000
     Number of concurrency: 32
     Total requests: 1000
     Succeed requests: 988
     Failed requests: 12
     Average QPS: 1.958
     Average latency: 15.749
     Throughput(average output tokens per second): 631.260
     Average time to first token: 15.749
     Average input tokens per request: 40.277
     Average output tokens per request: 322.326
     Average time per output token: 0.00158
     Average package per request: 1.000
     Average package latency: 15.749
     Percentile of time to first token: 
       p50: 16.3129
       p66: 18.2424
       p75: 19.2717
       p80: 19.9721
       p90: 22.5020
       p95: 24.8820
       p98: 26.8200
       p99: 27.5026
     Percentile of request latency: 
       p50: 16.3129
       p66: 18.2424
       p75: 19.2717
       p80: 19.9721
       p90: 22.5020
       p95: 24.8820
       p98: 26.8200
       p99: 27.5026
    
  • parallel 64
    Benchmarking summary: 
     Time taken for tests: 372.118 seconds
     Expected number of requests: 1000
     Number of concurrency: 64
     Total requests: 991
     Succeed requests: 920
     Failed requests: 71
     Average QPS: 2.472
     Average latency: 19.225
     Throughput(average output tokens per second): 753.852
     Average time to first token: 19.225
     Average input tokens per request: 40.204
     Average output tokens per request: 304.915
     Average time per output token: 0.00133
     Average package per request: 1.000
     Average package latency: 19.225
     Percentile of time to first token: 
       p50: 20.0818
       p66: 22.5724
       p75: 23.7237
       p80: 24.5372
       p90: 26.8947
       p95: 28.2781
       p98: 29.2713
       p99: 29.5592
     Percentile of request latency: 
       p50: 20.0818
       p66: 22.5724
       p75: 23.7237
       p80: 24.5372
       p90: 26.8947
       p95: 28.2781
       p98: 29.2713
       p99: 29.5592
    
  • parallel 100
    Benchmarking summary: 
     Time taken for tests: 363.016 seconds
     Expected number of requests: 1000
     Number of concurrency: 100
     Total requests: 955
     Succeed requests: 854
     Failed requests: 101
     Average QPS: 2.353
     Average latency: 19.235
     Throughput(average output tokens per second): 700.681
     Average time to first token: 19.235
     Average input tokens per request: 40.218
     Average output tokens per request: 297.843
     Average time per output token: 0.00143
     Average package per request: 1.000
     Average package latency: 19.235
     Percentile of time to first token: 
       p50: 20.1519
       p66: 22.4520
       p75: 23.9718
       p80: 24.6514
       p90: 26.5725
       p95: 28.0242
       p98: 29.1395
       p99: 29.5591
     Percentile of request latency: 
       p50: 20.1519
       p66: 22.4520
       p75: 23.9718
       p80: 24.6514
       p90: 26.5725
       p95: 28.0242
       p98: 29.1395
       p99: 29.5591
    
  • parallel 128
    Benchmarking summary: 
     Time taken for tests: 365.024 seconds
     Expected number of requests: 1000
     Number of concurrency: 128
     Total requests: 924
     Succeed requests: 852
     Failed requests: 72
     Average QPS: 2.334
     Average latency: 19.151
     Throughput(average output tokens per second): 700.324
     Average time to first token: 19.151
     Average input tokens per request: 40.188
     Average output tokens per request: 300.041
     Average time per output token: 0.00143
     Average package per request: 1.000
     Average package latency: 19.151
     Percentile of time to first token: 
       p50: 20.0230
       p66: 22.2805
       p75: 23.6442
       p80: 24.5745
       p90: 26.8284
       p95: 28.1978
       p98: 29.3051
       p99: 29.6282
     Percentile of request latency: 
       p50: 20.0230
       p66: 22.2805
       p75: 23.6442
       p80: 24.5745
       p90: 26.8284
       p95: 28.1978
       p98: 29.3051
       p99: 29.6282
    
  • parallel 150
    Benchmarking summary: 
     Time taken for tests: 418.874 seconds
     Expected number of requests: 1000
     Number of concurrency: 150
     Total requests: 889
     Succeed requests: 857
     Failed requests: 32
     Average QPS: 2.046
     Average latency: 17.479
     Throughput(average output tokens per second): 637.260
     Average time to first token: 17.479
     Average input tokens per request: 40.272
     Average output tokens per request: 311.473
     Average time per output token: 0.00157
     Average package per request: 1.000
     Average package latency: 17.479
     Percentile of time to first token: 
       p50: 17.9699
       p66: 20.0651
       p75: 21.5481
       p80: 22.2761
       p90: 24.7138
       p95: 27.1316
       p98: 28.8103
       p99: 29.3229
     Percentile of request latency: 
       p50: 17.9699
       p66: 20.0651
       p75: 21.5481
       p80: 22.2761
       p90: 24.7138
       p95: 27.1316
       p98: 28.8103
       p99: 29.3229
    
  • parallel 200
    Benchmarking summary: 
     Time taken for tests: 459.912 seconds
     Expected number of requests: 1000
     Number of concurrency: 200
     Total requests: 830
     Succeed requests: 805
     Failed requests: 25
     Average QPS: 1.750
     Average latency: 15.949
     Throughput(average output tokens per second): 550.200
     Average time to first token: 15.949
     Average input tokens per request: 40.337
     Average output tokens per request: 314.340
     Average time per output token: 0.00182
     Average package per request: 1.000
     Average package latency: 15.949
     Percentile of time to first token: 
       p50: 16.2526
       p66: 18.0521
       p75: 19.4394
       p80: 20.1756
       p90: 23.0809
       p95: 25.5287
       p98: 28.5592
       p99: 29.4651
     Percentile of request latency: 
       p50: 16.2526
       p66: 18.0521
       p75: 19.4394
       p80: 20.1756
       p90: 23.0809
       p95: 25.5287
       p98: 28.5592
       p99: 29.4651
    
  • parallel 300
    Benchmarking summary: 
     Time taken for tests: 401.385 seconds
     Expected number of requests: 1000
     Number of concurrency: 300
     Total requests: 734
     Succeed requests: 668
     Failed requests: 66
     Average QPS: 1.664
     Average latency: 16.718
     Throughput(average output tokens per second): 510.211
     Average time to first token: 16.718
     Average input tokens per request: 40.383
     Average output tokens per request: 306.573
     Average time per output token: 0.00196
     Average package per request: 1.000
     Average package latency: 16.718
     Percentile of time to first token: 
       p50: 17.0123
       p66: 19.2765
       p75: 20.8575
       p80: 21.9712
       p90: 24.4670
       p95: 26.6873
       p98: 28.2457
       p99: 29.1315
     Percentile of request latency: 
       p50: 17.0123
       p66: 19.2765
       p75: 20.8575
       p80: 21.9712
       p90: 24.4670
       p95: 26.6873
       p98: 28.2457
       p99: 29.1315
    
  • parallel 400
    Benchmarking summary: 
     Time taken for tests: 458.450 seconds
     Expected number of requests: 1000
     Number of concurrency: 400
     Total requests: 622
     Succeed requests: 575
     Failed requests: 47
     Average QPS: 1.254
     Average latency: 14.750
     Throughput(average output tokens per second): 392.697
     Average time to first token: 14.750
     Average input tokens per request: 40.310
     Average output tokens per request: 313.099
     Average time per output token: 0.00255
     Average package per request: 1.000
     Average package latency: 14.750
     Percentile of time to first token: 
       p50: 14.8173
       p66: 16.7375
       p75: 18.0656
       p80: 18.7752
       p90: 21.4278
       p95: 24.3040
       p98: 28.1092
       p99: 29.1603
     Percentile of request latency: 
       p50: 14.8173
       p66: 16.7375
       p75: 18.0656
       p80: 18.7752
       p90: 21.4278
       p95: 24.3040
       p98: 28.1092
       p99: 29.1603
    
  • parallel 500
    Benchmarking summary: 
     Time taken for tests: 375.796 seconds
     Expected number of requests: 1000
     Number of concurrency: 500
     Total requests: 526
     Succeed requests: 437
     Failed requests: 89
     Average QPS: 1.163
     Average latency: 15.771
     Throughput(average output tokens per second): 362.167
     Average time to first token: 15.771
     Average input tokens per request: 40.268
     Average output tokens per request: 311.444
     Average time per output token: 0.00276
     Average package per request: 1.000
     Average package latency: 15.771
     Percentile of time to first token: 
       p50: 15.8847
       p66: 17.4772
       p75: 18.8302
       p80: 20.0079
       p90: 23.7326
       p95: 26.3385
       p98: 28.6426
       p99: 29.0919
     Percentile of request latency: 
       p50: 15.8847
       p66: 17.4772
       p75: 18.8302
       p80: 20.0079
       p90: 23.7326
       p95: 26.3385
       p98: 28.6426
       p99: 29.0919
    

实验结果(Nvidia T4: vLLM)

Qwen2-7B-Instruct

指标 8 16 32 64 100 128 150 200 300 400 500
用时 1030.632 629.843 409.251 280.896 252.708 236.597 254.208 274.376 256.799 222.726 175.166
QPS 0.970 1.588 2.443 3.503 3.593 3.580 3.509 3.076 2.847 2.820 3.060
延迟 8.213 9.944 12.846 16.913 19.182 19.831 17.806 16.743 16.285 16.285 17.649
吞吐量 298.860 496.458 753.176 1073.697 1039.610 1005.881 1032.794 925.476 839.437 816.049 875.565
p50 8.4703 10.2975 13.3060 17.5850 20.1962 20.9986 18.6905 17.0203 16.6503 16.4550 18.0830
p90 11.8119 14.2264 18.5003 23.7261 27.0515 27.5461 25.4481 24.0718 23.5927 24.0938 25.7348
失败 0 0 0 15 73 115 26 11 19 23 26
  • 平均每个请求的输入 token 数: 40
  • 平均每个请求的输出 token 数: 300

部署

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8008 \
    --model /data/models/llm/qwen/Qwen2-7B-Instruct/ \
    --served-model-name qwen2-7b \
    --tensor-parallel-size 4 \
    --dtype=float16 \
    --max-model-len 16000

压测

evalscope-perf http://172.16.33.66:8008/v1/chat/completions qwen2-7b \
    ./datasets/open_qa.jsonl \
    --parallels 8 \
    --parallels 16 \
    --parallels 32 \
    --parallels 64 \
    --parallels 100 \
    --parallels 128 \
    --parallels 150 \
    --parallels 200 \
    --parallels 300 \
    --parallels 400 \
    --parallels 500 \
    --n 1000
  • parallel 8
    Benchmarking summary: 
     Time taken for tests: 1030.632 seconds
     Expected number of requests: 1000
     Number of concurrency: 8
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 0.970
     Average latency: 8.213
     Throughput(average output tokens per second): 298.860
     Average time to first token: 8.213
     Average input tokens per request: 40.296
     Average output tokens per request: 308.015
     Average time per output token: 0.00335
     Average package per request: 1.000
     Average package latency: 8.213
     Percentile of time to first token: 
       p50: 8.4703
       p66: 9.3882
       p75: 10.1059
       p80: 10.5404
       p90: 11.8119
       p95: 12.8131
       p98: 14.0085
       p99: 15.0006
     Percentile of request latency: 
       p50: 8.4703
       p66: 9.3882
       p75: 10.1059
       p80: 10.5404
       p90: 11.8119
       p95: 12.8131
       p98: 14.0085
       p99: 15.0006
    
  • parallel 16
    Benchmarking summary: 
     Time taken for tests: 629.843 seconds
     Expected number of requests: 1000
     Number of concurrency: 16
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 1.588
     Average latency: 9.944
     Throughput(average output tokens per second): 496.458
     Average time to first token: 9.944
     Average input tokens per request: 40.296
     Average output tokens per request: 312.691
     Average time per output token: 0.00201
     Average package per request: 1.000
     Average package latency: 9.944
     Percentile of time to first token: 
       p50: 10.2975
       p66: 11.4685
       p75: 12.2523
       p80: 12.7172
       p90: 14.2264
       p95: 15.5556
       p98: 17.0453
       p99: 18.0745
     Percentile of request latency: 
       p50: 10.2975
       p66: 11.4685
       p75: 12.2523
       p80: 12.7172
       p90: 14.2264
       p95: 15.5556
       p98: 17.0453
       p99: 18.0745
    
  • parallel 32
    Benchmarking summary: 
     Time taken for tests: 409.251 seconds
     Expected number of requests: 1000
     Number of concurrency: 32
     Total requests: 1000
     Succeed requests: 1000
     Failed requests: 0
     Average QPS: 2.443
     Average latency: 12.846
     Throughput(average output tokens per second): 753.176
     Average time to first token: 12.846
     Average input tokens per request: 40.296
     Average output tokens per request: 308.238
     Average time per output token: 0.00133
     Average package per request: 1.000
     Average package latency: 12.846
     Percentile of time to first token: 
       p50: 13.3060
       p66: 14.7707
       p75: 15.6738
       p80: 16.5227
       p90: 18.5003
       p95: 20.2770
       p98: 22.3723
       p99: 23.2546
     Percentile of request latency: 
       p50: 13.3060
       p66: 14.7707
       p75: 15.6738
       p80: 16.5227
       p90: 18.5003
       p95: 20.2770
       p98: 22.3723
       p99: 23.2546
    
  • parallel 64
    Benchmarking summary: 
     Time taken for tests: 280.896 seconds
     Expected number of requests: 1000
     Number of concurrency: 64
     Total requests: 999
     Succeed requests: 984
     Failed requests: 15
     Average QPS: 3.503
     Average latency: 16.913
     Throughput(average output tokens per second): 1073.697
     Average time to first token: 16.913
     Average input tokens per request: 40.278
     Average output tokens per request: 306.501
     Average time per output token: 0.00093
     Average package per request: 1.000
     Average package latency: 16.913
     Percentile of time to first token: 
       p50: 17.5850
       p66: 19.7422
       p75: 20.8753
       p80: 21.5526
       p90: 23.7261
       p95: 25.7272
       p98: 28.0227
       p99: 28.5129
     Percentile of request latency: 
       p50: 17.5850
       p66: 19.7422
       p75: 20.8753
       p80: 21.5526
       p90: 23.7261
       p95: 25.7272
       p98: 28.0227
       p99: 28.5129
    
  • parallel 100
    Benchmarking summary: 
     Time taken for tests: 252.708 seconds
     Expected number of requests: 1000
     Number of concurrency: 100
     Total requests: 981
     Succeed requests: 908
     Failed requests: 73
     Average QPS: 3.593
     Average latency: 19.182
     Throughput(average output tokens per second): 1039.610
     Average time to first token: 19.182
     Average input tokens per request: 40.166
     Average output tokens per request: 289.337
     Average time per output token: 0.00096
     Average package per request: 1.000
     Average package latency: 19.182
     Percentile of time to first token: 
       p50: 20.1962
       p66: 22.6797
       p75: 23.9937
       p80: 24.8991
       p90: 27.0515
       p95: 28.3231
       p98: 29.3312
       p99: 29.6593
     Percentile of request latency: 
       p50: 20.1962
       p66: 22.6797
       p75: 23.9937
       p80: 24.8991
       p90: 27.0515
       p95: 28.3231
       p98: 29.3312
       p99: 29.6593
    
  • parallel 128
    Benchmarking summary: 
     Time taken for tests: 236.597 seconds
     Expected number of requests: 1000
     Number of concurrency: 128
     Total requests: 962
     Succeed requests: 847
     Failed requests: 115
     Average QPS: 3.580
     Average latency: 19.831
     Throughput(average output tokens per second): 1005.881
     Average time to first token: 19.831
     Average input tokens per request: 40.038
     Average output tokens per request: 280.979
     Average time per output token: 0.00099
     Average package per request: 1.000
     Average package latency: 19.831
     Percentile of time to first token: 
       p50: 20.9986
       p66: 23.7704
       p75: 25.2322
       p80: 25.8841
       p90: 27.5461
       p95: 28.7326
       p98: 29.5878
       p99: 29.8692
     Percentile of request latency: 
       p50: 20.9986
       p66: 23.7704
       p75: 25.2322
       p80: 25.8841
       p90: 27.5461
       p95: 28.7326
       p98: 29.5878
       p99: 29.8692
    
  • parallel 150
    Benchmarking summary: 
     Time taken for tests: 254.208 seconds
     Expected number of requests: 1000
     Number of concurrency: 150
     Total requests: 918
     Succeed requests: 892
     Failed requests: 26
     Average QPS: 3.509
     Average latency: 17.806
     Throughput(average output tokens per second): 1032.794
     Average time to first token: 17.806
     Average input tokens per request: 40.286
     Average output tokens per request: 294.333
     Average time per output token: 0.00097
     Average package per request: 1.000
     Average package latency: 17.806
     Percentile of time to first token: 
       p50: 18.6905
       p66: 20.7610
       p75: 22.0715
       p80: 23.0934
       p90: 25.4481
       p95: 27.6829
       p98: 29.1659
       p99: 29.6321
     Percentile of request latency: 
       p50: 18.6905
       p66: 20.7610
       p75: 22.0715
       p80: 23.0934
       p90: 25.4481
       p95: 27.6829
       p98: 29.1659
       p99: 29.6321
    
  • parallel 200
    Benchmarking summary: 
     Time taken for tests: 274.376 seconds
     Expected number of requests: 1000
     Number of concurrency: 200
     Total requests: 855
     Succeed requests: 844
     Failed requests: 11
     Average QPS: 3.076
     Average latency: 16.743
     Throughput(average output tokens per second): 925.476
     Average time to first token: 16.743
     Average input tokens per request: 40.307
     Average output tokens per request: 300.863
     Average time per output token: 0.00108
     Average package per request: 1.000
     Average package latency: 16.743
     Percentile of time to first token: 
       p50: 17.0203
       p66: 19.1086
       p75: 20.4436
       p80: 21.6162
       p90: 24.0718
       p95: 26.2962
       p98: 28.3664
       p99: 29.4007
     Percentile of request latency: 
       p50: 17.0203
       p66: 19.1086
       p75: 20.4436
       p80: 21.6162
       p90: 24.0718
       p95: 26.2962
       p98: 28.3664
       p99: 29.4007
    
  • parallel 300
    Benchmarking summary: 
     Time taken for tests: 256.799 seconds
     Expected number of requests: 1000
     Number of concurrency: 300
     Total requests: 750
     Succeed requests: 731
     Failed requests: 19
     Average QPS: 2.847
     Average latency: 16.285
     Throughput(average output tokens per second): 839.437
     Average time to first token: 16.285
     Average input tokens per request: 40.453
     Average output tokens per request: 294.893
     Average time per output token: 0.00119
     Average package per request: 1.000
     Average package latency: 16.285
     Percentile of time to first token: 
       p50: 16.6503
       p66: 18.6659
       p75: 19.8392
       p80: 20.8286
       p90: 23.5927
       p95: 26.8649
       p98: 28.0731
       p99: 29.3460
     Percentile of request latency: 
       p50: 16.6503
       p66: 18.6659
       p75: 19.8392
       p80: 20.8286
       p90: 23.5927
       p95: 26.8649
       p98: 28.0731
       p99: 29.3460
    
  • parallel 400
    Benchmarking summary: 
     Time taken for tests: 222.726 seconds
     Expected number of requests: 1000
     Number of concurrency: 400
     Total requests: 651
     Succeed requests: 628
     Failed requests: 23
     Average QPS: 2.820
     Average latency: 16.285
     Throughput(average output tokens per second): 816.049
     Average time to first token: 16.285
     Average input tokens per request: 40.247
     Average output tokens per request: 289.419
     Average time per output token: 0.00123
     Average package per request: 1.000
     Average package latency: 16.285
     Percentile of time to first token: 
       p50: 16.4550
       p66: 18.4089
       p75: 20.1452
       p80: 20.9587
       p90: 24.0938
       p95: 26.5017
       p98: 28.0750
       p99: 28.6327
     Percentile of request latency: 
       p50: 16.4550
       p66: 18.4089
       p75: 20.1452
       p80: 20.9587
       p90: 24.0938
       p95: 26.5017
       p98: 28.0750
       p99: 28.6327
    
  • parallel 500
    Benchmarking summary: 
     Time taken for tests: 175.166 seconds
     Expected number of requests: 1000
     Number of concurrency: 500
     Total requests: 562
     Succeed requests: 536
     Failed requests: 26
     Average QPS: 3.060
     Average latency: 17.649
     Throughput(average output tokens per second): 875.565
     Average time to first token: 17.649
     Average input tokens per request: 40.192
     Average output tokens per request: 286.136
     Average time per output token: 0.00114
     Average package per request: 1.000
     Average package latency: 17.649
     Percentile of time to first token: 
       p50: 18.0830
       p66: 20.6764
       p75: 22.3065
       p80: 23.0424
       p90: 25.7348
       p95: 27.9985
       p98: 29.2158
       p99: 29.5974
     Percentile of request latency: 
       p50: 18.0830
       p66: 20.6764
       p75: 22.3065
       p80: 23.0424
       p90: 25.7348
       p95: 27.9985
       p98: 29.2158
       p99: 29.5974
    

Qwen2-7B-Instruct (long.jsonl)

指标 4 8 12 16 20 25 30 35 40
用时 1501.129 831.393 661.167 553.051 492.972 482.926 503.931 708.094 1708.086
QPS 0.066 0.120 0.151 0.181 0.203 0.197 0.169 0.102 0.036
延迟 58.530 63.844 75.483 81.761 93.514 95.232 82.990 67.363 55.090
吞吐量 150.200 268.802 340.991 411.723 450.299 437.290 369.642 224.586 79.384
p50 61.1869 67.7709 79.9282 85.2388 101.0449 100.2105 84.4174 67.4104 57.0599
p90 63.1877 70.4871 84.9531 89.7831 106.6341 113.0055 106.2575 81.7474 59.3524
失败 1 0 0 0 0 5 15 28 38
  • 平均每个请求的输入 token 数: 1600
  • 平均每个请求的输出 token 数: 2200

部署

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8008 \
    --model /data/models/llm/qwen/Qwen2-7B-Instruct/ \
    --served-model-name qwen2-7b \
    --tensor-parallel-size 4 \
    --dtype=float16 \
    --max-model-len 16000

压测

evalscope-perf http://172.16.33.66:8008/v1/chat/completions qwen2-7b \
    ./datasets/open_qa.jsonl \
    --parallels 4 \
    --parallels 8 \
    --parallels 12 \
    --parallels 16 \
    --parallels 20 \
    --parallels 25 \
    --parallels 30 \
    --parallels 35 \
    --parallels 40 \
    --n 100
  • parallel 4
    Benchmarking summary: 
     Time taken for tests: 1501.129 seconds
     Expected number of requests: 100
     Number of concurrency: 4
     Total requests: 100
     Succeed requests: 99
     Failed requests: 1
     Average QPS: 0.066
     Average latency: 58.530
     Throughput(average output tokens per second): 150.200
     Average time to first token: 58.530
     Average input tokens per request: 1614.000
     Average output tokens per request: 2277.475
     Average time per output token: 0.00666
     Average package per request: 1.000
     Average package latency: 58.530
     Percentile of time to first token: 
       p50: 61.1869
       p66: 61.9394
       p75: 62.4250
       p80: 62.7114
       p90: 63.1877
       p95: 63.8551
       p98: 64.0589
       p99: 64.3522
     Percentile of request latency: 
       p50: 61.1869
       p66: 61.9394
       p75: 62.4250
       p80: 62.7114
       p90: 63.1877
       p95: 63.8551
       p98: 64.0589
       p99: 64.3522
    
  • parallel 8
    Benchmarking summary: 
     Time taken for tests: 831.393 seconds
     Expected number of requests: 100
     Number of concurrency: 8
     Total requests: 100
     Succeed requests: 100
     Failed requests: 0
     Average QPS: 0.120
     Average latency: 63.844
     Throughput(average output tokens per second): 268.802
     Average time to first token: 63.844
     Average input tokens per request: 1614.000
     Average output tokens per request: 2234.800
     Average time per output token: 0.00372
     Average package per request: 1.000
     Average package latency: 63.844
     Percentile of time to first token: 
       p50: 67.7709
       p66: 68.7757
       p75: 69.2135
       p80: 69.4998
       p90: 70.4871
       p95: 71.4362
       p98: 74.7053
       p99: 77.3827
     Percentile of request latency: 
       p50: 67.7709
       p66: 68.7757
       p75: 69.2135
       p80: 69.4998
       p90: 70.4871
       p95: 71.4362
       p98: 74.7053
       p99: 77.3827
    
  • parallel 12
    Benchmarking summary: 
     Time taken for tests: 661.167 seconds
     Expected number of requests: 100
     Number of concurrency: 12
     Total requests: 100
     Succeed requests: 100
     Failed requests: 0
     Average QPS: 0.151
     Average latency: 75.483
     Throughput(average output tokens per second): 340.991
     Average time to first token: 75.483
     Average input tokens per request: 1614.000
     Average output tokens per request: 2254.520
     Average time per output token: 0.00293
     Average package per request: 1.000
     Average package latency: 75.483
     Percentile of time to first token: 
       p50: 79.9282
       p66: 81.6170
       p75: 82.3043
       p80: 82.8302
       p90: 84.9531
       p95: 87.1365
       p98: 94.9953
       p99: 96.4239
     Percentile of request latency: 
       p50: 79.9282
       p66: 81.6170
       p75: 82.3043
       p80: 82.8302
       p90: 84.9531
       p95: 87.1365
       p98: 94.9953
       p99: 96.4239
    
  • parallel 16
    Benchmarking summary: 
     Time taken for tests: 553.051 seconds
     Expected number of requests: 100
     Number of concurrency: 16
     Total requests: 100
     Succeed requests: 100
     Failed requests: 0
     Average QPS: 0.181
     Average latency: 81.761
     Throughput(average output tokens per second): 411.723
     Average time to first token: 81.761
     Average input tokens per request: 1614.000
     Average output tokens per request: 2277.040
     Average time per output token: 0.00243
     Average package per request: 1.000
     Average package latency: 81.761
     Percentile of time to first token: 
       p50: 85.2388
       p66: 86.6946
       p75: 87.8569
       p80: 88.2254
       p90: 89.7831
       p95: 91.9183
       p98: 93.1188
       p99: 94.1187
     Percentile of request latency: 
       p50: 85.2388
       p66: 86.6946
       p75: 87.8569
       p80: 88.2254
       p90: 89.7831
       p95: 91.9183
       p98: 93.1188
       p99: 94.1187
    
  • parallel 20
    Benchmarking summary: 
     Time taken for tests: 492.972 seconds
     Expected number of requests: 100
     Number of concurrency: 20
     Total requests: 100
     Succeed requests: 100
     Failed requests: 0
     Average QPS: 0.203
     Average latency: 93.514
     Throughput(average output tokens per second): 450.299
     Average time to first token: 93.514
     Average input tokens per request: 1614.000
     Average output tokens per request: 2219.850
     Average time per output token: 0.00222
     Average package per request: 1.000
     Average package latency: 93.514
     Percentile of time to first token: 
       p50: 101.0449
       p66: 103.1628
       p75: 104.5066
       p80: 105.2498
       p90: 106.6341
       p95: 109.0586
       p98: 112.6580
       p99: 114.0485
     Percentile of request latency: 
       p50: 101.0449
       p66: 103.1628
       p75: 104.5066
       p80: 105.2498
       p90: 106.6341
       p95: 109.0586
       p98: 112.6580
       p99: 114.0485
    
  • parallel 25
    Benchmarking summary: 
     Time taken for tests: 482.926 seconds
     Expected number of requests: 100
     Number of concurrency: 25
     Total requests: 95
     Succeed requests: 95
     Failed requests: 0
     Average QPS: 0.197
     Average latency: 95.232
     Throughput(average output tokens per second): 437.290
     Average time to first token: 95.232
     Average input tokens per request: 1614.000
     Average output tokens per request: 2222.937
     Average time per output token: 0.00229
     Average package per request: 1.000
     Average package latency: 95.232
     Percentile of time to first token: 
       p50: 100.2105
       p66: 103.2044
       p75: 104.4999
       p80: 105.5968
       p90: 113.0055
       p95: 117.3441
       p98: 119.4187
       p99: 119.9363
     Percentile of request latency: 
       p50: 100.2105
       p66: 103.2044
       p75: 104.4999
       p80: 105.5968
       p90: 113.0055
       p95: 117.3441
       p98: 119.4187
       p99: 119.9363
    
  • parallel 30
    Benchmarking summary: 
     Time taken for tests: 503.931 seconds
     Expected number of requests: 100
     Number of concurrency: 30
     Total requests: 85
     Succeed requests: 85
     Failed requests: 0
     Average QPS: 0.169
     Average latency: 82.990
     Throughput(average output tokens per second): 369.642
     Average time to first token: 82.990
     Average input tokens per request: 1614.000
     Average output tokens per request: 2191.459
     Average time per output token: 0.00271
     Average package per request: 1.000
     Average package latency: 82.990
     Percentile of time to first token: 
       p50: 84.4174
       p66: 86.3143
       p75: 88.0736
       p80: 91.9183
       p90: 106.2575
       p95: 109.8099
       p98: 118.7324
       p99: 119.7411
     Percentile of request latency: 
       p50: 84.4174
       p66: 86.3143
       p75: 88.0736
       p80: 91.9183
       p90: 106.2575
       p95: 109.8099
       p98: 118.7324
       p99: 119.7411
    
  • parallel 35
    Benchmarking summary: 
     Time taken for tests: 708.094 seconds
     Expected number of requests: 100
     Number of concurrency: 35
     Total requests: 72
     Succeed requests: 72
     Failed requests: 0
     Average QPS: 0.102
     Average latency: 67.363
     Throughput(average output tokens per second): 224.586
     Average time to first token: 67.363
     Average input tokens per request: 1614.000
     Average output tokens per request: 2208.722
     Average time per output token: 0.00445
     Average package per request: 1.000
     Average package latency: 67.363
     Percentile of time to first token: 
       p50: 67.4104
       p66: 67.9889
       p75: 68.3605
       p80: 68.5264
       p90: 81.7474
       p95: 116.1246
       p98: 118.3541
       p99: 119.0251
     Percentile of request latency: 
       p50: 67.4104
       p66: 67.9889
       p75: 68.3605
       p80: 68.5264
       p90: 81.7474
       p95: 116.1246
       p98: 118.3541
       p99: 119.0251
    
  • parallel 40
    Benchmarking summary: 
     Time taken for tests: 1708.086 seconds
     Expected number of requests: 100
     Number of concurrency: 40
     Total requests: 62
     Succeed requests: 62
     Failed requests: 0
     Average QPS: 0.036
     Average latency: 55.090
     Throughput(average output tokens per second): 79.384
     Average time to first token: 55.090
     Average input tokens per request: 1614.000
     Average output tokens per request: 2187.000
     Average time per output token: 0.01260
     Average package per request: 1.000
     Average package latency: 55.090
     Percentile of time to first token: 
       p50: 57.0599
       p66: 57.4411
       p75: 58.2631
       p80: 58.5435
       p90: 59.3524
       p95: 62.2720
       p98: 96.8205
       p99: 99.6312
     Percentile of request latency: 
       p50: 57.0599
       p66: 57.4411
       p75: 58.2631
       p80: 58.5435
       p90: 59.3524
       p95: 62.2720
       p98: 96.8205
       p99: 99.6312
    

参考资料