目录

沐曦训练芯片 MXC500 介绍

曦云®C500是沐曦面向通用计算的旗舰产品,提供强大高精度及多精度混合算力,配备大规格高带宽显存,片间互联MetaXLink无缝链接多GPU系统,自主研发的MXMACA®软件栈可兼容主流GPU生态,能够全面满足数字经济建设和产业数字化的算力需求。

2023 年 6 月 14 日,沐曦官宣 AI 训练 GPU MXC500 完成芯片功能测试,MXMACA 2.0 计算平台基础测试完成,意味着公司首款 AI 训练芯片 MXC500成功点亮,该芯片采用 7nm 制程,GPGPU 架构,能够兼容 CUDA,目标对标英伟达 A100/A800 芯片。

沐曦主要有三大产品线:

  1. 用于 AI 推理的 MXN 系列;
  2. 用于 AI 训练及通用计算的 MXC 系列;
  3. 用于图形渲染的 MXG 系列。

研发实力强大,软件生态布局完善。沐曦的研发团队阵容豪华,三位创始人均在 AMD 拥有 20 年左右的 GPU 研发经验,其中两位为 AMD 科学家(Fellow)。沐曦采用了完全自主研发的 GPU IP,有效提高了产品的开发效率,同时拥有完全自主知识产权的指令集和架构,可以对每个独立的计算实例进行灵活配置,从而优化数据中心计算资源的效率。同时,沐曦配有兼容主流 GPU 生态 的 完 整 软 件 栈 ( MXMACA ) 平 台 , 支 持 AI 神 经 网 络 框 架 ( 如TensorFlow/PyTorch 等)、库(如 Blas/DNN 等)和 Linux Kernel 等技术,并持续优化平台来实现更高的性能和可扩展性。此外,沐曦还成立了“曦思应用生态联盟”,联合多家生态合作伙伴,包括芯驰技术、中恒讯通科、清华大学苏州汽车研究院等,推动 MXN 系列产品和解决方案的应用落地。

2023 年 3 月,公司的 MXN100 芯片与百度飞桨完成 I 级兼容性测试。

部署模型

登录 JumpServer

选择 MXC500 服务器,4卡64G。

下载模型

进入 /data/models 目录。

cd /data/models

Qwen2.5-72B-Instruct

git clone https://www.modelscope.cn/Qwen/Qwen2.5-72B-Instruct.git

DeepSeek-R1-Distill-Qwen-32B

git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.git

运行容器(定制的 vLLM 镜像)

docker run -itd --restart=always \
    --device=/dev/dri \
    --device=/dev/mxcd \
    --group-add video \
    --network=host \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --shm-size 256gb \
    --ulimit memlock=-1 \
    -v /data/models:/data \
    --hostname vllm \
    --name vllm \
    cr.metax-tech.com/public-ai-release/c500/vllm:maca2.27.0.9-py310-kylin2309a-arm64 \
    bash

进入容器

docker exec -it vllm bash

部署模型

  • Qwen2.5-72B-Instruct
vllm serve /data/Qwen2.5-72B-Instruct --served-model-name qwen2.5 --tensor-parallel-size 4
  • DeepSeek-R1-Distill-Qwen-32B
vllm serve /data/DeepSeek-R1-Distill-Qwen-32B --served-model-name qwen2.5 --tensor-parallel-size 4

查看 GPU 状态

运行 mx-smi 命令查看 GPU 状态。

mx-smi
mx-smi  version: 2.1.6

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Fri Feb 14 16:26:46 2025

Attached GPUs                                     : 4
+---------------------------------------------------------------------------------+
| MX-SMI 2.1.6                        Kernel Mode Driver Version: 2.5.014         |
| MACA Version: 2.23.0.1018           BIOS Version: 1.13.5.0                      |
|------------------------------------+---------------------+----------------------+
| GPU         NAME                   | Bus-id              | GPU-Util             |
| Temp        Power                  | Memory-Usage        |                      |
|====================================+=====================+======================|
| 0           MXC500                 | 0000:05:00.0        | 0%                   |
| 51C         71W                    | 60982/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+
| 1           MXC500                 | 0000:08:00.0        | 0%                   |
| 51C         76W                    | 60534/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+
| 2           MXC500                 | 0000:0e:00.0        | 0%                   |
| 50C         74W                    | 60534/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+
| 3           MXC500                 | 0000:0f:00.0        | 0%                   |
| 50C         71W                    | 60534/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                    15652         python                       60032          |
|  1                    22854         ray::RayWorkerW              59584          |
|  2                    22945         ray::RayWorkerW              59584          |
|  3                    23034         ray::RayWorkerW              59584          |
+---------------------------------------------------------------------------------+

测试模型

聊天

curl 'http://localhost:8000/v1/chat/completions' \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen2.5",
        "messages": [ 
            { "role": "system", "content": "你是位人工智能专家。" }, 
            { "role": "user", "content": "解释人工智能" } 
        ]
    }'

文本补全

curl 'http://localhost:8000/v1/completions' \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen2.5",
        "prompt": "你是谁?"
    }'

压力测试

工具安装

pip install evalscope==0.5.5
pip install evalscope-perf

数据集下载

中文聊天 HC3-Chinese

mkdir datasets
wget https://modelscope.cn/datasets/AI-ModelScope/HC3-Chinese/resolve/master/open_qa.jsonl \
    -O datasets/open_qa.jsonl

代码问答 Codefuse-Evol-Instruct-Clean

wget https://modelscope.cn/datasets/Banksy235/Codefuse-Evol-Instruct-Clean/resolve/master/data.json \
    -O datasets/Codefuse-Evol-Instruct-Clean-data.jsonl

# 修改数据集格式,将 "input" 改为 "question",以适应 EvalScope 的数据集格式 openqa
sed -i 's/"input"/"question"/g' datasets/Codefuse-Evol-Instruct-Clean-data.jsonl

压力测试

evalscope-perf http://127.0.0.1:8000/v1/chat/completions qwen2.5 \
    ./datasets/open_qa.jsonl \
    --max-prompt-length 8000 \
    --read-timeout=120 \
    --parallels 128 \
    --n 1000
Benchmarking summary: 
 Time taken for tests: 630.612 seconds
 Expected number of requests: 1000
 Number of concurrency: 128
 Total requests: 988
 Succeed requests: 938
 Failed requests: 50
 Average QPS: 1.487
 Average latency: 65.745
 Throughput(average output tokens per second): 404.683
 Average time to first token: 65.745
 Average input tokens per request: 50.303
 Average output tokens per request: 272.066
 Average time per output token: 0.00247
 Average package per request: 1.000
 Average package latency: 65.745
 Percentile of time to first token: 
     p50: 68.1765
     p66: 78.5570
     p75: 84.1820
     p80: 88.1003
     p90: 97.8666
     p95: 107.1864
     p98: 114.8085
     p99: 116.9974
 Percentile of request latency: 
     p50: 68.1765
     p66: 78.5570
     p75: 84.1820
     p80: 88.1003
     p90: 97.8666
     p95: 107.1864
     p98: 114.8085
     p99: 116.9974

📌 Metrics: {'Average QPS': 1.487, 'Average latency': 65.745, 'Throughput': 404.683}

拷贝文件

cp performance_metrics.png /tmp/systemd-private-7a1518cf39464adb821c6af0a9b6902e-chronyd.service-zua6qJ/tmp/

实验结果

总的测试数量为 1000 。

Qwen2.5-72B-Instruct

压测命令:

evalscope-perf http://127.0.0.1:8000/v1/chat/completions qwen2.5 \
    ./datasets/open_qa.jsonl \
    --read-timeout=120 \
    --parallels 8 \
    --parallels 16 \
    --parallels 32 \
    --parallels 64 \
    --parallels 100 \
    --parallels 128 \
    --parallels 150 \
    --parallels 200 \
    --n 1000

指标 8 16 32 64 100 128 150 200
失败数 0 0 0 13 21 69 100 158

显存使用及利用率:

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Sat Feb 15 16:17:48 2025

Attached GPUs                                     : 4
+---------------------------------------------------------------------------------+
| MX-SMI 2.1.6                        Kernel Mode Driver Version: 2.5.014         |
| MACA Version: 2.23.0.1018           BIOS Version: 1.13.5.0                      |
|------------------------------------+---------------------+----------------------+
| GPU         NAME                   | Bus-id              | GPU-Util             |
| Temp        Power                  | Memory-Usage        |                      |
|====================================+=====================+======================|
| 0           MXC500                 | 0000:05:00.0        | 36%                  |
| 54C         122W                   | 60278/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+
| 1           MXC500                 | 0000:08:00.0        | 36%                  |
| 54C         125W                   | 59830/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+
| 2           MXC500                 | 0000:0e:00.0        | 24%                  |
| 53C         122W                   | 59830/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+
| 3           MXC500                 | 0000:0f:00.0        | 35%                  |
| 54C         122W                   | 59830/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                   343683         python                       59328          |
|  1                   350950         ray::RayWorkerW              58880          |
|  2                   351042         ray::RayWorkerW              58880          |
|  3                   351132         ray::RayWorkerW              58880          |
+---------------------------------------------------------------------------------+

DeepSeek-R1-Distill-Qwen-32B

evalscope-perf http://127.0.0.1:8000/v1/chat/completions qwen2.5 \
    ./datasets/open_qa.jsonl \
    --read-timeout=120 \
    --parallels 8 \
    --parallels 16 \
    --parallels 32 \
    --parallels 64 \
    --parallels 100 \
    --parallels 128 \
    --parallels 150 \
    --parallels 200 \
    --parallels 256 \
    --parallels 300 \
    --n 1000

指标 8 16 32 64 100 128 150 200 256 300
失败数 12 10 17 38 109 129 211 197 161 266

显存使用及利用率:

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Sat Feb 15 11:45:50 2025

Attached GPUs                                     : 4
+---------------------------------------------------------------------------------+
| MX-SMI 2.1.6                        Kernel Mode Driver Version: 2.5.014         |
| MACA Version: 2.23.0.1018           BIOS Version: 1.13.5.0                      |
|------------------------------------+---------------------+----------------------+
| GPU         NAME                   | Bus-id              | GPU-Util             |
| Temp        Power                  | Memory-Usage        |                      |
|====================================+=====================+======================|
| 0           MXC500                 | 0000:05:00.0        | 34%                  |
| 54C         112W                   | 61292/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+
| 1           MXC500                 | 0000:08:00.0        | 34%                  |
| 54C         118W                   | 60716/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+
| 2           MXC500                 | 0000:0e:00.0        | 33%                  |
| 53C         114W                   | 60716/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+
| 3           MXC500                 | 0000:0f:00.0        | 33%                  |
| 53C         111W                   | 60716/65536 MiB     |                      |
+------------------------------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                   279777         python                       60352          |
|  1                   287035         ray::RayWorkerW              59776          |
|  2                   287125         ray::RayWorkerW              59776          |
|  3                   287215         ray::RayWorkerW              59776          |
+---------------------------------------------------------------------------------+

NUMA 配置(加速推理性能)

安装 numactl

yum install numactl

查看 NUMA 节点

numactl --hardware
numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 64992 MB
node 0 free: 4950 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 65461 MB
node 1 free: 12524 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 2 size: 65461 MB
node 2 free: 9299 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 3 size: 65461 MB
node 3 free: 11394 MB
node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 4 size: 65461 MB
node 4 free: 7189 MB
node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 5 size: 65461 MB
node 5 free: 6740 MB
node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 6 size: 65461 MB
node 6 free: 7317 MB
node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 7 size: 64436 MB
node 7 free: 11067 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  22  22  22  28  22  28  28 
  1:  22  10  22  22  22  28  28  28 
  2:  22  22  10  22  28  28  28  22 
  3:  22  22  22  10  28  28  22  28 
  4:  28  22  28  28  10  22  22  22 
  5:  22  28  28  28  22  10  22  22 
  6:  28  28  28  22  22  22  10  22 
  7:  28  28  22  28  22  22  22  10

查看 GPU 与 CPU 的拓扑关系

mx-smi topo -m
Attached GPUs                                     : 4
Device link type matrix
        GPU0    GPU1    GPU2    GPU3    Node Affinity  CPU Affinity
GPU0    X       MX      MX      MX      0              0-15
GPU1    MX      X       MX      MX      0              0-15
GPU2    MX      MX      X       MX      0              0-15
GPU3    MX      MX      MX      X       0              0-15

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  MX   = Connection traversing MetaXLink
  NA   = Connection type is unknown

NUMA 绑定

从上面的拓扑关系可以看出,GPU 与 CPU 的拓扑关系是 MX,表示连接通过 MetaXLink,这种连接方式是最快的。所有GPU都连接到 NUMA 节点 0 上,所以可以将进程绑定到 NUMA 节点 0 上,这样进程只能使用 NUMA 节点 0 上的 CPU 和内存,避免了 NUMA 交叉访问,提高了性能。

numactl --cpunodebind=0 --membind=0 vllm serve /data/Qwen2.5-72B-Instruct --served-model-name qwen2.5 --tensor-parallel-size 4

可以简写为:

numactl -N0 -m0 vllm serve /data/Qwen2.5-72B-Instruct --served-model-name qwen2.5 --tensor-parallel-size 4

测试

部署模型

numactl -N0 -m0 vllm serve /data/Qwen2.5-72B-Instruct \
    --served-model-name qwen2.5 \
    --tensor-parallel-size 4

压力测试

evalscope-perf http://127.0.0.1:8000/v1/chat/completions qwen2.5 \
     ./datasets/open_qa.jsonl \
     --max-prompt-length 8000 \
     --read-timeout=120 \
     --parallels 100 \
     --n 500

没有配置 NUMA

Benchmarking summary: 
 Time taken for tests: 555.873 seconds
 Expected number of requests: 500
 Number of concurrency: 100
 Total requests: 474
 Succeed requests: 426
 Failed requests: 48
 Average QPS: 0.766
 Average latency: 73.051
 Throughput(average output tokens per second): 538.497
 Average time to first token: 73.051
 Average input tokens per request: 24.338
 Average output tokens per request: 678.622
 Average time per output token: 0.00186
 Average package per request: 1.000
 Average package latency: 73.051

📌 Metrics: {'Average QPS': 0.766, 'Average latency': 73.051, 'Throughput': 538.497}

numactl -N0 -m0

Benchmarking summary: 
 Time taken for tests: 510.021 seconds
 Expected number of requests: 500
 Number of concurrency: 100
 Total requests: 480
 Succeed requests: 443
 Failed requests: 37
 Average QPS: 0.869
 Average latency: 71.212
 Throughput(average output tokens per second): 609.616
 Average time to first token: 71.212
 Average input tokens per request: 24.413
 Average output tokens per request: 688.063
 Average time per output token: 0.00164
 Average package per request: 1.000
 Average package latency: 71.212

📌 Metrics: {'Average QPS': 0.869, 'Average latency': 71.212, 'Throughput': 609.616}

运行 htop 查看 CPU 使用情况,可以看到进程只使用了 NUMA 节点 0 上的 CPU。

numactl –cpunodebind=0,1,2,3 –membind=0,1,2,3

Benchmarking summary: 
 Time taken for tests: 647.449 seconds
 Expected number of requests: 500
 Number of concurrency: 100
 Total requests: 460
 Succeed requests: 423
 Failed requests: 37
 Average QPS: 0.653
 Average latency: 71.199
 Throughput(average output tokens per second): 443.555
 Average time to first token: 71.199
 Average input tokens per request: 24.333
 Average output tokens per request: 678.910
 Average time per output token: 0.00225
 Average package per request: 1.000
 Average package latency: 71.199

📌 Metrics: {'Average QPS': 0.653, 'Average latency': 71.199, 'Throughput': 443.555}

numactl –cpunodebind=0,2,4,6 –membind=0,2,4,6

Benchmarking summary: 
 Time taken for tests: 641.340 seconds
 Expected number of requests: 500
 Number of concurrency: 100
 Total requests: 461
 Succeed requests: 411
 Failed requests: 50
 Average QPS: 0.641
 Average latency: 73.287
 Throughput(average output tokens per second): 415.901
 Average time to first token: 73.287
 Average input tokens per request: 24.421
 Average output tokens per request: 648.988
 Average time per output token: 0.00240
 Average package per request: 1.000
 Average package latency: 73.287

📌 Metrics: {'Average QPS': 0.641, 'Average latency': 73.287, 'Throughput': 415.901}

结果对比

NUMA 绑定 Average QPS Average latency Throughput
0.766 73.051 538.497
numactl -N0 -m0 0.869 71.212 609.616
numactl –cpunodebind=0,1,2,3 –membind=0,1,2,3 0.653 71.199 443.555
numactl –cpunodebind=0,2,4,6 –membind=0,2,4,6 0.641 73.287 415.901

结论:NUMA 绑定可以提高推理性能,但是绑定的节点需要根据实际情况(GPU和CPU的拓扑关系)进行调整,否则可能会降低性能。

还验证了数据集存放在固态硬盘和普通硬盘的区别,发现没有明显差异。

实验结果(NUMA 绑定)

使用上面部署的 Qwen2.5-72B-Instruct 模型,总的测试数量为 1000 。

evalscope-perf

evalscope-perf http://127.0.0.1:8000/v1/chat/completions qwen2.5 \
    ./datasets/open_qa.jsonl \
    --read-timeout=120 \
    --parallels 8 \
    --parallels 16 \
    --parallels 32 \
    --parallels 64 \
    --parallels 100 \
    --parallels 128 \
    --parallels 150 \
    --parallels 200 \
    --n 1000

指标 8 16 32 64 100 128 150 200
QPS 0.37 0.61 0.89 1.08 1.37 1.44 1.48 1.47
延迟 20.94 25.73 35.16 55.09 61.88 67.18 70.76 73.50
吞吐量 109.62 179.37 260.32 309.04 385.08 383.47 393.34 364.43
失败数 0 0 0 14 35 78 89 164

没有看到明显的性能提升,但并发数 150 时性能达到峰值。

vllm benchmark

下载数据集 ShareGPT

wget https://modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split/resolve/master/ShareGPT_V3_unfiltered_cleaned_split.json \
    -O datasets/ShareGPT_V3_unfiltered_cleaned_split.json

克隆 vllm 项目

git clone https://github.com/vllm-project/vllm
cd vllm

运行 benchmark

python3 ./benchmarks/benchmark_serving.py --backend vllm \
    --model qwen2.5 \
    --tokenizer /data/models/Qwen2.5-72B-Instruct \
    --dataset-name "sharegpt" \
    --dataset-path "/data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json" \
    --base-url http://0.0.0.0:8000 --trust-remote-code
Namespace(backend='vllm', base_url='http://0.0.0.0:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='qwen2.5', tokenizer='/data/models/Qwen2.5-72B-Instruct', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=True, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  384.69    
Total input tokens:                      217393    
Total generated tokens:                  197245    
Request throughput (req/s):              2.60      
Output token throughput (tok/s):         512.73    
Total Token throughput (tok/s):          1077.84   
---------------Time to First Token----------------
Mean TTFT (ms):                          129740.24 
Median TTFT (ms):                        118243.29 
P99 TTFT (ms):                           319822.78 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          487.16    
Median TPOT (ms):                        501.16    
P99 TPOT (ms):                           1076.45   
---------------Inter-token Latency----------------
Mean ITL (ms):                           428.46    
Median ITL (ms):                         565.66    
P99 ITL (ms):                            1021.33   
==================================================

运行 benchmark(并发 100)

python3 ./benchmarks/benchmark_serving.py --backend vllm \
    --model qwen2.5 \
    --tokenizer /data/models/Qwen2.5-72B-Instruct \
    --dataset-name "sharegpt" \
    --dataset-path "/data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json" \
    --base-url http://0.0.0.0:8000 --trust-remote-code \
    --max-concurrency 100
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 100

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  552.55    
Total input tokens:                      217393    
Total generated tokens:                  198142    
Request throughput (req/s):              1.81      
Output token throughput (tok/s):         358.60    
Total Token throughput (tok/s):          752.04    
---------------Time to First Token----------------
Mean TTFT (ms):                          1109.58   
Median TTFT (ms):                        566.98    
P99 TTFT (ms):                           5961.61   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          272.72    
Median TPOT (ms):                        271.69    
P99 TPOT (ms):                           419.77    
---------------Inter-token Latency----------------
Mean ITL (ms):                           260.09    
Median ITL (ms):                         139.25    
P99 ITL (ms):                            623.20    
==================================================

效果不好。❌