目录

wrk

安装

git clone https://github.com/wg/wrk.git
cd wrk
#使用多线程(机器的处理器核数)加速编译,
make -j $(nproc)
cp wrk /usr/local/bin/

Tabby Server

服务器:NVIDIA T4 16GB X 4

部署

  • 模型:TabbyML/DeepseekCoder-6.7B
    docker run -d --gpus all -p 8080:8080 \
    -v /data/zhw/tabby/data:/data \
    tabbyml/tabby:latest \
    serve --model TabbyML/DeepseekCoder-6.7B \
    --device cuda --parallelism 4
    
  • 模型:TabbyML/DeepseekCoder-1.3B
    docker run -d --gpus all -p 8080:8080 \
    -v /data/zhw/tabby/data:/data \
    tabbyml/tabby:latest \
    serve --model TabbyML/DeepseekCoder-1.3B \
    --device cuda --parallelism 12
    

curl 测试

curl http://127.0.0.1:8080/v1/completions   -H "Content-Type: application/json"   -d '{
  "language": "python",
  "segments": {
    "prefix": "#实现一个快速排序\n  def "
  }
}'|jq
{
  "id": "cmpl-6ef400f4-86da-43cc-b27a-eeae9394c316",
  "choices": [
    {
      "index": 0,
      "text": "quick_sort(arr):\n    if len(arr) <= 1:\n      return arr\n    pivot = arr[0]\n    left = [x for x in arr[1:] if x <= pivot]\n    right = [x for x in arr[1:] if x > pivot]\n    return quick_sort(left) + [pivot] + quick_sort(right)"
    }
  ]
}

Token 计算

输入 12 个 Tokens

#实现一个快速排序
  def 

输出 74 个 Tokens

quick_sort(arry):
    if len(arr) <= 1:
      return arr
    pivot = arr[0]
    left = [x for x in arr[1:] if x <= pivot]
    right = [x for x in arr[1:] if x > pivot]
    return quick_sort(left) + [pivot] + quick_sort(right)

准备

编辑测试脚本 post_json.lua

wrk.method = "POST"
wrk.body   = "{\"language\": \"python\", \"segments\": {\"prefix\": \"#Implement a quick sort\\n  def \"}}"
wrk.headers["Content-Type"] = "application/json"

\\n 不能写为 \n,否则会报错:Failed to parse the request body as JSON: segments.prefix: control character (\u0000-\u001F) found while parsing a string at line 2 column 0

监控 8080 端口

sudo tcpdump -i any -A 'tcp port 8080 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -w -

基准测试

  • 基准测试工具:wrk
  • 持续时间:1 分钟
  • 超时时间:10 秒

总结

TabbyML/DeepseekCoder-6.7B(并行 4)

  并发连接数 线程数 平均延迟 最大延迟 完成请求数 超时请求数 平均每秒请求数
  1 1 3.75s 3.81s 16 0 0.27
  2 2 5.00s 5.11s 24 0 0.40
  3 3 5.80s 5.99s 30 0 0.50
👍 4 4 5.43s 5.62s 43 0 0.72
  5 5 6.12s 6.30s 40 9 0.67
  6 6 7.42s 9.14s 41 9 0.68
  8 8 7.29s 9.85s 40 34 0.67

TabbyML/DeepseekCoder-1.3B(并行 12)

  并发连接数 线程数 平均延迟 最大延迟 完成请求数 超时请求数 平均每秒请求数
  1 1 1.03s 1.11s 57 0 0.95
  4 4 1.46s 1.79s 161 0 2.68
  8 8 1.95s 2.18s 241 0 4.02
👍 12 12 2.79s 3.06s 251 0 4.18
  16 16 4.19s 6.40s 221 0 3.68
  20 20 6.43s 8.16s 177 0 2.95
  24 24 9.01s 10.00s 143 42 2.38

测试数据

TabbyML/DeepseekCoder-6.7B

  • 1 个并发连接,1 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c1 -t1 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    1 threads and 1 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     3.75s    29.55ms   3.81s    68.75%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    3.75s 
       75%    3.78s 
       90%    3.79s 
       99%    3.81s 
    16 requests in 1.00m, 9.05KB read
    Requests/sec:      0.27
    Transfer/sec:     154.21B
    
  • 2 个并发连接,2 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c2 -t2 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    2 threads and 2 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     5.00s    43.20ms   5.11s    87.50%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    5.00s 
       75%    5.01s 
       90%    5.01s 
       99%    5.11s 
    24 requests in 1.00m, 13.57KB read
    Requests/sec:      0.40
    Transfer/sec:     231.57B
    
  • 3 个并发连接,3 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c3 -t3 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    3 threads and 3 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     5.80s    91.00ms   5.99s    73.33%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    5.81s 
       75%    5.84s 
       90%    5.88s 
       99%    5.99s 
    30 requests in 1.00m, 16.96KB read
    Requests/sec:      0.50
    Transfer/sec:     289.48B
    
  • 4 个并发连接,4 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c4 -t4 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    4 threads and 4 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     5.43s   143.39ms   5.62s    72.09%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    5.46s 
       75%    5.52s 
       90%    5.60s 
       99%    5.62s 
    43 requests in 1.00m, 24.31KB read
    Requests/sec:      0.72
    Transfer/sec:     414.28B
    
  • 5 个并发连接,5 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c5 -t5 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    5 threads and 5 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     6.12s    98.11ms   6.30s    64.52%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    6.13s 
       75%    6.17s 
       90%    6.25s 
       99%    6.30s 
    40 requests in 1.00m, 22.62KB read
    Socket errors: connect 0, read 0, write 0, timeout 9
    Requests/sec:      0.67
    Transfer/sec:     385.36B
    
  • 6 个并发连接,6 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c6 -t6 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    6 threads and 6 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     7.42s     1.62s    9.14s    71.88%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    7.83s 
       75%    9.01s 
       90%    9.12s 
       99%    9.14s 
    41 requests in 1.00m, 22.94KB read
    Socket errors: connect 0, read 0, write 0, timeout 9
    Requests/sec:      0.68
    Transfer/sec:     390.83B
    
  • 8 个并发连接,8 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c8 -t8 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    8 threads and 8 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     7.29s     1.89s    9.85s    66.67%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    6.29s 
       75%    9.58s 
       90%    9.85s 
       99%    9.85s 
    40 requests in 1.00m, 22.47KB read
    Socket errors: connect 0, read 0, write 0, timeout 34
    Requests/sec:      0.67
    Transfer/sec:     382.82B
    

TabbyML/DeepseekCoder-1.3B

  • 1 个并发连接,1 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c1 -t1 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    1 threads and 1 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     1.03s    12.40ms   1.11s    89.47%
      Req/Sec     0.04      0.19     1.00     96.49%
    Latency Distribution
       50%    1.03s 
       75%    1.04s 
       90%    1.04s 
       99%    1.11s 
    57 requests in 1.00m, 32.40KB read
    Requests/sec:      0.95
    Transfer/sec:     552.85B
    
  • 4 个并发连接,4 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c4 -t4 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    4 threads and 4 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     1.46s    93.34ms   1.79s    85.09%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    1.40s 
       75%    1.55s 
       90%    1.56s 
       99%    1.79s 
    161 requests in 1.00m, 92.53KB read
    Requests/sec:      2.68
    Transfer/sec:      1.54KB
    
  • 8 个并发连接,8 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c8 -t8 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    8 threads and 8 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     1.95s   115.96ms   2.18s    59.34%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    1.98s 
       75%    2.05s 
       90%    2.09s 
       99%    2.16s 
    241 requests in 1.00m, 138.75KB read
    Requests/sec:      4.02
    Transfer/sec:      2.31KB
    
  • 12 个并发连接,12 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c12 -t12 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    12 threads and 12 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     2.79s   173.15ms   3.06s    58.57%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    2.78s 
       75%    2.95s 
       90%    3.01s 
       99%    3.05s 
    251 requests in 1.00m, 144.62KB read
    Requests/sec:      4.18
    Transfer/sec:      2.41KB
    
  • 16 个并发连接,16 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c16 -t16 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    16 threads and 16 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     4.19s   874.57ms   6.40s    61.99%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    3.82s 
       75%    5.06s 
       90%    5.49s 
       99%    6.04s 
    221 requests in 1.00m, 127.01KB read
    Requests/sec:      3.68
    Transfer/sec:      2.11KB
    
  • 20 个并发连接,20 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c20 -t20 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    20 threads and 20 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     6.43s     1.25s    8.16s    61.58%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    6.98s 
       75%    7.38s 
       90%    7.72s 
       99%    8.09s 
    177 requests in 1.00m, 102.04KB read
    Requests/sec:      2.95
    Transfer/sec:      1.70KB
    
  • 24 个并发连接,24 个线程,持续 1 分钟,超时时间 10 秒
    wrk -c24 -t24 -d1m --timeout 10s --latency -s post_json.lua http://127.0.0.1:8080/v1/completions
    
    Running 1m test @ http://127.0.0.1:8080/v1/completions
    24 threads and 24 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     9.01s     1.39s   10.00s    88.12%
      Req/Sec     0.00      0.00     0.00    100.00%
    Latency Distribution
       50%    9.52s 
       75%    9.74s 
       90%    9.82s 
       99%   10.00s 
    143 requests in 1.00m, 82.42KB read
    Socket errors: connect 0, read 0, write 0, timeout 42
    Requests/sec:      2.38
    Transfer/sec:      1.37KB