7 篇文章带有标签 “benchmarks”

Kimi K2.5:首个开源多模态智能体集群

感觉 Kimi K2.5 在国内被低估了,让子弹飞一会儿 🚀🚀🚀

基准测试(Benchmarks)

Agent Swarm 基准测试

为了严格评估智能体集群(Agent Swarm)框架的有效性,选择了三个具有代表性的基准测试,它们共同涵盖了深度推理大规模检索以及真实世界的复杂性

  • BrowseComp:一项具有挑战性的深度研究基准,需要多步推理和复杂的信息综合。
  • WideSearch:旨在评估在不同来源中进行广泛、多步信息寻求和推理能力的基准。
  • In-house Swarm Bench:一项内部开发的集群基准,旨在评估智能体集群在真实世界、高复杂度条件下的性能。 它涵盖了四个领域:
    • WildSearch(开放网络上不受约束的真实世界信息检索);
    • Batch Download(大规模获取多样化资源);
    • WideRead(涉及 100 多个输入文档的大规模文档理解);
    • Long-Form Writing(连贯生成超过 10 万字的海量内容)。 该基准整合了极端规模的场景,旨在压力测试基于智能体系统的编排(Orchestration)、可扩展性(Scalability)和协作能力

主要基准测试

Kimi K2.5 评估涵盖了多个领域的基准测试,下面是按能力维度分类的各基准测试说明:

推理与通用能力 (Reasoning & General) Humanity’s Last Exam

Qwen2.5-Coder Technical Report

Abstract(摘要)

In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.

开源 OCR 引擎基准测试

OCR 引擎

EasyOCR

EasyOCR 支持 80+ 语言。

Abaza = 'abq'
Adyghe = 'ady'
Afrikaans = 'af'
Angika = 'ang'
Arabic = 'ar'
Assamese = 'as'
Avar = 'ava'
Azerbaijani = 'az'
Belarusian = 'be'
Bulgarian = 'bg'
Bihari = 'bh'
Bhojpuri = 'bho'
Bengali = 'bn'
Bosnian = 'bs'
Simplified_Chinese = 'ch_sim'
// ...

安装

pip install torch==2.0.1 torchvision==0.15.2 -i https://download.pytorch.org/whl/cpu
pip install easyocr

代码示例 import easyocr languages = ['ch_sim', 'en'] model = easyocr.

LLM Leaderboard

Leaderboards

LLM

Open LLM Leaderboard

Qwen-7B

ChatGLM2-6B

Embedding 模型

Massive Text Embedding Benchmark (MTEB) Leaderboard

sensenova/piccolo-large-zh

piccolo是一个通用embedding模型(中文), 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程,采用了两阶段的训练方式。 在第一阶段中,我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据),并采用二元组的softmax对比学习损失来优化模型。 在第二阶段中,我们搜集整理了2000万人工标注的中文文本对(精标数据),并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。

BAAI/bge-large-zh

FlagEmbedding 将任意文本映射为低维稠密向量,以用于检索、分类、聚类或语义匹配等任务,并可支持为大模型调用外部知识。

不同的任务

参考资料

Ultralytics YOLOv8 推理速度对比

CPU

服务器信息

lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          40
On-line CPU(s) list:             0-39
Thread(s) per core:              2
Core(s) per socket:              10
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
Stepping:                        1
CPU MHz:                         1201.687
CPU max MHz:                     3400.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        4788.86
Virtualization:                  VT-x
L1d cache:                       640 KiB
L1i cache:                       640 KiB
L2 cache:                        5 MiB
L3 cache:                        50 MiB
NUMA node0 CPU(s):               0-9,20-29
NUMA node1 CPU(s):               10-19,30-39
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nop
                                 l xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c 
                                 rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 sme
                                 p bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

OpenVINO Benchmark Python Tool

性能指标评测工具

该工具使用卷积网络执行推理。性能可以测量两种推理模式:

  • 同步(面向延迟 Latency)
  • 异步(面向吞吐量 Throughput)

帮助信息 -i PATHS_TO_INPUT [PATHS_TO_INPUT ...], --paths_to_input PATHS_TO_INPUT [PATHS_TO_INPUT ...] Optional. Path to a folder with images and/or binaries or to specific image or binary file.It is also allowed to map files to network inputs: input_1:file_1/dir1,file_2/dir2,input_4:file_4/dir4 input_2:file_3/dir3 -m PATH_TO_MODEL, --path_to_model PATH_TO_MODEL Required. Path to an .xml/.onnx file with a trained model or to a .blob file with a trained compiled model. -d TARGET_DEVICE, --target_device TARGET_DEVICE Optional.

HTTP 基准测试工具

wrk

wrk 使用的是 HTTP/1.1

安装

需要从 GitHub 上克隆代码自己编译,编译前需要安装 git, gcc。

git clone https://github.com/wg/wrk.git
cd wrk
#使用多线程(机器的处理器核数)加速编译,
make -j $(nproc)
cp wrk /usr/local/bin/

测试

10 个线程,保持打开 100 个并发连接,持续 10 秒。

wrk -t10 -c100 -d10 http://www.baidu.com/
Running 10s test @ http://www.baidu.com/
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   138.46ms  219.05ms   1.89s    92.11%
    Req/Sec   128.07     79.83   700.00     94.40%
  12776 requests in 10.02s, 128.22MB read
  Socket errors: connect 0, read 57, write 0, timeout 3
Requests/sec:   1275.30
Transfer/sec:     12.80MB

打印详细的延时分布信息 wrk -c10 -t4 --latenc