evaluation - 标签 - 军舰的日志

模型类型	模型	评估结果
语言模型	Qwen2.5-0.5B	❌
	Qwen2.5-1.5B	✅
	Qwen2.5-7B	✅
	Qwen2.5-14B-Instruct	✅
	Qwen2.5-32B-Instruct	✅
推理模型	DeepSeek-R1-Distill-Qwen2.5-1.5B	❌
	DeepSeek-R1-Distill-Qwen2.5-7B	❌
	DeepSeek-R1-Distill-Qwen2.5-14B	✅
	DeepSeek-R1-Distill-Qwen2.5-32B	✅
	Qwen/QwQ-32B	✅
	Qwen/QwQ-32B-Preview	✅
	Qwen/QwQ-32B-AWQ	❌
代码模型	Qwen2.5-Coder-0.5B	❌
	Qwen2.5-Coder-1.5B	✅
	Qwen2.5-Coder-3B	✅

CUA 评估额外信息

This document includes extra information to how we evaluated our Computer Using Agent, including (browser/VM) environments, prompts, sampling parameters, and scoring procedures. For more details, read https://openai.com/index/computer-using-agent/.

本文档包括我们如何评估我们的计算机使用代理的额外信息，包括（浏览器/VM）环境，提示，采样参数和评分程序。有关更多详细信息，请阅读 https://openai.com/index/computer-using-agent/ 。

1 Environment（环境）

For WebArena and WebVoyager, we run the evals in operator browser instead of playwright browsers since our model relies on the visual action space for navigation (search bar, backward/forward button). Our model does not have access to tool calls that control the navigation.
对于WebArena和WebVoyager，我们在 operator browser 中运行评估，而不是在 playwright 浏览器中运行，因为我们的模型依赖于用于导航的视觉动作空间（搜索栏，后退/前进按钮）。我们的模型无法访问控制导航的工具调用。
For OSWorld, we use the VMWare Ubuntu VM distributed by the authors. Our environment has the dock on the right side of the screen instead of the left side, which we have found to improve the performance slightly.
对于 OSWorld，我们使用作者分发的 VMWare Ubuntu VM。我们的环境将 dock 放在屏幕的右侧，而不是左侧，我们发现这样可以稍微提高性能。

2025-01-26 10:00

cua benchmark openai osworld webarena webvoyager evaluation prompt-engineering

2 篇文章带有标签 “evaluation”

2025年3月17日星期一

大模型实战评测：语言 vs 推理 vs 代码

2025年1月26日星期日

CUA 评估额外信息

2 篇文章带有标签 “evaluation”

2025年3月17日 星期一

大模型实战评测：语言 vs 推理 vs 代码

2025年1月26日 星期日

CUA 评估额外信息

2025年3月17日星期一

2025年1月26日星期日