5 篇文章带有标签 “osworld”

2025年2月1日星期六

Claude: Developing a computer use model

Developing a computer use model（开发计算机使用模型）

Claude can now use computers. The latest version of Claude 3.5 Sonnet can, when run through the appropriate software setup, follow a user’s commands to move a cursor around their computer’s screen, click on relevant locations, and input information via a virtual keyboard, emulating the way people interact with their own computer.

Claude现在可以使用计算机了。最新版本的Claude 3.5 Sonnet可以在通过适当的软件设置后，按照用户的命令在计算机屏幕上移动光标，单击相关位置，并通过虚拟键盘输入信息，模拟人们与自己的计算机交互的方式。

We think this skill—which is currently in public beta—represents a significant breakt

2025-02-01 10:00

2025年1月31日星期五

OSWorld：在真实计算机环境中为开放式任务进行多模态代理基准测试

参考

Abstract（摘要）

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability.

2025-01-31 10:00

osworld benchmark agent multimodal-agent vlm llm gui cli pyautogui

2025年1月27日星期一

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS: Pioneering Automated GUI Interaction with Native Agents（与本地代理进行自动化 GUI 交互的先驱）

Abstract（摘要）

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks.

2025-01-27 10:00

ui-tars agent gui llm native-agent bytedance qwen-2-vl osworld androidworld system-2-reasoning

2025年1月26日星期日

CUA 评估额外信息

CUA eval extra information

This document includes extra information to how we evaluated our Computer Using Agent, including (browser/VM) environments, prompts, sampling parameters, and scoring procedures. For more details, read https://openai.com/index/computer-using-agent/.

本文档包括我们如何评估我们的计算机使用代理的额外信息，包括（浏览器/VM）环境，提示，采样参数和评分程序。有关更多详细信息，请阅读 https://openai.com/index/computer-using-agent/ 。

1 Environment（环境）

For WebArena and WebVoyager, we run the evals in operator browser instead of playwright browsers since our model relies on the visual action space for navigation (search bar, backward/forward button). Our model does not have access to tool calls that control the navigation.
对于WebArena和WebVoyager，我们在 operator browser 中运行评估，而不是在 playwright 浏览器中运行，因为我们的模型依赖于用于导航的视觉动作空间（搜索栏，后退/前进按钮）。我们的模型无法访问控制导航的工具调用。
For OSWorld, we use the VMWare Ubuntu VM distributed by the authors. Our environment has the dock on the right side of the screen instead of the left side, which we have found to improve the performance slightly.
对于 OSWorld，我们使用作者分发的 VMWare Ubuntu VM。我们的环境将 dock 放在屏幕的右侧，而不是左侧，我们发现这样可以稍微提高性能。

2025-01-26 10:00

cua benchmark openai osworld webarena webvoyager evaluation prompt-engineering

2025年1月25日星期六

Computer-Using Agent

Computer-Using Agent (CUA)

A universal interface for AI to interact with the digital world. AI 与数字世界交互的通用接口。

Today we introduced a research preview of Operator⁠, an agent that can go to the web to perform tasks for you. Powering Operator is Computer-Using Agent (CUA), a model that combines GPT-4o's vision capabilities with advanced reasoning through reinforcement learning. CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen—just as humans do.

2025-01-25 10:00

cua operator computer-using-agent openai gui agent osworld webarena webvoyager

5 篇文章带有标签 “osworld”

2025年2月1日 星期六