目录

Computer-Using Agent (CUA)

A universal interface for AI to interact with the digital world. AI 与数字世界交互的通用接口。

Today we introduced a research preview of Operator⁠, an agent that can go to the web to perform tasks for you. Powering Operator is Computer-Using Agent (CUA), a model that combines GPT-4o’s vision capabilities with advanced reasoning through reinforcement learning. CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen—just as humans do. This gives it the flexibility to perform digital tasks without using OS- or web-specific APIs.

今天,我们介绍了 Operator 的研究预览,这是一个可以为您执行任务的代理。Operator 的动力来源于 Computer-Using Agent (CUA)这是一个将 GPT-4o 的视觉能力与通过强化学习进行的高级推理相结合的模型CUA 被训练为与图形用户界面(GUI)进行交互——这是人们在屏幕上看到的按钮、菜单和文本字段——就像人类一样。这使它能够执行数字任务,而无需使用特定于 OS 或 Web 的 API。

CUA builds off of years of foundational research at the intersection of multimodal understanding and reasoning. By combining advanced GUI perception with structured problem-solving, it can break tasks into multi-step plans and adaptively self-correct when challenges arise. This capability marks the next step in AI development, allowing models to use the same tools humans rely on daily and opening the door to a vast range of new applications.

CUA 建立在多模态理解和推理交叉点的多年基础研究之上。通过将先进的 GUI 感知与结构化问题解决相结合,CUA 可以将任务分解为多步计划,并在面临挑战时自适应自我纠正。这种能力标志着 AI 发展的下一步,使模型能够使用人类每天依赖的相同工具,并打开了广泛的新应用范围。

While CUA is still early and has limitations, it sets new state-of-the-art benchmark results, achieving a 38.1% success rate on OSWorld for full computer use tasks, and 58.1% on WebArena and 87% on WebVoyager for web-based tasks. These results highlight CUA’s ability to navigate and operate across diverse environments using a single general action space.

虽然 CUA 仍处于早期阶段并且存在局限性,但它已经创造了新的最新基准结果,在 OSWorld 上实现了 38.1% 的全面计算机使用任务成功率,在 WebArena 上实现了 58.1%,在 WebVoyager 上实现了 87%。这些结果突显了 CUA 利用单一通用动作空间在不同环境中导航和操作的能力。

We’ve developed CUA with safety as a top priority to address the challenges posed by an agent having access to the digital world, as detailed in our Operator System Card. In line with our iterative deployment strategy, we are releasing CUA through a research preview of Operator at operator.chatgpt.com for Pro⁠ Tier users in the U.S. to start. By gathering real-world feedback, we can refine safety measures and continuously improve as we prepare for a future with increasing use of digital agents.

我们将安全性作为首要任务开发 CUA,以解决代理访问数字世界所带来的挑战,详细信息请参阅我们的 Operator 系统卡。与我们的迭代部署策略一致,我们通过 operator.chatgpt.com 的 Operator 研究预览向美国的 Pro 用户发布 CUA。通过收集现实世界的反馈,我们可以完善安全措施,并在为数字代理的使用增加的未来做准备的过程中不断改进。

How it works(工作原理)

                                                                                 COMMANDS ARE APPLIED TO THE VM
                                                                                  (命令应用于虚拟机)
INPUT TO CUA(CUA 的输入)-> SAMPLED ACTIONS GENERATED BY CUA(由 CUA 生成的采样操作)--------------------------> VIRTUAL MACHINE(虚拟机)
             ^                                 ^                                                                     |
             |                                 |                                                                     |
         /       \                             |                                                                     |
TASK AS TEXT    SCREENSHOT AS IMAGE        ACTIONS (操作)                                                             |
 (文本任务📜)     (屏幕截图作为图像🏞️)           COT: Looking up the history of OpenAI... (查找 OpenAI 的历史...)           |
         \       /                          Click 300, 200 (点击 300, 200)                                            |
             |                              Type: OpenAI   (输入 OpenAI)                                              |
             |                                                                                                       |                         
             |                                                                                                       |
              ---------------<-----------------<-----------------<-----------------<-----------------<---------------

📜 “Please write a document about OpenAI’s history”:“请撰写一篇关于 OpenAI 历史的文档”

CUA processes raw pixel data to understand what’s happening on the screen and uses a virtual mouse and keyboard to complete actions. It can navigate multi-step tasks, handle errors, and adapt to unexpected changes. This enables CUA to act in a wide range of digital environments, performing tasks like filling out forms and navigating websites without needing specialized APIs.

CUA 处理原始像素数据以了解屏幕上发生的情况,并使用虚拟鼠标和键盘完成操作。它可以导航多步任务,处理错误,并适应意外变化。这使 CUA 能够在各种数字环境中执行操作,例如填写表单和导航网站,而无需专门的 API。

Given a user’s instruction, CUA operates through an iterative loop that integrates perception, reasoning, and action:

给定用户的指令,CUA 通过一个集成感知、推理和行动的迭代循环运行:

  • Perception: Screenshots from the computer are added to the model’s context, providing a visual snapshot of the computer’s current state.

  • 感知:从计算机中获取的屏幕截图被添加到模型的上下文中,提供计算机当前状态的视觉快照。

  • Reasoning: CUA reasons through the next steps using chain-of-thought, taking into consideration current and past screenshots and actions. This inner monologue improves task performance by enabling the model to evaluate its observations, track intermediate steps, and adapt dynamically.

  • 推理:CUA 使用思维链推理出下一步的步骤,考虑当前和过去的屏幕截图和操作。这种内心独白通过使模型能够评估其观察结果、跟踪中间步骤并动态地适应来提高任务性能。

  • Action: It performs the actions—clicking, scrolling, or typing—until it decides that the task is completed or user input is needed. While it handles most steps automatically, CUA seeks user confirmation for sensitive actions, such as entering login details or responding to CAPTCHA forms.

  • 行动:它执行操作——点击、滚动或输入——直到决定任务完成或需要用户输入。虽然它会自动处理大多数步骤,但对于敏感操作(例如输入登录详细信息或回应 CAPTCHA 表单),CUA 会寻求用户确认。

Evaluations(评估)

CUA establishes a new state-of-the-art in both computer use and browser use benchmarks by using the same universal interface of screen, mouse, and keyboard.

CUA 通过使用屏幕、鼠标和键盘的相同通用接口在计算机使用和浏览器使用基准测试中建立了新的领先地位(SOTA)。

基准测试类型 基准测试 计算机使用(通用接口)   Web 浏览代理 人类
    OpenAI CUA 先前的 SOTA 先前的 SOTA  
计算机使用 OSWorld 38.1% 22.0% - 72.4%
浏览器使用 WebArena 58.1% 36.2% 57.1% 78.2%
  WebVoyager 87.0% 56.0% 87.0% -

评估细节在这里描述

Browser use(浏览器使用)

WebArena and WebVoyager are designed to evaluate the performance of web browsing agents in completing real-world tasks using browsers. WebArena utilizes self-hosted open-source websites offline to imitate real-world scenarios in e-commerce, online store content management (CMS), social forum platforms, and more. WebVoyager tests the model’s performance on online live websites like Amazon, GitHub, and Google Maps.

WebArenaWebVoyager 旨在评估使用浏览器完成真实任务的 Web 浏览代理的性能。WebArena 利用自托管的开源网站离线模拟电子商务、在线商店内容管理(CMS)、社交论坛平台等真实场景。WebVoyager 在在线实时网站(如 Amazon、GitHub 和 Google 地图)上测试模型的性能。

In these benchmarks, CUA sets a new standard using the same universal interface that perceives the browser screen as pixels and takes action through mouse and keyboard. CUA achieved a 58.1% success rate on WebArena and an 87% success rate on WebVoyager for web-based tasks. While CUA achieves a high success rate on WebVoyager, where most tasks are relatively simple, CUA still needs more improvements to close the gap with human performance on more complex benchmarks like WebArena.

在这些基准测试中,CUA 使用相同的通用接口,将浏览器屏幕视为像素,并通过鼠标和键盘采取行动,树立了新的标准。CUA 在 WebArena 上实现了 58.1% 的成功率,在 WebVoyager 上实现了 87% 的成功率。虽然 CUA 在 WebVoyager 上取得了较高的成功率,其中大多数任务相对简单,但 CUA 仍需要更多的改进来缩小与 WebArena 等更复杂基准测试中人类表现之间的差距。

Prompt: Go to the Plus section of Cambridge Dictionary, finish a recommended Grammar quiz without login and tell me your final score.

提示:转到剑桥词典的 Plus 部分,完成一个推荐的语法测验,无需登录,并告诉我你的最终得分。

Assistant Message: I completed the “-ed and -ing adjectives” grammar quiz on the Cambridge Dictionary +Plus section and achieved a perfect score of 12 out of 12 (). Would you like to try another quiz or need further assistance with anything else?

助手消息:我在剑桥词典 +Plus 部分完成了“-ed 和 -ing 形容词”语法测验,并取得了满分 12 分()。您想尝试另一个测验还是需要进一步帮助?

Computer use(计算机使用)

OSWorld⁠ is a benchmark that evaluates models’ ability to control full operating systems like Ubuntu, Windows, and macOS. In this benchmark, CUA achieves 38.1% success rate. We observed test-time scaling, meaning CUA’s performance improves when more steps are allowed. The figure below compares CUA’s performance with previous state-of-the-arts with varying maximum allowed steps. Human performance on this benchmark is 72.4%, so there is still significant room for improvement.

OSWorld 是一个基准测试,评估模型控制 Ubuntu、Windows 和 macOS 等完整操作系统的能力。在这个基准测试中,CUA 实现了 38.1% 的成功率。我们观察到测试时间缩放(test-time scaling),这意味着当允许更多步骤时,CUA 的性能会提高。下图比较了 CUA 的性能与先前的最新技术在不同的最大允许步骤下。在这个基准测试中,人类的表现是 72.4%,因此仍有很大的改进空间。

The following visualizations show examples of CUA navigating a variety of standardized OSWorld tasks.

以下可视化展示了 CUA 导航各种标准化 OSWorld 任务的示例。

Prompt: Please do the following task: I want to learn python programming and my friend recommends me this course website. I have grabbed the lecture slide for week 0. Please download the PDFs for other weeks into the opened folder and leave the file name as-it-is. Here are some helpful tips:

  • computer.clipboard, computer.sync_file, computer.sync_shared_folder, computer.computer_output_citation are disabled.
  • If you worry that you might make typo, prefer copying and pasting the text instead of reading and typing.
  • My computer’s password is “password”, feel free to use it when you need sudo rights. - For the thunderbird account “anonym-x2024@outlook.com”, the password is “gTCI”;=@y7 QJ0nDa_kN3Sb&>”.
  • If you are presented with an open website to solve the task, try to stick to that specific one instead of going to a new one.
  • You have full authority to execute any action without my permission. I won’t be watching so please don’t ask for confirmation.
  • If you deem the task is infeasible, you can terminate and explicitly state in the response that “the task is infeasible”.

提示:请执行以下任务:我想学习 Python 编程,我的朋友推荐了这个课程网站。我已经抓取了第 0 周的讲座幻灯片。请将其他周的 PDF 下载到打开的文件夹中,并保留文件名不变。以下是一些有用的提示:

  • computer.clipboard、computer.sync_file、computer.sync_shared_folder、computer.computer_output_citation 已禁用。
  • 如果您担心可能会出现拼写错误,请优先复制并粘贴文本,而不是阅读和输入。
  • 我计算机的密码是“password”,在需要 sudo 权限时请随意使用。 - 对于 thunderbird 帐户 “anonym-x2024@outlook.com”,密码是 “gTCI”;=@y7 QJ0nDa_kN3Sb&>”。
  • 如果您看到一个打开的网站来解决任务,请尽量坚持使用该特定网站,而不是转到新的网站。
  • 您有完全的权力执行任何操作,无需我的许可。我不会看着,所以请不要要求确认。
  • 如果您认为任务不可行,您可以终止并明确声明在响应中“任务不可行”。

Assistant Message: I successfully downloaded the PDF lecture slides for weeks 1 through 9 into the ‘lecture_slides’ folder, along with the existing week 0 slide. Each file was saved with its original filename, ranging from “lecture1.pdf” to “lecture9.pdf.” Let me know if there’s anything else you need!

助手消息: 我成功地将第 1 到第 9 周的 PDF 讲座幻灯片下载到“lecture_slides”文件夹中,以及现有的第 0 周幻灯片。每个文件都以其原始文件名保存,从“lecture1.pdf”到“lecture9.pdf”。如果您需要其他帮助,请告诉我!

CUA in Operator(Operator 中的 CUA)

We’re making CUA available through a research preview of Operator, an agent that can go to the web to perform tasks for you. Operator is available to Pro users in the U.S. at operator.chatgpt.com⁠. This research preview is an opportunity to learn from our users and the broader ecosystem, refining and improving Operator iteratively. As with any early-stage technology, we don’t expect CUA to perform reliably in all scenarios just yet. However, it has already proven useful in a variety of cases, and we aim to extend that reliability across a wider range of tasks. By releasing CUA in Operator, we hope to gather valuable insights from our users, which will guide us in refining its capabilities and expanding its applications.

我们通过 Operator 的研究预览提供 CUA,这是一个可以为您执行任务的代理。Operator 可供美国的 Pro 用户在 operator.chatgpt.com 使用。这个研究预览是一个学习我们的用户和更广泛生态系统的机会,通过迭代改进和完善 Operator。与任何早期阶段的技术一样,我们还不希望 CUA 在所有场景中都能可靠地执行。然而,它已经在各种情况下证明了其有用性,我们的目标是将这种可靠性扩展到更广泛的任务范围。通过在 Operator 中发布 CUA,我们希望从我们的用户那里收集宝贵的见解,这将指导我们完善其功能并扩展其应用。

In the table below, we present CUA’s performance in Operator on a handful of trials given a prompt to illustrate its known strengths and weaknesses.

在下表中,我们展示了 CUA 在 Operator 中的性能,给出了一些试验,以说明其已知的优点和缺点。

类别 提示 成功 / 尝试 备注
与各种 UI 组件交互以完成任务 轮 1:搜索 Britannica 以获取熊栖息地的详细地图视图
轮 2:太棒了!现在请查看黑熊、棕熊和北极熊的链接,并提供它们的物理特征的简明概述,特别是它们的区别。哦,还有保存链接,这样我就可以快速访问它们。
10 / 10 CUA 可以与各种 UI 组件交互,搜索、排序和过滤结果,以找到用户想要的信息。不同网站和 UI 的可靠性有所不同。
  我想要那些目标交易中的一个。你能查一下他们有没有在 poppi 益生菌苏打水上有交易吗?如果有的话,我想要 12 盎司罐装的西瓜口味。给我这种交易类型,并检查一下它是否无麸质。 9 / 10  
  我打算搬到西雅图,我想让你在 Redfin 上搜索一个至少有 3 间卧室、2 间浴室和节能设计(例如,太阳能电池板或 LEED 认证)的联排别墅。我的预算在 60 万美元至 80 万美元之间,最好接近 1500 平方英尺。 3 / 10  

类别 提示 成功 / 尝试 备注
通过重复简单的 UI 交互完成的任务 在 Todoist 中创建一个名为“周末杂货购物”的新项目。添加以下购物清单:
香蕉(6 个)
鳄梨(2 个熟透的)
婴儿菠菜(1 袋)
全脂牛奶(1 加仑)
切达奶酪(8 盎司块)
薯片(盐味,家庭装)
黑巧克力(70% 可可,2 条)
10 / 10 CUA 可以可靠地多次重复简单的 UI 交互,以自动化用户的简单但繁琐的任务。
  在 Spotify 上搜索 1990 年代美国最流行的歌曲,并创建一个至少包含 10 首歌曲的播放列表。 10 / 10  

类别 提示 成功 / 尝试 备注
只有在提示中包含如何使用网站的详细提示时,CUA 才会显示高成功率的任务。 访问 tagvenue.com 并查找伦敦可容纳 150 人的音乐厅。我需要它在 2025 年 2 月 22 日,从上午 9 点到晚上 12 点,只需确保每小时低于 90 英镑。
哦,你能检查一下适当的筛选器部分,确保有停车场,整个地方都可以无障碍通行。
8 / 10 即使是相同的任务,CUA 的可靠性可能会因我们如何提示任务而改变。在这种情况下,我们可以通过提供日期的具体信息(例如,上午 9 点到下午 12 点与整天从上午 9 点开始)来提高可靠性,并通过提供关于应该使用哪个 UI 来查找结果的提示(例如,检查筛选器部分…)。
  访问 tagvenue.com 并查找伦敦可容纳 150 人的音乐厅。我需要它在 2025 年 2 月 22 日,从上午 9 点开始,只需确保每小时低于 90 英镑。
哦,确保有停车场,整个地方都可以无障碍通行。
3 / 10  

类别 Prompt 提示 成功 / 尝试 备注
在与训练过程中几乎没有互动的 UI 进行互动时,CUA 很难弄清楚如何适当地使用提供的 UI。这通常导致大量的尝试和错误,以及低效的操作。 Struggling to use unfamiliar UI and text editing Use html5editor and input the folowing text on the left side, then edit it following my instructions and give me a screenshot of the entire thing when done. The text is:

Hello world!

This is my first text. I need to see how it would look like when programmed with HTML.

Some parts should be red.

Some bold.

Some italic.

Some underlined.

Until my lesson is complete, and we shift to the other side.


Hello world! should have header 2 applied
The sentence below it should be a regular paragraph text.
The sentence mentioning red should be normal text and red
The sentence mentionnihg bold should be normal text bolded
Sentence mentioning italic should be italicized
The final sentence should be aligned to the right instead of the usual left
使用 html5editor 并在左侧输入以下文本,然后按照我的指示进行编辑,并在完成后给我整个内容的屏幕截图。文本是:

您好,世界!

这是我的第一篇文章。我需要看看它在用 HTML 编程时会是什么样子。

有些部分应该是红色的。

有些是粗体。

有些是斜体。

有些是下划线。

直到我的课程完成,我们转移到另一侧。


“Hello world!”应该应用标题 2
下面的句子应该是普通段落文本。
提到红色的句子应该是普通文本和红色的
提到粗体的句子应该是粗体
提到斜体的句子应该是斜体
最后一句应该右对齐,而不是通常的左对齐。
4 / 10 当 CUA 必须与在训练过程中几乎没有互动的 UI 进行互动时,它很难弄清楚如何适当地使用提供的 UI。这通常导致大量的尝试和错误,以及低效的操作。CUA 在文本编辑方面不够精确。它在过程中经常犯很多错误,或者提供带有错误的输出。

Safety(安全性)

Because CUA is one of our first agentic products with an ability to directly take actions in a browser, it brings new risks and challenges to address. As we prepared for deployment of Operator, we did extensive safety testing and implemented mitigations across three major classes of safety risks: misuse, model mistakes, and frontier risks. We believe it is important to take a layered approach to safety, so we implemented safeguards across the whole deployment context: the CUA model itself, the Operator system, and post-deployment processes. The aim is to have mitigations that stack, with each layer incrementally reducing the risk profile.

由于 CUA 是我们的第一个具有直接在浏览器中采取行动能力的代理产品之一,它带来了新的风险和挑战。在准备部署 Operator 时,我们进行了广泛的安全测试,并在三个主要安全风险类别中实施了缓解措施滥用模型错误前沿风险。我们认为采取分层安全方法很重要,因此我们在整个部署上下文中实施了保障措施:CUA 模型本身Operator 系统部署后流程。目标是实施逐层减少风险概况的缓解措施。

The first category of risk is misuse. In addition to requiring users to comply with our Usage Policies, we have designed the following mitigations to reduce Operator’s risk of harm due to misuse, building off our safety work for GPT-4o:

第一类风险是滥用。除了要求用户遵守我们的使用政策,我们还设计了以下缓解措施,以减少 Operator 由于滥用而导致的伤害风险,这是基于我们为 GPT-4o 的安全工作

  • Refusals: The CUA model is trained to refuse many harmful tasks and illegal or regulated activities.

  • 拒绝:CUA 模型经过训练,拒绝许多有害任务和非法或受监管的活动。

  • Blocklist: Operator cannot access websites that we’ve preemptively blocked, such as many gambling sites, adult entertainment, and drug or gun retailers.

  • 黑名单:Operator 无法访问我们预先阻止的网站,例如许多赌博网站、成人娱乐网站以及毒品或枪支零售商。

  • Moderation: User interactions are reviewed in real-time by automated safety checkers that are designed to ensure compliance with Usage Policies and have the ability to issue warnings or blocks for prohibited activities.

  • 审查:用户交互由自动安全检查器实时审核,这些检查器旨在确保遵守使用政策,并具有发出警告或阻止禁止活动的能力。

  • Offline detection: We’ve also developed automated detection and human review pipelines to identify prohibited usage in priority policy areas, including child safety and deceptive activities, allowing us to enforce our Usage Policies.

  • 离线检测:我们还开发了自动检测和人工审核管道,以识别优先政策领域中的禁止使用,包括儿童安全和欺骗性活动,从而使我们能够执行我们的使用政策。

The second category of risk is model mistakes, where the CUA model accidentally takes an action that the user didn’t intend, which in turn causes harm to the user or others. Hypothetical mistakes can range in severity, from a typo in an email, to purchasing the wrong item, to permanently deleting an important document. To minimize potential harm, we’ve developed the following mitigations:

第二类风险是模型错误,即 CUA 模型意外执行用户没有打算执行的操作,从而对用户或他人造成伤害。假设性错误的严重程度各不相同,从电子邮件中的拼写错误,到购买错误的物品,到永久删除重要文档。为了最小化潜在的伤害,我们开发了以下缓解措施:

  • User confirmations: The CUA model is trained to ask for user confirmation before finalizing tasks with external side effects, for example before submitting an order, sending an email, etc., so that the user can double-check the model’s work before it becomes permanent.

  • 用户确认:CUA 模型经过训练,在最终确定具有外部副作用的任务之前要求用户确认,例如在提交订单、发送电子邮件等之前,以便用户在模型变为永久之前可以再次检查模型的工作。

  • Limitations on tasks: For now, the CUA model will decline to help with certain higher-risk tasks, like banking transactions and tasks that require sensitive decision-making.

  • 任务限制:目前,CUA 模型将拒绝帮助执行某些较高风险的任务,例如银行交易和需要敏感决策的任务。

  • Watch mode: On particularly sensitive websites, such as email, Operator requires active user supervision, ensuring users can directly catch and address any potential mistakes the model might make.

  • 监视模式:在特别敏感的网站上,例如电子邮件,Operator 需要用户积极监督,确保用户可以直接发现并解决模型可能出现的任何潜在错误。

One particularly important category of model mistakes is adversarial attacks on websites that cause the CUA model to take unintended actions, through prompt injections, jailbreaks, and phishing attempts. In addition to the aforementioned mitigations against model mistakes, we developed several additional layers of defense to protect against these risks:

特别重要的模型错误类别是对网站的对抗性攻击,这些攻击会导致 CUA 模型执行意外操作,通过提示注入、越狱和钓鱼尝试。除了前面提到的针对模型错误的缓解措施外,我们还开发了几个额外的防御层,以防范这些风险:

  • Cautious navigation: The CUA model is designed to identify and ignore prompt injections on websites, recognizing all but one case from an early internal red-teaming session.

  • 谨慎导航:CUA 模型被设计为识别并忽略网站上的提示注入,从早期内部红队会议中识别所有情况,除了一个情况。

  • Monitoring: In Operator, we’ve implemented an additional model to monitor and pause execution if it detects suspicious content on the screen.

  • 监控:在 Operator 中,我们实施了一个额外的模型,用于监视并在检测到屏幕上的可疑内容时暂停执行。

  • Detection pipeline: We’re applying both automated detection and human review pipelines to identify suspicious access patterns that can be flagged and rapidly added to the monitor (in a matter of hours).

  • 检测管道:我们正在应用自动检测和人工审核管道,以识别可疑的访问模式,这些模式可以被标记并迅速添加到监视器中(在几小时内)。

Finally, we evaluated the CUA model against frontier risks outlined in our Preparedness Framework, including scenarios involving autonomous replication and biorisk tooling. These assessments showed no incremental risk on top of GPT-4o.

最后,我们根据我们的准备框架评估了 CUA 模型的前沿风险,包括涉及自主复制和生物风险工具的情景。这些评估显示,CUA 模型没有增加 GPT-4o 的风险。

For those interested in exploring the evaluations and safeguards in more detail, we encourage you to review the Operator System Card, a living document that provides transparency into our safety approach and ongoing improvements.

对于那些有兴趣更详细地探讨评估和保障措施的人,我们鼓励您查看 Operator 系统卡,这是一个提供我们安全方法和持续改进透明度的活动文档。

As many of Operator’s capabilities are new, so are the risks and mitigation approaches we’ve implemented. While we have aimed for state-of-the-art, diverse and complementary mitigations, we expect these risks and our approach to evolve as we learn more. We look forward to using the research preview period as an opportunity to gather user feedback, refine our safeguards, and enhance agentic safety.

由于 Operator 的许多功能是新的,我们实施的风险和缓解方法也是新的。虽然我们的目标是最新技术、多样化和互补的缓解措施,但我们希望随着我们的学习,这些风险和我们的方法会不断发展。我们期待利用研究预览期作为收集用户反馈、完善我们的保障措施和增强代理安全性的机会。

Conclusion(结论)

CUA builds on years of research advancements in multimodality, reasoning and safety. We have made significant progress in deep reasoning through the o-model series, vision capabilities through GPT-4o, and new techniques to improve robustness through reinforcement learning and instruction hierarchy. The next challenge space we plan to explore is expanding the action space of agents. The flexibility offered by a universal interface addresses this challenge, enabling an agent that can navigate any software tool designed for humans. By moving beyond specialized agent-friendly APIs, CUA can adapt to whatever computer environment is available—truly addressing the “long tail” of digital use cases that remain out of reach for most AI models.

CUA 建立在多模态、推理和安全性研究进展多年的基础上。我们通过 o 模型系列在深度推理方面取得了重大进展,通过 GPT-4o 在视觉能力方面取得了进展,并通过强化学习和指令层次结构改进了鲁棒性的新技术。我们计划探索的下一个挑战空间是扩展代理的动作空间。通用接口提供的灵活性解决了这一挑战,使代理能够导航任何为人类设计的软件工具。通过超越专门的代理友好 API,CUA 可以适应任何可用的计算机环境——真正解决了大多数 AI 模型无法触及的数字使用案例的“长尾”。

We’re also working to make CUA available in the API, so developers can use it to build their own computer-using agents. As we continue to iterate on CUA, we look forward to seeing the different use cases the community will discover. We plan to use the real-world feedback we gather from this early preview to continuously refine CUA’s capabilities and safety mitigations to safely advance our mission of distributing the benefits of AI to everyone.

我们还致力于在API中提供 CUA,以便开发人员可以使用它来构建自己的计算机使用代理。随着我们继续迭代 CUA,我们期待看到社区将发现的不同用例。我们计划利用我们从这个早期预览中收集的现实世界反馈,持续完善 CUA 的功能和安全缓解措施,以安全地推进我们将 AI 的好处分发给每个人的使命。