5 篇文章带有标签 “arXiv”

2025年2月23日星期日

Qwen2.5-VL Technical Report

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately.

2025年2月23日 83 分钟 20,369 字

2025年2月6日星期四

Introducing deep research

An agent that uses reasoning to synthesize large amounts of online information and complete multi-step research tasks for you.

一个代理，使用推理来综合大量在线信息，并为您完成多步研究任务。

Today we’re launching deep research in ChatGPT, a new agentic capability that conducts multi-step research on the internet for complex tasks. It accomplishes in tens of minutes what would take a human many hours.

今天我们在 ChatGPT 中推出了 deep research，这是一种新的代理能力，可以在互联网上进行复杂任务的多步研究。它可以在几十分钟内完成人类需要花费数小时才能完成的任务。

Deep research is OpenAI's ne

2025年2月6日 24 分钟 5,862 字

DeepResearch Agent o3 arXiv OpenAI

2025年2月4日星期二

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-ofthe-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

2025年2月4日 62 分钟 15,379 字

SWE-bench Benchmark arXiv LLM

2025年2月1日星期六

Claude: Developing a computer use model

Claude can now use computers. The latest version of Claude 3.5 Sonnet can, when run through the appropriate software setup, follow a user’s commands to move a cursor around their computer’s screen, click on relevant locations, and input information via a virtual keyboard, emulating the way people interact with their own computer.

Claude现在可以使用计算机了。最新版本的Claude 3.5 Sonnet可以在通过适当的软件设置后，按照用户的命令在计算机屏幕上移动光标，单击相关位置，并通过虚拟键盘输入信息，模拟人们与自己的计算机交互的方式。

We think this skill—which is currently in public beta—represents a significant breakthrough in AI progress.

2025年2月1日 14 分钟 3,406 字

Claude arXiv Agent LLM Anthropics

2025年1月31日星期五

OSWorld：在真实计算机环境中为开放式任务进行多模态代理基准测试

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWORLD, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWORLD can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWORLD, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWORLD reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWORLD provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

2025年1月31日 107 分钟 26,700 字

OSWorld Benchmark Agent arXiv

5 篇文章带有标签 “arXiv”

2025年2月23日 星期日