通义灵码2.0













































CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz(64核)GPU: NVIDIA T4(16GB)X 4内存: 256GBconda create -n eval-llm python==3.12 -y
conda activate eval-llm
cd /data/wjj
mkdir eval-llm
cd eval-llm
pip install vllm==0.7.3 pandas
git clone https://github.com/vllm-project/vllm
docker pull lmsysorg/sglang:latest
pip install evalscope-perf==1.0.0
通过设置环境变量没有生效。
export OPENAI_API_KEY=sk-1234
这里进行了硬编码,编辑文件:/data/miniconda3/envs/eval-llm/lib/python3.12/site-packages/evalscope_perf/main.py
conda create -n litellm python==3.12.9 -y
conda activate litellm
pip install "litellm[proxy]" langfuse openai
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up
注册用户
浏览器访问 http://localhost:3000/,单击 Sign up 注册一个新账户。



# 模型列表
model_list:
## 语言模型
- model_name: gpt-4
litellm_params:
model: openai/qwen2.5:7b
api_base: http://localhost:11434/v1
api_key: NONE
## 视觉模型
- model_name: gpt-4o
litellm_params:
model: openai/llama3.2-vision
api_base: http://localhost:11434/v1
api_key: NONE
## 代码模型
// ...
lscpu
架构: x86_64
CPU 运行模式: 32-bit, 64-bit
字节序: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU: 256
在线 CPU 列表: 0-254
离线 CPU 列表: 255
每个核的线程数: 1
每个座的核数: 64
座: 2
NUMA 节点: 8
厂商 ID: HygonGenuine
BIOS Vendor ID: Chengdu Hygon
CPU 系列: 24
型号: 4
// ...
DCU:Hygon K100_AI 64G X 8
lspci -v | grep -A22 'Co-processor'
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately.
git clone https://github.com/cline/cline.git
code cline
npm run install:all
如果构建项目时遇到问题,请安装 esbuild problem matchers 扩展。
Activating task providers npm
错误: problemMatcher 引用无效: $esbuild-watch
打开 运行和调试 侧边栏,运行 Run Extension,或者按 F5 键启动调试,打开一个新的 VSCode 窗口,加载扩展。

这里想探索使用多模态大模型答题的技术方案,包含单选题、多选题、判断题,最终构建自主答题的智能体。
工作流程:🏞️ -> MLM(多模态大模型)-> 答案
直接使用多模态大模型读题(转成文字),然后检索答案,把题和答案组合的提示词输入给语言大模型。
我使用了 Ollama 调用多模态大模型
minicpm-v:8b来生成文字。llava:7b的效果不好。
代码示例:
import ollama
response = ollama.chat(
model="minicpm-v:8b",
messages=[
{
'role': 'user',
'content': '读取图像中的题。',
'images': ['ti.png']
}
]
)
print(response['message']['content'])
T4 GPU 服务器,4卡16G。
conda create -n deepseek-r1 python=3.12 -y
conda activate deepseek-r1
pip install vllm
曦云®C500是沐曦面向通用计算的旗舰产品,提供强大高精度及多精度混合算力,配备大规格高带宽显存,片间互联MetaXLink无缝链接多GPU系统,自主研发的MXMACA®软件栈可兼容主流GPU生态,能够全面满足数字经济建设和产业数字化的算力需求。
2023 年 6 月 14 日,沐曦官宣 AI 训练 GPU MXC500 完成芯片功能测试,MXMACA 2.0 计算平台基础测试完成,意味着公司首款 AI 训练芯片 MXC500成功点亮,该芯片采用 7nm 制程,GPGPU 架构,能够兼容 CUDA,目标对标英伟达 A100/A800 芯片。
沐曦主要有三大产品线:
研发实力强大,软件生态布局完善。沐曦的研发团队阵容豪华,三位创始人均在 AMD 拥有 20 年左右的 GPU 研发经验,其中两位为 AMD 科学家(Fellow)。沐曦采用了完全自主研发的 GPU IP,有效提高了产品的开发效率,同时拥有完全自主知识产权的指令集和架构,可以对每个独立的计算实例进行灵活配置,从而优化数据中心计算资源的效率。
Yesterday, OpenAI released Deep Research, a system that browses the web to summarize content and answer questions based on the summary. The system is impressive and blew our minds when we tried it for the first time.
昨天,OpenAI 发布了 Deep Research,这是一个浏览网页以总结内容并根据总结回答问题的系统。当我们第一次尝试时,这个系统给我们留下了深刻的印象。
One of the main results in the blog post is a strong improvement of performances on the General AI Assistants benchmark (GAIA), a benchmark we’ve been playing with recently as well, where they successfully reached near 67% correct answers on 1-shot on average, and 47.
An agent that uses reasoning to synthesize large amounts of online information and complete multi-step research tasks for you.
一个代理,使用推理来综合大量在线信息,并为您完成多步研究任务。
Today we’re launching deep research in ChatGPT, a new agentic capability that conducts multi-step research on the internet for complex tasks. It accomplishes in tens of minutes what would take a human many hours.
今天我们在 ChatGPT 中推出了 deep research,这是一种新的代理能力,可以在互联网上进行复杂任务的多步研究。 它可以在几十分钟内完成人类需要花费数小时才能完成的任务。
Deep research is OpenAI's ne
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-ofthe-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.
从实验来看,需要用英文描述,中文描述生成的效果不好。
This year is the Year of the Snake. I want to create a lifelike snake, wearing a fiery red new outfit, holding its head high, floating in the air, and writing "Happy New Year 2025" in snake-like font.

今年是蛇年,我想生成一只栩栩如生的蛇,穿着火红色的新衣,高昂着头,悬浮于空,用蛇体字型写上“2025年新年快乐”。

下面的图是快手可灵生成的。

I wanted to create a lifelike snake, with its head held high, suspended in the air.

我想生成一只栩栩如生的蛇,高昂着头,悬浮于空。

Modern abstract digital artwork with a split layout, black on the left and beige on the right. The subject is a beautiful snake woman with smooth skin and bright colors.

Get started quickly with our computer use reference implementation that includes a web interface, Docker container, example tool implementations, and an agent loop.
快速开始使用我们的计算机使用参考实现,其中包括Web界面、Docker容器、示例工具实现和代理循环。
Here’s an example of how to provide computer use tools to Claude using the Messages API:
以下是如何使用消息API为Claude提供计算机使用工具的示例:
Claude can now use computers. The latest version of Claude 3.5 Sonnet can, when run through the appropriate software setup, follow a user’s commands to move a cursor around their computer’s screen, click on relevant locations, and input information via a virtual keyboard, emulating the way people interact with their own computer.
Claude现在可以使用计算机了。最新版本的Claude 3.5 Sonnet可以在通过适当的软件设置后,按照用户的命令在计算机屏幕上移动光标,单击相关位置,并通过虚拟键盘输入信息,模拟人们与自己的计算机交互的方式。
We think this skill—which is currently in public beta—represents a significant breakthrough in AI progress.
Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWORLD, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWORLD can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWORLD, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWORLD reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWORLD provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.
UI-TARS: Pioneering Automated GUI Interaction with Native Agents(与本地代理进行自动化 GUI 交互的先驱)
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks.
This document includes extra information to how we evaluated our Computer Using Agent, including (browser/VM) environments, prompts, sampling parameters, and scoring procedures. For more details, read https://openai.com/index/computer-using-agent/.
本文档包括我们如何评估我们的计算机使用代理的额外信息,包括(浏览器/VM)环境,提示,采样参数和评分程序。有关更多详细信息,请阅读 https://openai.com/index/computer-using-agent/ 。
For WebArena and WebVoyager, we run the evals in operator browser instead of playwright browsers since our model relies on the visual action space for navigation (search bar, backward/forward button).
A universal interface for AI to interact with the digital world. AI 与数字世界交互的通用接口。
Today we introduced a research preview of Operator, an agent that can go to the web to perform tasks for you. Powering Operator is Computer-Using Agent (CUA), a model that combines GPT-4o's vision capabilities with advanced reasoning through reinforcement learning. CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen—just as humans do.
没有找到匹配的文章