---
layout: single
title:  "【生成式AI时代下的机器学习(2025)】第九讲：谈谈有关大型语言模型评估的几件事"
date:   2025-06-07 06:00:00 +0800
categories: [AI 与大模型, 数据科学]
tags: [模型评估, 2025生成式AI时代下的机器学习, 生成式AI, 机器学习, 李宏毅]
---

本文档讨论了**大型语言模型的评估**，重点关注其**推理能力**和**记忆效应**。文档展示了不同的基准测试结果，例如**DeepSeek**和**OpenAI模型在推理任务上的表现**，以及模型回答可能来自“记忆”而非推理的**准确性下降情况**。此外，还介绍了**人工通用智能（ARC-AGI）的抽象推理语料库**作为一种评估框架，并探讨了**聊天机器人竞技场（Chatbot Arena）及其Elo评分系统**，用于**衡量和比较不同模型在实际用户互动中的表现**，包括**情感和风格控制**。

<!-- more -->

![](/images/2025/HungYiLee/09-ModelEvaluation/01.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/02.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/03.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/04.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/05.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/06.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/07.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/08.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/09.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/10.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/11.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/12.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/13.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/14.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/15.jpg)

![](/images/2025/HungYiLee/09-ModelEvaluation/16.jpg)

- [【生成式AI時代下的機器學習(2025)】第九講：你這麽認這個評分系統幹什麽啊？談談有關大型語言模型評估的幾件事](https://www.youtube.com/watch?v=s266BzGNKKc)