---
layout: single
title:  "audio2sub — 音频转字幕工具"
date:   2026-06-10 08:00:00 +0800
categories: [工具]
tags: [audio2sub, OpenAI Whisper, 音频转字幕, VTT, SRT, Python, ffmpeg]
---

基于 [OpenAI Whisper](https://github.com/openai/whisper) 的命令行工具，将音频文件批量转写为 VTT / SRT 格式字幕。

---

## 环境要求

| 依赖 | 说明 |
|------|------|
| Python | ≥ 3.8 |
| PyTorch | Whisper 的运行时依赖，自动安装 |
| openai-whisper | 语音识别引擎 |
| ffmpeg | 音频解码，系统级安装 |

---

## 安装步骤

### 1. 安装 ffmpeg

- macOS：

```bash
brew install ffmpeg
```

- Ubuntu / Debian：

```bash
sudo apt update && sudo apt install ffmpeg
```

### 2. 安装 openai-whisper

```bash
pip install openai-whisper
```

> 该命令会自动拉取 `torch` 等依赖。首次运行时 Whisper 模型文件会下载到 `~/.cache/whisper/`。

### ⚠️ macOS 环境注意事项

使用系统 Python 或 miniconda 安装 whisper：

```bash
# miniconda（推荐，已预装 torch）
/opt/miniconda/bin/pip install openai-whisper

# 或系统 Python
/usr/bin/python3 -m pip install openai-whisper
```

---

## 脚本文件

编写文件：`audio2sub.py`

```py
#!/usr/bin/env python3
"""
audio2sub.py - 使用 OpenAI Whisper 将音频文件转写为字幕文件（VTT / SRT）

用法:
    python audio2sub.py <audio_path> [options]

参数:
    audio_path          音频文件或目录路径（必填）
                        若为目录，递归遍历其中所有音频文件并逐个生成字幕
    --model MODEL       Whisper 模型名称，默认 base
                        可选: tiny, base, small, medium, large, large-v2, large-v3
    --language LANG     音频语言，默认 en（英语）
                        例: zh（中文）, ja（日语）, auto（自动检测）
    --format FMT        字幕格式，vtt 或 srt（默认: vtt）
    --device DEVICE     计算设备，cpu / mps / auto（默认: auto，macOS Apple Silicon 优先 MPS）
    --recursive / --no-recursive
                        递归扫描子目录（默认: 递归）
    --skip-existing     跳过已有对应字幕文件的音频（避免重复转写）
    --output OUTPUT     输出文件路径（仅单文件模式有效；目录模式忽略此项）

示例:
    python audio2sub.py "Unit 1.mp3"
    python audio2sub.py "Unit 1.mp3" --model small --format srt
    python audio2sub.py "Unit 1.mp3" --model base --language zh
    python audio2sub.py "Unit 1.mp3" --output /tmp/Unit1.vtt
    python audio2sub.py ./SeniorHighSchool/                     # 递归遍历
    python audio2sub.py ./SeniorHighSchool/ --no-recursive      # 仅顶层
    python audio2sub.py ./SeniorHighSchool/ --skip-existing     # 跳过已有字幕
    python audio2sub.py ./words/ --model small --format srt --language auto
    python audio2sub.py "Unit 1.mp3" --device mps               # 强制使用 GPU
    python audio2sub.py "Unit 1.mp3" --device cpu               # 强制使用 CPU
"""

import argparse
import os
import sys

AUDIO_EXTENSIONS = {".mp3", ".wav", ".flac", ".m4a", ".ogg", ".wma", ".aac", ".opus"}


def resolve_device(device: str) -> str:
    """解析计算设备，auto 模式下优先使用 MPS (Apple Silicon GPU)"""
    import torch

    if device == "auto":
        if torch.backends.mps.is_available() and torch.backends.mps.is_built():
            return "mps"
        elif torch.cuda.is_available():
            return "cuda"
        else:
            return "cpu"
    return device


def format_timestamp_vtt(seconds: float) -> str:
    """将秒数转换为 VTT 时间戳格式 HH:MM:SS.mmm"""
    ms = int((seconds % 1) * 1000)
    s = int(seconds) % 60
    m = int(seconds) // 60 % 60
    h = int(seconds) // 3600
    return f"{h:02d}:{m:02d}:{s:02d}.{ms:03d}"


def format_timestamp_srt(seconds: float) -> str:
    """将秒数转换为 SRT 时间戳格式 HH:MM:SS,mmm（逗号分隔符）"""
    ms = int((seconds % 1) * 1000)
    s = int(seconds) % 60
    m = int(seconds) // 60 % 60
    h = int(seconds) // 3600
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"


def build_subtitle_content(segments: list, fmt: str) -> str:
    """根据格式构建字幕内容"""
    if fmt == "srt":
        lines = []
        for i, seg in enumerate(segments, 1):
            start = format_timestamp_srt(seg["start"])
            end = format_timestamp_srt(seg["end"])
            text = seg["text"].strip()
            lines.append(str(i))
            lines.append(f"{start} --> {end}")
            lines.append(text)
            lines.append("")
    else:  # vtt
        lines = ["WEBVTT", ""]
        for i, seg in enumerate(segments, 1):
            start = format_timestamp_vtt(seg["start"])
            end = format_timestamp_vtt(seg["end"])
            text = seg["text"].strip()
            lines.append(str(i))
            lines.append(f"{start} --> {end}")
            lines.append(text)
            lines.append("")
    return "\n".join(lines)


def get_output_path(audio_path: str, fmt: str, output_arg: str = None) -> str:
    """计算输出文件路径"""
    if output_arg:
        return os.path.abspath(output_arg)
    audio_dir = os.path.dirname(audio_path)
    audio_basename = os.path.splitext(os.path.basename(audio_path))[0]
    ext = ".vtt" if fmt == "vtt" else ".srt"
    return os.path.join(audio_dir, audio_basename + ext)


def collect_audio_files(path: str, recursive: bool = True) -> list:
    """收集目录下的音频文件（按路径排序），支持递归子目录"""
    files = []
    if recursive:
        for root, _dirs, entries in os.walk(path):
            for entry in entries:
                if os.path.splitext(entry)[1].lower() in AUDIO_EXTENSIONS:
                    files.append(os.path.join(root, entry))
    else:
        for entry in os.listdir(path):
            full = os.path.join(path, entry)
            if os.path.isfile(full) and os.path.splitext(entry)[1].lower() in AUDIO_EXTENSIONS:
                files.append(full)
    files.sort(key=lambda f: f.lower())
    return files


def transcribe_one(model, audio_path: str, language: str, fmt: str, output_path: str,
                   skip_existing: bool = False) -> bool:
    """转写单个音频文件并保存字幕，返回是否成功"""
    basename = os.path.basename(audio_path)

    # 跳过已有字幕
    if skip_existing and os.path.isfile(output_path):
        print(f"  ⊘ {basename} → 字幕已存在，跳过")
        return True

    try:
        lang_arg = None if language == "auto" else language
        print(f"  转写中: {basename} (language={language}) ...")
        result = model.transcribe(audio_path, language=lang_arg, task="transcribe", verbose=False)

        detected_lang = result.get("language", "unknown")
        segments = result["segments"]
        print(f"  检测语言: {detected_lang}，共 {len(segments)} 个片段")

        content = build_subtitle_content(segments, fmt)
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(content)

        size_kb = os.path.getsize(output_path) / 1024
        fmt_label = fmt.upper()
        print(f"  ✓ {basename} → {os.path.basename(output_path)}  ({size_kb:.1f} KB, {len(segments)} 条字幕)")
        return True
    except Exception as e:
        print(f"  ✗ {basename} 转写失败: {e}", file=sys.stderr)
        return False


def run(input_path: str, model_name: str, language: str, fmt: str,
        output_arg: str = None, recursive: bool = True, skip_existing: bool = False,
        device: str = "auto") -> None:
    try:
        import whisper
    except ImportError:
        print("错误: 未找到 openai-whisper，请先安装：pip install openai-whisper", file=sys.stderr)
        sys.exit(1)

    input_path = os.path.abspath(input_path)

    # 收集待处理的音频文件列表
    if os.path.isdir(input_path):
        audio_files = collect_audio_files(input_path, recursive=recursive)
        if not audio_files:
            print(f"目录下未找到音频文件: {input_path}", file=sys.stderr)
            sys.exit(1)
        mode = "dir"
    elif os.path.isfile(input_path):
        audio_files = [input_path]
        mode = "single"
    else:
        print(f"错误: 路径不存在: {input_path}", file=sys.stderr)
        sys.exit(1)

    # 加载模型（统一加载一次，避免重复加载）
    resolved_device = resolve_device(device)
    print(f"加载模型: {model_name} (device={resolved_device}) ...")
    model = whisper.load_model(model_name, device=resolved_device)

    total = len(audio_files)
    success = 0
    fail = 0

    for idx, audio_path in enumerate(audio_files, 1):
        if total > 1:
            rel_path = os.path.relpath(audio_path, input_path) if os.path.isdir(input_path) else os.path.basename(audio_path)
            print(f"\n[{idx}/{total}] {rel_path}")

        out = get_output_path(audio_path, fmt, output_arg if mode == "single" else None)
        ok = transcribe_one(model, audio_path, language, fmt, out, skip_existing=skip_existing)
        if ok:
            success += 1
        else:
            fail += 1

    # 汇总
    if total > 1:
        print(f"\n{'='*40}")
        print(f"完成！成功 {success} 个，失败 {fail} 个，共 {total} 个文件")


def main():
    parser = argparse.ArgumentParser(
        description="将音频文件转写为字幕文件（基于 OpenAI Whisper，支持 VTT / SRT）",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__.split("用法:")[1] if "用法:" in __doc__ else ""
    )
    parser.add_argument("audio_path", help="音频文件或目录路径")
    parser.add_argument(
        "--model", "-m",
        default="base",
        choices=["tiny", "base", "small", "medium", "large", "large-v2", "large-v3"],
        help="Whisper 模型（默认: base）"
    )
    parser.add_argument(
        "--language", "-l",
        default="en",
        help="音频语言代码，如 en、zh、ja，或 auto 自动检测（默认: en）"
    )
    parser.add_argument(
        "--format", "-f",
        default="vtt",
        choices=["vtt", "srt"],
        help="字幕格式（默认: vtt）"
    )
    parser.add_argument(
        "--output", "-o",
        default=None,
        help="输出文件路径（仅单文件模式有效；目录模式忽略此项）"
    )
    parser.add_argument(
        "--device", "-d",
        default="auto",
        choices=["auto", "cpu", "mps", "cuda"],
        help="计算设备（默认: auto，macOS Apple Silicon 优先 MPS GPU 加速）"
    )
    parser.add_argument(
        "--recursive", "-r",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="递归扫描子目录（默认: 递归，使用 --no-recursive 关闭）"
    )
    parser.add_argument(
        "--skip-existing", "-s",
        action="store_true",
        default=False,
        help="跳过已有对应字幕文件的音频"
    )

    args = parser.parse_args()
    run(args.audio_path, args.model, args.language, args.format, args.output,
        recursive=args.recursive, skip_existing=args.skip_existing, device=args.device)


if __name__ == "__main__":
    main()
```

---

## 使用方法

### 基本语法

```bash
python audio2sub.py <audio_path> [options]
```

### 参数一览

| 参数 | 缩写 | 默认值 | 说明 |
|------|------|--------|------|
| `audio_path` | — | 必填 | 音频文件或目录路径 |
| `--model` | `-m` | `base` | Whisper 模型（见下表） |
| `--language` | `-l` | `en` | 语言代码，`auto` 为自动检测 |
| `--format` | `-f` | `vtt` | 字幕格式：`vtt` 或 `srt` |
| `--device` | `-d` | `auto` | 计算设备：`auto` / `cpu` / `mps` / `cuda` |
| `--recursive` / `--no-recursive` | `-r` | 递归 | 目录模式是否扫描子目录 |
| `--skip-existing` | `-s` | 关闭 | 跳过已有字幕的音频 |
| `--output` | `-o` | 同目录同名 | 输出路径（仅单文件模式） |

### 模型选择

| 模型 | 参数量 | 英语模型大小 | 多语言模型大小 | 相对速度 |
|------|--------|-------------|---------------|---------|
| tiny | 39M | ~39 MB | ~48 MB | ★★★★★ |
| base | 74M | ~74 MB | ~142 MB | ★★★★ |
| small | 244M | ~244 MB | ~466 MB | ★★★ |
| medium | 769M | ~769 MB | ~1.5 GB | ★★ |
| large | 1550M | — | ~2.9 GB | ★ |
| large-v2 | 1550M | — | ~2.9 GB | ★ |
| large-v3 | 1550M | — | ~2.9 GB | ★ |

> 英语单词音频推荐 `base`，速度和精度均衡；中文语音推荐 `small` 及以上。

### 支持的音频格式

`.mp3` `.wav` `.flac` `.m4a` `.ogg` `.wma` `.aac` `.opus`

### GPU 加速

脚本支持通过 `--device` 参数选择计算设备：

| 设备 | 说明 |
|------|------|
| `auto` | 自动选择（默认），优先 MPS > CUDA > CPU |
| `mps` | macOS Apple Silicon GPU（Metal Performance Shaders） |
| `cuda` | NVIDIA GPU |
| `cpu` | 纯 CPU 计算 |

**macOS Apple Silicon 用户**无需额外配置，默认 `auto` 即可自动启用 MPS GPU 加速：

```bash
# 默认 auto，Apple Silicon 自动用 MPS
python audio2sub.py "Unit 1.mp3"

# 显式指定 MPS
python audio2sub.py "Unit 1.mp3" --device mps

# 强制使用 CPU
python audio2sub.py "Unit 1.mp3" --device cpu
```

> **前置条件：** PyTorch ≥ 1.12 且 macOS ≥ 12.3。可通过以下命令检查：
> ```bash
> python3 -c "import torch; print('MPS available:', torch.backends.mps.is_available())"
> ```

**实测性能对比**（base 模型，同一 14MB MP3 文件，Apple M 系列）：

| 设备 | 转写耗时 | 说明 |
|------|---------|------|
| MPS (GPU) | ~19s | GPU 加速，无 FP16 警告 |
| CPU | ~13s | CPU 多核，但有 FP16 回退警告 |

> 注：base 模型较小时 CPU 多核并行可能更快；`small` 及以上模型 GPU 优势明显。

---

## 示例

### 单文件

```bash
# 最简用法（默认 base 模型、英语、VTT 格式）
python audio2sub.py "Unit 1.mp3"

# 指定模型和格式
python audio2sub.py "Unit 1.mp3" --model small --format srt

# 中文音频
python audio2sub.py "对话.mp3" --language zh

# 自动检测语言
python audio2sub.py "audio.mp3" --language auto

# 指定输出路径
python audio2sub.py "Unit 1.mp3" -o /tmp/Unit1.vtt
```

### 目录批量

```bash
# 递归遍历子目录（默认行为）
python audio2sub.py ./SeniorHighSchool/

# 仅扫描顶层目录
python audio2sub.py ./SeniorHighSchool/ --no-recursive

# 跳过已有字幕（断点续传）
python audio2sub.py ./SeniorHighSchool/ --skip-existing

# 完整参数
python audio2sub.py ./SeniorHighSchool/ -m small -f srt -l auto -s
```

### 输出示例

```
加载模型: base ...

[1/295] Compulsory1/texts/Unit 1 Listening and speaking 2.mp3
  转写中: Unit 1 Listening and speaking 2.mp3 (language=en) ...
  检测语言: en，共 174 个片段
  ✓ Unit 1 Listening and speaking 2.mp3 → Unit 1 Listening and speaking 2.vtt  (7.8 KB, 174 条字幕)

[2/295] Compulsory1/texts/Unit 1 Listening and speaking 3.mp3
  转写中: Unit 1 Listening and speaking 3.mp3 (language=en) ...
  ...

========================================
完成！成功 293 个，失败 2 个，共 295 个文件
```

---

## 输出格式说明

### VTT（WebVTT）

```
WEBVTT

1
00:00:00.000 --> 00:00:04.000
Unit 1, precise.

2
00:00:04.000 --> 00:00:06.000
Precise.
```

### SRT（SubRip）

```
1
00:00:00,000 --> 00:00:04,000
Unit 1, precise.

2
00:00:04,000 --> 00:00:06,000
Precise.
```

两者差异：VTT 有 `WEBVTT` 头部，毫秒分隔符用 `.`；SRT 无头部，毫秒分隔符用 `,`。

---

## 文件输出规则

- **单文件模式**：默认输出到音频同目录，文件名相同 + 格式扩展名（如 `Unit 1.mp3` → `Unit 1.vtt`）
- **目录模式**：每个音频文件在自身所在目录生成对应字幕文件，保持目录结构不变
- 使用 `--output` 仅在单文件模式下生效，目录模式忽略此项

---

## 常见问题

### Q: 运行时报 `FP16 is not supported on CPU` 警告？

正常，CPU 模式下自动回退到 FP32，不影响结果。使用 `--device mps` 或 `--device auto`（默认）可避免此警告并启用 GPU 加速。

### Q: 如何确认 GPU 是否生效？

运行时观察输出中的设备信息：

```
加载模型: base (device=mps) ...    ← GPU 生效
加载模型: base (device=cpu) ...    ← 未使用 GPU
```

### Q: MPS 模式报错怎么办？

部分操作在 MPS 上可能存在兼容性问题，回退到 CPU 即可：

```bash
python audio2sub.py "Unit 1.mp3" --device cpu
```

### Q: 如何提升识别精度？

1. 升级模型：`--model small` 或 `--model medium`
2. 指定正确语言：`--language en`（明确语言比 auto 检测更稳）
3. 确保音频质量：低底噪、清晰发音效果更好

### Q: 大批量转写中断了怎么办？

使用 `--skip-existing` 重新运行，已有字幕的音频会自动跳过：

```bash
python audio2sub.py ./SeniorHighSchool/ -s
```

---

## 运行环境信息

本工具在以下环境验证通过：

| 项目 | 值 |
|------|------|
| 系统 | macOS ARM64 (Apple Silicon) |
| Python | 3.10.9 (miniconda) |
| PyTorch | 2.11.0 |
| openai-whisper | 20250625 |
| ffmpeg | brew 安装 |