audio2sub — 音频转字幕工具

基于 OpenAI Whisper 的命令行工具，将音频文件批量转写为 VTT / SRT 格式字幕。

环境要求

依赖	说明
Python	≥ 3.8
PyTorch	Whisper 的运行时依赖，自动安装
openai-whisper	语音识别引擎
ffmpeg	音频解码，系统级安装

安装步骤

1. 安装 ffmpeg

macOS：

brew install ffmpeg

Ubuntu / Debian：

sudo apt update && sudo apt install ffmpeg

2. 安装 openai-whisper

pip install openai-whisper

该命令会自动拉取 torch 等依赖。首次运行时 Whisper 模型文件会下载到 ~/.cache/whisper/。

⚠️ macOS 环境注意事项

使用系统 Python 或 miniconda 安装 whisper：

# miniconda（推荐，已预装 torch）
/opt/miniconda/bin/pip install openai-whisper

# 或系统 Python
/usr/bin/python3 -m pip install openai-whisper

脚本文件

编写文件：audio2sub.py

#!/usr/bin/env python3
"""
audio2sub.py - 使用 OpenAI Whisper 将音频文件转写为字幕文件（VTT / SRT）

用法:
    python audio2sub.py <audio_path> [options]

参数:
    audio_path          音频文件或目录路径（必填）
                        若为目录，递归遍历其中所有音频文件并逐个生成字幕
    --model MODEL       Whisper 模型名称，默认 base
                        可选: tiny, base, small, medium, large, large-v2, large-v3
    --language LANG     音频语言，默认 en（英语）
                        例: zh（中文）, ja（日语）, auto（自动检测）
    --format FMT        字幕格式，vtt 或 srt（默认: vtt）
    --device DEVICE     计算设备，cpu / mps / auto（默认: auto，macOS Apple Silicon 优先 MPS）
    --recursive / --no-recursive
                        递归扫描子目录（默认: 递归）
    --skip-existing     跳过已有对应字幕文件的音频（避免重复转写）
    --output OUTPUT     输出文件路径（仅单文件模式有效；目录模式忽略此项）

示例:
    python audio2sub.py "Unit 1.mp3"
    python audio2sub.py "Unit 1.mp3" --model small --format srt
    python audio2sub.py "Unit 1.mp3" --model base --language zh
    python audio2sub.py "Unit 1.mp3" --output /tmp/Unit1.vtt
    python audio2sub.py ./SeniorHighSchool/                     # 递归遍历
    python audio2sub.py ./SeniorHighSchool/ --no-recursive      # 仅顶层
    python audio2sub.py ./SeniorHighSchool/ --skip-existing     # 跳过已有字幕
    python audio2sub.py ./words/ --model small --format srt --language auto
    python audio2sub.py "Unit 1.mp3" --device mps               # 强制使用 GPU
    python audio2sub.py "Unit 1.mp3" --device cpu               # 强制使用 CPU
"""

import argparse
import os
import sys

AUDIO_EXTENSIONS = {".mp3", ".wav", ".flac", ".m4a", ".ogg", ".wma", ".aac", ".opus"}


def resolve_device(device: str) -> str:
    """解析计算设备，auto 模式下优先使用 MPS (Apple Silicon GPU)"""
    import torch

    if device == "auto":
        if torch.backends.mps.is_available() and torch.backends.mps.is_built():
            return "mps"
        elif torch.cuda.is_available():
            return "cuda"
        else:
            return "cpu"
    return device


def format_timestamp_vtt(seconds: float) -> str:
    """将秒数转换为 VTT 时间戳格式 HH:MM:SS.mmm"""
    ms = int((seconds % 1) * 1000)
    s = int(seconds) % 60
    m = int(seconds) // 60 % 60
    h = int(seconds) // 3600
    return f"{h:02d}:{m:02d}:{s:02d}.{ms:03d}"


def format_timestamp_srt(seconds: float) -> str:
    """将秒数转换为 SRT 时间戳格式 HH:MM:SS,mmm（逗号分隔符）"""
    ms = int((seconds % 1) * 1000)
    s = int(seconds) % 60
    m = int(seconds) // 60 % 60
    h = int(seconds) // 3600
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"


def build_subtitle_content(segments: list, fmt: str) -> str:
    """根据格式构建字幕内容"""
    if fmt == "srt":
        lines = []
        for i, seg in enumerate(segments, 1):
            start = format_timestamp_srt(seg["start"])
            end = format_timestamp_srt(seg["end"])
            text = seg["text"].strip()
            lines.append(str(i))
            lines.append(f"{start} --> {end}")
            lines.append(text)
            lines.append("")
    else:  # vtt
        lines = ["WEBVTT", ""]
        for i, seg in enumerate(segments, 1):
            start = format_timestamp_vtt(seg["start"])
            end = format_timestamp_vtt(seg["end"])
            text = seg["text"].strip()
            lines.append(str(i))
            lines.append(f"{start} --> {end}")
            lines.append(text)
            lines.append("")
    return "\n".join(lines)


def get_output_path(audio_path: str, fmt: str, output_arg: str = None) -> str:
    """计算输出文件路径"""
    if output_arg:
        return os.path.abspath(output_arg)
    audio_dir = os.path.dirname(audio_path)
    audio_basename = os.path.splitext(os.path.basename(audio_path))[0]
    ext = ".vtt" if fmt == "vtt" else ".srt"
    return os.path.join(audio_dir, audio_basename + ext)


def collect_audio_files(path: str, recursive: bool = True) -> list:
    """收集目录下的音频文件（按路径排序），支持递归子目录"""
    files = []
    if recursive:
        for root, _dirs, entries in os.walk(path):
            for entry in entries:
                if os.path.splitext(entry)[1].lower() in AUDIO_EXTENSIONS:
                    files.append(os.path.join(root, entry))
    else:
        for entry in os.listdir(path):
            full = os.path.join(path, entry)
            if os.path.isfile(full) and os.path.splitext(entry)[1].lower() in AUDIO_EXTENSIONS:
                files.append(full)
    files.sort(key=lambda f: f.lower())
    return files


def transcribe_one(model, audio_path: str, language: str, fmt: str, output_path: str,
                   skip_existing: bool = False) -> bool:
    """转写单个音频文件并保存字幕，返回是否成功"""
    basename = os.path.basename(audio_path)

    # 跳过已有字幕
    if skip_existing and os.path.isfile(output_path):
        print(f"  ⊘ {basename} → 字幕已存在，跳过")
        return True

    try:
        lang_arg = None if language == "auto" else language
        print(f"  转写中: {basename} (language={language}) ...")
        result = model.transcribe(audio_path, language=lang_arg, task="transcribe", verbose=False)

        detected_lang = result.get("language", "unknown")
        segments = result["segments"]
        print(f"  检测语言: {detected_lang}，共 {len(segments)} 个片段")

        content = build_subtitle_content(segments, fmt)
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(content)

        size_kb = os.path.getsize(output_path) / 1024
        fmt_label = fmt.upper()
        print(f"  ✓ {basename} → {os.path.basename(output_path)}  ({size_kb:.1f} KB, {len(segments)} 条字幕)")
        return True
    except Exception as e:
        print(f"  ✗ {basename} 转写失败: {e}", file=sys.stderr)
        return False


def run(input_path: str, model_name: str, language: str, fmt: str,
        output_arg: str = None, recursive: bool = True, skip_existing: bool = False,
        device: str = "auto") -> None:
    try:
        import whisper
    except ImportError:
        print("错误: 未找到 openai-whisper，请先安装：pip install openai-whisper", file=sys.stderr)
        sys.exit(1)

    input_path = os.path.abspath(input_path)

    # 收集待处理的音频文件列表
    if os.path.isdir(input_path):
        audio_files = collect_audio_files(input_path, recursive=recursive)
        if not audio_files:
            print(f"目录下未找到音频文件: {input_path}", file=sys.stderr)
            sys.exit(1)
        mode = "dir"
    elif os.path.isfile(input_path):
        audio_files = [input_path]
        mode = "single"
    else:
        print(f"错误: 路径不存在: {input_path}", file=sys.stderr)
        sys.exit(1)

    # 加载模型（统一加载一次，避免重复加载）
    resolved_device = resolve_device(device)
    print(f"加载模型: {model_name} (device={resolved_device}) ...")
    model = whisper.load_model(model_name, device=resolved_device)

    total = len(audio_files)
    success = 0
    fail = 0

    for idx, audio_path in enumerate(audio_files, 1):
        if total > 1:
            rel_path = os.path.relpath(audio_path, input_path) if os.path.isdir(input_path) else os.path.basename(audio_path)
            print(f"\n[{idx}/{total}] {rel_path}")

        out = get_output_path(audio_path, fmt, output_arg if mode == "single" else None)
        ok = transcribe_one(model, audio_path, language, fmt, out, skip_existing=skip_existing)
        if ok:
            success += 1
        else:
            fail += 1

    # 汇总
    if total > 1:
        print(f"\n{'='*40}")
        print(f"完成！成功 {success} 个，失败 {fail} 个，共 {total} 个文件")


def main():
    parser = argparse.ArgumentParser(
        description="将音频文件转写为字幕文件（基于 OpenAI Whisper，支持 VTT / SRT）",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__.split("用法:")[1] if "用法:" in __doc__ else ""
    )
    parser.add_argument("audio_path", help="音频文件或目录路径")
    parser.add_argument(
        "--model", "-m",
        default="base",
        choices=["tiny", "base", "small", "medium", "large", "large-v2", "large-v3"],
        help="Whisper 模型（默认: base）"
    )
    parser.add_argument(
        "--language", "-l",
        default="en",
        help="音频语言代码，如 en、zh、ja，或 auto 自动检测（默认: en）"
    )
    parser.add_argument(
        "--format", "-f",
        default="vtt",
        choices=["vtt", "srt"],
        help="字幕格式（默认: vtt）"
    )
    parser.add_argument(
        "--output", "-o",
        default=None,
        help="输出文件路径（仅单文件模式有效；目录模式忽略此项）"
    )
    parser.add_argument(
        "--device", "-d",
        default="auto",
        choices=["auto", "cpu", "mps", "cuda"],
        help="计算设备（默认: auto，macOS Apple Silicon 优先 MPS GPU 加速）"
    )
    parser.add_argument(
        "--recursive", "-r",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="递归扫描子目录（默认: 递归，使用 --no-recursive 关闭）"
    )
    parser.add_argument(
        "--skip-existing", "-s",
        action="store_true",
        default=False,
        help="跳过已有对应字幕文件的音频"
    )

    args = parser.parse_args()
    run(args.audio_path, args.model, args.language, args.format, args.output,
        recursive=args.recursive, skip_existing=args.skip_existing, device=args.device)


if __name__ == "__main__":
    main()

使用方法

基本语法

python audio2sub.py <audio_path> [options]

参数一览

参数	缩写	默认值	说明
`audio_path`	—	必填	音频文件或目录路径
`--model`	`-m`	`base`	Whisper 模型（见下表）
`--language`	`-l`	`en`	语言代码，`auto` 为自动检测
`--format`	`-f`	`vtt`	字幕格式：`vtt` 或 `srt`
`--device`	`-d`	`auto`	计算设备：`auto` / `cpu` / `mps` / `cuda`
`--recursive` / `--no-recursive`	`-r`	递归	目录模式是否扫描子目录
`--skip-existing`	`-s`	关闭	跳过已有字幕的音频
`--output`	`-o`	同目录同名	输出路径（仅单文件模式）

模型选择

模型	参数量	英语模型大小	多语言模型大小	相对速度
tiny	39M	~39 MB	~48 MB	★★★★★
base	74M	~74 MB	~142 MB	★★★★
small	244M	~244 MB	~466 MB	★★★
medium	769M	~769 MB	~1.5 GB	★★
large	1550M	—	~2.9 GB	★
large-v2	1550M	—	~2.9 GB	★
large-v3	1550M	—	~2.9 GB	★

英语单词音频推荐 base，速度和精度均衡；中文语音推荐 small 及以上。

支持的音频格式

.mp3 .wav .flac .m4a .ogg .wma .aac .opus

GPU 加速

脚本支持通过 --device 参数选择计算设备：

设备	说明
`auto`	自动选择（默认），优先 MPS > CUDA > CPU
`mps`	macOS Apple Silicon GPU（Metal Performance Shaders）
`cuda`	NVIDIA GPU
`cpu`	纯 CPU 计算

macOS Apple Silicon 用户无需额外配置，默认 auto 即可自动启用 MPS GPU 加速：

# 默认 auto，Apple Silicon 自动用 MPS
python audio2sub.py "Unit 1.mp3"

# 显式指定 MPS
python audio2sub.py "Unit 1.mp3" --device mps

# 强制使用 CPU
python audio2sub.py "Unit 1.mp3" --device cpu

前置条件： PyTorch ≥ 1.12 且 macOS ≥ 12.3。可通过以下命令检查：
python3 -c "import torch; print('MPS available:', torch.backends.mps.is_available())"

实测性能对比（base 模型，同一 14MB MP3 文件，Apple M 系列）：

设备	转写耗时	说明
MPS (GPU)	~19s	GPU 加速，无 FP16 警告
CPU	~13s	CPU 多核，但有 FP16 回退警告

注：base 模型较小时 CPU 多核并行可能更快；small 及以上模型 GPU 优势明显。

示例

单文件

# 最简用法（默认 base 模型、英语、VTT 格式）
python audio2sub.py "Unit 1.mp3"

# 指定模型和格式
python audio2sub.py "Unit 1.mp3" --model small --format srt

# 中文音频
python audio2sub.py "对话.mp3" --language zh

# 自动检测语言
python audio2sub.py "audio.mp3" --language auto

# 指定输出路径
python audio2sub.py "Unit 1.mp3" -o /tmp/Unit1.vtt

目录批量

# 递归遍历子目录（默认行为）
python audio2sub.py ./SeniorHighSchool/

# 仅扫描顶层目录
python audio2sub.py ./SeniorHighSchool/ --no-recursive

# 跳过已有字幕（断点续传）
python audio2sub.py ./SeniorHighSchool/ --skip-existing

# 完整参数
python audio2sub.py ./SeniorHighSchool/ -m small -f srt -l auto -s

输出示例

加载模型: base ...

[1/295] Compulsory1/texts/Unit 1 Listening and speaking 2.mp3
  转写中: Unit 1 Listening and speaking 2.mp3 (language=en) ...
  检测语言: en，共 174 个片段
  ✓ Unit 1 Listening and speaking 2.mp3 → Unit 1 Listening and speaking 2.vtt  (7.8 KB, 174 条字幕)

[2/295] Compulsory1/texts/Unit 1 Listening and speaking 3.mp3
  转写中: Unit 1 Listening and speaking 3.mp3 (language=en) ...
  ...

========================================
完成！成功 293 个，失败 2 个，共 295 个文件

输出格式说明

VTT（WebVTT）

WEBVTT

1
00:00:00.000 --> 00:00:04.000
Unit 1, precise.

2
00:00:04.000 --> 00:00:06.000
Precise.

SRT（SubRip）

1
00:00:00,000 --> 00:00:04,000
Unit 1, precise.

2
00:00:04,000 --> 00:00:06,000
Precise.

两者差异：VTT 有 WEBVTT 头部，毫秒分隔符用 .；SRT 无头部，毫秒分隔符用 ,。

文件输出规则

单文件模式：默认输出到音频同目录，文件名相同 + 格式扩展名（如 Unit 1.mp3 → Unit 1.vtt）
目录模式：每个音频文件在自身所在目录生成对应字幕文件，保持目录结构不变
使用 --output 仅在单文件模式下生效，目录模式忽略此项

常见问题

Q: 运行时报 `FP16 is not supported on CPU` 警告？

正常，CPU 模式下自动回退到 FP32，不影响结果。使用 --device mps 或 --device auto（默认）可避免此警告并启用 GPU 加速。

Q: 如何确认 GPU 是否生效？

运行时观察输出中的设备信息：

加载模型: base (device=mps) ...    ← GPU 生效
加载模型: base (device=cpu) ...    ← 未使用 GPU

Q: MPS 模式报错怎么办？

部分操作在 MPS 上可能存在兼容性问题，回退到 CPU 即可：

python audio2sub.py "Unit 1.mp3" --device cpu

Q: 如何提升识别精度？

升级模型：--model small 或 --model medium
指定正确语言：--language en（明确语言比 auto 检测更稳）
确保音频质量：低底噪、清晰发音效果更好

Q: 大批量转写中断了怎么办？

使用 --skip-existing 重新运行，已有字幕的音频会自动跳过：

python audio2sub.py ./SeniorHighSchool/ -s

运行环境信息

本工具在以下环境验证通过：

项目	值
系统	macOS ARM64 (Apple Silicon)
Python	3.10.9 (miniconda)
PyTorch	2.11.0
openai-whisper	20250625
ffmpeg	brew 安装

audio2sub — 音频转字幕工具

环境要求

安装步骤

1. 安装 ffmpeg

2. 安装 openai-whisper

⚠️ macOS 环境注意事项

脚本文件

使用方法

基本语法

参数一览

模型选择

支持的音频格式

GPU 加速

示例

单文件

目录批量

输出示例

输出格式说明

VTT（WebVTT）

SRT（SubRip）

文件输出规则

常见问题

Q: 运行时报 `FP16 is not supported on CPU` 警告？

Q: 如何确认 GPU 是否生效？

Q: MPS 模式报错怎么办？

Q: 如何提升识别精度？

Q: 大批量转写中断了怎么办？

运行环境信息

相关文章

Whisper 语音识别

环境要求

安装步骤

1. 安装 ffmpeg

2. 安装 openai-whisper

⚠️ macOS 环境注意事项

脚本文件

使用方法

基本语法

参数一览

模型选择

支持的音频格式

GPU 加速

示例

单文件

目录批量

输出示例

输出格式说明

VTT（WebVTT）

SRT（SubRip）

文件输出规则

常见问题

Q: 运行时报 FP16 is not supported on CPU 警告？

Q: 如何确认 GPU 是否生效？

Q: MPS 模式报错怎么办？

Q: 如何提升识别精度？

Q: 大批量转写中断了怎么办？

运行环境信息

相关文章

Whisper 语音识别

Q: 运行时报 `FP16 is not supported on CPU` 警告？