whisper.cpp
类别: Whisper 标签: Whisper NEON MPS CoreML MacBookProM2Max目录
NEON & MPS 🆚 CoreML
下载模型(large-v3)
models/download-ggml-model.sh large-v3
NEON & MPS
编译
make clean
make -j
main 帮助
./main --help
usage: ./main [options] file0.wav file1.wav ...
options:
-h, --help [default] show this help message and exit
-t N, --threads N [4 ] number of threads to use during computation
-p N, --processors N [1 ] number of processors to use during computation
-ot N, --offset-t N [0 ] time offset in milliseconds
-on N, --offset-n N [0 ] segment index offset
-d N, --duration N [0 ] duration of audio to process in milliseconds
-mc N, --max-context N [-1 ] maximum number of text context tokens to store
-ml N, --max-len N [0 ] maximum segment length in characters
-sow, --split-on-word [false ] split on word rather than on token
-bo N, --best-of N [5 ] number of best candidates to keep
-bs N, --beam-size N [5 ] beam size for beam search
-wt N, --word-thold N [0.01 ] word timestamp probability threshold
-et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
-lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
-debug, --debug-mode [false ] enable debug mode (eg. dump log_mel)
-tr, --translate [false ] translate from source language to english
-di, --diarize [false ] stereo audio diarization
-tdrz, --tinydiarize [false ] enable tinydiarize (requires a tdrz model)
-nf, --no-fallback [false ] do not use temperature fallback while decoding
-otxt, --output-txt [false ] output result in a text file
-ovtt, --output-vtt [false ] output result in a vtt file
-osrt, --output-srt [false ] output result in a srt file
-olrc, --output-lrc [false ] output result in a lrc file
-owts, --output-words [false ] output script for generating karaoke video
-fp, --font-path [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
-ocsv, --output-csv [false ] output result in a CSV file
-oj, --output-json [false ] output result in a JSON file
-ojf, --output-json-full [false ] include more information in the JSON file
-of FNAME, --output-file FNAME [ ] output file path (without file extension)
-ps, --print-special [false ] print special tokens
-pc, --print-colors [false ] print colors
-pp, --print-progress [false ] print progress
-nt, --no-timestamps [false ] do not print timestamps
-l LANG, --language LANG [en ] spoken language ('auto' for auto-detect)
-dl, --detect-language [false ] exit after automatically detecting language
--prompt PROMPT [ ] initial prompt
-m FNAME, --model FNAME [models/ggml-base.en.bin] model path
-f FNAME, --file FNAME [ ] input WAV file path
-oved D, --ov-e-device DNAME [CPU ] the OpenVINO device used for encode inference
-ls, --log-score [false ] log best decoder scores of tokens
-ng, --no-gpu [false ] disable GPU
语音识别
time ./main -m models/ggml-large-v3.bin -f test.wav -l auto
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 3117.88 MB, ( 3118.53 / 51539.61)
whisper_model_load: Metal buffer size = 3117.87 MB
whisper_model_load: model size = 3117.39 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 220.20 MB, ( 3338.73 / 51539.61)
whisper_init_state: kv self size = 220.20 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 245.76 MB, ( 3584.49 / 51539.61)
whisper_init_state: kv cross size = 245.76 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.51 / 51539.61)
whisper_init_state: compute buffer (conv) = 32.36 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.52 / 51539.61)
whisper_init_state: compute buffer (encode) = 212.36 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.54 / 51539.61)
whisper_init_state: compute buffer (cross) = 9.32 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.56 / 51539.61)
whisper_init_state: compute buffer (decode) = 99.17 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 30.72 MB, ( 3615.28 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 210.73 MB, ( 3826.01 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 7.68 MB, ( 3833.69 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 97.53 MB, ( 3931.23 / 51539.61)
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 |
main: processing 'test.wav' (6939648 samples, 433.7 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...
whisper_full_with_state: auto-detected language: zh (p = 0.916022)
[00:00:00.000 --> 00:00:12.760] 三的前面的一个数字
[00:00:12.760 --> 00:00:21.520] 九的后面的一个数字
[00:00:21.520 --> 00:00:28.480] 八的前面的三个数字
[00:00:28.480 --> 00:00:34.940] 八的后面三个数字
[00:00:34.940 --> 00:00:46.080] 长方形 正方形 正方形 圆柱球
[00:00:46.080 --> 00:00:53.440] 同左边数D
[00:00:53.440 --> 00:00:57.340] 是D
[00:00:57.340 --> 00:01:05.080] 同左 同右边数
[00:01:05.080 --> 00:01:08.020] 是D几个
[00:01:08.020 --> 00:01:22.080] 有个 然后这个数
[00:01:22.080 --> 00:01:24.840] 这什么字啊 答案都不清楚
[00:01:24.840 --> 00:01:25.840] 最
[00:01:25.840 --> 00:01:27.080] 最多
[00:01:27.080 --> 00:01:34.820] 最多
[00:01:34.820 --> 00:01:40.820] 一共有几个减法
[00:01:40.820 --> 00:01:45.620] 还剩几个用
[00:01:45.620 --> 00:01:49.620] 用
[00:01:49.620 --> 00:01:55.620] 加法 减法
[00:01:55.620 --> 00:01:56.820] 一个
[00:01:56.820 --> 00:01:59.080] 三个数
[00:01:59.080 --> 00:02:02.180] 个位上是五
[00:02:02.180 --> 00:02:05.220] 十位上是一
[00:02:05.220 --> 00:02:12.620] 这个数是十个十
[00:02:12.620 --> 00:02:18.880] 和四个一合起来
[00:02:18.880 --> 00:02:26.560] 是一个十和九个一合起来
[00:02:26.560 --> 00:02:48.320] 是一个十六里面有个十和一个一二十里面有个十
[00:02:48.320 --> 00:02:56.300] 二十五和十几中间的数是
[00:02:56.300 --> 00:03:06.400] 和九相邻的两个数是
[00:03:06.400 --> 00:03:14.560] 比六比十六多一的数是
[00:03:14.560 --> 00:03:19.560] 比十六少三的数是
[00:03:19.560 --> 00:03:20.100] 比十六少三的数是
[00:03:20.100 --> 00:03:20.660] 比十六少三的数是
[00:03:20.660 --> 00:03:21.660] 比十六少三的数是
[00:03:21.660 --> 00:03:22.660] 比十六少三的数是
[00:03:22.660 --> 00:03:24.040] 比十六少三的数是
[00:03:24.040 --> 00:03:26.040] 比十六少三的数是
[00:03:26.040 --> 00:03:26.780] 比十六少三的数是
[00:03:26.780 --> 00:03:29.100] 被减数
[00:03:29.100 --> 00:03:31.500] 二是十一
[00:03:31.500 --> 00:03:34.280] 减数是三
[00:03:34.280 --> 00:03:39.380] 差是五
[00:03:39.380 --> 00:03:43.900] 最大的疑问数是
[00:03:43.900 --> 00:03:52.280] 最小的两位数是
[00:03:52.280 --> 00:03:55.780] 九
[00:03:55.780 --> 00:04:05.640] 十九这个数在位置
[00:04:05.640 --> 00:04:07.980] 漂亮的字我看
[00:04:07.980 --> 00:04:19.580] 表表快点就片不是吗
[00:04:19.580 --> 00:04:22.820] 表是
[00:04:22.820 --> 00:04:27.040] 个在位
[00:04:27.040 --> 00:04:32.040] 表示个
[00:04:32.040 --> 00:04:36.280] 这个
[00:04:36.280 --> 00:04:38.340] 算
[00:04:38.340 --> 00:04:44.680] 是中间数是和
[00:04:44.680 --> 00:04:47.740] 和是
[00:04:47.740 --> 00:04:49.320] 这个
[00:04:49.320 --> 00:04:57.760] 算是中
[00:04:57.760 --> 00:05:01.160] 被减数是
[00:05:01.160 --> 00:05:06.160] 减数是差是
[00:05:06.160 --> 00:05:12.880] 一个加数是十二
[00:05:12.880 --> 00:05:19.060] 一个加数是六
[00:05:19.060 --> 00:05:21.100] 和是十二
[00:05:21.100 --> 00:05:27.260] 被减数是十五
[00:05:27.260 --> 00:05:34.900] 减数是十三差是十五
[00:05:34.900 --> 00:05:48.800] 一个数从右边起第一位是位第二
[00:05:48.800 --> 00:06:03.140] 一个数是十二加数是十三里面有个十和个一
[00:06:03.140 --> 00:06:11.840] 一个数是十二加数是十二加数是十三里面有个十和个一
[00:06:11.840 --> 00:06:22.580] 和十二相邻的两个数是
[00:06:22.580 --> 00:06:34.880] 个三个一和一个十合起来是
[00:06:34.880 --> 00:06:41.580] 再过两小对
[00:06:41.580 --> 00:06:55.920] 一个数是十二加数是十三里面有个十和个一
[00:06:55.920 --> 00:07:00.780] 一个数是十二加数是十三里面有个十和个一
[00:07:00.780 --> 00:07:05.380] 一个数是十二加数是十三里面有个十和个一
[00:07:05.380 --> 00:07:11.320] 一个数是十二加数是十三里面有个十和个一
[00:07:11.320 --> 00:07:13.320] 謝謝
whisper_print_timings: load time = 1007.19 ms
whisper_print_timings: fallbacks = 5 p / 4 h
whisper_print_timings: mel time = 216.87 ms
whisper_print_timings: sample time = 3550.35 ms / 12205 runs ( 0.29 ms per run)
whisper_print_timings: encode time = 7821.69 ms / 17 runs ( 460.10 ms per run)
whisper_print_timings: decode time = 2958.22 ms / 198 runs ( 14.94 ms per run)
whisper_print_timings: batchd time = 88241.95 ms / 11913 runs ( 7.41 ms per run)
whisper_print_timings: prompt time = 1618.32 ms / 3783 runs ( 0.43 ms per run)
whisper_print_timings: total time = 105432.62 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-large-v3.bin -f test.wav -l auto 11.65s user 1.89s system 12% cpu 1:45.50 total
CoreML
安装依赖
pip install openai-whisper coremltools ane-transformers
生成 CoreML 模型
models/generate-coreml-model.sh large-v3
ModelDimensions(n_mels=128, n_audio_ctx=1500, n_audio_state=1280, n_audio_head=20, n_audio_layer=32, n_vocab=51866, n_text_ctx=448, n_text_state=1280, n_text_head=20, n_text_layer=32)
/Users/junjian/GitHub/ggerganov/whisper.cpp/env/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
/Users/junjian/GitHub/ggerganov/whisper.cpp/env/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
scale = (n_state // self.n_head) ** -0.25
Converting PyTorch Frontend ==> MIL Ops: 100%|███████████████████████████████████████████████████████████████████████▉| 2611/2612 [00:00<00:00, 4137.49 ops/s]
Running MIL frontend_pytorch pipeline: 100%|███████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 34.67 passes/s]
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████| 71/71 [00:15<00:00, 4.46 passes/s]
Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 331.10 passes/s]
done converting
/Users/junjian/GitHub/ggerganov/whisper.cpp/models/coreml-encoder-large-v3.mlmodelc/coremldata.bin
models/coreml-encoder-large-v3.mlmodelc -> models/ggml-large-v3-encoder.mlmodelc
编译
make clean
WHISPER_COREML=1 make -j
语音识别
time ./main -m models/ggml-large-v3.bin -f test.wav -l auto
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 3117.88 MB, ( 3118.53 / 51539.61)
whisper_model_load: Metal buffer size = 3117.87 MB
whisper_model_load: model size = 3117.39 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 220.20 MB, ( 3338.73 / 51539.61)
whisper_init_state: kv self size = 220.20 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 245.76 MB, ( 3584.49 / 51539.61)
whisper_init_state: kv cross size = 245.76 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-v3-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.51 / 51539.61)
whisper_init_state: compute buffer (conv) = 10.85 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.52 / 51539.61)
whisper_init_state: compute buffer (cross) = 9.32 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.54 / 51539.61)
whisper_init_state: compute buffer (decode) = 99.17 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 9.22 MB, ( 3593.76 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 7.68 MB, ( 3601.45 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 97.53 MB, ( 3698.98 / 51539.61)
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 1 | OPENVINO = 0 |
main: processing 'test.wav' (6939648 samples, 433.7 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...
whisper_full_with_state: auto-detected language: zh (p = 0.916203)
[00:00:00.000 --> 00:00:12.760] 三的前面的一个数字
[00:00:12.760 --> 00:00:21.520] 九的后面的一个数字
[00:00:21.520 --> 00:00:28.480] 八的前面的三个数字
[00:00:28.480 --> 00:00:34.940] 八的后面三个数字
[00:00:34.940 --> 00:00:43.720] 长方形正方形
[00:00:43.720 --> 00:00:46.080] 圆柱球
[00:00:46.080 --> 00:00:53.740] 同左边数地
[00:00:53.740 --> 00:00:57.500] 是地
[00:00:57.500 --> 00:01:02.880] 同左
[00:01:02.880 --> 00:01:05.120] 同右边数
[00:01:05.120 --> 00:01:06.860] 是地
[00:01:06.860 --> 00:01:08.040] 几个
[00:01:08.040 --> 00:01:17.120] 有个
[00:01:17.120 --> 00:01:18.640] 有个
[00:01:18.640 --> 00:01:20.680] 然后这个
[00:01:20.680 --> 00:01:22.120] 数
[00:01:22.120 --> 00:01:24.840] 这什么字啊答案都不清楚
[00:01:24.840 --> 00:01:25.820] 最
[00:01:25.820 --> 00:01:27.460] 最多
[00:01:27.460 --> 00:01:27.480] 最多
[00:01:27.480 --> 00:01:38.520] 一共有几个减法
[00:01:38.520 --> 00:01:42.720] 还剩几个
[00:01:42.720 --> 00:01:44.080] 用
[00:01:44.080 --> 00:01:46.000] 用
[00:01:46.000 --> 00:01:50.960] 加法减法
[00:01:50.960 --> 00:01:51.480] 好
[00:01:51.480 --> 00:01:53.160] 这个
[00:01:53.160 --> 00:01:56.660] 一个
[00:01:56.660 --> 00:01:57.460] 一个
[00:01:57.460 --> 00:01:59.140] 这个数
[00:01:59.140 --> 00:02:02.200] 个位上是五
[00:02:02.200 --> 00:02:05.260] 十位上是一
[00:02:05.260 --> 00:02:08.060] 这个数是
[00:02:08.060 --> 00:02:12.820] 十个十
[00:02:12.820 --> 00:02:21.900] 和四个一合起来是一个
[00:02:21.900 --> 00:02:29.660] 十和九个一合起来是
[00:02:29.660 --> 00:02:33.980] 十六里面有
[00:02:33.980 --> 00:02:35.780] 个
[00:02:35.780 --> 00:02:38.540] 十和
[00:02:38.540 --> 00:02:42.460] 个一二十里面
[00:02:42.460 --> 00:02:44.620] 有
[00:02:44.620 --> 00:02:48.340] 个十
[00:02:48.340 --> 00:02:51.620] 五十五合十一
[00:02:51.620 --> 00:03:05.100] 中间的数是和9相邻的两个数是
[00:03:05.100 --> 00:03:19.280] 比16多1的数是比16少3的数是
[00:03:19.280 --> 00:03:35.280] 被解数是11解数是3差是
[00:03:35.280 --> 00:03:46.880] 最大的疑问数是最小的两位数是
[00:03:46.880 --> 00:03:47.280] 5
[00:03:47.280 --> 00:03:49.260] 最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:03:49.260] 最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:03:49.260] 最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:03:49.260] 最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:03:49.260] 最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:04:19.240] 最大的疑问数是最小的两位数是
[00:04:19.240 --> 00:04:49.220] 最大的疑问数是最小的两位数是
[00:04:49.220 --> 00:05:19.200] 最大的疑问数是最小的两位数是
[00:05:19.200 --> 00:05:49.180] 最大的疑问数是最小的两位数是
[00:05:49.180 --> 00:06:19.160] 最大的疑问数是最小的两位数是
[00:06:19.160 --> 00:06:49.140] 最大的疑问数是最小的两位数是最小的两位数是
[00:06:49.140 --> 00:07:13.760] 最大的疑问数是最小的两位数是最小的两位数是
whisper_print_timings: load time = 859.73 ms
whisper_print_timings: fallbacks = 3 p / 1 h
whisper_print_timings: mel time = 224.71 ms
whisper_print_timings: sample time = 2659.66 ms / 8154 runs ( 0.33 ms per run)
whisper_print_timings: encode time = 5801.61 ms / 16 runs ( 362.60 ms per run)
whisper_print_timings: decode time = 4105.18 ms / 279 runs ( 14.71 ms per run)
whisper_print_timings: batchd time = 54016.19 ms / 7797 runs ( 6.93 ms per run)
whisper_print_timings: prompt time = 1218.58 ms / 2721 runs ( 0.45 ms per run)
whisper_print_timings: total time = 71318.02 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-large-v3.bin -f test.wav -l auto 8.81s user 2.29s system 15% cpu 1:11.44 total
仅编码
感觉和上面的一样,包括效果和速度。
models/generate-coreml-model.sh large-v3 --encoder-only True
time ./main -m models/ggml-large-v3.bin -f test.wav -l auto
总结
Neon & MPS 👍 | CoreML 🚀 (47%) | |
---|---|---|
load time | 1007.19 ms | 859.73 ms |
mel time | 216.87 ms | 224.71 ms |
sample time | 3550.35 ms | 2659.66 ms |
encode time | 7821.69 ms | 5801.61 ms |
decode time | 2958.22 ms | 4105.18 ms |
batchd time | 88241.95 ms | 54016.19 ms |
prompt time | 1618.32 ms | 1218.58 ms |
total time | 105432.62 ms | 71318.02 ms |
cpu time | 1:45.50 | 1:11.44 |
速度提高了,但效果下降了。
性能对比(NEON & MPS)
下载模型 ggerganov/whisper.cpp
git clone https://huggingface.co/ggerganov/whisper.cpp ggerganov/whisper.cpp
创建模型链接
编写脚本 ln-models.sh
#!/bin/bash
# 源目录
src_dir="/Users/junjian/HuggingFace/ggerganov/whisper.cpp"
# 目标目录
dst_dir="models"
# 遍历源目录下的所有文件
for src_file in "$src_dir"/*
do
# 获取文件名
file_name=$(basename "$src_file")
# 获取文件扩展名
extension="${file_name##*.}"
# 如果文件名不是 README.md 并且文件扩展名不是 zip,则创建软链接
if [ "$file_name" != "README.md" ] && [ "$extension" != "zip" ]
then
ln -s "$src_file" "$dst_dir/$file_name"
fi
done
执行脚本
sh ln-models.sh
下面的性能测试使用的是一个 5 分钟
的音频文件 test.wav
。
模型 tiny
time ./main -f test.wav -l zh -m models/ggml-tiny.bin
whisper_print_timings: load time = 73.39 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.73 ms
whisper_print_timings: sample time = 1967.16 ms / 8244 runs ( 0.24 ms per run)
whisper_print_timings: encode time = 259.27 ms / 13 runs ( 19.94 ms per run)
whisper_print_timings: decode time = 69.91 ms / 38 runs ( 1.84 ms per run)
whisper_print_timings: batchd time = 5104.22 ms / 8130 runs ( 0.63 ms per run)
whisper_print_timings: prompt time = 55.85 ms / 2175 runs ( 0.03 ms per run)
whisper_print_timings: total time = 7675.59 ms
./main -f test.wav -l zh -m models/ggml-tiny.bin 5.40s user 0.50s system 76% cpu 7.704 total
模型 tiny-q5_1
time ./main -f test.wav -l zh -m models/ggml-tiny-q5_1.bin
whisper_print_timings: load time = 68.97 ms
whisper_print_timings: fallbacks = 1 p / 0 h
whisper_print_timings: mel time = 134.27 ms
whisper_print_timings: sample time = 2650.25 ms / 10960 runs ( 0.24 ms per run)
whisper_print_timings: encode time = 232.78 ms / 11 runs ( 21.16 ms per run)
whisper_print_timings: decode time = 7.82 ms / 5 runs ( 1.56 ms per run)
whisper_print_timings: batchd time = 7218.69 ms / 10898 runs ( 0.66 ms per run)
whisper_print_timings: prompt time = 69.42 ms / 2452 runs ( 0.03 ms per run)
whisper_print_timings: total time = 10395.01 ms
./main -f test.wav -l zh -m models/ggml-tiny-q5_1.bin 7.20s user 0.63s system 75% cpu 10.422 total
模型 base
time ./main -f test.wav -l zh -m models/ggml-base.bin
whisper_print_timings: load time = 81.90 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 133.21 ms
whisper_print_timings: sample time = 2283.01 ms / 9560 runs ( 0.24 ms per run)
whisper_print_timings: encode time = 396.49 ms / 11 runs ( 36.04 ms per run)
whisper_print_timings: decode time = 7.37 ms / 3 runs ( 2.46 ms per run)
whisper_print_timings: batchd time = 8629.71 ms / 9505 runs ( 0.91 ms per run)
whisper_print_timings: prompt time = 88.45 ms / 2226 runs ( 0.04 ms per run)
whisper_print_timings: total time = 11631.52 ms
./main -f test.wav -l zh -m models/ggml-base.bin 6.29s user 0.60s system 59% cpu 11.664 total
模型 base-q5_1
time ./main -f test.wav -l zh -m models/ggml-base-q5_1.bin
whisper_print_timings: load time = 63.39 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.60 ms
whisper_print_timings: sample time = 2266.66 ms / 9567 runs ( 0.24 ms per run)
whisper_print_timings: encode time = 424.01 ms / 11 runs ( 38.55 ms per run)
whisper_print_timings: decode time = 7.25 ms / 3 runs ( 2.42 ms per run)
whisper_print_timings: batchd time = 8911.47 ms / 9512 runs ( 0.94 ms per run)
whisper_print_timings: prompt time = 98.58 ms / 2227 runs ( 0.04 ms per run)
whisper_print_timings: total time = 11916.36 ms
./main -f test.wav -l zh -m models/ggml-base-q5_1.bin 6.18s user 0.56s system 56% cpu 11.948 total
模型 small
time ./main -f test.wav -l zh -m models/ggml-small.bin
whisper_print_timings: load time = 200.65 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.19 ms
whisper_print_timings: sample time = 2134.68 ms / 8277 runs ( 0.26 ms per run)
whisper_print_timings: encode time = 1222.70 ms / 12 runs ( 101.89 ms per run)
whisper_print_timings: decode time = 24.96 ms / 5 runs ( 4.99 ms per run)
whisper_print_timings: batchd time = 15979.48 ms / 8208 runs ( 1.95 ms per run)
whisper_print_timings: prompt time = 218.47 ms / 2191 runs ( 0.10 ms per run)
whisper_print_timings: total time = 19925.94 ms
./main -f test.wav -l zh -m models/ggml-small.bin 6.21s user 0.67s system 34% cpu 19.968 total
模型 small-q5_1
time ./main -f test.wav -l zh -m models/ggml-small-q5_1.bin
whisper_print_timings: load time = 99.42 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 131.74 ms
whisper_print_timings: sample time = 2121.60 ms / 8218 runs ( 0.26 ms per run)
whisper_print_timings: encode time = 1419.51 ms / 13 runs ( 109.19 ms per run)
whisper_print_timings: decode time = 147.85 ms / 33 runs ( 4.48 ms per run)
whisper_print_timings: batchd time = 15960.53 ms / 8116 runs ( 1.97 ms per run)
whisper_print_timings: prompt time = 266.62 ms / 2419 runs ( 0.11 ms per run)
whisper_print_timings: total time = 20160.34 ms
./main -f test.wav -l zh -m models/ggml-small-q5_1.bin 6.03s user 0.59s system 32% cpu 20.191 total
模型 medium
time ./main -f test.wav -l zh -m models/ggml-medium.bin
whisper_print_timings: load time = 476.85 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.37 ms
whisper_print_timings: sample time = 2233.07 ms / 9028 runs ( 0.25 ms per run)
whisper_print_timings: encode time = 2951.15 ms / 11 runs ( 268.29 ms per run)
whisper_print_timings: decode time = 42.86 ms / 4 runs ( 10.72 ms per run)
whisper_print_timings: batchd time = 38405.30 ms / 8972 runs ( 4.28 ms per run)
whisper_print_timings: prompt time = 550.54 ms / 2232 runs ( 0.25 ms per run)
whisper_print_timings: total time = 44803.71 ms
./main -f test.wav -l zh -m models/ggml-medium.bin 7.14s user 1.01s system 18% cpu 44.848 total
模型 medium-q5_0
time ./main -f test.wav -l zh -m models/ggml-medium-q5_0.bin
whisper_print_timings: load time = 203.72 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.13 ms
whisper_print_timings: sample time = 2251.24 ms / 9072 runs ( 0.25 ms per run)
whisper_print_timings: encode time = 3288.98 ms / 11 runs ( 299.00 ms per run)
whisper_print_timings: decode time = 55.99 ms / 6 runs ( 9.33 ms per run)
whisper_print_timings: batchd time = 39768.11 ms / 9014 runs ( 4.41 ms per run)
whisper_print_timings: prompt time = 624.49 ms / 2234 runs ( 0.28 ms per run)
whisper_print_timings: total time = 46336.09 ms
./main -f test.wav -l zh -m models/ggml-medium-q5_0.bin 7.02s user 0.81s system 16% cpu 46.372 total
模型 large-v3
time ./main -f test.wav -l zh -m models/ggml-large-v3.bin
whisper_print_timings: load time = 859.78 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 151.76 ms
whisper_print_timings: sample time = 2201.30 ms / 8640 runs ( 0.25 ms per run)
whisper_print_timings: encode time = 5561.91 ms / 12 runs ( 463.49 ms per run)
whisper_print_timings: decode time = 1033.94 ms / 67 runs ( 15.43 ms per run)
whisper_print_timings: batchd time = 55377.50 ms / 8503 runs ( 6.51 ms per run)
whisper_print_timings: prompt time = 820.89 ms / 1975 runs ( 0.42 ms per run)
whisper_print_timings: total time = 66020.98 ms
./main -f test.wav -l zh -m models/ggml-large-v3.bin 7.45s user 1.40s system 13% cpu 1:06.08 total
模型 large-v3-q5_0
time ./main -f test.wav -l zh -m models/ggml-large-v3-q5_0.bin
whisper_print_timings: load time = 341.02 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 159.70 ms
whisper_print_timings: sample time = 2230.36 ms / 8311 runs ( 0.27 ms per run)
whisper_print_timings: encode time = 5895.91 ms / 11 runs ( 535.99 ms per run)
whisper_print_timings: decode time = 344.94 ms / 25 runs ( 13.80 ms per run)
whisper_print_timings: batchd time = 57613.93 ms / 8239 runs ( 6.99 ms per run)
whisper_print_timings: prompt time = 1062.95 ms / 2198 runs ( 0.48 ms per run)
whisper_print_timings: total time = 67659.70 ms
./main -f test.wav -l zh -m models/ggml-large-v3-q5_0.bin 7.25s user 1.03s system 12% cpu 1:07.71 total
总结
tiny | tiny-q5_1 | base | base-q5_1 | small | small-q5_1 | medium | medium-q5_0 | large-v3 | large-v3-q5_0 | |
---|---|---|---|---|---|---|---|---|---|---|
load time | 73.39 ms | 68.97 ms | 81.90 ms | 63.39 ms | 200.65 ms | 99.42 ms | 476.85 ms | 203.72 ms | 859.78 ms | 341.02 ms |
mel time | 132.73 ms | 134.27 ms | 133.21 ms | 132.60 ms | 132.19 ms | 131.74 ms | 132.37 ms | 132.13 ms | 151.76 ms | 159.70 ms |
sample time | 1967.16 ms | 2650.25 ms | 2283.01 ms | 2266.66 ms | 2134.68 ms | 2121.60 ms | 2233.07 ms | 2251.24 ms | 2201.30 ms | 2230.36 ms |
encode time | 259.27 ms | 232.78 ms | 396.49 ms | 424.01 ms | 1222.70 ms | 1419.51 ms | 2951.15 ms | 3288.98 ms | 5561.91 ms | 5895.91 ms |
decode time | 69.91 ms | 7.82 ms | 7.37 ms | 7.25 ms | 24.96 ms | 55.99 ms | 42.86 ms | 55.99 ms | 1033.94 ms | 344.94 ms |
batchd time | 5104.22 ms | 7218.69 ms | 8629.71 ms | 8911.47 ms | 15979.48 ms | 15960.53 ms | 38405.30 ms | 39768.11 ms | 55377.50 ms | 57613.93 ms |
prompt time | 55.85 ms | 69.42 ms | 88.45 ms | 98.58 ms | 218.47 ms | 266.62 ms | 550.54 ms | 624.49 ms | 820.89 ms | 1062.95 ms |
total time | 7675.59 ms | 10395.01 ms | 11631.52 ms | 11916.36 ms | 19925.94 ms | 20160.34 ms | 44803.71 ms | 46336.09 ms | 66020.98 ms | 67659.70 ms |
cpu time | 0:07.70 | 0:10.42 | 0:11.66 | 0:11.94 | 0:19.96 | 0:20.19 | 0:44.84 | 0:46.37 | 1:06.08 | 1:07.71 |
🚀 | 🚀 | 🚀🚀 | 🚀🚀 | 🚀🚀🚀🚀 | 🚀🚀🚀🚀 | 🚀🚀🚀🚀🚀🚀 | 🚀🚀🚀🚀🚀🚀 |
性能对比(MLX)
编写脚本 test-speed.py
import argparse
import whisper
# 创建一个解析器
parser = argparse.ArgumentParser(description='Transcribe a speech file using a specific model.')
parser.add_argument('speech_file', type=str, help='The path to the speech file.')
parser.add_argument('model', type=str, help='The model to use for transcription.')
# 解析命令行参数
args = parser.parse_args()
# 使用指定的音频文件和模型进行转录
text = whisper.transcribe(args.speech_file, model=args.model, initial_prompt='大家好!')["text"]
print(text)
执行脚本
time python test-speed.py test.wav base
下面的性能测试使用的是一个 5 分钟
的音频文件 test.wav
。
模型 tiny
9.83s user 8.39s system 142% cpu 12.813 total
模型 base
7.93s user 6.82s system 143% cpu 10.297 total
模型 small
14.49s user 9.87s system 129% cpu 18.812 total
模型 medium
30.05s user 17.96s system 122% cpu 39.291 total
模型 large-v3
47.01s user 28.10s system 119% cpu 1:02.96 total
总结
tiny | base | small | medium | large-v3 | |
---|---|---|---|---|---|
cpu time | 0:12.81 | 0:10.30 | 0:18.81 | 0:39.29 | 1:02.96 |
性能对比(NEON & MPS 🆚 MLX)
tiny | base | small | medium | large-v3 | |
---|---|---|---|---|---|
NEON & MPS | 0:07.70 | 0:11.66 | 0:19.96 | 0:44.84 | 1:06.08 |
MLX | 0:12.81 | 0:10.30 | 0:18.81 | 0:39.29 | 1:02.96 |
MLX 的性能已经超过了 whisper.cpp 的性能了。