目录

NEON & MPS 🆚 CoreML

下载模型(large-v3)

models/download-ggml-model.sh large-v3

NEON & MPS

编译

make clean
make -j

main 帮助

./main --help

usage: ./main [options] file0.wav file1.wav ...

options:
  -h,        --help              [default] show this help message and exit
  -t N,      --threads N         [4      ] number of threads to use during computation
  -p N,      --processors N      [1      ] number of processors to use during computation
  -ot N,     --offset-t N        [0      ] time offset in milliseconds
  -on N,     --offset-n N        [0      ] segment index offset
  -d  N,     --duration N        [0      ] duration of audio to process in milliseconds
  -mc N,     --max-context N     [-1     ] maximum number of text context tokens to store
  -ml N,     --max-len N         [0      ] maximum segment length in characters
  -sow,      --split-on-word     [false  ] split on word rather than on token
  -bo N,     --best-of N         [5      ] number of best candidates to keep
  -bs N,     --beam-size N       [5      ] beam size for beam search
  -wt N,     --word-thold N      [0.01   ] word timestamp probability threshold
  -et N,     --entropy-thold N   [2.40   ] entropy threshold for decoder fail
  -lpt N,    --logprob-thold N   [-1.00  ] log probability threshold for decoder fail
  -debug,    --debug-mode        [false  ] enable debug mode (eg. dump log_mel)
  -tr,       --translate         [false  ] translate from source language to english
  -di,       --diarize           [false  ] stereo audio diarization
  -tdrz,     --tinydiarize       [false  ] enable tinydiarize (requires a tdrz model)
  -nf,       --no-fallback       [false  ] do not use temperature fallback while decoding
  -otxt,     --output-txt        [false  ] output result in a text file
  -ovtt,     --output-vtt        [false  ] output result in a vtt file
  -osrt,     --output-srt        [false  ] output result in a srt file
  -olrc,     --output-lrc        [false  ] output result in a lrc file
  -owts,     --output-words      [false  ] output script for generating karaoke video
  -fp,       --font-path         [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
  -ocsv,     --output-csv        [false  ] output result in a CSV file
  -oj,       --output-json       [false  ] output result in a JSON file
  -ojf,      --output-json-full  [false  ] include more information in the JSON file
  -of FNAME, --output-file FNAME [       ] output file path (without file extension)
  -ps,       --print-special     [false  ] print special tokens
  -pc,       --print-colors      [false  ] print colors
  -pp,       --print-progress    [false  ] print progress
  -nt,       --no-timestamps     [false  ] do not print timestamps
  -l LANG,   --language LANG     [en     ] spoken language ('auto' for auto-detect)
  -dl,       --detect-language   [false  ] exit after automatically detecting language
             --prompt PROMPT     [       ] initial prompt
  -m FNAME,  --model FNAME       [models/ggml-base.en.bin] model path
  -f FNAME,  --file FNAME        [       ] input WAV file path
  -oved D,   --ov-e-device DNAME [CPU    ] the OpenVINO device used for encode inference
  -ls,       --log-score         [false  ] log best decoder scores of tokens
  -ng,       --no-gpu            [false  ] disable GPU

语音识别

time ./main -m models/ggml-large-v3.bin -f test.wav -l auto
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =  3117.88 MB, ( 3118.53 / 51539.61)
whisper_model_load:    Metal buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =   220.20 MB, ( 3338.73 / 51539.61)
whisper_init_state: kv self size  =  220.20 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =   245.76 MB, ( 3584.49 / 51539.61)
whisper_init_state: kv cross size =  245.76 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     0.02 MB, ( 3584.51 / 51539.61)
whisper_init_state: compute buffer (conv)   =   32.36 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     0.02 MB, ( 3584.52 / 51539.61)
whisper_init_state: compute buffer (encode) =  212.36 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     0.02 MB, ( 3584.54 / 51539.61)
whisper_init_state: compute buffer (cross)  =    9.32 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     0.02 MB, ( 3584.56 / 51539.61)
whisper_init_state: compute buffer (decode) =   99.17 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =    30.72 MB, ( 3615.28 / 51539.61)
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =   210.73 MB, ( 3826.01 / 51539.61)
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     7.68 MB, ( 3833.69 / 51539.61)
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =    97.53 MB, ( 3931.23 / 51539.61)

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'test.wav' (6939648 samples, 433.7 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: zh (p = 0.916022)

[00:00:00.000 --> 00:00:12.760]  三的前面的一个数字
[00:00:12.760 --> 00:00:21.520]  九的后面的一个数字
[00:00:21.520 --> 00:00:28.480]  八的前面的三个数字
[00:00:28.480 --> 00:00:34.940]  八的后面三个数字
[00:00:34.940 --> 00:00:46.080]  长方形 正方形 正方形 圆柱球
[00:00:46.080 --> 00:00:53.440]  同左边数D
[00:00:53.440 --> 00:00:57.340]  是D
[00:00:57.340 --> 00:01:05.080]  同左 同右边数
[00:01:05.080 --> 00:01:08.020]  是D几个
[00:01:08.020 --> 00:01:22.080]  有个 然后这个数
[00:01:22.080 --> 00:01:24.840]  这什么字啊 答案都不清楚
[00:01:24.840 --> 00:01:25.840]  最
[00:01:25.840 --> 00:01:27.080]  最多
[00:01:27.080 --> 00:01:34.820]  最多
[00:01:34.820 --> 00:01:40.820]  一共有几个减法
[00:01:40.820 --> 00:01:45.620]  还剩几个用
[00:01:45.620 --> 00:01:49.620]  用
[00:01:49.620 --> 00:01:55.620]  加法 减法
[00:01:55.620 --> 00:01:56.820]  一个
[00:01:56.820 --> 00:01:59.080]  三个数
[00:01:59.080 --> 00:02:02.180]  个位上是五
[00:02:02.180 --> 00:02:05.220]  十位上是一
[00:02:05.220 --> 00:02:12.620]  这个数是十个十
[00:02:12.620 --> 00:02:18.880]  和四个一合起来
[00:02:18.880 --> 00:02:26.560]  是一个十和九个一合起来
[00:02:26.560 --> 00:02:48.320]  是一个十六里面有个十和一个一二十里面有个十
[00:02:48.320 --> 00:02:56.300]  二十五和十几中间的数是
[00:02:56.300 --> 00:03:06.400]  和九相邻的两个数是
[00:03:06.400 --> 00:03:14.560]  比六比十六多一的数是
[00:03:14.560 --> 00:03:19.560]  比十六少三的数是
[00:03:19.560 --> 00:03:20.100]  比十六少三的数是
[00:03:20.100 --> 00:03:20.660]  比十六少三的数是
[00:03:20.660 --> 00:03:21.660]  比十六少三的数是
[00:03:21.660 --> 00:03:22.660]  比十六少三的数是
[00:03:22.660 --> 00:03:24.040]  比十六少三的数是
[00:03:24.040 --> 00:03:26.040]  比十六少三的数是
[00:03:26.040 --> 00:03:26.780]  比十六少三的数是
[00:03:26.780 --> 00:03:29.100]  被减数
[00:03:29.100 --> 00:03:31.500]  二是十一
[00:03:31.500 --> 00:03:34.280]  减数是三
[00:03:34.280 --> 00:03:39.380]  差是五
[00:03:39.380 --> 00:03:43.900]  最大的疑问数是
[00:03:43.900 --> 00:03:52.280]  最小的两位数是
[00:03:52.280 --> 00:03:55.780]  九
[00:03:55.780 --> 00:04:05.640]  十九这个数在位置
[00:04:05.640 --> 00:04:07.980]  漂亮的字我看
[00:04:07.980 --> 00:04:19.580]  表表快点就片不是吗
[00:04:19.580 --> 00:04:22.820]  表是
[00:04:22.820 --> 00:04:27.040]  个在位
[00:04:27.040 --> 00:04:32.040]  表示个
[00:04:32.040 --> 00:04:36.280]  这个
[00:04:36.280 --> 00:04:38.340]  算
[00:04:38.340 --> 00:04:44.680]  是中间数是和
[00:04:44.680 --> 00:04:47.740]  和是
[00:04:47.740 --> 00:04:49.320]  这个
[00:04:49.320 --> 00:04:57.760]  算是中
[00:04:57.760 --> 00:05:01.160]  被减数是
[00:05:01.160 --> 00:05:06.160]  减数是差是
[00:05:06.160 --> 00:05:12.880]  一个加数是十二
[00:05:12.880 --> 00:05:19.060]  一个加数是六
[00:05:19.060 --> 00:05:21.100]  和是十二
[00:05:21.100 --> 00:05:27.260]  被减数是十五
[00:05:27.260 --> 00:05:34.900]  减数是十三差是十五
[00:05:34.900 --> 00:05:48.800]  一个数从右边起第一位是位第二
[00:05:48.800 --> 00:06:03.140]  一个数是十二加数是十三里面有个十和个一
[00:06:03.140 --> 00:06:11.840]  一个数是十二加数是十二加数是十三里面有个十和个一
[00:06:11.840 --> 00:06:22.580]  和十二相邻的两个数是
[00:06:22.580 --> 00:06:34.880]  个三个一和一个十合起来是
[00:06:34.880 --> 00:06:41.580]  再过两小对
[00:06:41.580 --> 00:06:55.920]  一个数是十二加数是十三里面有个十和个一
[00:06:55.920 --> 00:07:00.780]  一个数是十二加数是十三里面有个十和个一
[00:07:00.780 --> 00:07:05.380]  一个数是十二加数是十三里面有个十和个一
[00:07:05.380 --> 00:07:11.320]  一个数是十二加数是十三里面有个十和个一
[00:07:11.320 --> 00:07:13.320]  謝謝


whisper_print_timings:     load time =  1007.19 ms
whisper_print_timings:     fallbacks =   5 p /   4 h
whisper_print_timings:      mel time =   216.87 ms
whisper_print_timings:   sample time =  3550.35 ms / 12205 runs (    0.29 ms per run)
whisper_print_timings:   encode time =  7821.69 ms /    17 runs (  460.10 ms per run)
whisper_print_timings:   decode time =  2958.22 ms /   198 runs (   14.94 ms per run)
whisper_print_timings:   batchd time = 88241.95 ms / 11913 runs (    7.41 ms per run)
whisper_print_timings:   prompt time =  1618.32 ms /  3783 runs (    0.43 ms per run)
whisper_print_timings:    total time = 105432.62 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-large-v3.bin -f test.wav -l auto  11.65s user 1.89s system 12% cpu 1:45.50 total

CoreML

安装依赖

pip install openai-whisper coremltools ane-transformers

生成 CoreML 模型

models/generate-coreml-model.sh large-v3
ModelDimensions(n_mels=128, n_audio_ctx=1500, n_audio_state=1280, n_audio_head=20, n_audio_layer=32, n_vocab=51866, n_text_ctx=448, n_text_state=1280, n_text_head=20, n_text_layer=32)
/Users/junjian/GitHub/ggerganov/whisper.cpp/env/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
/Users/junjian/GitHub/ggerganov/whisper.cpp/env/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  scale = (n_state // self.n_head) ** -0.25
Converting PyTorch Frontend ==> MIL Ops: 100%|███████████████████████████████████████████████████████████████████████▉| 2611/2612 [00:00<00:00, 4137.49 ops/s]
Running MIL frontend_pytorch pipeline: 100%|███████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 34.67 passes/s]
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████| 71/71 [00:15<00:00,  4.46 passes/s]
Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 331.10 passes/s]
done converting
/Users/junjian/GitHub/ggerganov/whisper.cpp/models/coreml-encoder-large-v3.mlmodelc/coremldata.bin
models/coreml-encoder-large-v3.mlmodelc -> models/ggml-large-v3-encoder.mlmodelc

编译

make clean
WHISPER_COREML=1 make -j

语音识别

time ./main -m models/ggml-large-v3.bin -f test.wav -l auto 
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =  3117.88 MB, ( 3118.53 / 51539.61)
whisper_model_load:    Metal buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =   220.20 MB, ( 3338.73 / 51539.61)
whisper_init_state: kv self size  =  220.20 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =   245.76 MB, ( 3584.49 / 51539.61)
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-v3-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     0.02 MB, ( 3584.51 / 51539.61)
whisper_init_state: compute buffer (conv)   =   10.85 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     0.02 MB, ( 3584.52 / 51539.61)
whisper_init_state: compute buffer (cross)  =    9.32 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     0.02 MB, ( 3584.54 / 51539.61)
whisper_init_state: compute buffer (decode) =   99.17 MB
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     9.22 MB, ( 3593.76 / 51539.61)
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =     7.68 MB, ( 3601.45 / 51539.61)
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =    97.53 MB, ( 3698.98 / 51539.61)

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 1 | OPENVINO = 0 | 

main: processing 'test.wav' (6939648 samples, 433.7 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: zh (p = 0.916203)

[00:00:00.000 --> 00:00:12.760]  三的前面的一个数字
[00:00:12.760 --> 00:00:21.520]  九的后面的一个数字
[00:00:21.520 --> 00:00:28.480]  八的前面的三个数字
[00:00:28.480 --> 00:00:34.940]  八的后面三个数字
[00:00:34.940 --> 00:00:43.720]  长方形正方形
[00:00:43.720 --> 00:00:46.080]  圆柱球
[00:00:46.080 --> 00:00:53.740]  同左边数地
[00:00:53.740 --> 00:00:57.500]  是地
[00:00:57.500 --> 00:01:02.880]  同左
[00:01:02.880 --> 00:01:05.120]  同右边数
[00:01:05.120 --> 00:01:06.860]  是地
[00:01:06.860 --> 00:01:08.040]  几个
[00:01:08.040 --> 00:01:17.120]  有个
[00:01:17.120 --> 00:01:18.640]  有个
[00:01:18.640 --> 00:01:20.680]  然后这个
[00:01:20.680 --> 00:01:22.120]  数
[00:01:22.120 --> 00:01:24.840]  这什么字啊答案都不清楚
[00:01:24.840 --> 00:01:25.820]  最
[00:01:25.820 --> 00:01:27.460]  最多
[00:01:27.460 --> 00:01:27.480]  最多
[00:01:27.480 --> 00:01:38.520]  一共有几个减法
[00:01:38.520 --> 00:01:42.720]  还剩几个
[00:01:42.720 --> 00:01:44.080]  用
[00:01:44.080 --> 00:01:46.000]  用
[00:01:46.000 --> 00:01:50.960]  加法减法
[00:01:50.960 --> 00:01:51.480]  好
[00:01:51.480 --> 00:01:53.160]  这个
[00:01:53.160 --> 00:01:56.660]  一个
[00:01:56.660 --> 00:01:57.460]  一个
[00:01:57.460 --> 00:01:59.140]  这个数
[00:01:59.140 --> 00:02:02.200]  个位上是五
[00:02:02.200 --> 00:02:05.260]  十位上是一
[00:02:05.260 --> 00:02:08.060]  这个数是
[00:02:08.060 --> 00:02:12.820]  十个十
[00:02:12.820 --> 00:02:21.900]  和四个一合起来是一个
[00:02:21.900 --> 00:02:29.660]  十和九个一合起来是
[00:02:29.660 --> 00:02:33.980]  十六里面有
[00:02:33.980 --> 00:02:35.780]  个
[00:02:35.780 --> 00:02:38.540]  十和
[00:02:38.540 --> 00:02:42.460]  个一二十里面
[00:02:42.460 --> 00:02:44.620]  有
[00:02:44.620 --> 00:02:48.340]  个十
[00:02:48.340 --> 00:02:51.620]  五十五合十一
[00:02:51.620 --> 00:03:05.100]  中间的数是和9相邻的两个数是
[00:03:05.100 --> 00:03:19.280]  比16多1的数是比16少3的数是
[00:03:19.280 --> 00:03:35.280]  被解数是11解数是3差是
[00:03:35.280 --> 00:03:46.880]  最大的疑问数是最小的两位数是
[00:03:46.880 --> 00:03:47.280]  5
[00:03:47.280 --> 00:03:49.260]  最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:03:49.260]  最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:03:49.260]  最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:03:49.260]  最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:03:49.260]  最大的疑问数是最小的两位数是
[00:03:49.260 --> 00:04:19.240]  最大的疑问数是最小的两位数是
[00:04:19.240 --> 00:04:49.220]  最大的疑问数是最小的两位数是
[00:04:49.220 --> 00:05:19.200]  最大的疑问数是最小的两位数是
[00:05:19.200 --> 00:05:49.180]  最大的疑问数是最小的两位数是
[00:05:49.180 --> 00:06:19.160]  最大的疑问数是最小的两位数是
[00:06:19.160 --> 00:06:49.140]  最大的疑问数是最小的两位数是最小的两位数是
[00:06:49.140 --> 00:07:13.760]  最大的疑问数是最小的两位数是最小的两位数是


whisper_print_timings:     load time =   859.73 ms
whisper_print_timings:     fallbacks =   3 p /   1 h
whisper_print_timings:      mel time =   224.71 ms
whisper_print_timings:   sample time =  2659.66 ms /  8154 runs (    0.33 ms per run)
whisper_print_timings:   encode time =  5801.61 ms /    16 runs (  362.60 ms per run)
whisper_print_timings:   decode time =  4105.18 ms /   279 runs (   14.71 ms per run)
whisper_print_timings:   batchd time = 54016.19 ms /  7797 runs (    6.93 ms per run)
whisper_print_timings:   prompt time =  1218.58 ms /  2721 runs (    0.45 ms per run)
whisper_print_timings:    total time = 71318.02 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-large-v3.bin -f test.wav -l auto  8.81s user 2.29s system 15% cpu 1:11.44 total

仅编码

感觉和上面的一样,包括效果和速度。

models/generate-coreml-model.sh large-v3 --encoder-only True
time ./main -m models/ggml-large-v3.bin -f test.wav -l auto 

总结

  Neon & MPS 👍 CoreML 🚀 (47%)
load time 1007.19 ms 859.73 ms
mel time 216.87 ms 224.71 ms
sample time 3550.35 ms 2659.66 ms
encode time 7821.69 ms 5801.61 ms
decode time 2958.22 ms 4105.18 ms
batchd time 88241.95 ms 54016.19 ms
prompt time 1618.32 ms 1218.58 ms
total time 105432.62 ms 71318.02 ms
cpu time 1:45.50 1:11.44

速度提高了,但效果下降了。

性能对比(NEON & MPS)

下载模型 ggerganov/whisper.cpp

git clone https://huggingface.co/ggerganov/whisper.cpp ggerganov/whisper.cpp

创建模型链接

编写脚本 ln-models.sh

#!/bin/bash

# 源目录
src_dir="/Users/junjian/HuggingFace/ggerganov/whisper.cpp"

# 目标目录
dst_dir="models"

# 遍历源目录下的所有文件
for src_file in "$src_dir"/*
do
    # 获取文件名
    file_name=$(basename "$src_file")

    # 获取文件扩展名
    extension="${file_name##*.}"

    # 如果文件名不是 README.md 并且文件扩展名不是 zip,则创建软链接
    if [ "$file_name" != "README.md" ] && [ "$extension" != "zip" ]
    then
        ln -s "$src_file" "$dst_dir/$file_name"
    fi
done

执行脚本

sh ln-models.sh

下面的性能测试使用的是一个 5 分钟的音频文件 test.wav

模型 tiny

time ./main -f test.wav -l zh -m models/ggml-tiny.bin
whisper_print_timings:     load time =    73.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   132.73 ms
whisper_print_timings:   sample time =  1967.16 ms /  8244 runs (    0.24 ms per run)
whisper_print_timings:   encode time =   259.27 ms /    13 runs (   19.94 ms per run)
whisper_print_timings:   decode time =    69.91 ms /    38 runs (    1.84 ms per run)
whisper_print_timings:   batchd time =  5104.22 ms /  8130 runs (    0.63 ms per run)
whisper_print_timings:   prompt time =    55.85 ms /  2175 runs (    0.03 ms per run)
whisper_print_timings:    total time =  7675.59 ms
./main -f test.wav -l zh -m models/ggml-tiny.bin  5.40s user 0.50s system 76% cpu 7.704 total

模型 tiny-q5_1

time ./main -f test.wav -l zh -m models/ggml-tiny-q5_1.bin
whisper_print_timings:     load time =    68.97 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   134.27 ms
whisper_print_timings:   sample time =  2650.25 ms / 10960 runs (    0.24 ms per run)
whisper_print_timings:   encode time =   232.78 ms /    11 runs (   21.16 ms per run)
whisper_print_timings:   decode time =     7.82 ms /     5 runs (    1.56 ms per run)
whisper_print_timings:   batchd time =  7218.69 ms / 10898 runs (    0.66 ms per run)
whisper_print_timings:   prompt time =    69.42 ms /  2452 runs (    0.03 ms per run)
whisper_print_timings:    total time = 10395.01 ms
./main -f test.wav -l zh -m models/ggml-tiny-q5_1.bin  7.20s user 0.63s system 75% cpu 10.422 total

模型 base

time ./main -f test.wav -l zh -m models/ggml-base.bin
whisper_print_timings:     load time =    81.90 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   133.21 ms
whisper_print_timings:   sample time =  2283.01 ms /  9560 runs (    0.24 ms per run)
whisper_print_timings:   encode time =   396.49 ms /    11 runs (   36.04 ms per run)
whisper_print_timings:   decode time =     7.37 ms /     3 runs (    2.46 ms per run)
whisper_print_timings:   batchd time =  8629.71 ms /  9505 runs (    0.91 ms per run)
whisper_print_timings:   prompt time =    88.45 ms /  2226 runs (    0.04 ms per run)
whisper_print_timings:    total time = 11631.52 ms
./main -f test.wav -l zh -m models/ggml-base.bin  6.29s user 0.60s system 59% cpu 11.664 total

模型 base-q5_1

time ./main -f test.wav -l zh -m models/ggml-base-q5_1.bin
whisper_print_timings:     load time =    63.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   132.60 ms
whisper_print_timings:   sample time =  2266.66 ms /  9567 runs (    0.24 ms per run)
whisper_print_timings:   encode time =   424.01 ms /    11 runs (   38.55 ms per run)
whisper_print_timings:   decode time =     7.25 ms /     3 runs (    2.42 ms per run)
whisper_print_timings:   batchd time =  8911.47 ms /  9512 runs (    0.94 ms per run)
whisper_print_timings:   prompt time =    98.58 ms /  2227 runs (    0.04 ms per run)
whisper_print_timings:    total time = 11916.36 ms
./main -f test.wav -l zh -m models/ggml-base-q5_1.bin  6.18s user 0.56s system 56% cpu 11.948 total

模型 small

time ./main -f test.wav -l zh -m models/ggml-small.bin
whisper_print_timings:     load time =   200.65 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   132.19 ms
whisper_print_timings:   sample time =  2134.68 ms /  8277 runs (    0.26 ms per run)
whisper_print_timings:   encode time =  1222.70 ms /    12 runs (  101.89 ms per run)
whisper_print_timings:   decode time =    24.96 ms /     5 runs (    4.99 ms per run)
whisper_print_timings:   batchd time = 15979.48 ms /  8208 runs (    1.95 ms per run)
whisper_print_timings:   prompt time =   218.47 ms /  2191 runs (    0.10 ms per run)
whisper_print_timings:    total time = 19925.94 ms
./main -f test.wav -l zh -m models/ggml-small.bin  6.21s user 0.67s system 34% cpu 19.968 total

模型 small-q5_1

time ./main -f test.wav -l zh -m models/ggml-small-q5_1.bin
whisper_print_timings:     load time =    99.42 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   131.74 ms
whisper_print_timings:   sample time =  2121.60 ms /  8218 runs (    0.26 ms per run)
whisper_print_timings:   encode time =  1419.51 ms /    13 runs (  109.19 ms per run)
whisper_print_timings:   decode time =   147.85 ms /    33 runs (    4.48 ms per run)
whisper_print_timings:   batchd time = 15960.53 ms /  8116 runs (    1.97 ms per run)
whisper_print_timings:   prompt time =   266.62 ms /  2419 runs (    0.11 ms per run)
whisper_print_timings:    total time = 20160.34 ms
./main -f test.wav -l zh -m models/ggml-small-q5_1.bin  6.03s user 0.59s system 32% cpu 20.191 total

模型 medium

time ./main -f test.wav -l zh -m models/ggml-medium.bin
whisper_print_timings:     load time =   476.85 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   132.37 ms
whisper_print_timings:   sample time =  2233.07 ms /  9028 runs (    0.25 ms per run)
whisper_print_timings:   encode time =  2951.15 ms /    11 runs (  268.29 ms per run)
whisper_print_timings:   decode time =    42.86 ms /     4 runs (   10.72 ms per run)
whisper_print_timings:   batchd time = 38405.30 ms /  8972 runs (    4.28 ms per run)
whisper_print_timings:   prompt time =   550.54 ms /  2232 runs (    0.25 ms per run)
whisper_print_timings:    total time = 44803.71 ms
./main -f test.wav -l zh -m models/ggml-medium.bin  7.14s user 1.01s system 18% cpu 44.848 total

模型 medium-q5_0

time ./main -f test.wav -l zh -m models/ggml-medium-q5_0.bin
whisper_print_timings:     load time =   203.72 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   132.13 ms
whisper_print_timings:   sample time =  2251.24 ms /  9072 runs (    0.25 ms per run)
whisper_print_timings:   encode time =  3288.98 ms /    11 runs (  299.00 ms per run)
whisper_print_timings:   decode time =    55.99 ms /     6 runs (    9.33 ms per run)
whisper_print_timings:   batchd time = 39768.11 ms /  9014 runs (    4.41 ms per run)
whisper_print_timings:   prompt time =   624.49 ms /  2234 runs (    0.28 ms per run)
whisper_print_timings:    total time = 46336.09 ms
./main -f test.wav -l zh -m models/ggml-medium-q5_0.bin  7.02s user 0.81s system 16% cpu 46.372 total

模型 large-v3

time ./main -f test.wav -l zh -m models/ggml-large-v3.bin
whisper_print_timings:     load time =   859.78 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   151.76 ms
whisper_print_timings:   sample time =  2201.30 ms /  8640 runs (    0.25 ms per run)
whisper_print_timings:   encode time =  5561.91 ms /    12 runs (  463.49 ms per run)
whisper_print_timings:   decode time =  1033.94 ms /    67 runs (   15.43 ms per run)
whisper_print_timings:   batchd time = 55377.50 ms /  8503 runs (    6.51 ms per run)
whisper_print_timings:   prompt time =   820.89 ms /  1975 runs (    0.42 ms per run)
whisper_print_timings:    total time = 66020.98 ms
./main -f test.wav -l zh -m models/ggml-large-v3.bin  7.45s user 1.40s system 13% cpu 1:06.08 total

模型 large-v3-q5_0

time ./main -f test.wav -l zh -m models/ggml-large-v3-q5_0.bin
whisper_print_timings:     load time =   341.02 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   159.70 ms
whisper_print_timings:   sample time =  2230.36 ms /  8311 runs (    0.27 ms per run)
whisper_print_timings:   encode time =  5895.91 ms /    11 runs (  535.99 ms per run)
whisper_print_timings:   decode time =   344.94 ms /    25 runs (   13.80 ms per run)
whisper_print_timings:   batchd time = 57613.93 ms /  8239 runs (    6.99 ms per run)
whisper_print_timings:   prompt time =  1062.95 ms /  2198 runs (    0.48 ms per run)
whisper_print_timings:    total time = 67659.70 ms
./main -f test.wav -l zh -m models/ggml-large-v3-q5_0.bin  7.25s user 1.03s system 12% cpu 1:07.71 total

总结

  tiny tiny-q5_1 base base-q5_1 small small-q5_1 medium medium-q5_0 large-v3 large-v3-q5_0
load time 73.39 ms 68.97 ms 81.90 ms 63.39 ms 200.65 ms 99.42 ms 476.85 ms 203.72 ms 859.78 ms 341.02 ms
mel time 132.73 ms 134.27 ms 133.21 ms 132.60 ms 132.19 ms 131.74 ms 132.37 ms 132.13 ms 151.76 ms 159.70 ms
sample time 1967.16 ms 2650.25 ms 2283.01 ms 2266.66 ms 2134.68 ms 2121.60 ms 2233.07 ms 2251.24 ms 2201.30 ms 2230.36 ms
encode time 259.27 ms 232.78 ms 396.49 ms 424.01 ms 1222.70 ms 1419.51 ms 2951.15 ms 3288.98 ms 5561.91 ms 5895.91 ms
decode time 69.91 ms 7.82 ms 7.37 ms 7.25 ms 24.96 ms 55.99 ms 42.86 ms 55.99 ms 1033.94 ms 344.94 ms
batchd time 5104.22 ms 7218.69 ms 8629.71 ms 8911.47 ms 15979.48 ms 15960.53 ms 38405.30 ms 39768.11 ms 55377.50 ms 57613.93 ms
prompt time 55.85 ms 69.42 ms 88.45 ms 98.58 ms 218.47 ms 266.62 ms 550.54 ms 624.49 ms 820.89 ms 1062.95 ms
total time 7675.59 ms 10395.01 ms 11631.52 ms 11916.36 ms 19925.94 ms 20160.34 ms 44803.71 ms 46336.09 ms 66020.98 ms 67659.70 ms
cpu time 0:07.70 0:10.42 0:11.66 0:11.94 0:19.96 0:20.19 0:44.84 0:46.37 1:06.08 1:07.71
      🚀 🚀 🚀🚀 🚀🚀 🚀🚀🚀🚀 🚀🚀🚀🚀 🚀🚀🚀🚀🚀🚀 🚀🚀🚀🚀🚀🚀

性能对比(MLX)

编写脚本 test-speed.py

import argparse
import whisper

# 创建一个解析器
parser = argparse.ArgumentParser(description='Transcribe a speech file using a specific model.')
parser.add_argument('speech_file', type=str, help='The path to the speech file.')
parser.add_argument('model', type=str, help='The model to use for transcription.')

# 解析命令行参数
args = parser.parse_args()

# 使用指定的音频文件和模型进行转录
text = whisper.transcribe(args.speech_file, model=args.model, initial_prompt='大家好!')["text"]
print(text)

执行脚本

time python test-speed.py test.wav base

下面的性能测试使用的是一个 5 分钟的音频文件 test.wav

模型 tiny

9.83s user 8.39s system 142% cpu 12.813 total

模型 base

7.93s user 6.82s system 143% cpu 10.297 total

模型 small

14.49s user 9.87s system 129% cpu 18.812 total

模型 medium

30.05s user 17.96s system 122% cpu 39.291 total

模型 large-v3

47.01s user 28.10s system 119% cpu 1:02.96 total

总结

  tiny base small medium large-v3
cpu time 0:12.81 0:10.30 0:18.81 0:39.29 1:02.96

性能对比(NEON & MPS 🆚 MLX)

  tiny base small medium large-v3
NEON & MPS 0:07.70 0:11.66 0:19.96 0:44.84 1:06.08
MLX 0:12.81 0:10.30 0:18.81 0:39.29 1:02.96

MLX 的性能已经超过了 whisper.cpp 的性能了。

参考资料