whisper.cpp
NEON & MPS ð CoreML
äžèœœæš¡åïŒlarge-v3ïŒ
models/download-ggml-model.sh large-v3
NEON & MPS
çŒè¯
make clean
make -j
main åž®å©
./main --help
usage: ./main [options] file0.wav file1.wav ...
options:
-h, --help [default] show this help message and exit
-t N, --threads N [4 ] number of threads to use during computation
-p N, --processors N [1 ] number of processors to use during computation
-ot N, --offset-t N [0 ] time offset in milliseconds
-on N, --offset-n N [0 ] segment index offset
-d N, --duration N [0 ] duration of audio to process in milliseconds
-mc N, --max-context N [-1 ] maximum number of text context tokens to store
-ml N, --max-len N [0 ] maximum segment length in characters
-sow, --split-on-word [false ] split on word rather than on token
-bo N, --best-of N [5 ] number of best candidates to keep
-bs N, --beam-size N [5 ] beam size for beam search
-wt N, --word-thold N [0.01 ] word timestamp probability threshold
-et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
-lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
-debug, --debug-mode [false ] enable debug mode (eg. dump log_mel)
-tr, --translate [false ] translate from source language to english
-di, --diarize [false ] stereo audio diarization
-tdrz, --tinydiarize [false ] enable tinydiarize (requires a tdrz model)
-nf, --no-fallback [false ] do not use temperature fallback while decoding
-otxt, --output-txt [false ] output result in a text file
-ovtt, --output-vtt [false ] output result in a vtt file
-osrt, --output-srt [false ] output result in a srt file
-olrc, --output-lrc [false ] output result in a lrc file
-owts, --output-words [false ] output script for generating karaoke video
-fp, --font-path [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
-ocsv, --output-csv [false ] output result in a CSV file
-oj, --output-json [false ] output result in a JSON file
-ojf, --output-json-full [false ] include more information in the JSON file
-of FNAME, --output-file FNAME [ ] output file path (without file extension)
-ps, --print-special [false ] print special tokens
-pc, --print-colors [false ] print colors
-pp, --print-progress [false ] print progress
-nt, --no-timestamps [false ] do not print timestamps
-l LANG, --language LANG [en ] spoken language ('auto' for auto-detect)
-dl, --detect-language [false ] exit after automatically detecting language
--prompt PROMPT [ ] initial prompt
-m FNAME, --model FNAME [models/ggml-base.en.bin] model path
-f FNAME, --file FNAME [ ] input WAV file path
-oved D, --ov-e-device DNAME [CPU ] the OpenVINO device used for encode inference
-ls, --log-score [false ] log best decoder scores of tokens
-ng, --no-gpu [false ] disable GPU
è¯é³è¯å«
time ./main -m models/ggml-large-v3.bin -f test.wav -l auto
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 3117.88 MB, ( 3118.53 / 51539.61)
whisper_model_load: Metal buffer size = 3117.87 MB
whisper_model_load: model size = 3117.39 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 220.20 MB, ( 3338.73 / 51539.61)
whisper_init_state: kv self size = 220.20 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 245.76 MB, ( 3584.49 / 51539.61)
whisper_init_state: kv cross size = 245.76 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.51 / 51539.61)
whisper_init_state: compute buffer (conv) = 32.36 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.52 / 51539.61)
whisper_init_state: compute buffer (encode) = 212.36 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.54 / 51539.61)
whisper_init_state: compute buffer (cross) = 9.32 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.56 / 51539.61)
whisper_init_state: compute buffer (decode) = 99.17 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 30.72 MB, ( 3615.28 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 210.73 MB, ( 3826.01 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 7.68 MB, ( 3833.69 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 97.53 MB, ( 3931.23 / 51539.61)
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 |
main: processing 'test.wav' (6939648 samples, 433.7 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...
whisper_full_with_state: auto-detected language: zh (p = 0.916022)
[00:00:00.000 --> 00:00:12.760] äžçåé¢çäžäžªæ°å
[00:00:12.760 --> 00:00:21.520] ä¹çåé¢çäžäžªæ°å
[00:00:21.520 --> 00:00:28.480] å
«çåé¢çäžäžªæ°å
[00:00:28.480 --> 00:00:34.940] å
«çåé¢äžäžªæ°å
[00:00:34.940 --> 00:00:46.080] é¿æ¹åœ¢ æ£æ¹åœ¢ æ£æ¹åœ¢ åæ±ç
[00:00:46.080 --> 00:00:53.440] å巊蟹æ°D
[00:00:53.440 --> 00:00:57.340] æ¯D
[00:00:57.340 --> 00:01:05.080] åå·Š åå³èŸ¹æ°
[00:01:05.080 --> 00:01:08.020] æ¯Då 䞪
[00:01:08.020 --> 00:01:22.080] æäžª ç¶åè¿äžªæ°
[00:01:22.080 --> 00:01:24.840] è¿ä»ä¹åå çæ¡éœäžæž
æ¥
[00:01:24.840 --> 00:01:25.840] æ
[00:01:25.840 --> 00:01:27.080] æå€
[00:01:27.080 --> 00:01:34.820] æå€
[00:01:34.820 --> 00:01:40.820] äžå
±æå äžªåæ³
[00:01:40.820 --> 00:01:45.620] è¿å©å 䞪çš
[00:01:45.620 --> 00:01:49.620] çš
[00:01:49.620 --> 00:01:55.620] å æ³ åæ³
[00:01:55.620 --> 00:01:56.820] äžäžª
[00:01:56.820 --> 00:01:59.080] äžäžªæ°
[00:01:59.080 --> 00:02:02.180] 䞪äœäžæ¯äº
[00:02:02.180 --> 00:02:05.220] åäœäžæ¯äž
[00:02:05.220 --> 00:02:12.620] è¿äžªæ°æ¯å䞪å
[00:02:12.620 --> 00:02:18.880] åå䞪äžåèµ·æ¥
[00:02:18.880 --> 00:02:26.560] æ¯äžäžªååä¹äžªäžåèµ·æ¥
[00:02:26.560 --> 00:02:48.320] æ¯äžäžªåå
éé¢æäžªååäžäžªäžäºåéé¢æäžªå
[00:02:48.320 --> 00:02:56.300] äºåäºååå äžéŽçæ°æ¯
[00:02:56.300 --> 00:03:06.400] åä¹çžé»çäž€äžªæ°æ¯
[00:03:06.400 --> 00:03:14.560] æ¯å
æ¯åå
å€äžçæ°æ¯
[00:03:14.560 --> 00:03:19.560] æ¯åå
å°äžçæ°æ¯
[00:03:19.560 --> 00:03:20.100] æ¯åå
å°äžçæ°æ¯
[00:03:20.100 --> 00:03:20.660] æ¯åå
å°äžçæ°æ¯
[00:03:20.660 --> 00:03:21.660] æ¯åå
å°äžçæ°æ¯
[00:03:21.660 --> 00:03:22.660] æ¯åå
å°äžçæ°æ¯
[00:03:22.660 --> 00:03:24.040] æ¯åå
å°äžçæ°æ¯
[00:03:24.040 --> 00:03:26.040] æ¯åå
å°äžçæ°æ¯
[00:03:26.040 --> 00:03:26.780] æ¯åå
å°äžçæ°æ¯
[00:03:26.780 --> 00:03:29.100] è¢«åæ°
[00:03:29.100 --> 00:03:31.500] äºæ¯åäž
[00:03:31.500 --> 00:03:34.280] åæ°æ¯äž
[00:03:34.280 --> 00:03:39.380] å·®æ¯äº
[00:03:39.380 --> 00:03:43.900] æå€§çç鮿°æ¯
[00:03:43.900 --> 00:03:52.280] æå°çäž€äœæ°æ¯
[00:03:52.280 --> 00:03:55.780] ä¹
[00:03:55.780 --> 00:04:05.640] åä¹è¿äžªæ°åšäœçœ®
[00:04:05.640 --> 00:04:07.980] æŒäº®çåæç
[00:04:07.980 --> 00:04:19.580] 衚衚快ç¹å°±çäžæ¯å
[00:04:19.580 --> 00:04:22.820] 衚æ¯
[00:04:22.820 --> 00:04:27.040] 䞪åšäœ
[00:04:27.040 --> 00:04:32.040] 衚瀺䞪
[00:04:32.040 --> 00:04:36.280] è¿äžª
[00:04:36.280 --> 00:04:38.340] ç®
[00:04:38.340 --> 00:04:44.680] æ¯äžéŽæ°æ¯å
[00:04:44.680 --> 00:04:47.740] 忝
[00:04:47.740 --> 00:04:49.320] è¿äžª
[00:04:49.320 --> 00:04:57.760] ç®æ¯äž
[00:04:57.760 --> 00:05:01.160] è¢«åæ°æ¯
[00:05:01.160 --> 00:05:06.160] åæ°æ¯å·®æ¯
[00:05:06.160 --> 00:05:12.880] äžäžªå æ°æ¯åäº
[00:05:12.880 --> 00:05:19.060] äžäžªå æ°æ¯å
[00:05:19.060 --> 00:05:21.100] 忝åäº
[00:05:21.100 --> 00:05:27.260] è¢«åæ°æ¯åäº
[00:05:27.260 --> 00:05:34.900] åæ°æ¯åäžå·®æ¯åäº
[00:05:34.900 --> 00:05:48.800] äžäžªæ°ä»å³èŸ¹èµ·ç¬¬äžäœæ¯äœç¬¬äº
[00:05:48.800 --> 00:06:03.140] äžäžªæ°æ¯åäºå æ°æ¯åäžéé¢æäžªåå䞪äž
[00:06:03.140 --> 00:06:11.840] äžäžªæ°æ¯åäºå æ°æ¯åäºå æ°æ¯åäžéé¢æäžªåå䞪äž
[00:06:11.840 --> 00:06:22.580] ååäºçžé»çäž€äžªæ°æ¯
[00:06:22.580 --> 00:06:34.880] 䞪äžäžªäžåäžäžªååèµ·æ¥æ¯
[00:06:34.880 --> 00:06:41.580] åè¿äž€å°å¯¹
[00:06:41.580 --> 00:06:55.920] äžäžªæ°æ¯åäºå æ°æ¯åäžéé¢æäžªåå䞪äž
[00:06:55.920 --> 00:07:00.780] äžäžªæ°æ¯åäºå æ°æ¯åäžéé¢æäžªåå䞪äž
[00:07:00.780 --> 00:07:05.380] äžäžªæ°æ¯åäºå æ°æ¯åäžéé¢æäžªåå䞪äž
[00:07:05.380 --> 00:07:11.320] äžäžªæ°æ¯åäºå æ°æ¯åäžéé¢æäžªåå䞪äž
[00:07:11.320 --> 00:07:13.320] è¬è¬
whisper_print_timings: load time = 1007.19 ms
whisper_print_timings: fallbacks = 5 p / 4 h
whisper_print_timings: mel time = 216.87 ms
whisper_print_timings: sample time = 3550.35 ms / 12205 runs ( 0.29 ms per run)
whisper_print_timings: encode time = 7821.69 ms / 17 runs ( 460.10 ms per run)
whisper_print_timings: decode time = 2958.22 ms / 198 runs ( 14.94 ms per run)
whisper_print_timings: batchd time = 88241.95 ms / 11913 runs ( 7.41 ms per run)
whisper_print_timings: prompt time = 1618.32 ms / 3783 runs ( 0.43 ms per run)
whisper_print_timings: total time = 105432.62 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-large-v3.bin -f test.wav -l auto 11.65s user 1.89s system 12% cpu 1:45.50 total
CoreML
å®è£ äŸèµ
pip install openai-whisper coremltools ane-transformers
çæ CoreML æš¡å
models/generate-coreml-model.sh large-v3
ModelDimensions(n_mels=128, n_audio_ctx=1500, n_audio_state=1280, n_audio_head=20, n_audio_layer=32, n_vocab=51866, n_text_ctx=448, n_text_state=1280, n_text_head=20, n_text_layer=32)
/Users/junjian/GitHub/ggerganov/whisper.cpp/env/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
/Users/junjian/GitHub/ggerganov/whisper.cpp/env/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
scale = (n_state // self.n_head) ** -0.25
Converting PyTorch Frontend ==> MIL Ops: 100%|ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ| 2611/2612 [00:00<00:00, 4137.49 ops/s]
Running MIL frontend_pytorch pipeline: 100%|âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ| 5/5 [00:00<00:00, 34.67 passes/s]
Running MIL default pipeline: 100%|ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ| 71/71 [00:15<00:00, 4.46 passes/s]
Running MIL backend_mlprogram pipeline: 100%|âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ| 12/12 [00:00<00:00, 331.10 passes/s]
done converting
/Users/junjian/GitHub/ggerganov/whisper.cpp/models/coreml-encoder-large-v3.mlmodelc/coremldata.bin
models/coreml-encoder-large-v3.mlmodelc -> models/ggml-large-v3-encoder.mlmodelc
çŒè¯
make clean
WHISPER_COREML=1 make -j
è¯é³è¯å«
time ./main -m models/ggml-large-v3.bin -f test.wav -l auto
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 3117.88 MB, ( 3118.53 / 51539.61)
whisper_model_load: Metal buffer size = 3117.87 MB
whisper_model_load: model size = 3117.39 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/junjian/GitHub/ggerganov/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 220.20 MB, ( 3338.73 / 51539.61)
whisper_init_state: kv self size = 220.20 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 245.76 MB, ( 3584.49 / 51539.61)
whisper_init_state: kv cross size = 245.76 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-v3-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.51 / 51539.61)
whisper_init_state: compute buffer (conv) = 10.85 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.52 / 51539.61)
whisper_init_state: compute buffer (cross) = 9.32 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 0.02 MB, ( 3584.54 / 51539.61)
whisper_init_state: compute buffer (decode) = 99.17 MB
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 9.22 MB, ( 3593.76 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 7.68 MB, ( 3601.45 / 51539.61)
ggml_metal_add_buffer: allocated 'backend ' buffer, size = 97.53 MB, ( 3698.98 / 51539.61)
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 1 | OPENVINO = 0 |
main: processing 'test.wav' (6939648 samples, 433.7 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...
whisper_full_with_state: auto-detected language: zh (p = 0.916203)
[00:00:00.000 --> 00:00:12.760] äžçåé¢çäžäžªæ°å
[00:00:12.760 --> 00:00:21.520] ä¹çåé¢çäžäžªæ°å
[00:00:21.520 --> 00:00:28.480] å
«çåé¢çäžäžªæ°å
[00:00:28.480 --> 00:00:34.940] å
«çåé¢äžäžªæ°å
[00:00:34.940 --> 00:00:43.720] é¿æ¹åœ¢æ£æ¹åœ¢
[00:00:43.720 --> 00:00:46.080] åæ±ç
[00:00:46.080 --> 00:00:53.740] å巊蟹æ°å°
[00:00:53.740 --> 00:00:57.500] æ¯å°
[00:00:57.500 --> 00:01:02.880] åå·Š
[00:01:02.880 --> 00:01:05.120] åå³èŸ¹æ°
[00:01:05.120 --> 00:01:06.860] æ¯å°
[00:01:06.860 --> 00:01:08.040] å 䞪
[00:01:08.040 --> 00:01:17.120] æäžª
[00:01:17.120 --> 00:01:18.640] æäžª
[00:01:18.640 --> 00:01:20.680] ç¶åè¿äžª
[00:01:20.680 --> 00:01:22.120] æ°
[00:01:22.120 --> 00:01:24.840] è¿ä»ä¹ååçæ¡éœäžæž
æ¥
[00:01:24.840 --> 00:01:25.820] æ
[00:01:25.820 --> 00:01:27.460] æå€
[00:01:27.460 --> 00:01:27.480] æå€
[00:01:27.480 --> 00:01:38.520] äžå
±æå äžªåæ³
[00:01:38.520 --> 00:01:42.720] è¿å©å 䞪
[00:01:42.720 --> 00:01:44.080] çš
[00:01:44.080 --> 00:01:46.000] çš
[00:01:46.000 --> 00:01:50.960] å æ³åæ³
[00:01:50.960 --> 00:01:51.480] 奜
[00:01:51.480 --> 00:01:53.160] è¿äžª
[00:01:53.160 --> 00:01:56.660] äžäžª
[00:01:56.660 --> 00:01:57.460] äžäžª
[00:01:57.460 --> 00:01:59.140] è¿äžªæ°
[00:01:59.140 --> 00:02:02.200] 䞪äœäžæ¯äº
[00:02:02.200 --> 00:02:05.260] åäœäžæ¯äž
[00:02:05.260 --> 00:02:08.060] è¿äžªæ°æ¯
[00:02:08.060 --> 00:02:12.820] å䞪å
[00:02:12.820 --> 00:02:21.900] åå䞪äžåèµ·æ¥æ¯äžäžª
[00:02:21.900 --> 00:02:29.660] ååä¹äžªäžåèµ·æ¥æ¯
[00:02:29.660 --> 00:02:33.980] åå
é颿
[00:02:33.980 --> 00:02:35.780] 䞪
[00:02:35.780 --> 00:02:38.540] åå
[00:02:38.540 --> 00:02:42.460] 䞪äžäºåéé¢
[00:02:42.460 --> 00:02:44.620] æ
[00:02:44.620 --> 00:02:48.340] 䞪å
[00:02:48.340 --> 00:02:51.620] äºåäºååäž
[00:02:51.620 --> 00:03:05.100] äžéŽçæ°æ¯å9çžé»çäž€äžªæ°æ¯
[00:03:05.100 --> 00:03:19.280] æ¯16å€1çæ°æ¯æ¯16å°3çæ°æ¯
[00:03:19.280 --> 00:03:35.280] è¢«è§£æ°æ¯11è§£æ°æ¯3å·®æ¯
[00:03:35.280 --> 00:03:46.880] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:03:46.880 --> 00:03:47.280] 5
[00:03:47.280 --> 00:03:49.260] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:03:49.260 --> 00:03:49.260] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:03:49.260 --> 00:03:49.260] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:03:49.260 --> 00:03:49.260] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:03:49.260 --> 00:03:49.260] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:03:49.260 --> 00:04:19.240] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:04:19.240 --> 00:04:49.220] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:04:49.220 --> 00:05:19.200] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:05:19.200 --> 00:05:49.180] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:05:49.180 --> 00:06:19.160] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯
[00:06:19.160 --> 00:06:49.140] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯æå°çäž€äœæ°æ¯
[00:06:49.140 --> 00:07:13.760] æå€§çç鮿°æ¯æå°çäž€äœæ°æ¯æå°çäž€äœæ°æ¯
whisper_print_timings: load time = 859.73 ms
whisper_print_timings: fallbacks = 3 p / 1 h
whisper_print_timings: mel time = 224.71 ms
whisper_print_timings: sample time = 2659.66 ms / 8154 runs ( 0.33 ms per run)
whisper_print_timings: encode time = 5801.61 ms / 16 runs ( 362.60 ms per run)
whisper_print_timings: decode time = 4105.18 ms / 279 runs ( 14.71 ms per run)
whisper_print_timings: batchd time = 54016.19 ms / 7797 runs ( 6.93 ms per run)
whisper_print_timings: prompt time = 1218.58 ms / 2721 runs ( 0.45 ms per run)
whisper_print_timings: total time = 71318.02 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-large-v3.bin -f test.wav -l auto 8.81s user 2.29s system 15% cpu 1:11.44 total
ä» çŒç
æè§åäžé¢çäžæ ·ïŒå æ¬ææåé床ã
models/generate-coreml-model.sh large-v3 --encoder-only True
time ./main -m models/ggml-large-v3.bin -f test.wav -l auto
æ»ç»
 | Neon & MPS ð | CoreML ð (47%) |
---|---|---|
load time | 1007.19 ms | 859.73 ms |
mel time | 216.87 ms | 224.71 ms |
sample time | 3550.35 ms | 2659.66 ms |
encode time | 7821.69 ms | 5801.61 ms |
decode time | 2958.22 ms | 4105.18 ms |
batchd time | 88241.95 ms | 54016.19 ms |
prompt time | 1618.32 ms | 1218.58 ms |
total time | 105432.62 ms | 71318.02 ms |
cpu time | 1:45.50 | 1:11.44 |
é床æé«äºïŒäœææäžéäºã
æ§èœå¯¹æ¯ïŒNEON & MPSïŒ
äžèœœæš¡å ggerganov/whisper.cpp
git clone https://huggingface.co/ggerganov/whisper.cpp ggerganov/whisper.cpp
å建暡åéŸæ¥
çŒåèæ¬ ln-models.sh
#!/bin/bash
# æºç®åœ
src_dir="/Users/junjian/HuggingFace/ggerganov/whisper.cpp"
# ç®æ ç®åœ
dst_dir="models"
# éåæºç®åœäžçæææä»¶
for src_file in "$src_dir"/*
do
# è·åæä»¶å
file_name=$(basename "$src_file")
# è·åæä»¶æ©å±å
extension="${file_name##*.}"
# åŠææä»¶åäžæ¯ README.md å¹¶äžæä»¶æ©å±åäžæ¯ zipïŒååå»ºèœ¯éŸæ¥
if [ "$file_name" != "README.md" ] && [ "$extension" != "zip" ]
then
ln -s "$src_file" "$dst_dir/$file_name"
fi
done
æ§è¡èæ¬
sh ln-models.sh
äžé¢çæ§èœæµè¯äœ¿çšçæ¯äžäžª 5 åé
çé³é¢æä»¶ test.wav
ã
æš¡å tiny
time ./main -f test.wav -l zh -m models/ggml-tiny.bin
whisper_print_timings: load time = 73.39 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.73 ms
whisper_print_timings: sample time = 1967.16 ms / 8244 runs ( 0.24 ms per run)
whisper_print_timings: encode time = 259.27 ms / 13 runs ( 19.94 ms per run)
whisper_print_timings: decode time = 69.91 ms / 38 runs ( 1.84 ms per run)
whisper_print_timings: batchd time = 5104.22 ms / 8130 runs ( 0.63 ms per run)
whisper_print_timings: prompt time = 55.85 ms / 2175 runs ( 0.03 ms per run)
whisper_print_timings: total time = 7675.59 ms
./main -f test.wav -l zh -m models/ggml-tiny.bin 5.40s user 0.50s system 76% cpu 7.704 total
æš¡å tiny-q5_1
time ./main -f test.wav -l zh -m models/ggml-tiny-q5_1.bin
whisper_print_timings: load time = 68.97 ms
whisper_print_timings: fallbacks = 1 p / 0 h
whisper_print_timings: mel time = 134.27 ms
whisper_print_timings: sample time = 2650.25 ms / 10960 runs ( 0.24 ms per run)
whisper_print_timings: encode time = 232.78 ms / 11 runs ( 21.16 ms per run)
whisper_print_timings: decode time = 7.82 ms / 5 runs ( 1.56 ms per run)
whisper_print_timings: batchd time = 7218.69 ms / 10898 runs ( 0.66 ms per run)
whisper_print_timings: prompt time = 69.42 ms / 2452 runs ( 0.03 ms per run)
whisper_print_timings: total time = 10395.01 ms
./main -f test.wav -l zh -m models/ggml-tiny-q5_1.bin 7.20s user 0.63s system 75% cpu 10.422 total
æš¡å base
time ./main -f test.wav -l zh -m models/ggml-base.bin
whisper_print_timings: load time = 81.90 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 133.21 ms
whisper_print_timings: sample time = 2283.01 ms / 9560 runs ( 0.24 ms per run)
whisper_print_timings: encode time = 396.49 ms / 11 runs ( 36.04 ms per run)
whisper_print_timings: decode time = 7.37 ms / 3 runs ( 2.46 ms per run)
whisper_print_timings: batchd time = 8629.71 ms / 9505 runs ( 0.91 ms per run)
whisper_print_timings: prompt time = 88.45 ms / 2226 runs ( 0.04 ms per run)
whisper_print_timings: total time = 11631.52 ms
./main -f test.wav -l zh -m models/ggml-base.bin 6.29s user 0.60s system 59% cpu 11.664 total
æš¡å base-q5_1
time ./main -f test.wav -l zh -m models/ggml-base-q5_1.bin
whisper_print_timings: load time = 63.39 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.60 ms
whisper_print_timings: sample time = 2266.66 ms / 9567 runs ( 0.24 ms per run)
whisper_print_timings: encode time = 424.01 ms / 11 runs ( 38.55 ms per run)
whisper_print_timings: decode time = 7.25 ms / 3 runs ( 2.42 ms per run)
whisper_print_timings: batchd time = 8911.47 ms / 9512 runs ( 0.94 ms per run)
whisper_print_timings: prompt time = 98.58 ms / 2227 runs ( 0.04 ms per run)
whisper_print_timings: total time = 11916.36 ms
./main -f test.wav -l zh -m models/ggml-base-q5_1.bin 6.18s user 0.56s system 56% cpu 11.948 total
æš¡å small
time ./main -f test.wav -l zh -m models/ggml-small.bin
whisper_print_timings: load time = 200.65 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.19 ms
whisper_print_timings: sample time = 2134.68 ms / 8277 runs ( 0.26 ms per run)
whisper_print_timings: encode time = 1222.70 ms / 12 runs ( 101.89 ms per run)
whisper_print_timings: decode time = 24.96 ms / 5 runs ( 4.99 ms per run)
whisper_print_timings: batchd time = 15979.48 ms / 8208 runs ( 1.95 ms per run)
whisper_print_timings: prompt time = 218.47 ms / 2191 runs ( 0.10 ms per run)
whisper_print_timings: total time = 19925.94 ms
./main -f test.wav -l zh -m models/ggml-small.bin 6.21s user 0.67s system 34% cpu 19.968 total
æš¡å small-q5_1
time ./main -f test.wav -l zh -m models/ggml-small-q5_1.bin
whisper_print_timings: load time = 99.42 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 131.74 ms
whisper_print_timings: sample time = 2121.60 ms / 8218 runs ( 0.26 ms per run)
whisper_print_timings: encode time = 1419.51 ms / 13 runs ( 109.19 ms per run)
whisper_print_timings: decode time = 147.85 ms / 33 runs ( 4.48 ms per run)
whisper_print_timings: batchd time = 15960.53 ms / 8116 runs ( 1.97 ms per run)
whisper_print_timings: prompt time = 266.62 ms / 2419 runs ( 0.11 ms per run)
whisper_print_timings: total time = 20160.34 ms
./main -f test.wav -l zh -m models/ggml-small-q5_1.bin 6.03s user 0.59s system 32% cpu 20.191 total
æš¡å medium
time ./main -f test.wav -l zh -m models/ggml-medium.bin
whisper_print_timings: load time = 476.85 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.37 ms
whisper_print_timings: sample time = 2233.07 ms / 9028 runs ( 0.25 ms per run)
whisper_print_timings: encode time = 2951.15 ms / 11 runs ( 268.29 ms per run)
whisper_print_timings: decode time = 42.86 ms / 4 runs ( 10.72 ms per run)
whisper_print_timings: batchd time = 38405.30 ms / 8972 runs ( 4.28 ms per run)
whisper_print_timings: prompt time = 550.54 ms / 2232 runs ( 0.25 ms per run)
whisper_print_timings: total time = 44803.71 ms
./main -f test.wav -l zh -m models/ggml-medium.bin 7.14s user 1.01s system 18% cpu 44.848 total
æš¡å medium-q5_0
time ./main -f test.wav -l zh -m models/ggml-medium-q5_0.bin
whisper_print_timings: load time = 203.72 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.13 ms
whisper_print_timings: sample time = 2251.24 ms / 9072 runs ( 0.25 ms per run)
whisper_print_timings: encode time = 3288.98 ms / 11 runs ( 299.00 ms per run)
whisper_print_timings: decode time = 55.99 ms / 6 runs ( 9.33 ms per run)
whisper_print_timings: batchd time = 39768.11 ms / 9014 runs ( 4.41 ms per run)
whisper_print_timings: prompt time = 624.49 ms / 2234 runs ( 0.28 ms per run)
whisper_print_timings: total time = 46336.09 ms
./main -f test.wav -l zh -m models/ggml-medium-q5_0.bin 7.02s user 0.81s system 16% cpu 46.372 total
æš¡å large-v3
time ./main -f test.wav -l zh -m models/ggml-large-v3.bin
whisper_print_timings: load time = 859.78 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 151.76 ms
whisper_print_timings: sample time = 2201.30 ms / 8640 runs ( 0.25 ms per run)
whisper_print_timings: encode time = 5561.91 ms / 12 runs ( 463.49 ms per run)
whisper_print_timings: decode time = 1033.94 ms / 67 runs ( 15.43 ms per run)
whisper_print_timings: batchd time = 55377.50 ms / 8503 runs ( 6.51 ms per run)
whisper_print_timings: prompt time = 820.89 ms / 1975 runs ( 0.42 ms per run)
whisper_print_timings: total time = 66020.98 ms
./main -f test.wav -l zh -m models/ggml-large-v3.bin 7.45s user 1.40s system 13% cpu 1:06.08 total
æš¡å large-v3-q5_0
time ./main -f test.wav -l zh -m models/ggml-large-v3-q5_0.bin
whisper_print_timings: load time = 341.02 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 159.70 ms
whisper_print_timings: sample time = 2230.36 ms / 8311 runs ( 0.27 ms per run)
whisper_print_timings: encode time = 5895.91 ms / 11 runs ( 535.99 ms per run)
whisper_print_timings: decode time = 344.94 ms / 25 runs ( 13.80 ms per run)
whisper_print_timings: batchd time = 57613.93 ms / 8239 runs ( 6.99 ms per run)
whisper_print_timings: prompt time = 1062.95 ms / 2198 runs ( 0.48 ms per run)
whisper_print_timings: total time = 67659.70 ms
./main -f test.wav -l zh -m models/ggml-large-v3-q5_0.bin 7.25s user 1.03s system 12% cpu 1:07.71 total
æ»ç»
 | tiny | tiny-q5_1 | base | base-q5_1 | small | small-q5_1 | medium | medium-q5_0 | large-v3 | large-v3-q5_0 |
---|---|---|---|---|---|---|---|---|---|---|
load time | 73.39 ms | 68.97 ms | 81.90 ms | 63.39 ms | 200.65 ms | 99.42 ms | 476.85 ms | 203.72 ms | 859.78 ms | 341.02 ms |
mel time | 132.73 ms | 134.27 ms | 133.21 ms | 132.60 ms | 132.19 ms | 131.74 ms | 132.37 ms | 132.13 ms | 151.76 ms | 159.70 ms |
sample time | 1967.16 ms | 2650.25 ms | 2283.01 ms | 2266.66 ms | 2134.68 ms | 2121.60 ms | 2233.07 ms | 2251.24 ms | 2201.30 ms | 2230.36 ms |
encode time | 259.27 ms | 232.78 ms | 396.49 ms | 424.01 ms | 1222.70 ms | 1419.51 ms | 2951.15 ms | 3288.98 ms | 5561.91 ms | 5895.91 ms |
decode time | 69.91 ms | 7.82 ms | 7.37 ms | 7.25 ms | 24.96 ms | 55.99 ms | 42.86 ms | 55.99 ms | 1033.94 ms | 344.94 ms |
batchd time | 5104.22 ms | 7218.69 ms | 8629.71 ms | 8911.47 ms | 15979.48 ms | 15960.53 ms | 38405.30 ms | 39768.11 ms | 55377.50 ms | 57613.93 ms |
prompt time | 55.85 ms | 69.42 ms | 88.45 ms | 98.58 ms | 218.47 ms | 266.62 ms | 550.54 ms | 624.49 ms | 820.89 ms | 1062.95 ms |
total time | 7675.59 ms | 10395.01 ms | 11631.52 ms | 11916.36 ms | 19925.94 ms | 20160.34 ms | 44803.71 ms | 46336.09 ms | 66020.98 ms | 67659.70 ms |
cpu time | 0:07.70 | 0:10.42 | 0:11.66 | 0:11.94 | 0:19.96 | 0:20.19 | 0:44.84 | 0:46.37 | 1:06.08 | 1:07.71 |
 |  |  | ð | ð | ðð | ðð | ðððð | ðððð | ðððððð | ðððððð |
æ§èœå¯¹æ¯ïŒMLXïŒ
çŒåèæ¬ test-speed.py
import argparse
import whisper
# å建äžäžªè§£æåš
parser = argparse.ArgumentParser(description='Transcribe a speech file using a specific model.')
parser.add_argument('speech_file', type=str, help='The path to the speech file.')
parser.add_argument('model', type=str, help='The model to use for transcription.')
# è§£æåœä»€è¡åæ°
args = parser.parse_args()
# äœ¿çšæå®çé³é¢æä»¶åæš¡åè¿è¡èœ¬åœ
text = whisper.transcribe(args.speech_file, model=args.model, initial_prompt='倧家奜ïŒ')["text"]
print(text)
æ§è¡èæ¬
time python test-speed.py test.wav base
äžé¢çæ§èœæµè¯äœ¿çšçæ¯äžäžª 5 åé
çé³é¢æä»¶ test.wav
ã
æš¡å tiny
9.83s user 8.39s system 142% cpu 12.813 total
æš¡å base
7.93s user 6.82s system 143% cpu 10.297 total
æš¡å small
14.49s user 9.87s system 129% cpu 18.812 total
æš¡å medium
30.05s user 17.96s system 122% cpu 39.291 total
æš¡å large-v3
47.01s user 28.10s system 119% cpu 1:02.96 total
æ»ç»
 | tiny | base | small | medium | large-v3 |
---|---|---|---|---|---|
cpu time | 0:12.81 | 0:10.30 | 0:18.81 | 0:39.29 | 1:02.96 |
æ§èœå¯¹æ¯ïŒNEON & MPS ð MLXïŒ
 | tiny | base | small | medium | large-v3 |
---|---|---|---|---|---|
NEON & MPS | 0:07.70 | 0:11.66 | 0:19.96 | 0:44.84 | 1:06.08 |
MLX | 0:12.81 | 0:10.30 | 0:18.81 | 0:39.29 | 1:02.96 |
MLX çæ§èœå·²ç»è¶ è¿äº whisper.cpp çæ§èœäºã