gadersd / whisper-burn Goto Github PK
View Code? Open in Web Editor NEWA Rust implementation of OpenAI's Whisper model using the burn framework
License: MIT License
A Rust implementation of OpenAI's Whisper model using the burn framework
License: MIT License
OS: Mac Ventura
Seems like with the tiny model, transcription works, but when using the medium you get a buffer size error. Perhaps we could do chunking
Running `target/release/whisper audio.wav medium`
thread 'main' panicked at 'wgpu error: Validation Error
Caused by:
In Device::create_bind_group
Buffer binding 0 range 212439040 exceeds `max_*_buffer_binding_size` limit 134217728
', /Users/botch/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.17.0/src/backend/direct.rs:3056:5
stack backtrace:
0: rust_begin_unwind
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
1: core::panicking::panic_fmt
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
2: core::ops::function::Fn::call
3: <wgpu::backend::direct::Context as wgpu::context::Context>::device_create_bind_group
4: <T as wgpu::context::DynContext>::device_create_bind_group
5: wgpu::Device::create_bind_group
6: burn_wgpu::context::base::Context::execute
7: burn_wgpu::kernel::index::select::select
8: burn_tensor::tensor::ops::modules::base::ModuleOps::embedding
9: whisper::model::Whisper<B>::forward_decoder
10: whisper::main
Using a six-minute audio file with the tiny model produces the same issue.
This project fails to provide any output on a different test file (the test file works with whisper, sounds normal when I listen to it, and was created from an m4a via ffmpeg according to the whisper instructions):
$ ffmpeg -i ../whisper/20220922\ 084923.m4a -ar 16000 -ac 1 -c:a pcm_s16le output.wav
$ cargo run --release output.wav tiny_en
Running `target/release/whisper output.wav tiny_en`
<|notimestamps|>
Transcribed text: <|notimestamps|>
GitHub refuses to allow me to upload a wav file (even base64 encoded as .txt). Not sure what the best way to share is.
I ran into some errors due to missing a system install of libtorch at the expected path. I was able to trace these errors back to: https://github.com/LaurentMazare/tch-rs#libtorch-manual-install and the need to set some environment variables like LIBTORCH
or LIBTORCH_USE_PYTORCH
.
I'm trying to get this working with nix
(on aarch64-darwin) but not having luck so far.
Running on windows I have tried to set my intel gpu using let-env WGPU_ADAPTER_NAME = 'intel'
via nushell to no succsess.
I also tried changing device choose to let device = WgpuDevice::DiscreteGpu(0);
and it too did not work.
in the end I had to set type Backend = WgpuBackend<burn_wgpu::Vulkan, f32, i32>;
and use VK_ICD_FILENAMES = '\windows\system32\DriverStore\FileRepository\iigd_dch_d.inf_amd64_50b98d237e0753a8\igvk64.json'
to use the intel gpu (for anyone stumbling upon this the link may be different depending on gpu, so you will need to manually find igvk64.json
)
not sure if this is an issue with whisper-burn or the wgpu backend for burn, I think it's a wgpu-burn issue but thought it would be safer to report issue here first
thread 'main' panicked at 'slice index starts at 172409 but ends at 168511',
/tmp/whisper-burn/src/transcribe.rs:101:22
whisper-burn/src/transcribe.rs
Line 95 in 3757c15
Here waveform.len()
could be less than n_samples_per_tensor
, which results in iter_len
to be extremely large:
[src/transcribe.rs:97] n_samples_per_tensor = 238559
[src/transcribe.rs:97] waveform.len() = 168511
[src/transcribe.rs:97] waveform.len() - n_samples_per_tensor = 18446744073709481568
Replacing subtraction with saturating_sub
fixes the issue.
The latest tinygrad 0.7.0 moved state
to under tinygrad.nn.state
, which broke the conversion script. Updating the import path fixed the problem.
I am following the example in README with my Mac M1, I reach the point to run it, I choose to use wgpu backend and it panics. Before your refactor you did some days ago, I was able to run this project locally on my mac (no torch installed whatsover)
cargo run --release --features wgpu-backend --bin transcribe tiny_en audio16k.wav transcription.txt
warning: unused imports: `Bool`, `Int`, `activation::relu`
--> src/model/load.rs:6:9
|
6 | activation::relu,
| ^^^^^^^^^^^^^^^^
7 | Tensor,
8 | Int,
| ^^^
9 | Bool,
| ^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused import: `Conv1dRecord`
--> src/model/mod.rs:8:45
|
8 | nn::{self, conv::{Conv1d, Conv1dConfig, Conv1dRecord}, PaddingConfig1d},
| ^^^^^^^^^^^^
warning: unused import: `Tokenizer`
--> src/token.rs:4:18
|
4 | use tokenizers::{Tokenizer, AddedToken};
| ^^^^^^^^^
warning: unused import: `crate::helper::*`
--> src/transcribe.rs:2:5
|
2 | use crate::helper::*;
| ^^^^^^^^^^^^^^^^
warning: unused imports: `Float`, `Int`, `config::Config`, `self`
--> src/transcribe.rs:9:5
|
9 | config::Config,
| ^^^^^^^^^^^^^^
...
13 | backend::{self, Backend},
| ^^^^
...
16 | Int,
| ^^^
17 | Float,
| ^^^^^
warning: unused variable: `n_batch`
--> src/model/mod.rs:122:14
|
122 | let [n_batch, seq_len] = x.dims();
| ^^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_batch`
|
= note: `#[warn(unused_variables)]` on by default
warning: unused variable: `new_text`
--> src/transcribe.rs:38:14
|
38 | let (new_text, new_tokens) = mels_to_text(whisper, bpe, mel, &prev_normal_tokens[..], padding)?;
| ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_new_text`
warning: unused variable: `n_channel`
--> src/transcribe.rs:119:10
|
119 | let [n_channel, n_mel, n_ctx] = mels.dims();
| ^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_channel`
warning: unused variable: `start_of_prev_token`
--> src/transcribe.rs:130:9
|
130 | let start_of_prev_token = bpe.special_token(SpecialToken::StartofPrev).unwrap();
| ^^^^^^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_start_of_prev_token`
warning: unused variable: `n_batch`
--> src/transcribe.rs:158:14
|
158 | let [n_batch, n_token, n_dict] = out.dims();
| ^^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_batch`
warning: unused variable: `n_dict`
--> src/transcribe.rs:158:32
|
158 | let [n_batch, n_token, n_dict] = out.dims();
| ^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_dict`
warning: unused variable: `prev_normal_tokens`
--> src/transcribe.rs:113:92
|
113 | ...Tokenizer, mels: Tensor<B, 3>, prev_normal_tokens: &[usize], padding: usize) -> token::Result<(St...
| ^^^^^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_prev_normal_tokens`
warning: function `get_mel_filters` is never used
--> src/audio.rs:66:4
|
66 | fn get_mel_filters<B: Backend>(sample_rate: f64, n_fft: usize, n_mels: usize, htk: bool) -> Tensor<B,...
| ^^^^^^^^^^^^^^^
|
= note: `#[warn(dead_code)]` on by default
warning: function `fft_frequencies` is never used
--> src/audio.rs:128:4
|
128 | fn fft_frequencies<B: Backend>(sample_rate: f64, n_fft: usize) -> Tensor<B, 1> {
| ^^^^^^^^^^^^^^^
warning: function `test_fft_frequencies` is never used
--> src/audio.rs:137:4
|
137 | fn test_fft_frequencies<B: Backend>() {
| ^^^^^^^^^^^^^^^^^^^^
warning: function `test_mel_frequencies` is never used
--> src/audio.rs:144:4
|
144 | fn test_mel_frequencies<B: Backend>(htk: bool) {
| ^^^^^^^^^^^^^^^^^^^^
warning: function `mel_frequencies` is never used
--> src/audio.rs:152:4
|
152 | fn mel_frequencies<B: Backend>(n_mels: usize, fmin: f64, fmax: f64, htk: bool) -> Tensor<B, 1> {
| ^^^^^^^^^^^^^^^
warning: `whisper` (lib) generated 17 warnings (run `cargo fix --lib -p whisper` to apply 12 suggestions)
warning: unused import: `std::collections::HashMap`
--> src/bin/transcribe/main.rs:1:5
|
1 | use std::collections::HashMap;
| ^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused import: `std::iter`
--> src/bin/transcribe/main.rs:2:5
|
2 | use std::iter;
| ^^^^^^^^^
warning: unused import: `whisper::helper::*`
--> src/bin/transcribe/main.rs:5:5
|
5 | use whisper::helper::*;
| ^^^^^^^^^^^^^^^^^^
warning: unused import: `whisper::token`
--> src/bin/transcribe/main.rs:6:5
|
6 | use whisper::token;
| ^^^^^^^^^^^^^^
warning: unused imports: `Data`, `Float`, `Int`, `Tensor`, `self`, `self`
--> src/bin/transcribe/main.rs:21:9
|
21 | self,
| ^^^^
22 | backend::{self, Backend},
| ^^^^
23 | Data,
| ^^^^
24 | Tensor,
| ^^^^^^
25 | Int,
| ^^^
26 | Float,
| ^^^^^
warning: unused import: `num_traits::ToPrimitive`
--> src/bin/transcribe/main.rs:60:5
|
60 | use num_traits::ToPrimitive;
| ^^^^^^^^^^^^^^^^^^^^^^^
warning: unused import: `whisper::audio::prep_audio`
--> src/bin/transcribe/main.rs:61:5
|
61 | use whisper::audio::prep_audio;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
warning: unused import: `SpecialToken`
--> src/bin/transcribe/main.rs:62:37
|
62 | use whisper::token::{Gpt2Tokenizer, SpecialToken};
| ^^^^^^^^^^^^
warning: unused variable: `duration`
--> src/bin/transcribe/main.rs:36:9
|
36 | let duration = reader.duration() as usize;
| ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_duration`
|
= note: `#[warn(unused_variables)]` on by default
warning: unused variable: `bits_per_sample`
--> src/bin/transcribe/main.rs:39:9
|
39 | let bits_per_sample = spec.bits_per_sample;
| ^^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_bits_per_sample`
warning: variable does not need to be mutable
--> src/bin/transcribe/main.rs:33:9
|
33 | let mut reader = hound::WavReader::open(filename)?;
| ----^^^^^^
| |
| help: remove this `mut`
|
= note: `#[warn(unused_mut)]` on by default
warning: unused variable: `tokens`
--> src/bin/transcribe/main.rs:132:16
|
132 | let (text, tokens) = match waveform_to_text(&whisper, &bpe, waveform, sample_rate) {
| ^^^^^^ help: if this is intentional, prefix it with an underscore: `_tokens`
warning: `whisper` (bin "transcribe") generated 12 warnings (run `cargo fix --bin "transcribe"` to apply 12 suggestions)
Finished release [optimized] target(s) in 0.28s
warning: the following packages contain code that will be rejected by a future version of Rust: nom v1.2.4, nom v3.2.1
note: to see what the problems were, use the option `--future-incompat-report`, or run `cargo report future-incompatibilities --id 2`
Running `target/release/transcribe tiny_en audio16k.wav transcription.txt`
Loading waveform...
Loading model...
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Torch("Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, MPS, Meta, QuantizedCPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].\n\nCPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/build/aten/src/ATen/RegisterCPU.cpp:31034 [kernel]\nMPS: registered at /Users/runner/work/pytorch/pytorch/pytorch/build/aten/src/ATen/RegisterMPS.cpp:22748 [kernel]\nMeta: registered at /Users/runner/work/pytorch/pytorch/pytorch/build/aten/src/ATen/RegisterMeta.cpp:26824 [kernel]\nQuantizedCPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/build/aten/src/ATen/RegisterQuantizedCPU.cpp:929 [kernel]\nBackendSelect: registered at /Users/runner/work/pytorch/pytorch/pytorch/build/aten/src/ATen/RegisterBackendSelect.cpp:726 [kernel]\nPython: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:144 [backend fallback]\nFuncTorchDynamicLayerBackMode: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/DynamicLayer.cpp:491 [backend fallback]\nFunctionalize: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/FunctionalizeFallbackKernel.cpp:280 [backend fallback]\nNamed: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]\nConjugate: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/ConjugateFallback.cpp:21 [kernel]\nNegative: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/NegateFallback.cpp:23 [kernel]\nZeroTensor: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/ZeroTensorFallback.cpp:90 [kernel]\nADInplaceOrView: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]\nAutogradOther: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradCPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradCUDA: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradHIP: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradXLA: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradMPS: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradIPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradXPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradHPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradVE: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradLazy: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradMeta: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradMTIA: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradPrivateUse1: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradPrivateUse2: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradPrivateUse3: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nAutogradNestedTensor: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]\nTracer: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/TraceType_2.cpp:16726 [kernel]\nAutocastCPU: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/autocast_mode.cpp:487 [backend fallback]\nAutocastCUDA: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/autocast_mode.cpp:354 [backend fallback]\nFuncTorchBatched: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:815 [backend fallback]\nFuncTorchVmapMode: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]\nBatched: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/LegacyBatchingRegistrations.cpp:1073 [backend fallback]\nVmapMode: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]\nFuncTorchGradWrapper: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/TensorWrapper.cpp:210 [backend fallback]\nPythonTLSSnapshot: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:152 [backend fallback]\nFuncTorchDynamicLayerFrontMode: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/DynamicLayer.cpp:487 [backend fallback]\nPythonDispatcher: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:148 [backend fallback]\n\nException raised from reportError at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:548 (most recent call first):\nframe #0: c10::impl::OperatorEntry::reportError(c10::DispatchKey) const + 588 (0x10bef7020 in libtorch_cpu.dylib)\nframe #1: c10::impl::OperatorEntry::lookup(c10::DispatchKeySet) const + 164 (0x10bbbfd4c in libtorch_cpu.dylib)\nframe #2: at::Tensor c10::Dispatcher::redispatch<at::Tensor, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> >(c10::TypedOperatorHandle<at::Tensor (c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)> const&, c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) const + 88 (0x10cd1a6d4 in libtorch_cpu.dylib)\nframe #3: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 172 (0x10cc3af58 in libtorch_cpu.dylib)\nframe #4: at::_ops::empty_strided::call(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 300 (0x10cc3aaec in libtorch_cpu.dylib)\nframe #5: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 2624 (0x10c3c5244 in libtorch_cpu.dylib)\nframe #6: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 188 (0x10c8f9f70 in libtorch_cpu.dylib)\nframe #7: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 188 (0x10c8f9f70 in libtorch_cpu.dylib)\nframe #8: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>), &(torch::autograd::VariableType::(anonymous namespace)::_to_copy(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>))>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat> > >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 1104 (0x10e673864 in libtorch_cpu.dylib)\nframe #9: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 340 (0x10c8f9c30 in libtorch_cpu.dylib)\nframe #10: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 348 (0x10cacdc30 in libtorch_cpu.dylib)\nframe #11: atg_to + 104 (0x101004cdc in transcribe)\nframe #12: tch::wrappers::tensor_generated::_$LT$impl$u20$tch..wrappers..tensor..Tensor$GT$::to::hfeb248388ea0a533 + 88 (0x100ff6ba8 in transcribe)\nframe #13: burn_tch::ops::tensor::_$LT$impl$u20$burn_tensor..tensor..ops..tensor..TensorOps$LT$burn_tch..backend..TchBackend$LT$E$GT$$GT$$u20$for$u20$burn_tch..backend..TchBackend$LT$E$GT$$GT$::to_device::h39a91cf7f00bf6ef + 64 (0x100fc4108 in transcribe)\nframe #14: _$LT$burn_core..nn..conv..conv1d..Conv1d$LT$B$GT$$u20$as$u20$burn_core..module..base..Module$LT$B$GT$$GT$::map::h58a9338e15b1b63e + 64 (0x100fbc49c in transcribe)\nframe #15: _$LT$whisper..model..Whisper$LT$B$GT$$u20$as$u20$burn_core..module..base..Module$LT$B$GT$$GT$::map::h8a2e896e5556644c + 104 (0x100fe515c in transcribe)\nframe #16: transcribe::main::hfd8ee1d1f4f6714c + 1044 (0x100fc64d0 in transcribe)\nframe #17: std::sys_common::backtrace::__rust_begin_short_backtrace::h847fc7e56d1202dc + 12 (0x100fec7b8 in transcribe)\nframe #18: std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h9b3e7ad23a57bf45 + 16 (0x100fec7d0 in transcribe)\nframe #19: std::rt::lang_start_internal::hdd06e3566639fc5b + 648 (0x1013301d4 in transcribe)\nframe #20: main + 52 (0x100fc6b20 in transcribe)\nframe #21: start + 520 (0x1019dd08c in dyld)\n")', /Users/tiero/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tch-0.13.0/src/wrappers/tensor_generated.rs:17243:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Hi. Would you consider splitting out the functionality for computing the Mel spectrogram(inside audio.rs) into a separate crate?
This would be useful for other speech-centric models. There are a few libraries for this with ndarray, such as mel-spec and mfcc-rust (contributor), but this is the first implementation I've seen for burn.
Note: PR is comming
System: Win10
Rust version: 1.71.0 stable
I followed the instructions in the README file, after cargo run, I encounted this failure:
Failed to load tokenizer: Model "gpt2" on the Hub doesn't have a tokenizer
error: process didn't exit successfully: `target\release\whisper.exe audio.wav tiny` (exit code: 1)
It seems a dependency named tokenizer can't be initialized.
How can I fix this error?
Hi there, first of all, awesome library and thank you for making it.
When I record a short clear .wav file saying, "this is a test, this is a test" (link below), the waveform_to_text function does not successfully decode it. In essence, it usually shows something like "muffled" as the decoded text. I have used the medium sized model too. However, it does work for the audio.wav model provided in the repo example.
Have spent much time trying to analyze why this is failing, including performing analysis on the meta-data of both files, and also recording new audio files from difference sources (just in case this was to do with my own machine).
Do you have any experience or knowledge on the exact requirements for the .wav file in order for it to be successfully extracted using the library?
Here is the audio file which is failing: https://drive.google.com/file/d/1aaWL-mBrRaGtFvL_Re8r4WVtsMP3BmAI/view?usp=sharing
Again, this works great with the example audio file provided in the docs, but just not with any new custom file I record.
Seems like audio decode is picky on what gets input to it
Audio mediainfo
General
Complete name : C:\Users\Quack\code\whisper-burn\slap.wav
Format : Wave
File size : 788 KiB
Duration : 4 s 203 ms
Overall bit rate mode : Constant
Overall bit rate : 1 536 kb/s
Writing application : Lavf58.29.100
Audio
Format : PCM
Format settings : Little / Signed
Codec ID : 1
Duration : 4 s 203 ms
Bit rate mode : Constant
Bit rate : 1 536 kb/s
Channel(s) : 2 channels
Sampling rate : 48.0 kHz
Bit depth : 16 bits
Stream size : 788 KiB (100%)
Audio file: https://cdn.discordapp.com/attachments/615105639567589376/1141946730485665893/slap.wav
target\release\whisper.exe .\slap.wav small_en 08/18/2023 12:07:51 AMLoading waveform...
Loading model...
Chunk 0: (screaming)
Chunk 1: (screeching)
Transcribed text: (screeching)
whisper-ctranslate2:
whisper-ctranslate2.exe slap.wav --model tiny.en 08/18/2023 12:10:23 AM
Detected language 'English' with probability 1.000000
[00:00.000 --> 00:04.000] Also, it's not always useful.
Transcription results written to 'C:\Users\Quack\code\whisper-burn' directory
EDIT: transcoding the audio file using ffmpeg -i .\slap.wav -ar SAMPLE_RATE -ac 1 slap-edit.wav
seems to make it work, It needs to be both single channel as well as 41khz or less.
at 41khz the audio output was
Chunk 0: Oh, son, it's not all you are.
Transcribed text: Oh, son, it's not all you are.
at 24khz and below it is
Chunk 0: also it's not always useful.
Transcribed text: also it's not always useful
whisper-burn % cargo run --release --bin transcribe tiny_en audio16k.wav en transcription.txt
warning: unused imports: `Bool`, `Float`, `Int`
--> src/helper.rs:2:51
|
2 | activation::relu, backend::Backend, BasicOps, Bool, Element, Float, Int, Numeric, Tensor,
| ^^^^ ^^^^^ ^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused imports: `Bool`, `Int`, `activation::relu`
--> src/model/load.rs:8:14
|
8 | tensor::{activation::relu, backend::Backend, Bool, Int, Tensor},
| ^^^^^^^^^^^^^^^^ ^^^^ ^^^
warning: unused import: `Conv1dRecord`
--> src/model/mod.rs:10:38
|
10 | conv::{Conv1d, Conv1dConfig, Conv1dRecord},
| ^^^^^^^^^^^^
warning: unused import: `Tokenizer`
--> src/token.rs:4:30
|
4 | use tokenizers::{AddedToken, Tokenizer};
| ^^^^^^^^^
warning: unused import: `crate::helper::*`
--> src/transcribe.rs:2:5
|
2 | use crate::helper::*;
| ^^^^^^^^^^^^^^^^
warning: unused import: `num_traits::ToPrimitive`
--> src/transcribe.rs:7:5
|
7 | use num_traits::ToPrimitive;
| ^^^^^^^^^^^^^^^^^^^^^^^
warning: unused imports: `Float`, `Int`, `config::Config`, `self`
--> src/transcribe.rs:12:5
|
12 | config::Config,
| ^^^^^^^^^^^^^^
...
16 | backend::{self, Backend},
| ^^^^
17 | Data, Float, Int, Tensor,
| ^^^^^ ^^^
warning: unused import: `std::cmp::Ordering`
--> src/beam.rs:1:5
|
1 | use std::cmp::Ordering;
| ^^^^^^^^^^^^^^^^^^
warning: unused variable: `n_batch`
--> src/model/mod.rs:132:14
|
132 | let [n_batch, seq_len] = x.dims();
| ^^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_batch`
|
= note: `#[warn(unused_variables)]` on by default
warning: variable does not need to be mutable
--> src/token.rs:15:13
|
15 | let mut tokenizer = tokenizers::Tokenizer::from_file("tokenizer.json")?;
| ----^^^^^^^^^
| |
| help: remove this `mut`
|
= note: `#[warn(unused_mut)]` on by default
warning: unused variable: `new_text`
--> src/transcribe.rs:53:14
|
53 | let (new_text, new_tokens) =
| ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_new_text`
warning: unused variable: `n_ctx_max_decoder`
--> src/transcribe.rs:159:9
|
159 | let n_ctx_max_decoder = whisper.decoder_ctx_size();
| ^^^^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_ctx_max_decoder`
warning: unused variable: `n_channel`
--> src/transcribe.rs:161:10
|
161 | let [n_channel, n_mel, n_ctx] = mels.dims();
| ^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_channel`
warning: unused variable: `first_timestamp_token`
--> src/transcribe.rs:183:9
|
183 | let first_timestamp_token = bpe.special_token(SpecialToken::Timestamp(0.0)).unwrap();
| ^^^^^^^^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_first_timestamp_token`
warning: unused variable: `initial_tokens`
--> src/transcribe.rs:195:13
|
195 | let mut initial_tokens = if prev_nonspecial_tokens.len() > 0 {
| ^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_initial_tokens`
warning: unused variable: `n_batch`
--> src/transcribe.rs:263:14
|
263 | let [n_batch, n_token, n_dict] = log_probs.dims();
| ^^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_batch`
warning: unused variable: `n_token`
--> src/transcribe.rs:263:23
|
263 | let [n_batch, n_token, n_dict] = log_probs.dims();
| ^^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_token`
warning: unused variable: `n_dict`
--> src/transcribe.rs:263:32
|
263 | let [n_batch, n_token, n_dict] = log_probs.dims();
| ^^^^^^ help: if this is intentional, prefix it with an underscore: `_n_dict`
warning: variable does not need to be mutable
--> src/transcribe.rs:195:9
|
195 | let mut initial_tokens = if prev_nonspecial_tokens.len() > 0 {
| ----^^^^^^^^^^^^^^
| |
| help: remove this `mut`
warning: unused variable: `end_node`
--> src/beam.rs:74:17
|
74 | let end_node = continuations[end_node_index].clone();
| ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_end_node`
warning: unused variable: `tok1`
--> src/beam.rs:77:39
|
77 | ..._unstable_by(|(tok1, log_prob1), (tok2, log_prob2)| log_prob1.partial_cmp(&log_prob2).unwrap());
| ^^^^ help: if this is intentional, prefix it with an underscore: `_tok1`
warning: unused variable: `tok2`
--> src/beam.rs:77:58
|
77 | ..., log_prob1), (tok2, log_prob2)| log_prob1.partial_cmp(&log_prob2).unwrap());
| ^^^^ help: if this is intentional, prefix it with an underscore: `_tok2`
warning: function `get_mel_filters` is never used
--> src/audio.rs:58:4
|
58 | fn get_mel_filters<B: Backend>(
| ^^^^^^^^^^^^^^^
|
= note: `#[warn(dead_code)]` on by default
warning: function `fft_frequencies` is never used
--> src/audio.rs:145:4
|
145 | fn fft_frequencies<B: Backend>(sample_rate: f64, n_fft: usize) -> Tensor<B, 1> {
| ^^^^^^^^^^^^^^^
warning: function `test_fft_frequencies` is never used
--> src/audio.rs:159:4
|
159 | fn test_fft_frequencies<B: Backend>() {
| ^^^^^^^^^^^^^^^^^^^^
warning: function `test_mel_frequencies` is never used
--> src/audio.rs:166:4
|
166 | fn test_mel_frequencies<B: Backend>(htk: bool) {
| ^^^^^^^^^^^^^^^^^^^^
warning: function `mel_frequencies` is never used
--> src/audio.rs:174:4
|
174 | fn mel_frequencies<B: Backend>(n_mels: usize, fmin: f64, fmax: f64, htk: bool) -> Tensor<B, 1> {
| ^^^^^^^^^^^^^^^
warning: function `construct_special_tokens` is never used
--> src/token.rs:297:4
|
297 | fn construct_special_tokens() -> Vec<AddedToken> {
| ^^^^^^^^^^^^^^^^^^^^^^^^
warning: field `log_prob` is never read
--> src/transcribe.rs:145:5
|
143 | struct BeamSearchToken {
| --------------- field in this struct
144 | token: usize,
145 | log_prob: f64,
| ^^^^^^^^
|
= note: `BeamSearchToken` has a derived impl for the trait `Clone`, but this is intentionally ignored during dead code analysis
warning: function `first_repetition_end` is never used
--> src/transcribe.rs:370:4
|
370 | fn first_repetition_end(tokens: &[usize], period: usize) -> usize {
| ^^^^^^^^^^^^^^^^^^^^
warning: function `repetition_period` is never used
--> src/transcribe.rs:380:4
|
380 | fn repetition_period(
| ^^^^^^^^^^^^^^^^^
warning: function `find_repeated_tokens_index` is never used
--> src/transcribe.rs:404:4
|
404 | fn find_repeated_tokens_index(
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
warning: `whisper` (lib) generated 32 warnings (run `cargo fix --lib -p whisper` to apply 22 suggestions)
Compiling whisper v0.1.0 (/Users/davidgortega/Documents/projects/kunzite/whisper-burn)
warning: unused import: `std::collections::HashMap`
--> src/bin/transcribe/main.rs:1:5
|
1 | use std::collections::HashMap;
| ^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused import: `std::iter`
--> src/bin/transcribe/main.rs:2:5
|
2 | use std::iter;
| ^^^^^^^^^
warning: unused import: `whisper::helper::*`
--> src/bin/transcribe/main.rs:4:5
|
4 | use whisper::helper::*;
| ^^^^^^^^^^^^^^^^^^
warning: unused import: `token`
--> src/bin/transcribe/main.rs:6:15
|
6 | use whisper::{token, token::Language};
| ^^^^^
warning: unused imports: `Data`, `Float`, `Int`, `Tensor`, `self`, `self`
--> src/bin/transcribe/main.rs:23:9
|
23 | self,
| ^^^^
24 | backend::{self, Backend},
| ^^^^
25 | Data, Float, Int, Tensor,
| ^^^^ ^^^^^ ^^^ ^^^^^^
warning: unused import: `num_traits::ToPrimitive`
--> src/bin/transcribe/main.rs:57:5
|
57 | use num_traits::ToPrimitive;
| ^^^^^^^^^^^^^^^^^^^^^^^
warning: unused import: `whisper::audio::prep_audio`
--> src/bin/transcribe/main.rs:58:5
|
58 | use whisper::audio::prep_audio;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
warning: unused import: `SpecialToken`
--> src/bin/transcribe/main.rs:59:37
|
59 | use whisper::token::{Gpt2Tokenizer, SpecialToken};
| ^^^^^^^^^^^^
warning: unused variable: `duration`
--> src/bin/transcribe/main.rs:35:9
|
35 | let duration = reader.duration() as usize;
| ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_duration`
|
= note: `#[warn(unused_variables)]` on by default
warning: unused variable: `bits_per_sample`
--> src/bin/transcribe/main.rs:38:9
|
38 | let bits_per_sample = spec.bits_per_sample;
| ^^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_bits_per_sample`
warning: variable does not need to be mutable
--> src/bin/transcribe/main.rs:32:9
|
32 | let mut reader = hound::WavReader::open(filename)?;
| ----^^^^^^
| |
| help: remove this `mut`
|
= note: `#[warn(unused_mut)]` on by default
warning: unused variable: `tokens`
--> src/bin/transcribe/main.rs:145:16
|
145 | let (text, tokens) = match waveform_to_text(&whisper, &bpe, lang, waveform, sample_rate) {
| ^^^^^^ help: if this is intentional, prefix it with an underscore: `_tokens`
warning: `whisper` (bin "transcribe") generated 12 warnings (run `cargo fix --bin "transcribe"` to apply 12 suggestions)
Finished release [optimized] target(s) in 3.43s
warning: the following packages contain code that will be rejected by a future version of Rust: nom v1.2.4, nom v3.2.1
note: to see what the problems were, use the option `--future-incompat-report`, or run `cargo report future-incompatibilities --id 1`
Running `target/release/transcribe tiny_en audio16k.wav en transcription.txt`
Loading waveform...
Loading model...
Depth: 0
Chunk 0:
Transcription finished.
The file transcription is empty.
If I debug it
let (text, tokens) = match waveform_to_text(&whisper, &bpe, lang, waveform, sample_rate)
text
and tokens
are empty
this is my file generated by sox
As a TVM user Im very excited of this project because of the use of burn and its access to WGPU native. Personally speaking is the way to go.
However my tests are very discouraging. WGPU seems to be performing worse than CPU
cargo run --release --bin transcribe --features wgpu-backend medium audio16k.wav transcription.txt
Running `target/release/transcribe medium audio16k.wav en transcription.txt`
Loading waveform...
Loading model...
Depth: 0
...
Chunk 0: Hello, I am the whisper machine learning model. If you see this as text, then I am working properly.
infer took: 49665 ms
cargo run --release --bin transcribe medium audio16k.wav en transcription.txt
Running `target/release/transcribe medium audio16k.wav en transcription.txt`
Loading waveform...
Loading model...
Depth: 0
...
Chunk 0: Hello, I am the whisper machine learning model. If you see this as text, then I am working properly.
infer took: 19517 ms
Transcription finished.
the code was slightly modified:
fn main() {
cfg_if::cfg_if! {
if #[cfg(feature = "wgpu-backend")] {
type Backend = WgpuBackend<AutoGraphicsApi, f32, i32>;
let device = WgpuDevice::BestAvailable;
} else if #[cfg(feature = "torch-backend")] {
type Backend = TchBackend<f32>;
let device = TchDevice::Cpu;
}
}
...
let start_time = Instant::now();
let (text, tokens) = match waveform_to_text(&whisper, &bpe, lang, waveform, sample_rate) {
Ok((text, tokens)) => (text, tokens),
Err(e) => {
eprintln!("Error during transcription: {}", e);
process::exit(1);
}
};
let end_time = Instant::now();
let elapsed_time_ms = end_time.duration_since(start_time).as_millis();
println!("infer took: {} ms", elapsed_time_ms);
Same 3X for tiny CPU vs tiny WGPU
Might not be optimised for my machine? It's not working maybe?
This library is awesome, thank you. Incredibly fast and a much nicer API than alternatives.
I was hoping it would be the magic bullet that works on M2 and CUDA so that it can be deployed (running services from a MacBook seems the only option with these models!).
I tried last night on AWS with TchBackend and ran into:
Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend.
After that I noticed your chunk branch used the same settings I'd used.
It looks like empty_strided isn't available on CUDA at all, and models using it need to be moved to the CPU.
Is it possible to use alternative methods in the tensor constructors (?) so that it's compatible with both WGPU and CUDA? Or do you have any pointers - did you get it working with Tch initially?
Hi, thanks for developing this awesome Whisper implementation! I'm looking to deploy a small
Whisper model I finetuned using HuggingFace transformers. The model is supposed to generated cantonese romanizations and the language is set to English during training because they share the same ascii letters. The primary motivation is to take advantage of burn's wgpu backend for cross platform deployment to both iOS and Android users. Prior to trying your library, I managed to get my finetuned model running on iOS using whisper.cpp but I'd prefer a rust backend for portability.
For my experiment with importing the model into whisper-burn, I first converted the HuggingFace model to Whisper's pt format using a script (See step 1 of this issue). And then I followed the steps in the README and successfully converted the model to the burn format. However, when I run inference using my model, it produced garbage transcripts on the provided audio16k.wav as well as on my own test audio. For example, the audio16k.wav produced a transcript of "onbed" when normally the model should recognize English inputs in addition to Cantonese.
I'm wondering if it's possible for you to support importing HuggingFace models directly to whisper-burn? That way, it's easier to eliminate intermediate bugs during the conversion pipeline. Maybe the convert-h5-to-ggml from Whisper.cpp can come in handy? Thanks.
May I ask why the need of Tinygrad for the weights conversions? The script seems to be dumping them with np afterwards is read by tinygrad.
Not sure if this is related to loading the model, or the transcription process. Also it seems restoring the checkpoint into VRAM takes much longer compared to Python version.
RUST_BACKTRACE=1 cargo run --release audio.wav large-v2
Caused by:
In Device::create_bind_group
Buffer binding 0 range 265548800 exceeds `max_*_buffer_binding_size` limit 134217728
', /home/username/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.17.0/src/backend/direct.rs:3056:5
stack backtrace:
0: rust_begin_unwind
at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:593:5
1: core::panicking::panic_fmt
at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/panicking.rs:67:14
2: core::ops::function::Fn::call
3: <wgpu::backend::direct::Context as wgpu::context::Context>::device_create_bind_group
4: <T as wgpu::context::DynContext>::device_create_bind_group
5: wgpu::Device::create_bind_group
6: burn_wgpu::context::base::Context::execute
7: burn_wgpu::kernel::index::select::select
8: burn_tensor::tensor::ops::modules::base::ModuleOps::embedding
9: whisper::model::TextDecoder<B>::forward
10: whisper::transcribe::waveform_to_text
11: whisper::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.