snakers4 / silero-vad Goto Github PK

View Code? Open in Web Editor NEW

2.9K 39.0 320.0 88.73 MB

Silero VAD: pre-trained enterprise-grade Voice Activity Detector

License: MIT License

Python 83.18% Jupyter Notebook 16.82%

voice-detection voice-recognition voice-commands pytorch onnx voice-activity-detection voice-control

silero-vad's Introduction

Silero VAD

Silero VAD - pre-trained enterprise-grade Voice Activity Detector (also see our STT models).

Real Time Example

real-time-example.mp4

Key Features

Stellar accuracy

Silero VAD has excellent results on speech detection tasks.
Fast

One audio chunk (30+ ms) takes less than 1ms to be processed on a single CPU thread. Using batching or GPU can also improve performance considerably. Under certain conditions ONNX may even run up to 4-5x faster.
Lightweight

JIT model is around one megabyte in size.
General

Silero VAD was trained on huge corpora that include over 100 languages and it performs well on audios from different domains with various background noise and quality levels.
Flexible sampling rate

Silero VAD supports 8000 Hz and 16000 Hz sampling rates.
Flexible chunk size

Model was trained on 30 ms. Longer chunks are supported directly, others may work as well.
Highly Portable

Silero VAD reaps benefits from the rich ecosystems built around PyTorch and ONNX running everywhere where these runtimes are available.
No Strings Attached

Published under permissive license (MIT) Silero VAD has zero strings attached - no telemetry, no keys, no registration, no built-in expiration, no keys or vendor lock.

Typical Use Cases

Voice activity detection for IOT / edge / mobile use cases
Data cleaning and preparation, voice detection in general
Telephony and call-center automation, voice bots
Voice interfaces

Get In Touch

Try our models, create an issue, start a discussion, join our telegram chat, email us, read our news.

Please see our wiki and tiers for relevant information and email us directly.

Citations

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {[email protected]}
}

Examples and VAD-based Community Apps

Example of VAD ONNX Runtime model usage in C++
Voice activity detection for the browser using ONNX Runtime Web

silero-vad's People

Contributors

Stargazers

Watchers

Forkers

learnedvector vlajnaya-mol dnegorov agicblack ruohoruotsi oucxlw dophist xinkez hoanglocbk cheulyop kai-karren bontempogianpaolo1 sciai-ai ishine rafiuddinkhan delfvad shaun95 dmitry-voronin synergyst bbpatil alirezaazadbakht cxz spatial-intelligence slava715 wy192 changxuding barseghyanartur gzjas isgursoy jardnzm crawlingd dennistang742 arthur-s zyzl1 road2018 xx205 tockto human2b jeffery-work turchaev boragocode wangtianrui vadimda twinkcode vasyaitone adamnsandle jeewenjie michaelrw iamsvp94 amart85 linuxer41 lagousis convect-bot gabrielziegler3 lhx94as zhazhafon liam0620 jaedukseo sonicviz arossbach10 archontes oserban laoyin ximik666 hominhtri1996 enuan rapidai techthiyanes hsouporto cirenehc junhwanjang ethanyhzhang lbshinchan twilight0617 chenchy ankitshah009 jinmingche maxmax2016 cawinchan zhangwq740 kyosuke3635 seniorglassmaster wyn314 aosfatos fragrantrookie haojiepan1 gds101054108 nhungl flux9665 macroustc evaxige svats2k neongeckocom cdevelop harold-yh wpj12 asr-pub pragnakalpdev29 fanofjava jagabandhumishra

silero-vad's Issues

Comma missing in README.md VAD examples

After README.md updates VAD example code looks like this:

...

(get_speech_ts,
 get_speech_ts_adaptive
 _, read_audio,
 _, _, _) = utils

...

So there are missing commas after get_speech_ts_adaptive in several snippets.

❓ Questions and Help

This looks great, I saw your post on the KAIST VAD repo! I have two questions:

Do you plan to release any more information about the network architecture you used for training, or the training framework itself?
Have you also checked: https://github.com/ina-foss/inaSpeechSegmenter this framework is also excellent and the gender and music classification is fantastic. It is offline however, not online by default.

❓ VAD Training Data

Am not sure if this is morally right to ask this question, forgive me if am wrong. Could you let me know what were the data that was used to train silero_vad ? Whether it is a proprietary data or any public dataset ? If latter, could you pl name it ?
Are there any recommended datasets that you would suggest to train VAD with ?

Bug report - torch.cat with an empty list of Tensors

🐛 Bug

torch.cat called with an empty list of tensors in utils_vad.py.

Process Process-13:
Traceback (most recent call last):
  File "/lium/raid01_b/pchampi/lab/sidekit-for-vpc/venv/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/lium/raid01_b/pchampi/lab/sidekit-for-vpc/venv/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/lium/raid01_b/pchampi/lab/sidekit-for-vpc/tools/apply_vad_on_csv.py", line 49, in job
    vad_wav = collect_chunks(speech_timestamps, wav)
  File "/lium/home/pchampi/.cache/torch/hub/snakers4_silero-vad_a345715/utils_vad.py", line 622, in collect_chunks
    return torch.cat(chunks)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at aten/src/ATen/RegisterCPU.cpp:5925 [kernel]
CUDA: registered at aten/src/ATen/RegisterCUDA.cpp:7100 [kernel]
QuantizedCPU: registered at aten/src/ATen/RegisterQuantizedCPU.cpp:641 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:10525 [kernel]
Autocast: registered at ../aten/src/ATen/autocast_mode.cpp:254 [kernel]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

Env

Collecting environment information...
PyTorch version: 1.8.2+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.28

Python version: 3.8.5 (default, Sep  4 2020, 07:30:14)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-4.19.0-8-amd64-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.3
[pip3] torch==1.8.2+cu102
[pip3] torchaudio==0.8.2
[pip3] torchvision==0.9.2+cu102
[conda] numpy                     1.21.3                   pypi_0    pypi
[conda] torch                     1.8.2+cu102              pypi_0    pypi
[conda] torchaudio                0.8.2                    pypi_0    pypi
[conda] torchvision               0.9.2+cu102              pypi_0    pypi

❓ How to setup max speech duration?

as title
use default vad parameters to do speech segmentation
some duration of segments are over 3 minutes
any parameter as max_speech_duration to set ?

start= 0.00, dur= 0.94
start= 2.75, dur= 9.88
start= 13.50, dur= 6.56
start= 20.75, dur= 3.06
start= 23.88, dur= 41.44
start= 66.62, dur= 1.06
start= 69.56, dur=144.50
start=214.31, dur= 61.56
start=276.12, dur=257.50
start=533.62, dur= 27.00
start=560.81, dur= 54.19
start=614.94, dur=213.44
start=828.31, dur= 6.44

thanks

false negative & false positive for 8k Chinese phone record

Hi @snakers4 , really impressive project !
I am using this awesome vad in my project, but I find some false negative & false positive examples in my experiments, it's really hard to find a good parameters(trig, neg_trig, min_speech, min_silence).
My dataset is Chinese Phone records of 8k sample rate, I really believe this project can work, but can you give me some help about what I can do just using existing models.
best regards

VAD model❓ Questions / Help / Support

❓ Questions and Help

Excuse me, what model is used for VAD of this project? Is there an article about this project?

Значения в output

Добрый день, пробую onnx модель, и после запуска session.run (для 4000 элементного сэмпла) в аутпуте получаю два значения, например 0.94601 и 0.0567758 для первого сэмпла файла files_ru.wav. Я так понимаю, первый параметр это вероятность того, что сэмпл это речь, верно? А что представляет второе значение?

How to deploy to android?

How to deploy to android ?

how to do feature extraction on android
how to do inference on android

any c++ code reference?
thanks

Bug report - [timestamps overlap]

🐛 Bug

Hi
I use get_speech_ts_adaptive() function in utils_vad.py for find speech regions and when i test a audio with Sample Rate of 16000 and bitrate of 256000 and the video length is about 1 minute some time stamps have overlap such as you consider a timestamp. this time stamp end have overlap with the start of next time stamp

Environment

PyTorch Version : 1.8.1
OS : ubuntu 20.04
How you installed PyTorch :pip
Python version:3.8.5
CUDA/cuDNN version: no cuda
GPU models and configuration:use cpu in 1 thread by default

❓ Model Structure and Training

❓ Training details

Great, very exciting work. Unfortunately I couldn't find the details of the model structure and training. Very interested, could you share？

NameError: name 'pd' is not defined

When i run this code
speech_timestamps = get_speech_ts_adaptive(wav, model, visualize_probs=True)
this error:

May be add lines import pandas as pd above this string?

silero-vad/utils_vad.py

Line 153 in 24e0d40

pd.DataFrame({'probs':smoothed_probs}).plot(figsize=(16,8))

Why run the example of "Single Audio Stream" slow ?

Example of Single Audio Stream take 7.769s

wav = f'{files_dir}/en.wav'

for batch in single_audio_stream(model, wav):
    if batch:
        print(batch)

but speed is slow compared with the example of "Full Audio" (take 2.879s)

Feature request - [speech, music, noise]

🚀 Feature

extent vad to speech, music, noise

Motivation

As music is common in these days, vad for speech and noise is not enough.

Pitch

Can detect speech, music, noise in a audio stream

when run onnx vad example, it showed the error

import torch
import onnxruntime
from pprint import pprint

_, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=True)

(get_speech_ts,
 _,_,
 read_audio,
 _, _, _) = utils

files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'

def init_onnx_model(model_path: str):
    return onnxruntime.InferenceSession(model_path)

def validate_onnx(model, inputs):
    with torch.no_grad():
        ort_inputs = {'input': inputs.cpu().numpy()}
        outs = model.run(None, ort_inputs)
        outs = [torch.Tensor(x) for x in outs]
    return outs

model = init_onnx_model(f'{files_dir}/model.onnx')
wav = read_audio(f'{files_dir}/en.wav')

# get speech timestamps from full audio file
speech_timestamps = get_speech_ts(wav, model, num_steps=4, run_function=validate_onnx)
pprint(speech_timestamps)

Traceback (most recent call last):
File "1.py", line 30, in
speech_timestamps = get_speech_ts(wav, model, num_steps=4, run_function=validate_onnx)
File "/home/gary/.cache/torch/hub/snakers4_silero-vad_master/utils_vad.py", line 112, in get_speech_ts
outs = torch.cat(outs, dim=0)
TypeError: expected Tensor as element 0 in argument 0, but got list

Originally posted by @garymmi in #58

After release of silero-mini, colab examples became outdated

After the release of silero-mini, you added a new function to the utils_vad.py, which made the following piece of code invalid. Raising 'Too many values to unpack' Exception

(get_speech_ts,
 _, read_audio,
 _, _, _) = utils

It should be the following instead:

(get_speech_ts,
 _, _,
read_audio,
 _, _, _) = utils

This is true for all examples.

❓ Questions / Help / Support

❓ Questions and Help

support samplerate 24k?

GPU inference

[W:onnxruntime:Default, fallback_cpu_capability.h:140 GetCpuPreferedNodes] Force fallback to CPU execution for node: Equal_890

❓ Questions:how to convert the measurement?

The measurement of the results is sample,now i want to convert it to second.Maybe divide 16000 into the results?I am not sure about it,please help,thanks.

GPU inference

Hello. Any chance to put inference on GPU? After several tries i got error that this model is quantized, but maybe you can share non-quantized version?

Mobile / Edge / ARM / ONNX Use Cases

While the VAD (especially the micro one) was explicitly designed for IOT / edge / mobile use cases, we do not have the resource or expertise to provide instructions for corresponding ARM / mobile builds for PyTorch and / or ONNX.

ONNX guides were refurbished recently and it is implied that ARM binaries will be made available (but they are not yet).

People from the community (see telegram chat) have also claimed successful builds and use of silero-models on pytorch replacing mkl with cblas.

In any case sharing such dockerized builds (e.g. based off debian / ubuntu / alpine) for your tested used cases will be of great value for the community, PRs greatly encouraged and appreciated.

Please see some examples here - https://github.com/microsoft/onnxruntime/blob/master/dockerfiles/README.md#arm-32v7

If you feel like doing something like this - please provide a build in a dockerfile and provide some background info on which arch / device / processor you are running it, if this hardware is generally available, what is the end performance etc

Citing silero-vad in academic publications

I used silero-vad in a paper I am currently working on.
Is there a prefered way of citing it?

Bug report - loading error

🐛 Bug

Loading the model with hub.load fails

To Reproduce

base $~ python3Python 3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchaudio
>>> import soundfile
>>>
>>> torch.__version__
'1.8.2+cu102'
>>> torchaudio.__version__
'0.8.2'
>>> soundfile.__version__
'0.10.3'
>>> model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad:a345715',
...                               model='silero_vad')
Using cache found in /lium/home/pchampi/.cache/torch/hub/snakers4_silero-vad_a345715
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lium/raid01_b/pchampi/lab/sidekit-for-vpc/venv/lib/python3.8/site-packages/torch/hub.py", line 339, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/lium/raid01_b/pchampi/lab/sidekit-for-vpc/venv/lib/python3.8/site-packages/torch/hub.py", line 368, in _load_local
    model = entry(*args, **kwargs)
  File "/lium/home/pchampi/.cache/torch/hub/snakers4_silero-vad_a345715/hubconf.py", line 24, in silero_vad
    model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model.jit')
  File "/lium/home/pchampi/.cache/torch/hub/snakers4_silero-vad_a345715/utils_vad.py", line 74, in init_jit_model
    model = torch.jit.load(model_path, map_location=device)
  File "/lium/raid01_b/pchampi/lab/sidekit-for-vpc/venv/lib/python3.8/site-packages/torch/jit/_serialization.py", line 161, in load
    cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torch/nn/quantized/modules/linear.py", line 17, in __setstate__
    state: Tuple[Tensor, Optional[Tensor], bool, int]) -> None:
    self.dtype = (state)[3]
    _1 = (self).set_weight_bias((state)[0], (state)[1], )
          ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    self.training = (state)[2]
    return None
  File "code/__torch__/torch/nn/quantized/modules/linear.py", line 40, in set_weight_bias
    _10 = "Unsupported dtype on dynamic quantized linear!"
    if torch.eq(self.dtype, 12):
      _11 = ops.quantized.linear_prepack(weight, bias)
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
      self._packed_params = _11
    else:

Traceback of TorchScript, original code (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/quantized/modules/linear.py", line 93, in __setstate__
    def __setstate__(self, state):
        self.dtype = state[3]
        self.set_weight_bias(state[0], state[1])
        ~~~~~~~~~~~~~~~~~~~~ <--- HERE
        self.training = state[2]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/quantized/modules/linear.py", line 23, in set_weight_bias
    def set_weight_bias(self, weight: torch.Tensor, bias: Optional[torch.Tensor]) -> None:
        if self.dtype == torch.qint8:
            self._packed_params = torch.ops.quantized.linear_prepack(weight, bias)
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        elif self.dtype == torch.float16:
            self._packed_params = torch.ops.quantized.linear_prepack_fp16(weight, bias)
RuntimeError: Didn't find engine for operation quantized::linear_prepack NoQEngine

>>>

Environment

Collecting environment information...
PyTorch version: 1.8.2+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.28

Python version: 3.8.5 (default, Sep  4 2020, 07:30:14)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-4.19.0-8-amd64-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.3
[pip3] torch==1.8.2+cu102
[pip3] torchaudio==0.8.2
[pip3] torchvision==0.9.2+cu102
[conda] numpy                     1.21.3                   pypi_0    pypi
[conda] torch                     1.8.2+cu102              pypi_0    pypi
[conda] torchaudio                0.8.2                    pypi_0    pypi
[conda] torchvision               0.9.2+cu102              pypi_0    pypi

Is there a chinese model or relevant plan?

We have a wiki available for our users. Please make sure you have checked it out first.

any relevant paper?

Dear all:
Thank you for your contribution to the whole voice community. We really appreciate the method and pretrained models.
I wonder is there any related paper or document to illustrate inner details of your work? specifically, i would like to know what kind of network architecture you used? what dataset you used? Thank you again.

Feature request - [Chinese support]

Hope Chinese can be added into this excellent tool.

Finetuning VAD model

Is there a way to finetune provided pretrained model on my own data? Can you share some code for training? Thanks!

Migrating examples to new models

@Kai-Karren @Bontempogianpaolo1

In a few days we will be radically changing the models:

Probably dropping ONNX VAD models (we have not decided yet);
Reducing chunk size to 30ms (chunk will be flexible, but larger than 30ms);
Removing separate 8 / 16 kHz models, now all models would work with 8 and 16 kHz;
Most likely deprecating micro, mini and ordinary models in favor of just a mini-sized models (still running last experiments);
New models will be compatible with mobile builds of PyTorch;
Dropping the batched buffering approach we used because of large chunks;

i.e. radically simplifying and speeding up the models.

You have contributed to the examples, would you like to participate in improving them using the new models?

Is it possible to limit the languages within the language detection

Hi, I am tying to use the Language Classifier 95 model, but the accuracy is not so good.
I have tried to increase the top_n value, but did not help too much.
I thought I can neglect most of the languages (which I do not care about) with specifying a reduced set of languages in the lang_dict and the lang_group_dict parameters in the following line:
languages, language_groups = get_language_and_group(wav, model, lang_dict, lang_group_dict, top_n=2)
but it does not work.
Is it possible somehow to specify a subset of the languages for this model?
Thanks!

ONNX files for "micro" & "micro_8k" models are identical

🙏

What are the speech timestamps are they frame numbers or milli seconds?

❓ Questions and Help

We have a wiki available for our users. Please make sure you have checked it out first.

ONNX model fails to load in browser

Hello I'm trying to load onnx model using JS in browser. I'm using official example from ONNX github:

<html>
  <head> </head>

  <body>
    <!-- Load ONNX.js -->
    <script src="https://cdn.jsdelivr.net/npm/onnxjs/dist/onnx.min.js"></script>
    <!-- Code that consume ONNX.js -->
    <script>
      // create a session
      const myOnnxSession = new onnx.InferenceSession();
      // load the ONNX model file
      myOnnxSession.loadModel("./my-model.onnx").then(() => {
        // generate model input
        const inferenceInputs = getInputs();
        // execute the model
        myOnnxSession.run(inferenceInputs).then((output) => {
          // consume the output
          const outputTensor = output.values().next().value;
          console.log(`model output tensor: ${outputTensor.data}.`);
        });
      });
    </script>
  </body>
</html>

But loading fails with following message. Do you have any idea what might cause this? I found no information on onnx website.

Bug report - Device cuda does not work

🐛 Bug

I would like to use cuda to compute the vad. Your tookit has an argument for it:

silero-vad/utils_vad.py

Line 174 in a345715

device='cpu'):

But it crashes when I set device to 'cuda' (the input wav is also correctly set to("cuda")).
Does your toolkit support it ?
BTW, Thanks for your awesome work on this toolkit! 👍

Traceback (most recent call last):
  File "/lium/raid01_b/pchampi/lab/venv/bin/extract_xvectors.py", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/lium/raid01_b/pchampi/lab/sidekit/bin/extract_xvectors.py", line 157, in <module>
    main(xtractor, args.wav_scp, args.out_scp, args.device, args.vad, args.vad_num_samples_per_window, args.vad_min_silence_samples)
  File "/lium/raid01_b/pchampi/labvenv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/lium/raid01_b/pchampi/lab/sidekit/bin/extract_xvectors.py", line 123, in main
    speech_timestamps = get_speech_ts_adaptive(signal.to("cuda"), model,
  File "/lium/home/pchampi/.cache/torch/hub/snakers4_silero-vad_a345715/utils_vad.py", line 227, in get_speech_ts_adaptive
    chunks = torch.Tensor(torch.cat(to_concat, dim=0)).to(device)
TypeError: expected CPU (got CUDA)

Environment

Collecting environment information...
PyTorch version: 1.8.2+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.28

Python version: 3.8.5 (default, Sep  4 2020, 07:30:14)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-4.19.0-8-amd64-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.3
[pip3] torch==1.8.2+cu102
[pip3] torchaudio==0.8.2
[pip3] torchvision==0.9.2+cu102
[conda] numpy                     1.21.3                   pypi_0    pypi
[conda] torch                     1.8.2+cu102              pypi_0    pypi
[conda] torchaudio                0.8.2                    pypi_0    pypi
[conda] torchvision               0.9.2+cu102              pypi_0    pypi

❓ vad bad performance on child speech?

I have a 16k audio file of child speech
after vad, I got
start=0.875, dur=1.375
as picture

I used model='silero_vad' , call the api
speech_timestamps = get_speech_ts(wav_data, model, num_steps=4)

here is the audio file
https://drive.google.com/file/d/1yivG8OE77TyfJE_KL2IYl-Ltm5cFujnX/view?usp=sharing

could you help me to see what'wrong or teach me how to fine tune the parameters ?
thanks

❓ PyTorch 1.5.1 load() missing 1 required positional argument: 'github'

❓ Questions and Help

We have a wiki available for our users. Please make sure you have checked it out first.

i run the vad example, error as below
Traceback (most recent call last):
File "vad.py", line 7, in
force_reload=True)
TypeError: load() missing 1 required positional argument: 'github'

i use torch 1.5.1

thanks

'voxlingua107' for Language Classifier

This project has audio-samples for 107 languages:
http://bark.phon.ioc.ee/voxlingua107/
Would be great to improve the Language Classifier

Changelog

Just a handy issue to be notified of latest changes and micro-releases (we will mostly changing the models)

❓ Help How to classify an 20ms audio ?

I just want to classify an 20ms audio whether there are people talking.
Could you give me some examples?
Thanks

I want to know some details about Silero-VAD

❓ Help

Hello, thank you for the VAD tool provided. Our company is preparing to use this tool as a long audio cutting tool, but I have not found the implementation details and principle of this VAD, so that we can better optimize the algorithm model

Hello, I want to know the principle of Silero-VAD. Can you provide some relevant documents, thank you

❓ Questions / Help / Support RuntimeError: Backend "soundfile" is not one of available backends: ['sox', 'sox_io'].

❓ Questions and Help

We have a wiki available for our users. Please make sure you have checked it out first.

I used torch 1.7.1, run vad example, error as below

Downloading: "https://github.com/snakers4/silero-vad/archive/master.zip" to /home/nick/.cache/torch/hub/master.zip
/data/nick/Python-3.6.3/lib/python3.6/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to pytorch/audio#903 for the detail.
'"sox" backend is being deprecated. '
Traceback (most recent call last):
File "vad.py", line 7, in
force_reload=True)
File "/data/nick/Python-3.6.3/lib/python3.6/site-packages/torch/hub.py", line 370, in load
model = _load_local(repo_or_dir, model, *args, **kwargs)
File "/data/nick/Python-3.6.3/lib/python3.6/site-packages/torch/hub.py", line 396, in _load_local
hub_module = import_module(MODULE_HUBCONF, hubconf_path)
File "/data/nick/Python-3.6.3/lib/python3.6/site-packages/torch/hub.py", line 71, in import_module
spec.loader.exec_module(module)
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "/data/nick/.cache/torch/hub/snakers4_silero-vad_master/hubconf.py", line 3, in
from utils_vad import (init_jit_model,
File "/data/inck/.cache/torch/hub/snakers4_silero-vad_master/utils_vad.py", line 9, in
torchaudio.set_audio_backend("soundfile") # switch backend
File "/data/nick/Python-3.6.3/lib/python3.6/site-packages/torchaudio/backend/utils.py", line 47, in set_audio_backend
f'Backend "{backend}" is not one of '
RuntimeError: Backend "soundfile" is not one of available backends: ['sox', 'sox_io'].

About training python

Do you plan to release training code (python) ?
or where get it ?

Making Conda Packages

We planning to upload even better models shortly and reduce the repo size by moving audio samples and larger less popular models to external links.

Since the utilities we provide now are really minimalistic, there is very little difference between doing something like conda install and torch.hub.load, since in python having PyTorch is a must. Essentially a package is just a model and probably 50 - 100 lines of code.

Neverthelss many people like using packaged libraries, especially for a VAD, which seems like a "solved" task, unlike STT, which may involve some moving parts (for production grade quality of course).

Internally, we do not really maintain python packages (we usually favour pre-built docker image approach for our own production). Maybe someone, who has published a few conda or pip could lend a hand to help us build a quick CI package exported based around Github Actions?

If we keep VAD minimalistic in future, maintaning this seems like a no-brainer, and maybe even with conda it will play nicely with available builds of PyTorch for other platforms provided by the PyTorch team itself?

❓ VAD modelQuestions / Help / Support

❓ Questions and Help

We have a wiki available for our users. Please make sure you have checked it out first.
What method does your VDA model use to train? CNN or what?

torch hub silero_vad example breaks due to `get_speech_ts` refactoring

🐛 Bug

Torch hub silero_vad example still uses get_speech_ts, which used to have a num_steps argument. The refactored get_speech_timestamp does not, hence the example breaks.
A quick fix could simply be to remove the arg (which I did here) but I'd rather make sure that this is ok before merging.

To Reproduce

Steps to reproduce the behavior:

Run example in snakers4_silero-vad_vad.md.

Inconsistent output from onnx and jit

Hi @snakers4

I tried both the lang_classifier_95.onnx and lang_classifier_95.jit and found when fed with the same input, the outputs are different(with large enough margin). Based on the name, I guess they are exported from the same pytorch model. Why are the outcome different? Please help!

Thanks!
Junjie

Parameters for the 8k model?

Are the default samples tune for the 8k models? or should i half the sample size for things like num_samples_per_window, min_speech_samples, etc?

Kindly look into this issue and provide me example on how to do the inference on ONNX Model with live audio streaming

this is the Code

import io
import numpy as np
import torch
torch.set_num_threads(1)
import torchaudio
import matplotlib
import matplotlib.pylab as plt
torchaudio.set_audio_backend("soundfile")
import pyaudio

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True)

(get_speech_ts,
get_speech_ts_adaptive,
save_audio,
read_audio,
state_generator,
single_audio_stream,
collect_chunks) = utils

**def init_onnx_model(model_path: str):
return onnxruntime.InferenceSession(model_path)

model = init_onnx_model(model_path='./model.onnx')**

def validate(model,inputs: torch.Tensor):
with torch.no_grad():
ort_inputs = {'input': inputs.cpu().numpy()}
outs = model.run(None, ort_inputs)
outs = [torch.Tensor(x) for x in outs]
return outs[0]

def int2float(sound):
abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max > 0:
sound *= 1/abs_max
sound = sound.squeeze()
return sound

FORMAT = pyaudio.paInt16
CHANNELS = 1
SAMPLE_RATE = 16000
CHUNK = int(SAMPLE_RATE / 10)

audio = pyaudio.PyAudio()
stream = audio.open(format=FORMAT,
channels=CHANNELS,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=CHUNK)
data = []
voiced_confidences = []

print("Started Recording")
for i in range(0, frames_to_record):

audio_chunk = stream.read(int(SAMPLE_RATE * frame_duration_ms / 1000.0))

# in case you want to save the audio later
data.append(audio_chunk)

audio_int16 = np.frombuffer(audio_chunk, np.int16);

audio_float32 = int2float(audio_int16)

# get the confidences and add them to the list to plot them later
vad_outs = validate(model, torch.from_numpy(audio_float32))
# only keep the confidence for the speech
voiced_confidences.append(vad_outs[:,1])

print("Stopped the recording")

plot the confidences for the speech

plt.figure(figsize=(20,6))
plt.plot(voiced_confidences)
plt.show()

The Error I'm getting is,

vad_outs1 = validate(model, torch.from_numpy(audio_int161))
Traceback (most recent call last):
File "testimport.py", line 76, in
vad_outs1 = validate(model, torch.from_numpy(audio_int161))
File "testimport.py", line 47, in validate
outs = model.run(None, ort_inputs)
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/capi/session.py", line 110, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (N11onnxruntime17PrimitiveDataTypeIsEE) , expected: (N11onnxruntime17PrimitiveDataTypeIfEE)

@snakers4 Kindly provide me the solution

Tensorflow or Tensorflow Lite model of Silero VAD

🚀 Feature

Publish open source Silero VAD model on TensorFlow or TensorFlow Lite

Motivation

I wish to use your Silero VAD model in a production environment where only TF is supported

Pitch

Silero VAD would be very useful on mobile and embedded devices. TensorFlow Lite is the best variant in context of devices with limited memory and capacity.

Alternatives

I tried to convert both published models to TF but faced with different problems, maybe because of mutable input size or torchscript and onnxruntime instead of classical torch and onnx types.

Additional context

Thank you in advance.

Are ONNX Models Necessary?

In a few days we will be radically changing the models:

Probably dropping ONNX VAD models (we have not decided yet);
Reducing chunk size to 30ms (chunk will be flexible, but larger than 30ms);
Removing separate 8 / 16 kHz models, now all models would work with 8 and 16 kHz;
Most likely deprecating micro, mini and ordinary models in favor of just a mini-sized models (still running last experiments);
New models will be compatible with mobile builds of PyTorch;
Dropping the batched buffering approach we used because of large chunks;

i.e. we will be radically simplifying everything.

We have seen limited use of ONNX models, so therefore I am asking.

silero-models and silero-vad combined lead to ImportError

If using both silero-models and silero-vad combined in a function call, only either the models or vad call works, while the second leads to an ImportError:

ImportError: cannot import name 'get_speech_ts'

I assume not being aware of something trivial here, but couldn't figure out how to solve this until now. Any ideas?

snakers4 / silero-vad Goto Github PK

silero-vad's Introduction

Silero VAD

Key Features

Typical Use Cases

Links

Get In Touch

Examples and VAD-based Community Apps

silero-vad's People

Contributors

Stargazers

Watchers

Forkers

silero-vad's Issues

❓ Questions and Help

❓ VAD Training Data

🐛 Bug

Env

❓ Questions and Help

🐛 Bug

Environment

❓ Training details

🚀 Feature

Motivation

Pitch

❓ Questions and Help

🐛 Bug

To Reproduce

Environment

Is there a chinese model or relevant plan?

❓ Questions and Help

🐛 Bug

Environment

❓ Questions and Help

❓ Help

❓ Questions and Help

❓ Questions and Help

🐛 Bug

To Reproduce

Parameters for the 8k model?

plot the confidences for the speech

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Recommend Projects

Recommend Topics

Recommend Org