NISQA-s: Speech Quality and Naturalness Assessment for Online Inference

NISQA-s is highly stripped and optimized version of original NISQA metric. It is aiming to create universal metrics set for both offline and online evaluation of audio quality.

This version supports only CNN+LSTM version of original model (since other modifications don't support streaming or perform too slow). It uses the same architecture with some tweaks for streaming purposes. Also there's no MOS-only model, since main model supports MOS prediction (for simplicity of the code and repo).

Installation

(Optional) Create new venv or conda env

Then just pip install -r requirements.txt

Note that there may be some problems with torch installation. If so, follow official PyTorch instructions

Quick start

If you want to just run this repo with provided config and samples -

python -m scripts.run_infer_file

If you want to test online inference from your mic -

python -m scripts.run_infer_mic

This will log inference results to terminal, so pay attention to it.

Config options

Default config is config/nisqa_s.yaml. All configurations for everything related to training and inference are happening here. There are detailed comments about each parameter, so we'll cover only the most important ones for inference:

ckp: path to trained checkpoint (weights/nisqa_s.tar by default)
sample: path to evaluated file

If you plan to run online inference, you should pay close attention to last 4 arguments in this config:

frame lets you choose length of buffer to feed into the model;
updates will make the model spit metrics more often (check argument description)
sd_device's ID should be provided if you want to run this on different input devices (e.g. sound-card mic). First run of run_infer_mic.py will show you those IDs.
sd_dump lets you save mic input to check the results in offline later.

And finally, you can run custom config for your experiments - just add --yaml argument to python -m scripts.run_infer_file/python -m scripts.run_infer_mic and provide path to your own config:

python -m scripts.run_infer_file --yaml path/to/custom/config.yaml

Training

We provide simple interface for training your own version of NISQA-s.

Firstly, you will need the dataset. You can obtain it from official NISQA repo. This is probably the only (but definitely the best) way to train this, since the data needs to be very specifically labeled for this to work.

To train the same version as provided -

python -m scripts.run_train

Remember to check name of the experiment in nisqa_s.yaml and path to NISQA Corpus in data_dir, as well as path to save the model (output_dir)

Training and model parameters in config

Since you're most probably using NISQA Corpus, there is no need to change anything in Dataset options. If you use some hand-made dataset - you need to refer to this guide.
Training options contains all parameters connected to training setup (like learning rates, batch size etc.).
You can also experiment with bias loss by enabling Bias loss options
Change Mel-Specs options if you want to experiment on different samplerates, Fourier lengths or sample length for training (although it is highly not recommended to lower value of ms_max_length because of NISQA Corpus labeling)
CNN parameters and LSTM parameters - change those to experiment on different parameters of convolutional and recurrent layers.

Note that provided checkpoint is trained with provided config.

Citations

@article{Mittag_Naderi_Chehadi_Möller_2021, 
  title={Nisqa: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets}, 
  DOI={10.21437/interspeech.2021-299}, 
  journal={Interspeech 2021}, 
  author={Mittag, Gabriel and Naderi, Babak and Chehadi, Assmaa and Möller, Sebastian}, 
  year={2021}
}

@misc{deepvk2024nisqa,
  author = {Ivan, Beskrovnyi},
  title = {nisqa-s},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {https://github.com/deepvk/nisqa-s}
}

"Padding size should be less.." error for some WAV files

Another problem for some WAV files

/Users/agershun/repo/whisper/NISQA-s/.venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py:82: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=1 and num_layers=1
  warnings.warn("dropout option adds dropout after all but last "
NOI    COL   DISC  LOUD  MOS
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/agershun/repo/whisper/NISQA-s/scripts/run_infer_file.py", line 33, in <module>
    out, h0, c0 = process(audio, sr, model, h0, c0, args)
  File "/Users/agershun/repo/whisper/NISQA-s/src/utils/process_utils.py", line 79, in process
    audio = get_ta_melspec(
  File "/Users/agershun/repo/whisper/NISQA-s/src/utils/process_utils.py", line 39, in get_ta_melspec
    S = melSpec(y)
  File "/Users/agershun/repo/whisper/NISQA-s/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/agershun/repo/whisper/NISQA-s/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/agershun/repo/whisper/NISQA-s/.venv/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 619, in forward
    specgram = self.spectrogram(waveform)
  File "/Users/agershun/repo/whisper/NISQA-s/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/agershun/repo/whisper/NISQA-s/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/agershun/repo/whisper/NISQA-s/.venv/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 110, in forward
    return F.spectrogram(
  File "/Users/agershun/repo/whisper/NISQA-s/.venv/lib/python3.10/site-packages/torchaudio/functional/functional.py", line 126, in spectrogram
    spec_f = torch.stft(
  File "/Users/agershun/repo/whisper/NISQA-s/.venv/lib/python3.10/site-packages/torch/functional.py", line 648, in stft
    input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (480, 480) at dimension 2 of input [1, 88200, 2]

This is a sample of the file with the problem:
https://drive.google.com/file/d/1CHZ4TZaILu-K5rsA8XkPAEDbt_2VGXVd/view?usp=sharing

deepvk / nisqa-s Goto Github PK

nisqa-s's Introduction

NISQA-s: Speech Quality and Naturalness Assessment for Online Inference

Installation

Quick start

Config options

Training

Training and model parameters in config

Citations

nisqa-s's People

Contributors

Stargazers

Watchers

nisqa-s's Issues

"Maximum size for tensor at dimension" error for some audio files

"Padding size should be less.." error for some WAV files

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent