The introduction, technical details, and results are presented in the wandb report.
To get started install the requirements
pip install -r ./requirements.txt
SpEX+ architecture with additional speaker classification head was implemented in this project.
To train model from scratch run
python3 train.py -c final_model/config.json
For fine-tuning pretrained model from checkpoint, --resume
parameter is applied. For example, fine-tuning pretrained model
organized as follows
python3 train.py -c final_model/finetune.json -r saved/models/pretrain_final/<run_id>/model_best.pth
This command generates new mixed dataset. This option can be disabled by passing "reuse": true
for
train dataset in config final_model/finetune.json
.
Before applying model pretrained checkpoint is loaded by python code
import gdown
gdown.download("https://drive.google.com/uc?id=19i4NIk8R8AlkGvMfhQl8ex-eCg4g2Isv", "default_test_model/checkpoint.pth")
Model evaluation is executed by command
python test.py \
-c default_test_model/config.json \
-r default_test_model/checkpoint.pth \
-t test_data \
-o test_result.csv \
-g <output_dir> \
-s <interval_len>
Where -o
specify output .csv
file, which represents metrics
- PESQ (Perceptual Evaluation of Speech Quality)
- SI-SDR (Scale-Invariant Signal-to-Distortion Ratio)
Further sections reveal other command line arguments.
Important remark: for further experiments test-clean
part of Librispeech dataset is required.
It will be automatically installed after at least one execution python3 test.py
with default arguments.
Model evaluation with custom data conducted by running test.py
python3 test.py -t path/to/custom/dir
This command executes model on custom dataset folder, which includes mix
, target
and ref
subdirectories
with filenames *-mixed.wav
, *-target.wav
and *-ref.wav
respectively. Such directory will be created in
data/datasets/mixed/data/custom
from mixed test-clean
dataset after running
bash custom_set.sh
Extracted audio for the test set can be gathered in one directory by executing
python3 test.py -g path/to/output/dir
This results can be compared with direct speech recognition from mixed audio. Speech recognition pipeline was taken from asr repository. Comparison of mixed and extracted audios quality carried out by
bash asr_score.sh
For training stability audio were split into 3-seconds interval. However, test data provides arbitrary lengths of audios, which can be divided into intervals on inference stage with
python3 test.pt -s <interval_len_in_seconds>
WHAM! dataset provides diverse background noise, which can be also mixed with input audio. Installation
wget https://my-bucket-a8b4b49c25c811ee9a7e8bba05fa24c7.s3.amazonaws.com/wham_noise.zip
unzip wham_noise.zip
Creating noised dataset and model evaluation
python3 test.py -c wham_test/config.json
Described pipeline only involves tt (test)
part of WHAM!
, therefore, other directories are not required.
This repository is based on an asr-template repository.