maum-ai / voicefilter Goto Github PK

View Code? Open in Web Editor NEW

1.0K 36.0 227.0 1.17 MB

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Home Page: http://swpark.me/voicefilter

Python 98.38% Shell 1.62%

source-separation audio-separation speech-separation pytorch voicefilter

voicefilter's People

Contributors

Stargazers

Watchers

Forkers

hiyoung-asr sukheungsong edwardyoon johndpope hanjinhwan templeblock knhuq kastnerkyle ml-lab jfsantos orchestor awesome-archive thecooltechguy shaun95 xinkez nrobin ishine yuulin entn-at digoooooogle zhibinlin legendtianjin muncok adolfoeliazat moplast fy378968174 xiaozhuo12138 sdqdlgj maylibooyah69 byfaith wangmengzhi simpleishappy alwc jk-minds wangyang2014 lihao0214 cao-ming gatsbychen yunzqq pranoot runngezhang normonisping yuanchima afd77 kimnik6 swhan9873 yhgon xiongmaoxia anotherother hyli666 hwong39 goodtogreat-cjh dendisuhubdy stegben jamesmuwb starhxh twistedmove xdcesc smail91 ntzzc cmc1023 toanhvu bailiangze chenzixiao ronggan userzhongjieli sasikanth-maker maxmax2016 kuonanhong ajilim caoyuhang gavin-pu rahul75 gyamini25 pranshurastogi29 xmpx xiaoyuvision kumarkarun kwikwag wanjung06052 b1sounours jixu7952 wuqiangch maasemenycheva hwidong-na yayayan02 5l1v3r1 seodle yaoydong sombochea angelarwa edresson gazzola sonicboom74 kimkwangho82 motteww popupwired farishijazi batnik1 brotheroak

voicefilter's Issues

What is the term spk refers to in the below code ? line 127

https://github.com/mindslab-ai/voicefilter/blob/7db2930dacd94f33d44380bf07001fc7cab80e55/generator.py#L127

Question when preprocessing wav files

Hi all, I encountered a question when I tried to preprocess wav files.
When I input this

python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]

the command line display error messages like this:

Traceback (most recent call last):
  File "generator.py", line 98, in <module>
    os.makedirs(args.out_dir, exist_ok=False)
TypeError: makedirs() got an unexpected keyword argument 'exist_ok'


Traceback (most recent call last):
  File "generator.py", line 128, in <module>
    for spk in train_folders]
TypeError: glob() got an unexpected keyword argument 'recursive'

Traceback (most recent call last):
  File "generator.py", line 150, in <module>
    with Pool(cpu_num) as p:
AttributeError: __exit__

When I queried these questions, I found that they were all due to incompatibility of Python versions.My Python version is 2.7. Is "generator. py" only suitable for Python 3.5 environment? Is there any way to run this code in Python 2.7?

Looking forward to your reply!

how to create file embedder

Hi, i'd like generate file embedder from my data. So how to create it. Many thanks.

Training setting problem

Hi,

Thank you for publishing your code!
I am encountering a training problem. As an initial phase I have tried to train only on 1000 samples from LibriSpeech train-clean-100 dataset. I am using the default configuration as was published in your VoiceFilter repo. The only difference is that I used batch size of 6 due to memory limitations. Is it possible that the problem is related to the small batch size that I use?

Another question is related to the generation of the training and testing sets. I have noticed that there is an option to use a VAD for generating the training set but by default it is not used. What is the best practice? to use the VAD or not?

I appreciate your help!

Cannot reproduce reported SDR & retrain the speaker embedding

Hello, I have two questions about the implementation.

I cannot reproduce the results reported in the README.
I have trained for around > 400k steps on Librispeech 360h + 100h clean dataset, using the embedder provided in this repo.
However, I can only obtain up to a maximum SDR of 5.5.

To obtain data from the Librispeech 360h + 100h, I generate the mixed audios for 360h and 100h separately, then add them together in another folder. Is this the right way when I want to use more data to train the voice filter module?

I got worse results when retraining the speaker embedding
I retrained the embedder using the following repo: Speaker verification on 3 datasets: Librispeech, VoxCeleb1, VoxCeleb2.

Theoretically, I expect the voice filter module will benefit from the embedder trained on more data, but the results got even worse. Can you share how you train this embedder?

Thank you in advance!

Is the VoiceFilter model checkpoint available to be used directly?

the model implementation comprehension

Hello, I'm a master student in ITMO university in Saint-Petersburg, Russia.

Could you explain me please, what exactly this model implemenation do?
As for me (variant 1) it takes as input mixed sound of voice of a
person A and voice of a person B and clear voice A, the same as in
mixed one and trying to extract it from the mixed one.
(that is really strange because it is useless)
And in the paper (variant 2) it is said that it should take the mixed
one and clear voice of the target person but NOT the same sound as in
mixed one! And this is the point.

When I tried to look at train test, made by generator, I found out that in every example of ******-mixed.wav there is ******-target.wav with another voice! (but not another phrase of target person as I thought it should be)

Am I right? Or what's going on here?

Waiting for your answer,
thank you!

Try partial convolution padding scheme

Train loss of initial implementation with nn.Conv2d converged at 6e-3.

Now, I'm trying partial convolution padding scheme to replace naive zero-padding. Work in progress at pconv branch.

Question about normalize-resample.sh

Thank you for your great job! I have a question when I tried to run the project.
I set 'N' as my CPU core number , then I input 'chmod a+x normalize-resample.sh'.
However, after I input './normalize-resample.sh ', there is no output on the command line. Is this normal?
Furthermore, what is the function of this script?

Next, copy utils/normalize-resample.sh to root directory of unzipped data folder. Then:

vim normalize-resample.sh # set "N" as your CPU core number.
chmod a+x normalize-resample.sh
./normalize-resample.sh # this may take long

Looking for your reply!

inference

why in the inference the target record is the same as in mixed record?
as for me the point of all this voicefilter is that the target record is not the same as in mixed, and is just has the voice of the same person

Question about utils/evaluation.py

Hello @seungwonpark , thank you greatly for your work!
I notice that the utils/evalluation.py has a "break" in the loop of test dataloader.
That is, in the evaluation process, only the first case generated from test dataloader is taken account into computing test loss and test SDR. Could this raise problems like #5 and #9 ?

Look forward to your response.

fianl model

Is there any possible to provide the final checkpoint ？The training takes too much time.

Real-time inference

Hi, I'd like to use this voice filtering in real-time. Would it be possible to modify the inference code to run the model in real time for audio PCM data?

question about ffmpeg-normalize

hi~ i meet with a problem when doing ./normalize-resample.sh , it seems that wav in /tmp did not exit, i try to fix it but failed, does anyone know where the problem is ? i also run the commend " ffmpeg-normalize 1.wav -o 1-norm.wav" to test this normalize tool , and occured the same question ; how to make this normalize tool (ffmpeg-normalize） work？
yingyingying

Question about wav2spec function in utils/audo.py

I have noticed that there is amp_to_db function in wav2spec, which means that the input of model is db-scaled magnitude. I s this right? Since this is not mentioned in related paper.

VoiceFilter realization problem

Seungwon, hello.

My name is Vladimir. I am a researcher at Speech Technology Center, Russia, St. Petersburg. Your implementation of the VoiceFilter algorithm (https://github.com/mindslab-ai/voicefilter) is very interesting to me and my colleagues. Unfortunately, we could not get the SDR metric dynamics like yours, using your code with the standard settings in the default.yaml file. SDR converged to 4.5 dB after 200k iterations (see figure below), but not to 10 dB after 65k as in your results. Could you tell us your training settings, as well as the neural network architecture that you used to get your result?

Our python environment:

tqdm (ver. 4.32.1);
numpy (ver. 1.16.3);
torch (ver. 1.1.0);
pyyaml (ver. 5.1);
librosa (ver. 0.6.3);
mir_eval (ver. 0.5);
matplotlib (ver. 3.1.0);
tensorboardX (ver. 1.7);
ffmpeg (ver. 4.1.3);
ffmpeg_normalize (1.14.0);
python (ver. 3.6).

We use four Nvidia GeForce GTX 1080 Ti when training one VoiceFilter's model. Subsets train-clean-100, train-clean-360 and train-other-500 from LibriSpeech dataset are used to train VoiceFilter's model and dev-clean is used to test VoiceFilter's model. We use the pretrained d-vector model to encode the target speaker.

We used your default configuration file:

audio:
  n_fft: 1200
  num_freq: 601
  sample_rate: 16000
  hop_length: 160
  win_length: 400
  min_level_db: -100.0
  ref_level_db: 20.0
  preemphasis: 0.97
  power: 0.30

model:
  lstm_dim: 400
  fc1_dim: 600
  fc2_dim: 601

data:
  train_dir: 'path/to/train/data'
  test_dir: 'path/to/test/data'
  audio_len: 3.0

form:
  input: '*-norm.wav'
  dvec: '*-dvec.txt' 
  target:
    wav: '*-target.wav'
    mag: '*-target.pt'
  mixed:
    wav: '*-mixed.wav'
    mag: '*-mixed.pt'

train:
  batch_size: 8
  num_workers: 16
  optimizer: 'adam'
  adam: 0.001
  adabound:
    initial: 0.001
    final: 0.05
  summary_interval: 1
  checkpoint_interval: 1000

log:
  chkpt_dir: 'chkpt'
  log_dir: 'logs'

embedder:
  num_mels: 40
  n_fft: 512
  emb_dim: 256
  lstm_hidden: 768
  lstm_layers: 3
  window: 80
  stride: 40

The neural network architecture was standard and followed your implementation.

Can you get the initial mean SDR on LibriSpeech using Google's test list?

Hi, seungwonpark,

I was trying to use Google's posted test list for LibriSpeech to reproduce their results. But I can not even get their initial mean SDR (10.1 dB in their paper). I got only 1.5 dB. I am wondering have you tried their list and got around 10.1 dB for mean SDR before applying voice filter?

Thank you so much.

The files from the dataset are in flac format

Changing the bash script for .flac didn't help either.

Need to try power-law compression loss

When I was implementing this repository, I didn't know what power-law compression means in the original paper.
Thanks to Quan Wang's tutorial video, now I can understand it. Now I need to try that loss instead of using MSE loss.
https://www.youtube.com/watch?v=gnRX2lzepz0

embedder.pt with new dataset

Hi, if in case I wanted to use another dataset of audio files for the training and the test (not the one used here) the embedder.pt that I have to insert when I run "trainer.py" as I can generate it or which one I have to use ? Thank you

Can I get the pretrained model please! I so dearly need it for my project, here's my email just in case, [email protected]

Question about start point of SDR

Dear @seungwonpark

First of all, I would like to thank you for great open source.
I would like to test your nice code and I tried to train voice filter.

But i get the problem with SDR. When i saw SDR graph in voicefilter github,
SDR value from 2 to 10dB. But in my case, SDR value is from -0.8 to 1.2.

I am trying to find the cause of the problem but I can not find it.

Can you help me to find the cause of problem?

I used the default yaml and generator.py. ( train-clean-100, train-clean-360, dev-clean are used
to train)

Could you let me know what i can check?

Thanks you!

how to work for multi noise

I have the same problem as the title, is there any way to fix it, i'm a newbie

Question when training VoiceFilter

Hi, it's me again:)
Because of insufficient computer storage,I skipped the following step:


Preprocess wav files

In order to boost training speed, perform STFT for each files before training by:

python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]

This will create 100,000(train) + 1000(test) data. (About 160G)

Then I downloaded embedder.pt, train-clear-100.tar.gz and dev-clean.tar.gz. I unziped tar.gz files and put those unzipped file folders to the root directory of voicefilter.I also specifying train_dir, test_dir at config.yaml, such as:

  train_dir: '/home/../voicefilter/train/train-clean-100'
  test_dir: '/home/../voicefilter/dev/dev-clean'

After that, when I enter this instruction:

python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]

Error prompts pop up on the screen:AssertionError: no training file found

I want to know which step I made a mistake, or what configuration was missing? Thanks! REALLY looking forward to your reply!

hop_length and win_length

sorry i misunderstood

Out of memory when Inferencing a single file.

I tried to try the trained model on a single input and it gave OOM on GCP with 1 Nvidia P100.
RuntimeError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.90 GiB total capacity; 14.37 GiB already allocated; 889.81 MiB free; 19.21 MiB cached)
The file size of the mixed wav(19 MB) file was about 5 minutes and for reference file was 11 seconds.
I don't know why it shows 14.37 GiB allocated when not even training. I tried to restart the instance but it did not help.
Can you please suggest a way to reduce the memory required while Inference?
Thank you!