Git Product home page Git Product logo

vgg-speaker-recognition's Introduction

README

This repo will contain the code for ICASSP 2019, speaker identifcation (http://www.robots.ox.ac.uk/~vgg/research/speakerID/).

This repo contains a Keras implementation of the paper,
Utterance-level Aggregation For Speaker Recognition In The Wild (Xie et al., ICASSP 2019) (Oral).

**New challenge on speaker recognition: The VoxCeleb Speaker Recognition Challenge (VoxSRC).

Dependencies

There seems to be a bug in this version of librosa such that loading wav files is cripplingly slow (1 second per short file), you can replace read_wav with read_wav_fast in utils.py to fix this, but be aware that the result is that the sample rate is not constant.

Data

The dataset used for the experiments are

Training the model

To train the model on the Voxceleb2 dataset, you can run

  • python main.py --net resnet34s --batch_size 160 --gpu 2,3 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 128 --multiprocess 8 --loss softmax --data_path ../path_to_voxceleb2

Model

  • Models:

    google drive: https://drive.google.com/open?id=1M_SXoW1ceKm3LghItY2ENKKUn3cWYfZm
    
    dropbox: https://www.dropbox.com/sh/n96ekf7ilsvkjdp/AACXKDesS2ju5rp6Cyxh2PCva?dl=0
    
  • Download the models and put it in the folder, model/

Testing the model

To test a specific model on the voxceleb1 dataset, for example, the model trained with ResNet34s trained by adam with softmax, and feature dimension 512

  • python predict.py --gpu 1 --net resnet34s --ghost_cluster 2 --vlad_cluster 8 --loss softmax --resume ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5

  • Results:

      VoxCeleb1-Test: 3.22        VoxCeleb1-Test-Cleaned: 3.24
      VoxCeleb1-Test-E: 3.24      VoxCeleb1-Test-E-Cleaned: 3.13
      VoxCeleb1-Test-H: 5.17      VoxCeleb1-Test-H-Cleaned: 5.06
    

Fine Tuning the model

The weights provided do not include the weights of the final prediction layer, so one needs to randomly initialise this with network.load_weights(os.path.join(args.resume), by_name=True, skip_mismatch=True) in main.py

python main.py --net resnet34s --gpu 0 --ghost_cluster 2 --vlad_cluster 8 --batch_size 16 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 128 --multiprocess 8 --loss softmax --resume=../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5

Note that --data_path /path_to_your_dataset/dataset/ can be used to point to your own dataset, but you will need to write a small function in toolkits.py to return the corresponding datalist file contents.

Licence

The code and mode are available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License(https://creativecommons.org/licenses/by/4.0/).

  Downloading this code implies agreement to follow the same conditions for any modification 
  and/or re-distribution of the dataset in any form.

  Additionally any entity using this code agrees to the following conditions:

  THIS CODE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
  IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
  PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
  EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
  PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
  LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

  Please cite the papers below if you make use of the dataset and code.

Citation

@InProceedings{Xie19,
  author       = "W. Xie, A. Nagrani, J. S. Chung, A. Zisserman",
  title        = "Utterance-level Aggregation For Speaker Recognition In The Wild.",
  booktitle    = "ICASSP, 2019",
  year         = "2019",
}

@Article{Nagrani19,
  author       = "A. Nagrani, J. S. Chung, W. Xie, A. Zisserman",
  title        = "VoxCeleb: Large-scale Speaker Verification in the Wild.",
  journal      = "Computer Speech & Language",
  year         = "2019",
}

vgg-speaker-recognition's People

Contributors

a-nagrani avatar bml1g12 avatar seungwonpark avatar w4-jonghoon avatar weidixie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vgg-speaker-recognition's Issues

trained slowly

tool/toolkits.py
os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu
modified
os.environ['TF_CPP_MIN_LOG_LEVEL'] = args.gpu

Epoch 1/128
Learning rate for epoch 1 is 0.001.
1/7492 [..............................] - ETA: 294:55:57 - loss: 9.3743 - acc: 0.0000e+00

but trained slowly, gpu wasn't work.

The initialization method

Hi,

I saw that you used 'orthogonal' initialization for the whole network. Are there some special reason behind that?

how to implement a transfer learning solution for classification problems?

Hi, I want to train some layers and leave the others frozen (Dont have to train the entire model)
I dont know how I can adjust it
"python main.py --net resnet34s --batch_size 160 --gpu 2,3 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 128 --multiprocess 8 --loss softmax --data_path ../path_to_voxceleb2:" ??

using tripletLoss function

Hi Weidixie:
Thanks for your sharing, and I have a question: Have you ever tried to combine gVlad with triplet loss function or GE2E loss by Google to train the model and how about the effect?
I look forward for your reply and thank you very much again !

Training accuracy oscillates between 51-52% early while loss decreases slowly.

I have 4.5k speakers and 88k utterances (total 150k pairs of text) total with each range between 4-10 sec in Hindi-English mixed(98% Hindi).

I tried to run VGG-Speaker-Recognition on my dataset, and is giving me the following result:

Epoch 1/50
Learning rate for epoch 1 is 0.0001.
1130/1130 [==============================] - 11718s 10s/step - loss: 1.6074 - acc: 0.5193
Epoch 2/50
Learning rate for epoch 2 is 0.0001.
1130/1130 [==============================] - 11616s 10s/step - loss: 1.2483 - acc: 0.5243
Epoch 3/50
Learning rate for epoch 3 is 0.0001.
1130/1130 [==============================] - 11591s 10s/step - loss: 1.2341 - acc: 0.5260
Epoch 4/50
Learning rate for epoch 4 is 0.0001.
1130/1130 [==============================] - 11512s 10s/step - loss: 1.2289 - acc: 0.5239
Epoch 5/50
Learning rate for epoch 5 is 0.0001.
1130/1130 [==============================] - 11470s 10s/step - loss: 1.2255 - acc: 0.5281
Epoch 6/50
Learning rate for epoch 6 is 0.0001.
1130/1130 [==============================] - 11548s 10s/step - loss: 1.2246 - acc: 0.5264
Epoch 7/50
Learning rate for epoch 7 is 0.0001.
1130/1130 [==============================] - 11550s 10s/step - loss: 1.2228 - acc: 0.5278
Epoch 8/50
Learning rate for epoch 8 is 0.0001.
1130/1130 [==============================] - 11602s 10s/step - loss: 1.2223 - acc: 0.5273
Epoch 9/50
Learning rate for epoch 9 is 0.0001.
1130/1130 [==============================] - 11620s 10s/step - loss: 1.2211 - acc: 0.5292
Epoch 10/50
Learning rate for epoch 10 is 0.0001.
1130/1130 [==============================] - 11581s 10s/step - loss: 1.2206 - acc: 0.5284
Epoch 11/50
Learning rate for epoch 11 is 0.0001.
1130/1130 [==============================] - 11544s 10s/step - loss: 1.2203 - acc: 0.5272
Epoch 12/50
Learning rate for epoch 12 is 0.0001.
1130/1130 [==============================] - 11467s 10s/step - loss: 1.2196 - acc: 0.5286
Epoch 13/50
Learning rate for epoch 13 is 0.0001.
1130/1130 [==============================] - 11399s 10s/step - loss: 1.2191 - acc: 0.5294

As I can see above training accuracy oscillates between 51-52 % and it seems like model overfitted at 51-52 %.
Earlier I had 500 speakers (a subset of 4.5k speakers), I got the same result at that time as well.

What could be the reason for this result? Please help @WeidiXie.

Training is quite slow.

I am also facing the isse, as model trained very slowly. I run other codes and projects on same gpu and they are running fine, gpu has been used but VGG-Speaker runs slowly.
I tried it on two NVIDIA GTX-1060 installed in my computer and P100 on google-cloud as well.

I tried everything to resolve this issue but not succeed.

Epoch 1/10
Learning rate for epoch 1 is 0.0001.
17/305810 [..............................] - ETA: 3354:00:12 - loss: 0.8716 - acc: 0.9531

Please help.
Thanks.

How much of the acc when you train the model?

I use the default training code to train the model. In the training process ,the acc can get 90.09%.
I also the the default testing code to test the model. The eer is 0.0357370095445.
What about your result in training process?

problem with testing my own trained model

hi.
i found your code really complete and i tried to use is but:
when i use your model called in "readme" to predict it works but when i try to use my own data with "python main.py .... " and then i want to use it for prediction , i found an error about not same shape , ....
would you please help me about that?

Got core dumped when training with GPU option

Hi Weidi,
Everything is fine if i train a new model using CPU device. But when i chose GPU for training the training script is crashed at the first iteration with message being: "segmentation fault (core dumped)".
The following is the log that i got:
Epoch 1/128
Learning rate for epoch 1 is 0.0001.
Segmentation fault (core dumped)
Do you have any comment or suggestion for this?

audio feature was not tensor

scr/predict.py

specs = ut.load_data(...),
the specs was feature through stft,subtract mean, divided by time-wise var, while specs was not tensor, but array.

TypeError: 'int' object is not callable

Hi, Weidi .
Thank you for your prompt reply !!!
And sorry to interrupt again.
As you said yesterday, I can successfully load the pre-training model.
But I encountered a new mistake as the title shows.
This time I used the following command line:

python main.py --net resnet34s --batch_size 3 --gpu 0 --lr 0.001 --ghost_cluster 2 --vlad_cluster 8 --warmup_ratio 0.1 --optimizer adam --epochs 20 --multiprocess 1 --loss softmax --data_path ''

I know that the model loaded successfully through the information displayed in the terminal.
But after print Learning rate for epoch 1 is 0.0001., the new error occured.

**Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, self._kwargs)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/utils/data_utils.py", line 559, in _run
sequence = list(range(len(self.sequence)))
TypeError: 'int' object is not c

I think it's a multi-process or multi-threading problem.
So I tried to set --multiprocess 0, or comment out areas of code that involve multiple processes.
But nothing changed.
I found one issues, he set --workers 0, so I changed fit_generator function's workers=0 in main.py.
A "new" error occured:

**Traceback (most recent call last):
File "main.py", line 223, in
main()
File "main.py", line 162, in main
verbose=0)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, kwargs)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/training_utils.py", line 590, in iter_sequence_infinite
for item in seq:
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/utils/data_utils.py", line 372, in iter
for item in (self[i] for i in range(len(self))):
TypeError: 'int' object is not callable

The same error TypeError: 'int' object is not callable occured, someone said that your custom variable name and the default function or class name duplicate will cause this problem.
I can't solve the problem, So I would like to ask if you have come across this problem, or have any ideas to solve it.

Sorry again for the interruption, and thanks in advance !!!

AttributeError: module 'dask.dataframe' has no attribute 'Series' ?

Traceback (most recent call last):
File "src/main.py", line 201, in
main()
File "src/main.py", line 99, in main
update_freq=args.batch_size * 16)
File "/data/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 745, in init
from tensorflow.contrib.tensorboard.plugins import projector
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/init.py", line 37, in
from tensorflow.contrib import distributions
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distributions/init.py", line 39, in
from tensorflow.contrib.distributions.python.ops.estimator import *
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distributions/python/ops/estimator.py", line 21, in
from tensorflow.contrib.learn.python.learn.estimators.head import _compute_weighted_loss
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/init.py", line 95, in
from tensorflow.contrib.learn.python.learn import *
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/init.py", line 28, in
from tensorflow.contrib.learn.python.learn import *
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/init.py", line 30, in
from tensorflow.contrib.learn.python.learn import estimators
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/init.py", line 302, in
from tensorflow.contrib.learn.python.learn.estimators.dnn import DNNClassifier
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py", line 35, in
from tensorflow.contrib.learn.python.learn.estimators import dnn_linear_combined
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py", line 36, in
from tensorflow.contrib.learn.python.learn.estimators import estimator
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 52, in
from tensorflow.contrib.learn.python.learn.learn_io import data_feeder
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/learn_io/init.py", line 26, in
from tensorflow.contrib.learn.python.learn.learn_io.dask_io import extract_dask_data
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py", line 34, in
allowed_classes = (dd.Series, dd.DataFrame)
AttributeError: module 'dask.dataframe' has no attribute 'Series'

Please specify license for this code/model

Thanks for open sourcing the code and pretrained weights!

I'd like to be able to use/modify this awesome code in our company. However, I can't find any license specified here.
In specific manner, we need to know whether these code and model could be used in commercial use or not.

Again, thank you for open sourcing the code of your paper.

Question on train loss

Hi, sorry for bothering you again.

What was the final train loss of the model? Actually, I'm worried about our model being overfitted to VoxCeleb2 data, so I'm not sure whether I should use early-stopping here, or just wait until the convergence.

ValueError: Layer #125 (named "gvlad_center_assignment"), weight <tf.Variable 'gvlad_center_assignment/kernel:0' shape=(7, 1, 512, 12) dtype=float32_ref> has shape (7, 1, 512, 12), but the saved weight has shape (10, 512, 7, 1).

Thaks for your greate sharing ! ! !
The pre-training model can deal with identity coding effectively.
But when I change to my data to fine-tune this pre-training model, I got a error.
My python command line is :
python main.py --net resnet34s --batch_size 3 --gpu 0 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 20 --multiprocess 4 --loss softmax --data_path ''
The Error is:
Traceback (most recent call last): File "main.py", line 212, in <module> main() File "main.py", line 84, in main network.load_weights(os.path.join(args.resume), by_name=True) File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/network.py", line 1163, in load_weights reshape=reshape) File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/saving.py", line 1149, in load_weights_from_hdf5_group_by_name str(weight_values[i].shape) + '.') ValueError: Layer #125 (named "gvlad_center_assignment"), weight <tf.Variable 'gvlad_center_assignment/kernel:0' shape=(7, 1, 512, 12) dtype=float32_ref> has shape (7, 1, 512, 12), but the saved weight has shape (10, 512, 7, 1).
I've tried reshape, but doesn't work.
I have no idea for it. So how can I fix it?

is there any qualitative analysis of the network learn a good embedding feature, any visualize tools for this ?

in computer vision fields, there is some tools to visualize what the network learned for the final classification, such as gradcam/cam and so on
in speaker recognition fields, how to analysis the output which activate the input, then i can say the network learn a good feature directly. what are the generality things in the input spectrograms for different context of the same speaker?

a example of grad-cam audio
image

The training set contains the validation set

The dataset split file meta/voxlb2_train.txt contains audios in meta/voxlb2_val.txt.
The number of training examples is decreased from 1,198,728 to 985,290, when examples in the validation set are removed.

I guess people using this repository are suffering from overfitting because of the split error.
Please remove the duplicated examples and re-upload the two split files!

The code below is the one that I used to remove the duplicates using Pandas:

import pandas as pd

df_valid = pd.read_csv(f'meta/voxlb2_valid.txt', sep=' ', names=['path', 'label'])
df_train = pd.read_csv(f'meta/voxlb2_train.txt', sep=' ', names=['path', 'label'])
df_train = df_train[~df_train.path.isin(df_valid.path)]

Training under Windows

Hello @WeidiXie ,

Thanks for this awesome work, and sharing it with the open-source community ! I am trying to adapt the code for training on VoxCeleb1 (just because it is a smaller dataset, I decided to play with it first).
I have prepared the file lists, plugged in your weights, froze the first layers until the bottleneck in the code, and tried to run main.py . However, for some reason I do get an annoying error prior to training, that is likely related to the fact you're using multiprocessing to speed up data generation:

File "D:\Repos\VGG-Speaker-Recognition\tool\toolkits.py", line 45, in set_mp pool = mp.Pool(processes=processes, initializer=init_worker) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\context.py", line 119, in Pool context=self.get_context()) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\pool.py", line 175, in __init__ self._repopulate_pool() File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\pool.py", line 236, in _repopulate_pool self._wrap_exception) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\pool.py", line 255, in _repopulate_pool_static w.start() File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__ reduction.dump(process_obj, to_child) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'set_mp.<locals>.init_worker'

I also tried to set the number of processes to 1, and that did not help either.
I wonder if you have any suggestions on how to alleviate this.

Thanks again,

Anton.

Why we need to regularize outputs during eval mode?

I'm rewriting this project on PyTorch and got confused with the code below.
if mode == 'eval': y = keras.layers.Lambda(lambda x: keras.backend.l2_normalize(x, 1))(x)
Is there some special reason behind that?

There is something wrong with my model

@WeidiXie First, I transform the m4a files to wav files, and then change nothing of the code .
Learning rate for epoch 1 is 0.001.
9365/9365 [==============================] - 6966s 744ms/step - loss: 8.4926 - acc: 3.9876e-04
Epoch 2/64
Learning rate for epoch 2 is 0.001.
9365/9365 [==============================] - 6759s 722ms/step - loss: 8.4370 - acc: 4.4548e-04
Epoch 3/64
Learning rate for epoch 3 is 0.001.
9365/9365 [==============================] - 6732s 719ms/step - loss: 8.4284 - acc: 4.2629e-04
Epoch 4/64
Learning rate for epoch 4 is 0.001.
9365/9365 [==============================] - 6789s 725ms/step - loss: 8.4265 - acc: 4.6633e-04
Epoch 5/64
Learning rate for epoch 5 is 0.001.
9365/9365 [==============================] - 6795s 726ms/step - loss: 8.4266 - acc: 4.2295e-04
Epoch 6/64
Learning rate for epoch 6 is 0.001.
9365/9365 [==============================] - 6742s 720ms/step - loss: 8.4259 - acc: 4.2629e-04
Epoch 7/64
Learning rate for epoch 7 is 0.001.
9365/9365 [==============================] - 6726s 718ms/step - loss: 8.4257 - acc: 4.4381e-04
Epoch 8/64
Learning rate for epoch 8 is 0.001.
9365/9365 [==============================] - 6729s 718ms/step - loss: 8.4256 - acc: 4.3630e-04
Epoch 9/64
Learning rate for epoch 9 is 0.001.
9365/9365 [==============================] - 6733s 719ms/step - loss: 8.4255 - acc: 4.3046e-04
Epoch 10/64
Learning rate for epoch 10 is 0.001.
9365/9365 [==============================] - 6768s 723ms/step - loss: 8.4254 - acc: 4.4548e-04
Epoch 11/64
Learning rate for epoch 11 is 0.001.
9365/9365 [==============================] - 6743s 720ms/step - loss: 8.4253 - acc: 4.2462e-04
Epoch 12/64
Learning rate for epoch 12 is 0.001.
9365/9365 [==============================] - 6757s 722ms/step - loss: 8.4252 - acc: 4.4631e-04
Epoch 13/64
Learning rate for epoch 13 is 0.001.
9365/9365 [==============================] - 6751s 721ms/step - loss: 8.4253 - acc: 4.2379e-04
Epoch 14/64
Learning rate for epoch 14 is 0.001.
9365/9365 [==============================] - 6754s 721ms/step - loss: 8.4253 - acc: 4.4214e-04
Epoch 15/64
Learning rate for epoch 15 is 0.001.
9365/9365 [==============================] - 6796s 726ms/step - loss: 8.4253 - acc: 4.0960e-04
Epoch 16/64
Learning rate for epoch 16 is 0.001.
9365/9365 [==============================] - 6755s 721ms/step - loss: 8.4252 - acc: 4.1628e-04
Epoch 17/64
Learning rate for epoch 17 is 0.0001.
9365/9365 [==============================] - 6743s 720ms/step - loss: 8.4183 - acc: 4.2796e-04
Epoch 18/64
Learning rate for epoch 18 is 0.0001.
9365/9365 [==============================] - 6741s 720ms/step - loss: 8.4151 - acc: 4.2629e-04

error in data generation MP 2

I tried to train with voxceleb2

5272/7492 [====================>.........] - ETA: 49:13 - loss: 7.6936 - acc: 0.0133

and got message while training

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/data/ghostvlad-speaker-original/src/utils.py", line 28, in load_data
    linear_spect = lin_spectogram_from_wav(wav, hop_length, win_length, n_fft)
  File "/data/ghostvlad-speaker-original/src/utils.py", line 22, in lin_spectogram_from_wav
    linear = librosa.stft(wav, n_fft=n_fft, win_length=win_length, hop_length=hop_length) # linear spectrogram
  File "/usr/local/lib/python3.5/dist-packages/librosa/core/spectrum.py", line 165, in stft
    y = np.pad(y, int(n_fft // 2), mode=pad_mode)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraypad.py", line 1290, in pad
    " in axis {} of `array`".format(axis))
ValueError: There aren't any elements to reflect in axis 0 of `array`
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "main.py", line 196, in <module>
    main()
  File "main.py", line 136, in main
    verbose=1)
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training_generator.py", line 181, in fit_generator
    generator_output = next(output_generator)
  File "/usr/local/lib/python3.5/dist-packages/keras/utils/data_utils.py", line 601, in get
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/keras/utils/data_utils.py", line 595, in get
    inputs = self.queue.get(block=True).get()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.5/dist-packages/keras/utils/data_utils.py", line 401, in get_index
    return _SHARED_SEQUENCES[uid][i]
  File "/data/ghostvlad-speaker-original/src/generator.py", line 42, in __getitem__
    X, y = self.__data_generation_mp(list_IDs_temp, indexes)
  File "/data/ghostvlad-speaker-original/src/generator.py", line 58, in __data_generation_mp
    X = np.expand_dims(np.array([p.get() for p in X]), -1)
  File "/data/ghostvlad-speaker-original/src/generator.py", line 58, in <listcomp>
    X = np.expand_dims(np.array([p.get() for p in X]), -1)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
ValueError: There aren't any elements to reflect in axis 0 of `array`
I0528 18:57:06.038911 22079 executor.cpp:675] Container exited with status 1
W0528 18:57:06.038911 22072 logging.cpp:93] RAW: Received signal SIGTERM from process 16635 of user 0; exiting

And I think the accuracy is too low. How about in your case?

how to increase gpu usage?

i use 6 gpu to train the model with batch size 200, but each gpu-util is only between 0-22%, and cost 2 hours per epoch, but memory usage is high. The time is not increased compared with 2 gpu, while 2 gpu-util is about 80%.

and then i change the multi-thread to 128, the usage can be increased to 30-50%, and cost 1 hour per epoch, is there any way to increased the gpu usage anymore?

ValueError: Range cannot be empty (low >= high) unless no samples are taken

18/3521 [..............................] - ETA: 5:33:17 - loss: 9.4843 - acc: 0.0017
Traceback (most recent call last):
File "train.py", line 270, in
main()
File "train.py", line 207, in main
verbose=1)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/utils/data_utils.py", line 601, in get
six.reraise(*sys.exc_info())
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/utils/data_utils.py", line 595, in get
inputs = self.queue.get(block=True).get()
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
ValueError: Range cannot be empty (low >= high) unless no samples are taken


I got the error while training the model on my data, have you ever met?

error in data generation MP

i tried to add my own dataset and build a model but i get a data shape error

4/14671 [..............................] - ETA: 29:30:39 - loss: 11.9053 - acc: 0.0000e+00Traceback (most recent call last):

File "src/main.py", line 195, in
main()
File "src/main.py", line 135, in main
verbose=1)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 601, in get
six.reraise(*sys.exc_info())
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 595, in get
inputs = self.queue.get(block=True).get()
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 401, in get_index
return _SHARED_SEQUENCES[uid][i]
File "/media/rohit/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/VGG-Speaker-Recognition/src/generator.py", line 43, in getitem
X, y = self.__data_generation_mp(list_IDs_temp, indexes)
File "/media/rohit/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/VGG-Speaker-Recognition/src/generator.py", line 59, in __data_generation_mp
X = np.expand_dims(np.array([p.get() for p in X]), -1)
ValueError: could not broadcast input array from shape (257,250) into shape (257)

isit issue with any audio file size duration?

TypeError: unsupported operand type(s) for *: 'int'与'dimension'

src/model.py

the 70th line

outputs = K.reshape(cluster_l2, [-1, self.k_centers * num_features])
TypeError: unsupported operand type(s) for *: 'int'与'dimension'

modified

outputs = K.reshape(cluster_l2, [-1, int(self.k_centers) * int(num_features)])

-----debug successfully

Duration of audio file for the trained model?

Hello Weidi

Thank you for the excellent paper and making your work opensource.

I have a couple of questions:

  1. What is the duration of audio files you used for the pre-trained model? In the paper you mention that audio files have duration of 2.5 seconds during training phase. But you also mention that ‘in the wild’ sequences having longer utterances (4 seconds or more) is a significant improvement over shorter segments. I would image that for best results, I have to use the same size audio chunks as used for training therefore want to know the duration used for trained model.

  2. Did you try Mel STFTs instead of linear STFTs? Typically, Mel STFTs are known to provide far better results than linear STFTs.

Many thanks
.

questions about downloading pre-trained model

Hi, i have download  your pre-trained weights to a model , then i trained the model using Voxceleb2, but its loss is 9 ,acc is 0.001 ,It’s as if it’s the same as no download.  it shoud be low loss and acc is about 0.92.do you know why 
Looking forward to your reply

Different between the average and VLAD pooling

Hi,

This paper and the idea is pretty interesting! May I ask two questions about the details please?

  1. I found the idea of LDE (learnable dictionary encoding, cited in the paper as [Cai et.al.]) is very similar with the NetVLAD (if not the same). I'm wondering what your opinion is about the different between LDE and NetVLAD used in this paper?

  2. After going through the code, I found the forward propagation of the VLAD and average pooling seems different.
    For average pooling, the output of resnet_2D_v1/v2 is directly used, which makes the shape to be
    [batch, 7, 16, D] -> [batch, 84, D] (after pooling, no additional layer)

For VLAD, the output is processing by an additional Conv2D layer, making the shape:
[batch, 7, 16, D] -> [batch, 1, 16, D] (feat, use Conv2D) / [batch, 1, 16, n_clusters] (cluster_score) -> [batch, D * n_clusters] (after VLAD)

The additional layer may lead to better performance. Maybe this is part of the reasons why the TAP performs poorly in the paper?

Last, the performance comparison in the paper is really useful. Good work :-)

Error in loading the pretrained weights

When loading the pretrained model, I am getting the following error:

Traceback (most recent call last):
  File "main.py", line 201, in <module>
    main()
  File "main.py", line 82, in main
    if mgpu == 1: network.load_weights(os.path.join(args.resume))
  File "/home/ultron/miniconda3/envs/tf/lib/python3.7/site-packages/keras/engine/network.py", line 1166, in load_weights
    f, self.layers, reshape=reshape)
  File "/home/ultron/miniconda3/envs/tf/lib/python3.7/site-packages/keras/engine/saving.py", line 1030, in load_weights_from_hdf5_group
    str(len(filtered_layers)) + ' layers.')
ValueError: You are trying to load a weight file containing 80 layers into a model with 81 layers.

Why extend audio?

In the function load_wav_Predict why did you need to extend the audio file (see code below)

I cannot think of a reason why one would need to extend the time signal

def load_wav_Predict(vid_path, sr):
    wav, sr_ret = librosa.load(vid_path, sr=sr )
    assert sr_ret == sr
    extended_wav = np.append(wav, wav[::-1])
    return extended_wav

RuntimeWarning: divide by zero encountered in true_divide

Environment:

  • Ubuntu 18.04
  • Python 2.7/3.6
  • TensorFlow 1.13
  • keras 2.2

Before test, I truncated the voxceleb1_veri_test.txt to only 50 lines to speed up test.

Command:

python predict.py --gpu 0 --net resnet34s --ghost_cluster 2 --vlad_cluster 8 --loss softmax --resume ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5

output:

Instructions for updating:
Colocations handled automatically by placer.
==> successfully loading model ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5.
==> start testing.
Finish extracting features for 0/100th wav.
2019-05-29 17:26:34.142088: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Finish extracting features for 50/100th wav.
scores : 0.848808407784, gt : 1
scores : 0.635771036148, gt : 0
scores : 0.896652877331, gt : 1
scores : 0.666865587234, gt : 0
scores : 0.86113858223, gt : 1
scores : 0.649969100952, gt : 0
scores : 0.882004976273, gt : 1
scores : 0.675234436989, gt : 0
scores : 0.814641714096, gt : 1
scores : 0.612952053547, gt : 0
scores : 0.841726779938, gt : 1
scores : 0.691002070904, gt : 0
scores : 0.875537037849, gt : 1
scores : 0.760980308056, gt : 0
scores : 0.862766265869, gt : 1
scores : 0.595528423786, gt : 0
scores : 0.872184753418, gt : 1
scores : 0.580520808697, gt : 0
scores : 0.866317629814, gt : 1
scores : 0.861166357994, gt : 1
scores : 0.735198259354, gt : 0
scores : 0.846519947052, gt : 1
scores : 0.634202837944, gt : 0
scores : 0.879867553711, gt : 1
scores : 0.617964744568, gt : 0
scores : 0.866540849209, gt : 1
scores : 0.502503097057, gt : 0
scores : 0.884967088699, gt : 1
scores : 0.568573653698, gt : 0
scores : 0.926931381226, gt : 1
scores : 0.637345910072, gt : 0
scores : 0.834380090237, gt : 1
scores : 0.620291650295, gt : 0
scores : 0.912857890129, gt : 1
scores : 0.626294493675, gt : 0
scores : 0.952058196068, gt : 1
scores : 0.640718281269, gt : 0
scores : 0.933943748474, gt : 1
scores : 0.51838862896, gt : 0
scores : 0.861519873142, gt : 1
scores : 0.771008253098, gt : 0
scores : 0.881197452545, gt : 1
scores : 0.641325950623, gt : 0
scores : 0.885362446308, gt : 1
scores : 0.748977065086, gt : 0
scores : 0.839608311653, gt : 1
scores : 0.611160635948, gt : 0
/home/ubuntu/anaconda3/envs/python27/lib/python2.7/site-packages/scipy/interpolate/interpolate.py:610: RuntimeWarning: divide by zero encountered in true_divide
slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]
/home/ubuntu/anaconda3/envs/python27/lib/python2.7/site-packages/scipy/interpolate/interpolate.py:613: RuntimeWarning: invalid value encountered in multiply
y_new = slope*(x_new - x_lo)[:, None] + y_lo
==> model : ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5, EER: 0.0

two question about the vladpooling implement compare to the paper of netvlad?

image
this is the netvlad author's presentation, as show with yellow circle the feature map x is the same for two branch. but in your code, has some different:
1: frome feature map, x --> x_fc, and x --> x_k_center, then this two pass to vladpooling which do softmax and normalization, as compare to netvlad, look like x --> fc unnecessary
2: before compute softmax, why need to sub max first, this is seem not very common?
3: in netvlad there first do intra-normalization then l2-normalization (as one paper refer this improve the acc) but here only one l2-normaliztion

what's benefit will get from these differents?

i train use netvlad and vladpooling on my dataset with same optim params and use grad-cam to see the featmaps activation different, same times vladpooling will activate the background noise, but netvlad not:
orignal code:
image
orignal code add self_attetion after featmap and before vladpooling:
image
self_attetion after featmap and before netvlad:
image

toolkits.initialize_GPU(args) error

I'm getting error

Using TensorFlow backend.
Traceback (most recent call last):
  File "src/main.py", line 195, in <module>
    main()
  File "src/main.py", line 41, in main
    toolkits.initialize_GPU(args)
AttributeError: 'module' object has no attribute 'initialize_GPU'

It makes sense, since it has only import toolkits

Run command:
python src/main.py --net resnet34s --batch_size 160 --gpu 2,3 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 128 --loss softmax --data_path ../../data/voxceleb1

My libraries:

tensorflow          1.8.0
toolkits            0.1.28```

Answer about hardware

I've not found information about hardware. Can you tell me what gpu's model you used? In my implementation (3.6 mln parameters approximately as in paper) 35 element ~ 11 gb in memory (1080ti). Maybe I made a mistake.

some question about the thin-ResNet?

the thin-Resnet describe in the paper define each stage as repeat X times,but in the code,is conv2d + identity_block_2d * times, why do like this which is different from original resnet arch.
image

and did you try some other arch such as se_resnet or se_renext as attention mechanism may helpful for feature extract

Question on last ReLU layer for evaluating 512-dim vector

Hi, thanks again for open-sourcing this Speaker Recognition system and kindly replying to every issue.

I have a question about the model shown here. When evaluating final 512-dimension output(embedding vector), ReLU activation is applied at last, as shown in https://github.com/WeidiXie/VGG-Speaker-Recognition/blob/master/src/model.py#L139-L147. Hence, the output looks like:

[[1.15972664e-02 5.04462933e-03 2.85871420e-02 0.00000000e+00
  3.08723319e-02 0.00000000e+00 3.42872031e-02 4.36003655e-02
  0.00000000e+00 1.12573527e-01 5.46368458e-31 5.64192347e-02
  2.56476291e-02 0.00000000e+00 2.51553692e-02 4.77801599e-02
  0.00000000e+00 3.06680351e-02 2.24540825e-03 0.00000000e+00
  1.33734914e-02 2.91635211e-31 2.31502447e-02 5.39273359e-02
  9.22401696e-02 0.00000000e+00 3.31045166e-02 5.57319149e-02
  1.24792336e-02 4.04326282e-02 6.75894767e-02 0.00000000e+00
  6.08060285e-02 4.47864346e-02 2.85473187e-02 0.00000000e+00
... (truncated)

Here, we can observe that some values are 0.

In my opinion, the last ReLU layer is eliminating some information by erasing all negative values. Moreover, it limits the area of hypersphere where embeddings can exist, by a factor of 1/2^512. So, my question is: was the last ReLU layer necessary?

I strongly believe that it was necessary(since it's currently SotA on Speaker Recognition in the wild!), but I couldn't guess the necessity of the last ReLU layer. I would like to kindly ask you about that. Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.