philipperemy / deep-speaker Goto Github PK

View Code? Open in Web Editor NEW

895.0 895.0 239.0 81.52 MB

Deep Speaker: an End-to-End Neural Speaker Embedding System.

License: MIT License

Shell 3.22% Python 96.78%

deep-learning deep-speaker keras tensorflow

deep-speaker's Introduction

Solving Artificial Intelligence one step at a time 👋

Are you an individual / company willing to invest in open source? Become a sponsor!

deep-speaker's People

Contributors

Stargazers

Watchers

Forkers

venkatesh-1729 ruslanmlnkv mkingupta entn-at albertcao ad349 dariocazzani verderey shubham-iitg-ece hdubey glynpu ziyu123 daniel-sc seanhsieh linhdvu14 c1a1o1 lauragpt michaelyq akshayc11 deeplearningsprint young-sun colinsongf nkcsfight knhuq zhangxueyangjuxie astorfi shubhampachori12110095 wangxilei01 liuyanfeier yongyug lhwcv safgen wxb506 tenbabagu charliechen1 del18687058912 chienlinhuang1116 yang1688899 nangongmu lkunxyz hbritto yxma2015 hubeibei007 saber5433 arunes007 nd1511 jtjiekong namdn nightfury10497 suzhidong duxm aitorbajo reinhardhsu jasonaidm zhangxiangshanxi samangel93 lbqin chenjunweii xingzai0617 chavesliu amirunpri2018 tmgsr02 neverjoe chungnguyenfit aurora11111 anushiya-thevapalan anchor-kun anigi98932 veritasalice iamweiweishi samurainote gaoyiyeah ianmcaulay blueanthony tudoumi qbqandyxq shiwanglei 19ai yingmuying ddxk cyzhang9999 xinydev xiangyangai irentang mrjj krypto94 myhugong irving1997 ilyushin hzlihang99 templeblock y-bai zhaoyj1122 bimunlp gdy1201 htangle mbencherif b21791675 siomarry hwong39

deep-speaker's Issues

Training_loss = NaN of deep speaker!

While training the model, during 231st batch, training loss = 0.10000002384185791, but next batch onwards training loss = NaN. What could be possible reason for this? What is the fix for this problem. It doesn't seem to me as problem of divergence, since training loss has abruptly changed from 0.1 to NaN. Some clarifications please.

cli.py error

wav from the same speaker

when test with two different wavs which from the same unseen speaker ,the result is almost equal to the result which create by different unseen speaker.
can you tell me why?

How to Test data?

How can i Test data from these files of the branch Improvements one?

Length of audio

Hi,

I'm trying to use your model to create a real-time voice identification system.

Correct me if I'm wrong but when you convert the audio into mfcc, you use the whole audio to construct the mfcc and then randomly sample a size of NUM_FRAMES

I'm now onto investigating which sample size I should convert to pass into the mfcc fbank conversion. I haven't done extensive testing yet but upon initial trial, 50,000 frames passed onto the fbank() function works well.
This figure was pretty much a shot in the dark.

Would you have any advice as to the minimum required audio length?

how can extract embedding from the wav?

I want to get embedding from the raw wav file which the speaker has been trained. If I remove the cache and run a cache update, then compute the embedding, the result is diffent from directly run get_embeddings command. So which one is right? And when get embedding from raw wav which not trained, the cache file need to delete every time?

Question in normalization step in prepare MFCC step.

Hi Philip,

I have a two questions about the way your prepare the MFCC feature (audio.py).

In the following code, you normalized the raw Fbank features (num_frames, nfit ).

 filter_banks, energies = fbank(signal, samplerate=sample_rate, nfilt=NUM_FBANKS)
 frames_features = normalize_frames(filter_banks)
...
def normalize_frames(m, epsilon=1e-12):
    return [(v - np.mean(v)) / max(np.std(v), epsilon) for v in m]

But I think your code works in normalizing the 26D Fbank features within just each frame. It is actually the instance normalization but not the commonly-used batch normalization. If we want to normalize data at the batch (or the training data) level, I believe we should do like

return [ (v - np.mean(m, axis=0)) / np.std(m, axis=0) for v in m]] or

from sklearn.preprocessing import StandardScaler
s1 =StandardScaler()
return s1.fit_transform(m)

Could you please explain why you normalize data in this way?

Could you explain why the function name is read_mfcc() but you actually choose to used the Fbank features which lack the DCT from MFCC features.

Thanks you!

Did you observe that the loss was stuck at the margin?

Hi, as mentioned in the title, I am also implementing deep speaker, using the same data as you used. I am wondering was your triplet loss able to get smaller than the margin value? It seems mine was stuck at the margin.

How usable is the source code as of now ?

Can a speaker verification model be trained now ?

how to set to run on mutil-gpu

I am run a single gpu now,it's slow,how to run the project on mutil-gpu?

Speed-up the speaker verification somehow? (Features extraction mainly)

Is there any way to compute features faster? It takes around 15 minutes per speaker, or even more. Do I do anything wrong? I adapted the code, so I first make the cache, then compute features, save them to the file and only afterwards calculate the cosine simularity. But saving the features takes nearly 20 minutes per cache on i7-7700.
Looks like the model learns during the features extraction process. I think the file should only be feed into NN without learning, and near to the finil layer the featrures are being extracted. It is different here?

how to Fine-Tune this model?

Great work.
How to fine tune this model? thank you

LibriSpeech dataset with 5000+ different speakers？

Only about 2500 different speakers.

cli.py error

python cli.py --unseen_speakers p363,p363 --audio_dir $AUDIO_DIR --cache_output_dir $CACHE_DIR
2019-03-08 17:32:52,348 - INFO - audio_dir = /home/deep-speaker-data/VCTK-Corpus/
2019-03-08 17:32:52,348 - INFO - cache_dir = /home/deep-speaker-data/cache/
2019-03-08 17:32:52,348 - INFO - sample_rate = 8000
Using TensorFlow backend.
Traceback (most recent call last):
File "cli.py", line 83, in
main()
File "cli.py", line 71, in main
inference_unseen_speakers(audio_reader, unseen_speakers[0], unseen_speakers[1])
File "/home/deep-speaker/unseen_speakers.py", line 33, in inference_unseen_speakers
sp1_feat = generate_features_for_unseen_speakers(audio_reader, target_speaker=sp1)
File "/home/deep-speaker/unseen_speakers.py", line 22, in generate_features_for_unseen_speakers
assert target_speaker in audio_reader.all_speaker_ids
AssertionError

Can anyone tell me where the namedtupled package gets?

When I run the code, I meet the problem that no module named 'namedtupled'. And I try to look for it but I can't. So if you can help me, please! Thanks!

for constants.py, no need for BATCH_NUM_TRIPLETS to be a multiple of 3

BATCH_NUM_TRIPLETS = 6 # should be a multiple of 3

no need to be a multiple of 3

How can I make prediction for one audio file.

From your source code, it seems the file to predict should also be packed into batches the same size as when training. What if I have just one file and I wanna make the vector for it alone? How shall it be done. Thank you.

Adaptation to new language

Hi,
Thanks for such a great work,
I wonder if this model (trained on English) can perform well on a different language?

Why is the audio truncated to max 1 seconds in next_batch.py ?

Tensorflow and keras version

Which version of tensorflow and keras are you using??
I am getting this error while executing your code:
Traceback (most recent call last):
File "train_cli.py", line 245, in
start_training()
File "train_cli.py", line 207, in start_training
kx_train, ky_train, kx_test, ky_test, categorical_speakers = data_to_keras(data)
File "/home/deepankar/deep-speaker/utils.py", line 15, in data_to_keras
categorical_speakers = SpeakersToCategorical(data)
File "/home/deepankar/deep-speaker/utils.py", line 169, in init
self.speaker_categories = to_categorical(self.int_speaker_ids, num_classes=len(self.speaker_ids), dtype='float32')
File "/home/deepankar/venv-speaker/lib/python3.6/site-packages/keras/utils/np_utils.py", line 31, in to_categorical
num_classes = np.max(y) + 1
File "/home/deepankar/venv-speaker/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2505, in amax
initial=initial)
File "/home/deepankar/venv-speaker/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity

devector

if we change the embeddings produced by softmax into dvector embeddings which created by pytorch-speaker-verification. can it works?

counts_per_speaker

Could you tell me about --counts_per_speaker? I want to train on my own dataset, but I don't know how to use it.

Silence / Background Noise similarity

I've been having fun playing with your pre-trained model and implementation!

I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/losses.txt'

Hi, when I run models_train.py with python3 I got the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/losses.txt'

The file doesn't seem should be created by myself manually, what should be done for that? Thank you.

The test results

Hello, after training, I used the checkpoint with the best training effect. According to the test sample in readme.md, I changed it slightly and tested 100 speakers outside the training set for cross-validation. Resulting EER has reached ~23% ! Do you have any thoughts on this, is it not suitable for testing data outside the training set?

running on ubuntu16.04 i7CPU too slow

It takes several minutes to compute the value.
May I ask what it should be? several seconds?

Sorry, I can't find train_triplet_softmax_model.py in the last step. Did i miss sth?

Problem when using the model for prediction

Hello,

I am having problem to use a trained model to perform prediction. I was capable of training using your code, but I'm having problems with the shape of the input to feed to the model.predict.

Can someone help me?

cant find get_last_checkpoint_if_any, create_dir_and_delete_content in utils

models_train.py , cant find these two methods from utils.py

GPU utlization low

So i decided to go through your code which i am enjoying since yesterday! My intention is to try to help improve it and align it to the original paper.
I noticed that the utilization on my GPU is quite low close to 0% whereas my memory peaks to 8GB and i get all the relevant GPU messages from the tensorflow backend.
I also notice your comments that the code is not GPU efficient.
As you are more comfortable with the code do you think this is because when training for each "epoch" you dont really use batches of positive and negative speakers ?
Any ideas how to fix this ?

Thanks !
pb

Will the code run on a single GPU?

I am trying the coding on a single GPU with 2G memory and it gives out of memory error.

frame normalisation gives nan in case of long silence

def normalize_frames(m):
return [(v - np.mean(v)) / np.std(v) for v in m]

as std(v) is zeros in case of long silence

FileNotFoundError

[Errno 2] No such file or directory: '$CACHE_DIR\audio_cache_pkl\VCTK Corpus\wav48\p240\p240_034_cache.pkl'

how to test .h5 model?

link: https://github.com/philipperemy/deep-speaker/tree/master/v3
it wasn't non-existent to test file.
in addition, how to use to model.predict()?
tensorflow convolution layer wasn't used, but keras module, why?

How to quickly train the model based on the original model for the audio data of the new speaker？

Have you work out the code of deep speaker?

Dimensions issue

generate_features_for_unseen_speakers target_speaker 363

Is there any reason why the function generate_features_for_unseen_speakers in unseen_speakers.py always uses as parameter for target speaker 'p363'?

Performance of pretrained model

I'm working on testing the pretrained model on some public datasets. But the coding require to put every piece of audio into cache and I have to modify the code to read directly from audio file. So before doing the modification, I'd like to know the performance of the pretrianed model. If it is bad, then I would train from scratch before testing or maybe change to another approach.

Thanks.

missing pkl files for testing

Hi @philipperemy ,

Thanks you for the implementation. Can you help us with the files - speaker-change-detection-norm.pkl and speaker-change-detection-categorical_speakers.pkl. I couldn't find this resource on repo. Also, can you guide how the trained model can be used for inference?

Cheers!

Trained model produces near-identical outputs for different inputs?

As title. I noticed the trained model produces near-identical embeddings for different input audios (even randomly generated values). Is anyone having this issue?

If it helps, I'm training with

LibriSpeech dev-clean dataset: http://www.openslr.org/12/
SAMPLE_RATE = 16000
BATCH_NUM_TRIPLETS = 21
TRUNCATE_SOUND_FIRST_SECONDS = 3
Loss = 0.10225 when I stopped training

I've tried a few different datasets/settings and had the same results.

Probably related to #9 . Somehow the model decides the best way to go is make all embeddings the same, hence loss = margin...

can this identify a new speaker?

Where is CNN ?

I have checked the paper and the code and I realized that there is no implementation for the Convolution neural networks stack such as ResNet inside the code.

how to use gpu

I have install tensorflow-gpu.But when run './deepspeaker train_softmax',I find it only use cpu.
I don't know why. Can you give me some points.

'''
pip list | grep tensorflow
tensorflow-estimator 2.2.0
tensorflow-gpu 2.2.0
'''

reference

hello,I run this project with my own datasets.but get not good result.
for same speaker(same wav have been trained) the Cosine = 0.0
for same speaker(different wav) the Cosine = xe-06 or xe-07
for the different speaker the Cosine =X*e-05
there seems not much different from them ,can you give me some suggestions?

AssertionError

hello，when I run python cli.py --unseen_speakers p363,p364 --audio_dir $AUDIO_DIR --cache_output_dir $CACHE_DIR, it gives me an error

$ python cli.py --unseen_speakers p363,p364 --audio_dir $AUDIO_DIR --cache_output_dir $CACHE_DIR
2019-10-24 21:13:18,326 - INFO - audio_dir = /Users/obsidian/deep-speaker-data/VCTK-Corpus/
2019-10-24 21:13:18,326 - INFO - cache_dir = /Users/obsidian/deep-speaker-data/cache/
2019-10-24 21:13:18,326 - INFO - sample_rate = 8000
Using TensorFlow backend.
Traceback (most recent call last):
  File "cli.py", line 83, in <module>
    main()
  File "cli.py", line 71, in main
    inference_unseen_speakers(audio_reader, unseen_speakers[0], unseen_speakers[1])
  File "/Users/obsidian/source/deep-speaker/unseen_speakers.py", line 33, in inference_unseen_speakers
    sp1_feat = generate_features_for_unseen_speakers(audio_reader, target_speaker=sp1)
  File "/Users/obsidian/source/deep-speaker/unseen_speakers.py", line 22, in generate_features_for_unseen_speakers
    assert target_speaker in audio_reader.all_speaker_ids
AssertionError

How can I solve this problem?
Thank you！

use model with only cpu

Can i use this model with only cpu?And how to config?

speaker embedding

@philipperemy I want to know why cosine similarity is so close when I run python cli.py --get_embeddings xxx --cache_output_dir $CACHE_DIR --audio_dir $AUDIO_DIR to get two different speakers embedding?It is ~0.97, and I don't understand how to slove it.

can you explain your x_train,y_trian,x_test,y_test?

sir,can you explain your x_train,y_trian,x_test,y_test?it seems that the y_* isn't the label for data

GPU usage rate

when i run train_triplet phase of this project with the enviroment ubuntu18.4 and Tesla gpu, the nvidia-smi shows that GPU usage is only 305MiB ,i wonder how to set the gpu usage rate options in the code,thanks