Are you an individual / company willing to invest in open source? Become a sponsor!
philipperemy / deep-speaker Goto Github PK
View Code? Open in Web Editor NEWDeep Speaker: an End-to-End Neural Speaker Embedding System.
License: MIT License
Deep Speaker: an End-to-End Neural Speaker Embedding System.
License: MIT License
Are you an individual / company willing to invest in open source? Become a sponsor!
While training the model, during 231st batch, training loss = 0.10000002384185791, but next batch onwards training loss = NaN. What could be possible reason for this? What is the fix for this problem. It doesn't seem to me as problem of divergence, since training loss has abruptly changed from 0.1 to NaN. Some clarifications please.
when test with two different wavs which from the same unseen speaker ,the result is almost equal to the result which create by different unseen speaker.
can you tell me why?
How can i Test data from these files of the branch Improvements one?
Hi,
I'm trying to use your model to create a real-time voice identification system.
Correct me if I'm wrong but when you convert the audio into mfcc, you use the whole audio to construct the mfcc and then randomly sample a size of NUM_FRAMES
I'm now onto investigating which sample size I should convert to pass into the mfcc fbank conversion. I haven't done extensive testing yet but upon initial trial, 50,000 frames passed onto the fbank() function works well.
This figure was pretty much a shot in the dark.
Would you have any advice as to the minimum required audio length?
I want to get embedding from the raw wav file which the speaker has been trained. If I remove the cache and run a cache update, then compute the embedding, the result is diffent from directly run get_embeddings command. So which one is right? And when get embedding from raw wav which not trained, the cache file need to delete every time?
Hi Philip,
I have a two questions about the way your prepare the MFCC feature (audio.py).
filter_banks, energies = fbank(signal, samplerate=sample_rate, nfilt=NUM_FBANKS)
frames_features = normalize_frames(filter_banks)
...
def normalize_frames(m, epsilon=1e-12):
return [(v - np.mean(v)) / max(np.std(v), epsilon) for v in m]
But I think your code works in normalizing the 26D Fbank features within just each frame. It is actually the instance normalization but not the commonly-used batch normalization. If we want to normalize data at the batch (or the training data) level, I believe we should do like
return [ (v - np.mean(m, axis=0)) / np.std(m, axis=0) for v in m]]
or
from sklearn.preprocessing import StandardScaler
s1 =StandardScaler()
return s1.fit_transform(m)
Could you please explain why you normalize data in this way?
Thanks you!
Hi, as mentioned in the title, I am also implementing deep speaker, using the same data as you used. I am wondering was your triplet loss able to get smaller than the margin value? It seems mine was stuck at the margin.
Can a speaker verification model be trained now ?
I am run a single gpu now,it's slow,how to run the project on mutil-gpu?
Is there any way to compute features faster? It takes around 15 minutes per speaker, or even more. Do I do anything wrong? I adapted the code, so I first make the cache, then compute features, save them to the file and only afterwards calculate the cosine simularity. But saving the features takes nearly 20 minutes per cache on i7-7700.
Looks like the model learns during the features extraction process. I think the file should only be feed into NN without learning, and near to the finil layer the featrures are being extracted. It is different here?
Great work.
How to fine tune this model? thank you
Only about 2500 different speakers.
python cli.py --unseen_speakers p363,p363 --audio_dir $AUDIO_DIR --cache_output_dir $CACHE_DIR
2019-03-08 17:32:52,348 - INFO - audio_dir = /home/deep-speaker-data/VCTK-Corpus/
2019-03-08 17:32:52,348 - INFO - cache_dir = /home/deep-speaker-data/cache/
2019-03-08 17:32:52,348 - INFO - sample_rate = 8000
Using TensorFlow backend.
Traceback (most recent call last):
File "cli.py", line 83, in
main()
File "cli.py", line 71, in main
inference_unseen_speakers(audio_reader, unseen_speakers[0], unseen_speakers[1])
File "/home/deep-speaker/unseen_speakers.py", line 33, in inference_unseen_speakers
sp1_feat = generate_features_for_unseen_speakers(audio_reader, target_speaker=sp1)
File "/home/deep-speaker/unseen_speakers.py", line 22, in generate_features_for_unseen_speakers
assert target_speaker in audio_reader.all_speaker_ids
AssertionError
When I run the code, I meet the problem that no module named 'namedtupled'. And I try to look for it but I can't. So if you can help me, please! Thanks!
BATCH_NUM_TRIPLETS = 6 # should be a multiple of 3
no need to be a multiple of 3
From your source code, it seems the file to predict should also be packed into batches the same size as when training. What if I have just one file and I wanna make the vector for it alone? How shall it be done. Thank you.
Hi,
Thanks for such a great work,
I wonder if this model (trained on English) can perform well on a different language?
Which version of tensorflow and keras are you using??
I am getting this error while executing your code:
Traceback (most recent call last):
File "train_cli.py", line 245, in
start_training()
File "train_cli.py", line 207, in start_training
kx_train, ky_train, kx_test, ky_test, categorical_speakers = data_to_keras(data)
File "/home/deepankar/deep-speaker/utils.py", line 15, in data_to_keras
categorical_speakers = SpeakersToCategorical(data)
File "/home/deepankar/deep-speaker/utils.py", line 169, in init
self.speaker_categories = to_categorical(self.int_speaker_ids, num_classes=len(self.speaker_ids), dtype='float32')
File "/home/deepankar/venv-speaker/lib/python3.6/site-packages/keras/utils/np_utils.py", line 31, in to_categorical
num_classes = np.max(y) + 1
File "/home/deepankar/venv-speaker/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2505, in amax
initial=initial)
File "/home/deepankar/venv-speaker/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
if we change the embeddings produced by softmax into dvector embeddings which created by pytorch-speaker-verification. can it works?
Could you tell me about --counts_per_speaker
? I want to train on my own dataset, but I don't know how to use it.
I've been having fun playing with your pre-trained model and implementation!
I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features
. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features
, it would learn to predict various silent_features
and distinguish it from voices.
Hi, when I run models_train.py with python3 I got the following error:
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/losses.txt'
The file doesn't seem should be created by myself manually, what should be done for that? Thank you.
Hello, after training, I used the checkpoint with the best training effect. According to the test sample in readme.md, I changed it slightly and tested 100 speakers outside the training set for cross-validation. Resulting EER has reached ~23% ! Do you have any thoughts on this, is it not suitable for testing data outside the training set?
It takes several minutes to compute the value.
May I ask what it should be? several seconds?
Hello,
I am having problem to use a trained model to perform prediction. I was capable of training using your code, but I'm having problems with the shape of the input to feed to the model.predict.
Can someone help me?
models_train.py , cant find these two methods from utils.py
So i decided to go through your code which i am enjoying since yesterday! My intention is to try to help improve it and align it to the original paper.
I noticed that the utilization on my GPU is quite low close to 0% whereas my memory peaks to 8GB and i get all the relevant GPU messages from the tensorflow backend.
I also notice your comments that the code is not GPU efficient.
As you are more comfortable with the code do you think this is because when training for each "epoch" you dont really use batches of positive and negative speakers ?
Any ideas how to fix this ?
Thanks !
pb
I am trying the coding on a single GPU with 2G memory and it gives out of memory error.
def normalize_frames(m):
return [(v - np.mean(v)) / np.std(v) for v in m]
as std(v) is zeros in case of long silence
[Errno 2] No such file or directory: '$CACHE_DIR\audio_cache_pkl\VCTK Corpus\wav48\p240\p240_034_cache.pkl'
link: https://github.com/philipperemy/deep-speaker/tree/master/v3
it wasn't non-existent to test file.
in addition, how to use to model.predict()?
tensorflow convolution layer wasn't used, but keras module, why?
How to quickly train the model based on the original model for the audio data of the new speaker?
Is there any reason why the function generate_features_for_unseen_speakers in unseen_speakers.py always uses as parameter for target speaker 'p363'?
I'm working on testing the pretrained model on some public datasets. But the coding require to put every piece of audio into cache and I have to modify the code to read directly from audio file. So before doing the modification, I'd like to know the performance of the pretrianed model. If it is bad, then I would train from scratch before testing or maybe change to another approach.
Thanks.
Hi @philipperemy ,
Thanks you for the implementation. Can you help us with the files - speaker-change-detection-norm.pkl and speaker-change-detection-categorical_speakers.pkl. I couldn't find this resource on repo. Also, can you guide how the trained model can be used for inference?
Cheers!
As title. I noticed the trained model produces near-identical embeddings for different input audios (even randomly generated values). Is anyone having this issue?
If it helps, I'm training with
I've tried a few different datasets/settings and had the same results.
Probably related to #9 . Somehow the model decides the best way to go is make all embeddings the same, hence loss = margin...
I have checked the paper and the code and I realized that there is no implementation for the Convolution neural networks stack such as ResNet inside the code.
I have install tensorflow-gpu.But when run './deepspeaker train_softmax',I find it only use cpu.
I don't know why. Can you give me some points.
'''
pip list | grep tensorflow
tensorflow-estimator 2.2.0
tensorflow-gpu 2.2.0
'''
hello,I run this project with my own datasets.but get not good result.
for same speaker(same wav have been trained) the Cosine = 0.0
for same speaker(different wav) the Cosine = xe-06 or xe-07
for the different speaker the Cosine =X*e-05
there seems not much different from them ,can you give me some suggestions?
hello,when I run python cli.py --unseen_speakers p363,p364 --audio_dir $AUDIO_DIR --cache_output_dir $CACHE_DIR
, it gives me an error
$ python cli.py --unseen_speakers p363,p364 --audio_dir $AUDIO_DIR --cache_output_dir $CACHE_DIR
2019-10-24 21:13:18,326 - INFO - audio_dir = /Users/obsidian/deep-speaker-data/VCTK-Corpus/
2019-10-24 21:13:18,326 - INFO - cache_dir = /Users/obsidian/deep-speaker-data/cache/
2019-10-24 21:13:18,326 - INFO - sample_rate = 8000
Using TensorFlow backend.
Traceback (most recent call last):
File "cli.py", line 83, in <module>
main()
File "cli.py", line 71, in main
inference_unseen_speakers(audio_reader, unseen_speakers[0], unseen_speakers[1])
File "/Users/obsidian/source/deep-speaker/unseen_speakers.py", line 33, in inference_unseen_speakers
sp1_feat = generate_features_for_unseen_speakers(audio_reader, target_speaker=sp1)
File "/Users/obsidian/source/deep-speaker/unseen_speakers.py", line 22, in generate_features_for_unseen_speakers
assert target_speaker in audio_reader.all_speaker_ids
AssertionError
How can I solve this problem?
Thank you!
Can i use this model with only cpu?And how to config?
@philipperemy I want to know why cosine similarity is so close when I run python cli.py --get_embeddings xxx --cache_output_dir $CACHE_DIR --audio_dir $AUDIO_DIR
to get two different speakers embedding?It is ~0.97, and I don't understand how to slove it.
sir,can you explain your x_train,y_trian,x_test,y_test?it seems that the y_* isn't the label for data
when i run train_triplet phase of this project with the enviroment ubuntu18.4 and Tesla gpu, the nvidia-smi shows that GPU usage is only 305MiB ,i wonder how to set the gpu usage rate options in the code,thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.