taoruijie / ecapa-tdnn Goto Github PK

Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

License: MIT License

Python 100.00%

ecapa-tdnn speaker-recognition speaker-verification voxceleb1 voxceleb2

ecapa-tdnn's Introduction

Hi 👋🏽, I'm Tao Ruijie!

I am a PhD student in National University of Singapore (NUS), supervised by Prof. Li Haizhou. Prior to that, I received my Master’s Degree from NUS in 2019 and my Bachelor's Degree from Soochow University in 2018.

My research interest Audio-visual speech processing, includes speaker recognition, active speaker detection, self-supervised learning. I have published more than 10 papers at the top international AI conferences such as ACM MM, ICASSP, INTERSPEECH.

Languages and Tools:

ecapa-tdnn's People

Contributors

Stargazers

Watchers

Forkers

wantt ishine drinpeng yubai630 huolongjia elpse twistedmove beyondsundial runngezhang ian-tam andymyzhang32 villeh1 linjuanzhang-1668 muzihuole thangdepzai shanmon110 daiyuuu jieunpark1 quyjleo panshuyu betciso lh997 ethanyhzhang jfzhouuu jiangxin3 hiyoung-asr zds-potato viceaa zhazhafon whitefu niumq inoxcrm cenwurong scottcsy 545088212 lavendery pylon wy192 liurui97 underdogliu poppywelch imaginistli zzj9527 shaun95 ahmeftah phongtran263 zuoyunzheng fbdp1202 mio0922 yichen1997 shihkuanglee 88aggressive antonizdp wangth2001 xiaoxue1117 wxlsummer lbehringer tchouameni emuccc warisqr007 fragrantrookie wwwwwli hadi-mohseni jhcor speechlearning wh7776 starhaox rndlwjs wht2020 royandzoe gophone11 jscslld lucifar777 kpaul073 silvadirceu phpcat233 paulsunnypark rosebbb ryanzlay joemathai yinfan98 laozhanger chen-jia621 gedebabin 6naykai q-y-tang swaiyu avishai111 chienlinhuang1116 metalisai 19862288048 postrational zhaoyj1122 liguinan 1443244362 guojl7 themozhi gopiuwf4567 adelinocpp 201512130144

ecapa-tdnn's Issues

training set is not 5 times bigger after augmentation

I notice that in dataloader, the size of training set is the same size as original audio size after augmentation.

So, adding augmentation is not to increase the amount of training data, only to increase the diversity of it ?

Thank you so much! I will try it.

Originally posted by @KAI-LI-JAIST in #7 (comment)

请教一下这个测试数据需要做vad吗？

Questions about reproduced ECAPA-Tdnn paper

I found out there are some differences between your code configrations and original configurations in ECAPA.

The most important one is in your code, you just random choose 1 of the 6 noise to add . And in ECAPA, they use all 6 noise methods which means they have a largger dataset.

I trained the 512 channels model, which only can achieve 1.16 EER (1.01 in ECAPA) , but your result in 1024 channel is even better than ECAPA. So is there any secret you holding about training skill? or you changed the configrations in your upload code ( I just copy your project and change the channel num, and everything else stays the same). OR because the tiny differences in your code leads it is better on a large model.

And thank you for your excellent work! Any help will be appriciated!

Best

AS-Norm

Has anyone implemented as-norm? Can you share it with me?
thank you very much！！

Dimension issue about time and feature.

https://github.com/TaoRuijie/ECAPATDNN/blob/6bbf8c56e5b7c4be1df4d5aabcf526856aa36a57/model.py#L106
I think the dim 2 is the time dimension and dim 1 is the feature dimension. Can you check it further? Of course, it has no effect on the result.

EER 15%?

I got EER:15.98% with pretrain.model and Vox1_O(https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt)
Is it right?

result EER

I test the result in vox1-O (veri_test.txt) and I get the result below:
EER 1.12%, minDCF 0.0745%
I noticed that the result reported in the README is actually evaluated in Vox1(clean) veri_test2.txt. Still a great work.
By the way, without TTA, I got VEER 1.0052 MinDCF 0.08051 in Vox1(clean).

KeyError: 'id10004/JKMfqmjUwMU/00003.wav'

Hi,
Thanks for this great work.
I'm getting following error while executing trainSpeakerNet.py with customized(less number of samples from voxceleb1) dataset for training and evaluation. But I got this error and I couldn't solve this issue.

Please help me to resolve this issue.

Reading 0 of 37: 0.00 Hz, embedding size 512
Computing 0 of 143: 0.00 HzTraceback (most recent call last):
File "./trainSpeakerNet.py", line 310, in
main()
File "./trainSpeakerNet.py", line 304, in main
mp.spawn(main_worker, nprocs=n_gpus, args=(n_gpus, args))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/workspace/work/voxceleb_trainer-master/trainSpeakerNet.py", line 193, in main_worker
sc, lab, _ = trainer.evaluateFromList(**vars(args))
File "/workspace/work/voxceleb_trainer-master/SpeakerNet.py", line 221, in evaluateFromList
com_feat = feats[data[2]].cuda()
KeyError: 'id10004/JKMfqmjUwMU/00003.wav'

ECAPA-TDNN

Hi!

I'm having trouble understanding ECAPA-TDNN architecture.

To be specific, I don't understand what does the elements in ECAPA-TDNN do (PreEmphasis,MelSpectrogram,FBankAug,conv1d,relu, batchNorm1d, bottleneck, Attention...) in the context of speaker verification?

What about classifier AAAsoftmax, optimizer Adam and scheduler stepLR?

Thanks for your attention and time!

About the training time

Hello, thank you so much for contributing this project.
I am training this model recently. I also use one 3090 and the same setting as you. But i need spend about 20 hours for each epoch. Do you know what's the reason?
Thank you so much for your answering in advance.

file structure of the dataset

Could you please post the file structure of the dataset, it would be great if you could upload a demo of the dataset, thanks

模型输入不统一？

我看到推理代码中：
with torch.no_grad():
embedding_1 = self.speaker_encoder.forward(data_1, aug = False)
embedding_1 = F.normalize(embedding_1, p=2, dim=1)
embedding_2 = self.speaker_encoder.forward(data_2, aug = False)
embedding_2 = F.normalize(embedding_2, p=2, dim=1)
embeddings[file] = [embedding_1, embedding_2]
其中，data1是语音全部的数据，data2是分割后又stack的数据。对于不同长度的语音，data1和data2是没有规定长度的？都可以输入到self.speaker_encoder.forward计算embedding？？？

关于AS-norm的问题，

Hi！Ruijie，在B站关注你好久了！最近在做SASVC的比赛，发现用了你这个仓库做ASV 的 Baseline code. 你在Readme中写了这个ECAPA-TDNN结果是as-norm后的结果，可我没有在你的代码里找到任何关于backend norm的部分。请问是typo吗？还是您没有向本仓库中添加那一段代码?

GPU问题

请问博主，这个程序必须是安装了GPU版本的pytorch才可以使用吗？cpu版本的可以跑起来吗？

Question about Res2Net module configuration

Thank you for a good resource. :)

Is there any special reason to implement it differently from the original paper in the multi-scale (res2net) module part of the ECAPA-TDNN model?
(i.e., the first split is the identical form in the Res2Net paper, but the last split in your implementation)

ECAPA-TDNN/model.py

Lines 59 to 72 in a229093

 spx = torch.split(out, self.width, 1) 

 for i in range(self.nums): 

 if i==0: 

 sp = spx[i] 

 else: 

 sp = sp + spx[i] 

 sp = self.convs[i](sp) 

 sp = self.relu(sp) 

 sp = self.bns[i](sp) 

 if i==0: 

 out = sp 

 else: 

 out = torch.cat((out, sp), 1) 

 out = torch.cat((out, spx[self.nums]),1)

can not prepare the dataset

When I followed the Data preparation part in the link and ran the this code python3 dataprep.py --save_path data --download --user USERNAME --password PASSWORD , I met with the following error.

--2021-11-26 14:04:56-- http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
Resolving www.robots.ox.ac.uk (www.robots.ox.ac.uk)... 129.67.94.2
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa [following]
--2021-11-26 14:04:58-- https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-11-26 14:04:59 ERROR 404: Not Found.

Traceback (most recent call last):
File "Downloads/voxceleb_trainer-master/dataprep.py", line 176, in
download(args,fileparts)
File "Downloads/voxceleb_trainer-master/dataprep.py", line 58, in download
raise ValueError('Download failed %s. If download fails repeatedly, use alternate URL on the VoxCeleb website.'%url)
ValueError: Download failed http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa. If download fails repeatedly, use alternate URL on the VoxCeleb website.

How can I solve this problem? Thanks!

The feature passed into model is not MFCC

In the original paper the features passed into the model are MFCC in 80 dimensions, but in your code I don't find the original speech converted to MFCC. I'm not quite sure if I misunderstood or if you directly used the original speech as a feature, does that make any difference? Waiting for your reply.

Accelerating evaluation speed

During evaluation, the current implementation calculates the similarity scores one by one using a for loop, that could be slow when the size of "lines" gets larger. Is there an elegant way of vectorizing it?

How to use pretrain.model for continuing training?

I want to add some chinese audios to the training data.

Can I use your pretrain.model and continue to train using my data,

Or Do I have to download all the VoxCeleb1data plusing my data, and train it from the beginning?

Thank you for your reply.

The content format of these two files train_list_with_len.txt and veri_test2.txt

I want to use my data to train the model. But I met with some problems. I wondered whether my training list format is wrong or not. So I want to ask the content format of the training list in these two file. I followed Data preparation.
Here is the content of my /data08/VoxCeleb2/train_list_with_len.txt

d10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00002.wav
id10001 id10001/Y8hIVOBuels/00002.wav

Here is the content of my /data08/VoxCeleb1/veri_test2.txt

1 id10001/Y8hIVOBuels/00001.wav id10001/1zcIwhmdeo4/00001.wav
0 id10001/Y8hIVOBuels/00001.wav id10943/vNCVj7yLWPU/00005.wav
1 id10001/Y8hIVOBuels/00001.wav id10001/7w0IBEWc9Qw/00004.wav
0 id10001/Y8hIVOBuels/00001.wav id10999/G5R2-Hl7YX8/00008.wav
1 id10001/Y8hIVOBuels/00002.wav id10001/7w0IBEWc9Qw/00002.wav
0 id10001/Y8hIVOBuels/00002.wav id10787/qZInQxuCSVo/00008.wav

Do you have any open source plans for the Stage II in your latest paper?

I have read your great work in SELF-SUPERVISED SPEAKER RECOGNITION WITH LOSS-GATED LEARNING（ICASSP 2022).

I attempt to follow Stage II in your paper, which shows great gains in your experiments.

If these parts of codes are available, it will benefit a lot.

Thanks a lot.

dataset

could i use this code to run my own dataset?i wannna use it to run urbansound8k dataet

How do you apply AS-NORM?

Hi, thanks for sharing your code. You say the best performance is achieved with AS-NORM. Can you share how you apply the AS-NORM with which set?

Online augmentation affects speed

Hello, when I'm training, I find that data augmentation always waste a lot of time which cause the gpu (3090) to run intermittently. There are almost 20 seconds used for augmentation every batch(batch_size = 400). I want to ask you why yours so fast. Looking forward to your reply.

Vox1_E and Vox1_H

您好？请问在没有norm的情况下 Vox1_E and Vox1_H 的测试指标如何呢？

About split utterance to matrix

Excuse me, I noticed that you split utterances into the matrix in the evaluation stage. Could you please explain why you do that?

AS-norm

when do you release the code of AS-norm?

train and eval on cpu?

hello do you train and eval on cpu?

Code using

Could I use your code for R&E at high school, please?

请教：此模型训练结果中有权重文件吗？

I want to ask:if you did vad in test dataset before test EER?

如何解决训练较慢

学长您好我照着您的代码跑实验时训练较慢。感觉是CPU读入数据的时候花费了很长时间（设置读入线程为4时，每次四个batch训练完成后就会等半分钟秒左右。），请问有什么解决办法吗。
这是我的实验以及配置截图：

different version of torch/torchaudio get different result?

How to evaluate your nn

Hi!
I'm new at neural networks and i'm having trouble discovering how to evaluate your implementation.
By now I'm using an audio dataset which is different from your --eval_path and --eval_list, so I'm running this command:

python trainECAPAModel.py --eval --initial_model exps/pretrain.model --eval_list /eval_list_directory --eval_path /eval_path_directory

Is this the correct way to evaluate your implementation? Should I use any different argument? The point is I don't think I understand what exps/pretrain.model is, so I don't know how to use it.

Looking forward to your response!
Thanks

单卡改多卡

请问尝试过将代码改成多卡吗？我这边改完之后，收敛速度很慢。
89 epoch, LR 0.000069, LOSS 2.890798, ACC 48.38%, EER 1.90%, bestEER 1.76%
再往后的epoch性能也没有提升。

why does training and testing audio have different length?

Hi,
I notice that at training time, num_frames is 200, so the segment of training audio is 2 seconds.
But at eval time, the segment of training audio is 3 seconds, ECAPAModel.py line 63.
How come training and testing is not the same length?

Fail to download voxceleb1&2 dataset

Hi there, do you any idea to download requeseted dataset for training? I tried to download the voxceleb1&2 as stated in readme docs but it fails.

Could you add a license to this code?

Hi,
if this code is freely available for use, could you add a license (like MIT, or Apache) to it?
Thanks!

Train data?

Hi man,what is the train data, only VoxCeleb2 or VoxCeleb1+VoxCeleb2 ?

How to speed up training?

My GPU is also RTX 3090. Why is your training speed so fast? Mine is about 6 times slower!

pip install -r requirements.txt 出错

ERROR: Invalid requirement: 'torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2' (from line 1 of requirements.txt)

the motivation of the two part

I noticed that there're two parts modification here:

self.attention = nn.Sequential(
nn.Conv1d(96, 128, kernel_size=1),
nn.ReLU(),
nn.BatchNorm1d(128),
nn.Tanh(), # I add this layer
nn.Conv1d(128, 96, kernel_size=1),
nn.Softmax(dim=2),
)

self.se = nn.Sequential(
nn.AdaptiveAvgPool1d(1),
nn.Conv1d(channels, bottleneck, kernel_size=1, padding=0),
nn.ReLU(),
# nn.BatchNorm1d(bottleneck), # I remove this layer
nn.Conv1d(bottleneck, channels, kernel_size=1, padding=0),
nn.Sigmoid(),
)

What's the motivation here? And do these benefits the performance?
Thanks

门限确定和score的分布

您好，我关注到测试的score还有负值，请问score区间是多少呢？我理解是0-1的。请问门限是如何确定的呢？

算法正确率并不高？

80 epoch, LR 0.000090, LOSS 1.371865, ACC 74.04%, EER 1.03%, bestEER 0.96%
看到训练log，在第80个epoch的时候，eer已经达到1.03，但是ACC却只有74.04，正确率并不高呢？

GPU utilization error！

Hi, author. I am training the ECAPA-TDNN model end-to-end by using: python trainECAPAModel.py
However, I found that while training, the training time per epoch is very long. After checking, I found that the GPU memory is occupied, but its utilization is 0. I manually set model.cuda(), but it does not work. I'm wondering what part of the program should I change to make the model load successfully.

	spx = torch.split(out, self.width, 1)
	for i in range(self.nums):
	if i==0:
	sp = spx[i]
	else:
	sp = sp + spx[i]
	sp = self.convs[i](sp)
	sp = self.relu(sp)
	sp = self.bns[i](sp)
	if i==0:
	out = sp
	else:
	out = torch.cat((out, sp), 1)
	out = torch.cat((out, spx[self.nums]),1)