Git Product home page Git Product logo

ecapa-tdnn's Introduction

Hi 👋🏽, I'm Tao Ruijie!


I am a PhD student in National University of Singapore (NUS), supervised by Prof. Li Haizhou. Prior to that, I received my Master’s Degree from NUS in 2019 and my Bachelor's Degree from Soochow University in 2018.

My research interest Audio-visual speech processing, includes speaker recognition, active speaker detection, self-supervised learning. I have published more than 10 papers at the top international AI conferences such as ACM MM, ICASSP, INTERSPEECH.

Languages and Tools:

TaoRuijie's GitHub stats

ecapa-tdnn's People

Contributors

taoruijie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ecapa-tdnn's Issues

training set is not 5 times bigger after augmentation

I notice that in dataloader, the size of training set is the same size as original audio size after augmentation.

So, adding augmentation is not to increase the amount of training data, only to increase the diversity of it ?

Questions about reproduced ECAPA-Tdnn paper

Hi

I found out there are some differences between your code configrations and original configurations in ECAPA.

The most important one is in your code, you just random choose 1 of the 6 noise to add . And in ECAPA, they use all 6 noise methods which means they have a largger dataset.

I trained the 512 channels model, which only can achieve 1.16 EER (1.01 in ECAPA) , but your result in 1024 channel is even better than ECAPA. So is there any secret you holding about training skill? or you changed the configrations in your upload code ( I just copy your project and change the channel num, and everything else stays the same). OR because the tiny differences in your code leads it is better on a large model.

And thank you for your excellent work! Any help will be appriciated!

Best

AS-Norm

Has anyone implemented as-norm? Can you share it with me?
thank you very much!!

result EER

I test the result in vox1-O (veri_test.txt) and I get the result below:
EER 1.12%, minDCF 0.0745%
I noticed that the result reported in the README is actually evaluated in Vox1(clean) veri_test2.txt. Still a great work.
By the way, without TTA, I got VEER 1.0052 MinDCF 0.08051 in Vox1(clean).

KeyError: 'id10004/JKMfqmjUwMU/00003.wav'

Hi,
Thanks for this great work.
I'm getting following error while executing trainSpeakerNet.py with customized(less number of samples from voxceleb1) dataset for training and evaluation. But I got this error and I couldn't solve this issue.

Please help me to resolve this issue.

Reading 0 of 37: 0.00 Hz, embedding size 512
Computing 0 of 143: 0.00 HzTraceback (most recent call last):
File "./trainSpeakerNet.py", line 310, in
main()
File "./trainSpeakerNet.py", line 304, in main
mp.spawn(main_worker, nprocs=n_gpus, args=(n_gpus, args))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/workspace/work/voxceleb_trainer-master/trainSpeakerNet.py", line 193, in main_worker
sc, lab, _ = trainer.evaluateFromList(**vars(args))
File "/workspace/work/voxceleb_trainer-master/SpeakerNet.py", line 221, in evaluateFromList
com_feat = feats[data[2]].cuda()
KeyError: 'id10004/JKMfqmjUwMU/00003.wav'

ECAPA-TDNN

Hi!

I'm having trouble understanding ECAPA-TDNN architecture.

image

To be specific, I don't understand what does the elements in ECAPA-TDNN do (PreEmphasis,MelSpectrogram,FBankAug,conv1d,relu, batchNorm1d, bottleneck, Attention...) in the context of speaker verification?

What about classifier AAAsoftmax, optimizer Adam and scheduler stepLR?

Thanks for your attention and time!

About the training time

Hello, thank you so much for contributing this project.
I am training this model recently. I also use one 3090 and the same setting as you. But i need spend about 20 hours for each epoch. Do you know what's the reason?
Thank you so much for your answering in advance.

file structure of the dataset

Could you please post the file structure of the dataset, it would be great if you could upload a demo of the dataset, thanks

模型输入不统一?

我看到推理代码中:
with torch.no_grad():
embedding_1 = self.speaker_encoder.forward(data_1, aug = False)
embedding_1 = F.normalize(embedding_1, p=2, dim=1)
embedding_2 = self.speaker_encoder.forward(data_2, aug = False)
embedding_2 = F.normalize(embedding_2, p=2, dim=1)
embeddings[file] = [embedding_1, embedding_2]
其中,data1是语音全部的数据,data2是分割后又stack的数据。对于不同长度的语音,data1和data2是没有规定长度的?都可以输入到self.speaker_encoder.forward计算embedding???

关于AS-norm的问题,

Hi!Ruijie,在B站关注你好久了!最近在做SASVC的比赛,发现用了你这个仓库做ASV 的 Baseline code. 你在Readme中写了这个ECAPA-TDNN结果是as-norm后的结果,可我没有在你的代码里找到任何关于backend norm的部分。请问是typo吗?还是您没有向本仓库中添加那一段代码?

GPU问题

请问博主,这个程序必须是安装了GPU版本的pytorch才可以使用吗?cpu版本的可以跑起来吗?

Question about Res2Net module configuration

Thank you for a good resource. :)

Is there any special reason to implement it differently from the original paper in the multi-scale (res2net) module part of the ECAPA-TDNN model?
(i.e., the first split is the identical form in the Res2Net paper, but the last split in your implementation)

image

ECAPA-TDNN/model.py

Lines 59 to 72 in a229093

spx = torch.split(out, self.width, 1)
for i in range(self.nums):
if i==0:
sp = spx[i]
else:
sp = sp + spx[i]
sp = self.convs[i](sp)
sp = self.relu(sp)
sp = self.bns[i](sp)
if i==0:
out = sp
else:
out = torch.cat((out, sp), 1)
out = torch.cat((out, spx[self.nums]),1)

can not prepare the dataset

When I followed the Data preparation part in the link and ran the this code python3 dataprep.py --save_path data --download --user USERNAME --password PASSWORD , I met with the following error.

--2021-11-26 14:04:56-- http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
Resolving www.robots.ox.ac.uk (www.robots.ox.ac.uk)... 129.67.94.2
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa [following]
--2021-11-26 14:04:58-- https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-11-26 14:04:59 ERROR 404: Not Found.

Traceback (most recent call last):
File "Downloads/voxceleb_trainer-master/dataprep.py", line 176, in
download(args,fileparts)
File "Downloads/voxceleb_trainer-master/dataprep.py", line 58, in download
raise ValueError('Download failed %s. If download fails repeatedly, use alternate URL on the VoxCeleb website.'%url)
ValueError: Download failed http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa. If download fails repeatedly, use alternate URL on the VoxCeleb website.

How can I solve this problem? Thanks!

The feature passed into model is not MFCC

In the original paper the features passed into the model are MFCC in 80 dimensions, but in your code I don't find the original speech converted to MFCC. I'm not quite sure if I misunderstood or if you directly used the original speech as a feature, does that make any difference? Waiting for your reply.

Accelerating evaluation speed

During evaluation, the current implementation calculates the similarity scores one by one using a for loop, that could be slow when the size of "lines" gets larger. Is there an elegant way of vectorizing it?

How to use pretrain.model for continuing training?

I want to add some chinese audios to the training data.

Can I use your pretrain.model and continue to train using my data,

Or Do I have to download all the VoxCeleb1data plusing my data, and train it from the beginning?

Thank you for your reply.

The content format of these two files train_list_with_len.txt and veri_test2.txt

I want to use my data to train the model. But I met with some problems. I wondered whether my training list format is wrong or not. So I want to ask the content format of the training list in these two file. I followed Data preparation.
Here is the content of my /data08/VoxCeleb2/train_list_with_len.txt

d10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00002.wav
id10001 id10001/Y8hIVOBuels/00002.wav

Here is the content of my /data08/VoxCeleb1/veri_test2.txt

1 id10001/Y8hIVOBuels/00001.wav id10001/1zcIwhmdeo4/00001.wav
0 id10001/Y8hIVOBuels/00001.wav id10943/vNCVj7yLWPU/00005.wav
1 id10001/Y8hIVOBuels/00001.wav id10001/7w0IBEWc9Qw/00004.wav
0 id10001/Y8hIVOBuels/00001.wav id10999/G5R2-Hl7YX8/00008.wav
1 id10001/Y8hIVOBuels/00002.wav id10001/7w0IBEWc9Qw/00002.wav
0 id10001/Y8hIVOBuels/00002.wav id10787/qZInQxuCSVo/00008.wav

dataset

could i use this code to run my own dataset?i wannna use it to run urbansound8k dataet

How do you apply AS-NORM?

Hi, thanks for sharing your code. You say the best performance is achieved with AS-NORM. Can you share how you apply the AS-NORM with which set?

Online augmentation affects speed

Hello, when I'm training, I find that data augmentation always waste a lot of time which cause the gpu (3090) to run intermittently. There are almost 20 seconds used for augmentation every batch(batch_size = 400). I want to ask you why yours so fast. Looking forward to your reply.

Vox1_E and Vox1_H

您好?请问在没有norm的情况下 Vox1_E and Vox1_H 的测试指标如何呢?

About split utterance to matrix

Excuse me, I noticed that you split utterances into the matrix in the evaluation stage. Could you please explain why you do that?

Code using

Could I use your code for R&E at high school, please?

如何解决训练较慢

学长您好 我照着您的代码跑实验时 训练较慢。感觉是CPU读入数据的时候花费了很长时间(设置读入线程为4时,每次四个batch训练完成后就会等半分钟秒左右。),请问有什么解决办法吗。
这是我的实验以及配置截图:
截图20230307100011
截图20230307100509

How to evaluate your nn

Hi!
I'm new at neural networks and i'm having trouble discovering how to evaluate your implementation.
By now I'm using an audio dataset which is different from your --eval_path and --eval_list, so I'm running this command:

python trainECAPAModel.py --eval --initial_model exps/pretrain.model --eval_list /eval_list_directory --eval_path /eval_path_directory

Is this the correct way to evaluate your implementation? Should I use any different argument? The point is I don't think I understand what exps/pretrain.model is, so I don't know how to use it.

Looking forward to your response!
Thanks

单卡改多卡

请问尝试过将代码改成多卡吗?我这边改完之后,收敛速度很慢。
89 epoch, LR 0.000069, LOSS 2.890798, ACC 48.38%, EER 1.90%, bestEER 1.76%
再往后的epoch性能也没有提升。

why does training and testing audio have different length?

Hi,
I notice that at training time, num_frames is 200, so the segment of training audio is 2 seconds.
But at eval time, the segment of training audio is 3 seconds, ECAPAModel.py line 63.
How come training and testing is not the same length?

Fail to download voxceleb1&2 dataset

Hi there, do you any idea to download requeseted dataset for training? I tried to download the voxceleb1&2 as stated in readme docs but it fails.

Train data?

Hi man,what is the train data, only VoxCeleb2 or VoxCeleb1+VoxCeleb2 ?

the motivation of the two part

I noticed that there're two parts modification here:

  1. self.attention = nn.Sequential(
    nn.Conv1d(96, 128, kernel_size=1),
    nn.ReLU(),
    nn.BatchNorm1d(128),
    nn.Tanh(), # I add this layer
    nn.Conv1d(128, 96, kernel_size=1),
    nn.Softmax(dim=2),
    )
  1. self.se = nn.Sequential(
    nn.AdaptiveAvgPool1d(1),
    nn.Conv1d(channels, bottleneck, kernel_size=1, padding=0),
    nn.ReLU(),
    # nn.BatchNorm1d(bottleneck), # I remove this layer
    nn.Conv1d(bottleneck, channels, kernel_size=1, padding=0),
    nn.Sigmoid(),
    )

What's the motivation here? And do these benefits the performance?
Thanks

门限确定和score的分布

您好,我关注到测试的score还有负值,请问score区间是多少呢?我理解是0-1的。请问门限是如何确定的呢?

算法正确率并不高?

80 epoch, LR 0.000090, LOSS 1.371865, ACC 74.04%, EER 1.03%, bestEER 0.96%
看到训练log,在第80个epoch的时候,eer已经达到1.03,但是ACC却只有74.04,正确率并不高呢?

GPU utilization error!

Hi, author. I am training the ECAPA-TDNN model end-to-end by using: python trainECAPAModel.py
However, I found that while training, the training time per epoch is very long. After checking, I found that the GPU memory is occupied, but its utilization is 0. I manually set model.cuda(), but it does not work. I'm wondering what part of the program should I change to make the model load successfully.

what's the effect of PreEmphasis

I think the paper of ECAPA-TDNN didn't mention any pre-processing on wav signal. And voxceleb_trainer didn't have this processing as well. Does it affect performance?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.