Describe the question I trained the uis-rnn model on embeddings ob

uis-rnn gives different result on broken audios and continuous audios about uis-rnn HOT 5 CLOSED

ashu170292 commented on May 19, 2024

uis-rnn gives different result on broken audios and continuous audios

from uis-rnn.

Comments (5)

wq2012 commented on May 19, 2024

How did you train your uis-rnn network?

Did you also train it on continuous audio?

from uis-rnn.

ashu170292 commented on May 19, 2024

Q)How did you train your uis-rnn network?
Answer) To train the uis-rnn network, I made a train sequence as a single 2 -dim numpy array. I used around ~4000 utterances of timit. Each utterance has only one speaker. Each utterance is 4 seconds to 6 seconds long. For each utterance, embeddings were calculated and was appended in the train sequence.

Q)Did you also train it on continuous audio?
Answer) No

from uis-rnn.

wq2012 commented on May 19, 2024

The whole point of UIS-RNN is to learn conversational information from examples. If your UIS-RNN is trained on single-speaker utterance only, the trained model will be useless on multi-speaker audio.

from uis-rnn.

ashu170292 commented on May 19, 2024

Thanks for your help, Quan.

The uis-rnn model trained on single speaker utterance, performs bad on multi-speaker utterance(This can be explained by the answer above). I am still finding it hard to build intuition around the following:

If I break down the multi speaker audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in sequence, I get different and fairly accurate predicted ids (around 91 % accuracy).

Any idea why would that happen?

from uis-rnn.

wq2012 commented on May 19, 2024

UIS-RNN is an algorithm for supervised learning. This means, you train on multi-speaker data, it will perform well on multi-speaker data. You train it on single-speaker data only, it will only perform well on single-speaker data. It's not supposed to perform well on scenarios that never appeared during training.

When I use the model on continuous audios, I get only one or two speaker ids. But, if I break down the audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in a sequence, I get different and fairly accurate cluster ids.

This seems unrelated to UIS-RNN. Sounds like a bug in your speaker embedding implementation. If you extract speaker embeddings from sliding windows, whether it is continuous audio or broken audio should not make much difference.

from uis-rnn.

Recommend Projects

uis-rnn gives different result on broken audios and continuous audios about uis-rnn HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent