google / uis-rnn Goto Github PK

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

Home Page: https://arxiv.org/abs/1810.04719

License: Apache License 2.0

Python 98.10% Shell 1.90%

speaker-diarization uis-rnn speaker-recognition supervised-learning clustering supervised-clustering machine-learning

uis-rnn's Introduction

UIS-RNN

Overview

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm. UIS-RNN solves the problem of segmenting and clustering sequential data by learning from examples.

This algorithm was originally proposed in the paper Fully Supervised Speaker Diarization.

The work has been introduced by Google AI Blog.

Disclaimer

This open source implementation is slightly different than the internal one which we used to produce the results in the paper, due to dependencies on some internal libraries.

We CANNOT share the data, code, or model for the speaker recognition system (d-vector embeddings) used in the paper, since the speaker recognition system heavily depends on Google's internal infrastructure and proprietary data.

This library is NOT an official Google product.

We welcome community contributions (guidelines) to the uisrnn/contrib folder. But we won't be responsible for the correctness of any community contributions.

Dependencies

This library depends on:

python 3.5+
numpy 1.15.1
pytorch 1.3.0
scipy 1.1.0 (for evaluation only)

Getting Started

Install the package

Without downloading the repository, you can install the package by:

pip3 install uisrnn

python3 -m pip install uisrnn

Run the demo

To get started, simply run this command:

python3 demo.py --train_iteration=1000 -l=0.001

This will train a UIS-RNN model using data/toy_training_data.npz, then store the model on disk, perform inference on data/toy_testing_data.npz, print the inference results, and save the averaged accuracy in a text file.

PS. The files under data/ are manually generated toy data, for demonstration purpose only. These data are very simple, so we are supposed to get 100% accuracy on the testing data.

Run the tests

You can also verify the correctness of this library by running:

bash run_tests.sh

If you fork this library and make local changes, be sure to use these tests as a sanity check.

Besides, these tests are also great examples for learning the APIs, especially tests/integration_test.py.

Core APIs

Glossary

General Machine Learning	Speaker Diarization
Sequence	Utterance
Observation / Feature	Embedding / d-vector
Label / Cluster ID	Speaker

Arguments

In your main script, call this function to get the arguments:

model_args, training_args, inference_args = uisrnn.parse_arguments()

Model construction

All algorithms are implemented as the UISRNN class. First, construct a UISRNN object by:

model = uisrnn.UISRNN(args)

The definitions of the args are described in uisrnn/arguments.py. See model_parser.

Training

Next, train the model by calling the fit() function:

model.fit(train_sequences, train_cluster_ids, args)

The definitions of the args are described in uisrnn/arguments.py. See training_parser.

The fit() function accepts two types of input, as described below.

Input as list of sequences (recommended)

Here, train_sequences is a list of observation sequences. Each observation sequence is a 2-dim numpy array of type float.

The first dimension is the length of this sequence. And the length can vary from one sequence to another.
The second dimension is the size of each observation. This must be consistent among all sequences. For speaker diarization, the observation could be the d-vector embeddings.

train_cluster_ids is also a list, which has the same length as train_sequences. Each element of train_cluster_ids is a 1-dim list or numpy array of strings, containing the ground truth labels for the corresponding sequence in train_sequences. For speaker diarization, these labels are the speaker identifiers for each observation.

When calling fit() in this way, please be very careful with the argument --enforce_cluster_id_uniqueness.

For example, assume:

train_cluster_ids = [['a', 'b'], ['a', 'c']]

If the label 'a' from the two sequences refers to the same cluster across the entire dataset, then we should have enforce_cluster_id_uniqueness=False; otherwise, if 'a' is only a local indicator to distinguish from 'b' in the 1st sequence, and to distinguish from 'c' in the 2nd sequence, then we should have enforce_cluster_id_uniqueness=True.

Also, please note that, when calling fit() in this way, we are going to concatenate all sequences and all cluster IDs, and delegate to the next section below.

Input as single concatenated sequence

Here, train_sequences should be a single 2-dim numpy array of type float, for the concatenated observation sequences.

For example, if you have M training utterances, and each utterance is a sequence of L embeddings. Each embedding is a vector of D numbers. Then the shape of train_sequences is N * D, where N = M * L.

train_cluster_ids is a 1-dim list or numpy array of strings, of length N. It is the concatenated ground truth labels of all training data.

Since we are concatenating observation sequences, it is important to note that, ground truth labels in train_cluster_id across different sequences are supposed to be globally unique.

For example, if the set of labels in the first sequence is {'A', 'B', 'C'}, and the set of labels in the second sequence is {'B', 'C', 'D'}. Then before concatenation, we should rename them to something like {'1_A', '1_B', '1_C'} and {'2_B', '2_C', '2_D'}, unless 'B' and 'C' in the two sequences are meaningfully identical (in speaker diarization, this means they are the same speakers across utterances). This part will be automatically taken care of by the argument --enforce_cluster_id_uniqueness for the previous section.

The reason we concatenate all training sequences is that, we will be resampling and block-wise shuffling the training data as a data augmentation process, such that we result in a robust model even when there is insufficient number of training sequences.

Training on large datasets

For large datasets, the data usually could not be loaded into memory at once. In such cases, the fit() function needs to be called multiple times.

Here we provide a few guidelines as our suggestions:

Do not feed different datasets into different calls of fit(). Instead, for each call of fit(), the input should cover sequences from different datasets.
For each call to the fit() function, make the size of input roughly the same. And, don't make the input size too small.

Prediction

Once we are done with training, we can run the trained model to perform inference on new sequences by calling the predict() function:

predicted_cluster_ids = model.predict(test_sequences, args)

Here test_sequences should be a list of 2-dim numpy arrays of type float, corresponding to the observation sequences for testing.

The returned predicted_cluster_ids is a list of the same size as test_sequences. Each element of predicted_cluster_ids is a list of integers, with the same length as the corresponding test sequence.

You can also use a single test sequence for test_sequences. Then the returned predicted_cluster_ids will also be a single list of integers.

The definitions of the args are described in uisrnn/arguments.py. See inference_parser.

Citations

Our paper is cited as:

@inproceedings{zhang2019fully,
  title={Fully supervised speaker diarization},
  author={Zhang, Aonan and Wang, Quan and Zhu, Zhenyao and Paisley, John and Wang, Chong},
  booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={6301--6305},
  year={2019},
  organization={IEEE}
}

References

Baseline diarization system

To learn more about our baseline diarization system based on unsupervised clustering algorithms, check out this site.

A Python re-implementation of the spectral clustering algorithm used in this paper is available here.

The ground truth labels for the NIST SRE 2000 dataset (Disk6 and Disk8) can be found here.

For more public resources on speaker diarization, check out awesome-diarization.

Speaker recognizer/encoder

To learn more about our speaker embedding system, check out this site.

We are aware of several third-party implementations of this work:

Resemblyzer: PyTorch implementation by resemble-ai
TensorFlow implementation by Janghyun1230
PyTorch implementaion by HarryVolek - with UIS-RNN integration
PyTorch implementation as part of SV2TTS

Please use your own judgement to decide whether you want to use these implementations.

We are NOT responsible for the correctness of any third-party implementations.

Variants

Here we list the repositories that are based on UIS-RNN, but integrated with other technologies or added some improvements.

Link	Description
taylorlu/Speaker-Diarization	Speaker diarization using UIS-RNN and GhostVLAD. An easier way to support openset speakers.
DonkeyShot21/uis-rnn-sml	A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data.

uis-rnn's People

Contributors

Stargazers

Watchers

Forkers

arunabhattacharya aacharya-cs yt752 qqlizhn shilawat albertcbrown samangel93 wh-forker longjohncoder gaoyiyeah zhaoforever wyn314 levelsethu yongyug eos21 mbyase charlottesean roertech livius2 xzm2004260 muruganr96 kjeanclaude vinishuchiha kastnerkyle yhgon dendisuhubdy yzy081 jaciyu threeparty swinsey pplus rxhmdia liudb5 liangliang12 kenhehuang b-xiang insad fgdbtkd dalonlobo jjoving shaunstanislauslau hiyoung-asr junyue akinropo b2220333 yeekzhang wonmin1 ai-sherry trendingtechnology liuyuntianxing minzamammalik om26er duke24k zhangshengoo funny07 lightflyer marcinwal pdigennaro gaoyz0625 awesomemachinelearning instinct2k18 devhttps ferasos xinli94 wuzl feitianyiren entn-at wangmengzhi tanthml km269 haifengzeng valery-barysok iamsile huazhz codeaudit aesmin likehuaer aitorbajo bergqie hyzhan fohuhu jasonaidm jjboomsft maisyzhang leetsinghua wangxin22 yh646492956 lym0302 wgfi110 nicexyf dragowave bolshanetsky paulchou0309 xdzhangxuejun boostpapa ml-lab zpppy simpleishappy chenjohnai lovelan521

uis-rnn's Issues

about the version of pytorch and tensorflow

TensorFlow implementation by Janghyun1230
PyTorch implementaion by HarryVolek - with UIS-RNN integration
hello,Are the two versions above have the same algorithm with uis-rnn ?
what's the different between them ?
Can all of them identify enrollment utterances and unenrollment utterances?
Can all of them identify new speakers which is untrained ?

run_test.sh problem

Hi,
I run this demo.py twice. The first time it works well,and its accurancy is 1.But when i delete the model and try again it works but its result is only about 0.8.I'm sure i didn't change the program.I have tried to deleted the whole program and git clone it again. Its result is still about 0.8.Then I run run_test.sh and got a error,

======================================================================
FAIL: test_four_clusters (main.TestIntegration)
Four clusters on vertices of a square.

Traceback (most recent call last):
File "tests/integration_test.py", line 99, in test_four_clusters
self.assertEqual(1.0, accuracy)
AssertionError: 1.0 != 0.9

Ran 1 test in 17.543s

FAILED (failures=1)

There must be something strange happens.Could anyone tell me why could lead to this happen?
Thanks.

Refactor fit() and predict()

Current fit() and predict() functions are too long.

We should refactor them to move some components out and make them more readable.

Any plan on upgrading to pytorch 1.0+?

Describe the question

Wanted to use in Tesla 4 on Google Cloud VM. Apparently, they do not support lower versions of CUDA other than CUDA 10 on T4. And, for pytorch 0.4.1, it requires a lower version of CUDA.

My background

I just finished my masters and working as a speech recognition engineer.

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

about the training loss and the batch size

I want to know whether the loss below is normal or not，I set the batchsize=10
then ,no matter how I change dataset, the converge loss is about 900。

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

Publish the library as a package on PyPI

Follow the instructions here: https://packaging.python.org/tutorials/packaging-projects/

Here is another tutorial: https://realpython.com/pypi-publish-python-package/

Once done, we should support installing the library with pip3

Add a `partial_fit()` API

Add a new API named partial_fit(), which allows the user to train on a single sequence.

This API will target at the advanced users. The user needs to write his own code for things like data shuffling.

Ideally, the fit() function should be calling the partial_fit() function.

Interpreting the loss values

Hi Team,

I'm training the model on a custom dataset. I'm finding it confusing to interpret the various losses displayed during training. For example, does having large negative Training loss is better or should I concentrate on Negative log likelihood? To sum up, how will I know that the model is converging? Following image is from my training process.

Your help is greatly appreciated.

Test modell

Sorry if this sounds like a dumb question. I am not an expert in eighter python or speaker diarization. After I have trained the model, how can I use it to determine how is speaking from a wave file. I am trying to determine how is speaking from a one audio telephone conversation.

Could I for example use test_test_sequence=wavfile.read(mywav) as a input to
predicted_cluster_id = model.predict(test_sequence, args), and get get a prediction of how spoke from this file?

My question is more about the use of the code. I hope you can help!

Obscurity involved in sampling rate information of datasets used

It is specified in the paper that, Voxceleb (which has a sampling rate of 16k) is one among the public datasets used for training the d-vector model, while the testing dataset is 2000 CALLHOME (which is 8k).
Please clarify this sampling rate mismatch issue.

Is the GRU really needed to predict mu_t ?

I spent some time trying to figure out what the GRU really does.
My understanding is that it is used to estimate the running mean (mu_t in the paper) of each cluster.

I can see the benefit of a RNN for this (it can learn to not take some noisy samples into account) but I am wondering whether you had the chance to compare to an actual running mean.

handle overlapped speech

In your paper, during evaluation, you exclude overlapped speech. Which one below is the solution?

You method process the whole audio, ignore the error during the overlapped speech
You method first trim the overlapped part from the audio, then process the trimmed audio

And during training, which is the solution ?

The clustering performance influenced by overlap window size

@wq2012
The overlap rate seems strongly influenced the number of speakers.
Since when overlap size is larger, the speaker embedding will change more smoothly, and the change points will hard to detect, it's apt to generate fewer speakers.
And the size of sliding window also affects a lot, although this problem is caused by the speaker embedding algorithm.
This is my project integrates with the vgg-speaker-recognition algorithm: Speaker-Diarization
Thanks a lot.

Support half-life for learning rate

Currently learning rate is fixed during training. We should support half-life.

[Question] Which feature was used for VAD?

Describe the question

Hi, thanks for open-sourcing this awesome project.
Which feature was used for VAD? d-vector or PLP features (as you mentioned in "Speaker Diarization With LSTM") ?

My background

Have I read the README.md file?
yes
Have I searched for similar questions from closed issues?
yes
Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?
yes
Have I tried to find the answers in the reference Speaker Diarization with LSTM?
yes
Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?
yes

Batch prediction? - or allow prediction using multiprocessing

Describe the question

Documentation states that one can only apply prediction one sequence at a time.

On my machine (with GPU), it takes more than 10s to process one sequence with 100 samples.
Would be nice to support batch prediction to make processing large collection of sequences faster.
Beam search probably makes it impossible though.

Anyway, thanks for open-sourcing this implementation. This is really appreciated!

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

not applicable

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

not applicable

How to define a speaker per segments from overlapped widow of frames? Prediction on realtime data?

Describe the question

A clear and concise description of what the question is.
Summary of work:
Audio signal is transformed into frames (log-mel-filterbank energies features) with frame width 25ms and step of 10ms. then frames are constituted into the over-lapped window of size 240ms and 50% overlap. window level d-vector calculated and then d-vectors are constituted into a segment of 400ms or more so that a segment contains single speaker's d-vector.

Questions:
During Testing, since each audio file contains utterances of different speakers, if we make the over-lapped window of frames,

How do we sure about the 400 ms sized segment will represent a single speaker, and
If we make each segment as of 400 ms fixed, won't it affect the accuracy?
How do i perform real-time prediction if I have an audio file, how do I get speaker wise timestamp of each utterance?

Help appreciated.

My background

Have I read the README.md file?

yes/no - if you answered no, please stop filing the issue, and read it first
yes
Have I searched for similar questions from closed issues?
yes/no - if you answered no, please do it first
yes
Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?
yes/no
yes
Have I tried to find the answers in the reference Speaker Diarization with LSTM?
yes/no
yes
Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?
yes/no
yes

[Question]test on speakers untrained

Describe the question

when I tested my own datasets which haven't been trained ,the result is really bad,can you give me some suggestion?

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

Large datasets cause training machine to run out of memory

Hi,

I am working on training a uis-rnn model with dataset voxceleb2: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html.

Step 1: use my embedding model to generate npz files.
Step 2: then load npz with uis-rnn api to start training.

But loading npz files as training data causes out of memory issues, which took down my machine. There are over 1,000,000 training clips in the dataset. Is it possible to make large datasets work with this api?

Thanks,
Xin

Question on training loss

During the fit process,the loss has 3 parts.Could anyone tell me what's the meaning of loss2?And why counts the number of non-zero(the code is just above calculating loss2)?

Add a `online_predict()` API for streaming input

UIS-RNN is an online algorithm, but the current predict() API of this library is not.

If people want to deploy this library to a production environment for online use cases, an online_predict() API is going to be necessary.

Its usage should be like this:

# Feed the first sequence, and continuously make use of the label.
label = model.online_predict(X1)
label = model.online_predict(X2)
label = model.online_predict(X3)
model.online_predict(reset=True)
# Feed the second sequence, and continuously make use of the label.
label = model.online_predict(Y1)
label = model.online_predict(Y2)
label = model.online_predict(Y3)
label = model.online_predict(Y4)
model.online_predict(reset=True)

However, we may not have the bandwidth to work on this any time soon.

How to convert audio data into test data of algorithm for testing

Describe the question

A clear and concise description of what the question is.

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Hello, I have read README.md I want to convert my audio data into the test and training data needed by the model. How can I do this? I have also tried the third-party methods provided in README.md. However, they are for specific data sets, such as TIMIT, using our own audio data can not run successfully. Thank you very much for your guidance.

Allow `predict()` to accept a list of test_sequences as input

Just use a for loop.

Try to optimize performance later.

Performance degrade for multi-person meeting

During experiments, for conversational telephone, the model's performance is fine. But the model's performance degrade seriously for multi-person meeting scenario, such as ICSI. For ICSI, Confusion error could be 30%. And only DER for NIST SRE 2000 CALLHOME is provided in the paper. As in your paper, you guys use ICSI as part of training set, do you test the performance of model on the ICSI ?

How to embedding audio stream data to k-vector (512)

Hi, thank you for open source it !

I read your paper and tests/integration_test.py , my question is that I want to know the way you use, to embedding the audio stream data with D = 512.
Actually it's like the question here
The way you generate train data or test data from a audio stream.

Is that like librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40) ?
In your paper, say:
In this system, audio signals are first transformed into frames of width 25ms and step 10ms, and log-mel-filterbank energies of dimension 40 are extracted from each frame as the network input. These frames form overlapping sliding windows of a fixed length, on which we run the LSTM network. The last-frame output of the LSTM is then used as the d-vector representation of this sliding window
How can I reproduce this part ~

I appreciate it, waiting for your response!
Thanks,
Bo

ValueError: not enough values to unpack (expected 2, got 1)

@wq2012 i was recorded me and three different speakers audio for 4 sec, mono channel, 16k wav format file.
note: if any restrictions is there or not audio duration, format, and size ?
i given my audio file array as train_data, as well as test_data,

label_to_center = {
      'A': np.array(a[1],dtype=float),
      'B': np.array(b[1],dtype=float),
      'C': np.array(c[1],dtype=float),
      'D': np.array(d[1],dtype=float),
    }

python3 integration_test.py 
(63488,)
[  0.   0.  -2. ... 700. 687. 679.]
(64884,)
[ 8. 16.  2. ... 43. 44. 50.]
(63488,)
[  0.   0.  -2. ... 350. 392. 424.]
(63488,)
[  0.   2.  -8. ... 364. 343. 421.]
E
======================================================================
ERROR: test_four_clusters (__main__.TestIntegration)
Four clusters on vertices of a square.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "integration_test.py", line 86, in test_four_clusters
    train_cluster_id, label_to_center, sigma=0.01)
  File "integration_test.py", line 42, in _generate_random_sequence
    result = np.vstack((result, label_to_center[id]))
  File "/home/dell/Pictures/dp-0.1.1/12-09-2018/voice_reg/mycroft-precise/.venv/lib/python3.6/site-packages/numpy/core/shape_base.py", line 234, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

----------------------------------------------------------------------
Ran 1 test in 0.038s

FAILED (errors=1)

ValueError: all the input array dimensions except for the concatenation axis must match exactly
result = np.concatenate((result, label_to_center[id]))

again shows this error,

ValueError: not enough values to unpack (expected 2, got 1)

ERROR: test_four_clusters (__main__.TestIntegration)
Four clusters on vertices of a square.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "integration_test.py", line 112, in test_four_clusters
    model.fit(train_sequence, np.array(train_cluster_id), training_args)
  File "/home/dell/Pictures/dp-0.1.1/12-09-2018/voice_reg/mycroft-precise/uis-rnn/uis-rnn/model/uisrnn.py", line 180, in fit
    train_total_length, observation_dim = train_sequence.shape
ValueError: not enough values to unpack (expected 2, got 1)

my audio sequence shape:
np.shape(np.array(a[1],dtype=float)) --> (63488,)
np.array(a[1],dtype=float) --> [ 0. 0. -2. ... 700. 687. 679.]

uis-rnn ./data/training_data.npz sequence shape:
np.shape(sequence[sampled_idx_sets[j], :]) --> (39, 256) shape of sequence

if numpy shape is an issue then it will resolve automatically using utills.py
but it can't reach that utils.resize_sequence function.

issue for audio sequence ,

train_sequence = _generate_random_sequence(train_cluster_id, label_to_center, sigma=0.01)
print("train_seq...............", train_sequence)

train_seq............... [ 4.17022005e-03  7.20324493e-03 -1.99999886e+00 ...  7.00003598e+02
  6.87002227e+02  6.79005481e+02]

(63906800, ) train_sequence.shape

train_total_length, observation_dim = train_sequence.shape
ValueError: not enough values to unpack (expected 2, got 1)

how to resolve this issue @wq2012 sir.
advance thanks

2 space indentation

Hi Team,

Thanks for open sourcing this code. On a lighter note, is there a reason for coding with 2 space indentation rather than 4 space?

Thanks,
Dalon

Allow controlling verbosity level

We should add a flag to allow the verbosity level.

A low level would only print the most necessary information, while a high level would print lots of logging information to help debugging.

Question on the generative process of UIS-RNN

[Question] Would it be possible to publish trained models

Would it be possible to publish trained models for other languages...like Japanese and mandarin.

Would such models be useful in general speaker isolation for unknown speakers sequence not in training set.

How robust to background ambient sound is this model...like traffic noise.

Dataset

Which dataset should I use for training network?

Have you train the model on one database and test it on another database?

In your paper, "We randomly partition the dataset into five subsets, and each time leave one subset for evaluation, and train UIS-RNN on the other four subsets. Then we combine the evaluation on five subsets and report the averaged DER." So the persons in traindatas also appeare in the testdata.
Have you train the model on one database and test the model on another database? The persons in traindatas dont appeare in the testdata.
I have done it ,but get the bad result.

Pretrained Models

Hello,

I know this is exactly the model used in the paper, but was wondering if you planned to release pretrained model on the datasets of section 4.4 of the paper ?

Many thanks.

[Question] Can I use speaker annotated datasets in other language rather than English?

Describe the question

Hi,
This might be a naive question. But can I use speaker annotated speech corpora from different languages (English, Mandarin etc), combine them and train the speaker embedding component? Are speaker embeddings/UIS-RNN language independent?

My background

Have I read the README.md file?
yes

Have I searched for similar questions from closed issues?
yes

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?
yes

Have I tried to find the answers in the reference Speaker Diarization with LSTM?
yes

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?
yes

wrong diarization results for long sentence (about 60 seconds)

We trained a diarization model with our own data, the test result is correct for short utterances. However for longer utterances, the diarization output is all zero. Could you explain the possible reasons for this? Thank you.

Better fit() API, to accept list of sequences

fit() should accept list of arrays for both training sequences and training cluster IDs, and do the concatenation for the user.

The current API requires the user to do the concatenation himself/herself, which could cause unnecessary confusions.

[Question] How to prepare embedding data for training UIS-RNN?

Describe the question

Hi, thank you for open source it !
I have read the 'README.md' file and almost all the issues under this repo. But I 'm still in a puzzle about data pre-processing.

My understanding is that before the training of the UIS-RNN, a speaker embedding network should be trained with some single-speaker utterance-level features , as is mentioned in the paper of GE2E loss, in advance. After that , input frame-level features generated from raw data to the embedding network to generate frame-level embeddings. And then I can use them to train my UIS-RNN. Am I right about that? I 'm wondering whether these frame-level embeddings are 'continuous d-vector embeddings (as sequences) ' you said here.

I am a new comer of speaker diarization and the question I asked really confused me, so I 'd be very grateful if you can help me. Thanks :)

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

How can i train my own data on this ?

I want to use my own dataset and test over this, or lets take VOXCELEB dataset, please guide how should i make the embeddings and create .npz file in data folder and further train it

uis-rnn can't work for long utterances dataset?

Describe the question

In Diarization task, i train on AMI train-dev set and ICSI corpus , i test on AMI test set. Both datasets include audios of 3-5 speakers in 50-70 minutes. My d embedding trains on Voxceleb1,2 with EER = 4.55%. I train uirnn with window size .24ms, overlap 50%, segment size .4ms. The result is poor on both train and test set.
I also read all your code about uirnn, i don't understand 1> why do you split up the original utterances and concatenate them by speaker and then use that input for training? 2> Why doese the input ignore which audio the utterance belongs to, just merge all utterances in 1 single audio? .This process seems completely different to inference process and also reduce the capacity of using batch size if one speaker talk too much.
For 1 hour audio, the output has 20-30 speakers instead of 3-5 speakers no matter the smaller of crp_alpha is.

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

Consider allowing enforcing max number of speakers in predict()

Although the whole idea of UIS-RNN is to handle unbounded number of speakers by learning from data, some people may still want to enforce the max number of speakers in predict.

Consider adding this option in the future.

Understanding diarization labels

Describe the question

A clear and concise description of what the question is.

My background

Have I read the README.md file?

yes/no - if you answered no, please stop filing the issue, and read it first

Have I searched for similar questions from closed issues?

yes/no - if you answered no, please do it first

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

yes/no

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

yes/no

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

yes/no

Hello, we used third-party tools to generate train_sequence and train_cluster_id, and completed the training. We trained 46 people and tested one of them. The prediction accuracy of the model was 98%. We can't understand the relationship between real tags and predictive tags. Although the accuracy is high, it makes it impossible to find out who the speaker is.
We don't understand the label of demo you gave us. Thank you for your guidance.

Use a real sequence accuracy evaluation instead of an approximate accuracy

Currently in the demo we use a greedy method to compute sequence match accuracy. It's just an approximation, as an estimation of how well the predicted sequence matches the ground truth.

The correct way should be using Hungarian algorithm to compute the optimal match, which is more meaningful.

Consider making the package available in conda

Instructions: https://conda.io/projects/conda-build/en/latest/source/recipe.html

The number of speakers and whether the content of the speaker needs to be the same。

Describe the question

A clear and concise description of what the question is.

My background

Have I read the README.md file?

yes/no - if you answered no, please stop filing the issue, and read it first

Have I searched for similar questions from closed issues?

yes/no - if you answered no, please do it first

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

yes/no

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

yes/no

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

yes/no
Hello, preparations have been completed. Now we want to train our voice data. I want to ask three questions: 1. How many speakers do we need at least? 2. How long does each speaker need to say a few words? 3. Does each speaker need to speak the same sentence? Thank you for your guidance.

Model predicts new cluster for each input after calling load()

Hi,

I've loaded the saved model trained on the custom dataset using model.load. When I'm predicting test set with model.predict, Instead of labels, I'm getting the sequence of numbers, which looks like it starts from the length of sequences passed to predict. Following is a screenshot for your reference.

Thank you in advance.

Information about the Data

Hello,

Thank you for open sourcing this ! I haven't found information regarding the data you provide. What is it/ where does it come from ?

[Invalid][Cloud] Speaker tag is not accurate

Describe the bug

I have tested with my audio file for speaker Diarization which is not accurate. i have attached audio file(speaker_tag issue.wav) and my python code.
Is there any problem with my python code or audio file?

To Reproduce

This is my python code for speaker diarization.

from google.cloud import speech_v1p1beta1 as speech
from google.oauth2 import service_account
import os
client = speech.SpeechClient(credentials=service_account.Credentials.from_service_account_file(os.getenv("GOOGLE_APPLICATION_CREDENTIALS")))


#audio = speech.types.RecognitionAudio(content=content)

audio = speech.types.RecognitionAudio(uri = 'STORAGE_AUDIO_URL')

config = speech.types.RecognitionConfig(
    encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=48000,
    language_code='en-US',
    enable_speaker_diarization=True,
    diarization_speaker_count=2)


operation = client.long_running_recognize(config, audio)

response = operation.result(timeout=1000)

result = response.results[-1]

words_info = result.alternatives[0].words

# Printing out the output:
for word_info in words_info:
    print("word: '{}', speaker_tag: {}".format(word_info.word,
                                               word_info.speaker_tag))

Data samples

Audio file google drive link here

Above audio file Output:-
word: 'he', speaker_tag: 2
word: 'sighed', speaker_tag: 2
word: 'what', speaker_tag: 2
word: 'brings', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'in', speaker_tag: 2
word: 'today', speaker_tag: 2
word: 'I', speaker_tag: 2
word: 'have', speaker_tag: 2
word: 'a', speaker_tag: 2
word: 'really', speaker_tag: 2
word: 'severe', speaker_tag: 2
word: 'cough', speaker_tag: 2
word: 'really', speaker_tag: 2
word: 'severe', speaker_tag: 2
word: 'headache', speaker_tag: 2
word: 'and', speaker_tag: 2
word: 'my', speaker_tag: 1
word: 'throat', speaker_tag: 2
word: 'really', speaker_tag: 2
word: 'itchy', speaker_tag: 2
word: 'okay', speaker_tag: 2
word: 'let', speaker_tag: 2
word: 'me', speaker_tag: 2
word: 'check', speaker_tag: 2
word: 'seems', speaker_tag: 2
word: 'like', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'have', speaker_tag: 2
word: 'one', speaker_tag: 2
word: 'or', speaker_tag: 2
word: 'two', speaker_tag: 2
word: 'temperature', speaker_tag: 2
word: 'to', speaker_tag: 2
word: 'did', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'take', speaker_tag: 2
word: 'any', speaker_tag: 2
word: 'medication', speaker_tag: 2
word: 'what', speaker_tag: 2
word: 'dosage', speaker_tag: 2
word: 'will', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'take', speaker_tag: 2
word: 'in', speaker_tag: 2
word: 'animal', speaker_tag: 1
word: 'okay', speaker_tag: 1
word: 'let', speaker_tag: 2
word: 'me', speaker_tag: 2
word: 'take', speaker_tag: 2
word: 'a', speaker_tag: 2
word: 'look', speaker_tag: 2
word: 'at', speaker_tag: 2
word: 'it', speaker_tag: 2
word: 'it's', speaker_tag: 2
word: 'like', speaker_tag: 2
word: 'a', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'got', speaker_tag: 2
word: 'a', speaker_tag: 2
word: 'flu', speaker_tag: 2
word: 'did', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'take', speaker_tag: 2
word: 'your', speaker_tag: 2
word: 'flu', speaker_tag: 2
word: 'shot', speaker_tag: 2
word: 'so', speaker_tag: 2
word: 'the', speaker_tag: 2
word: 'intensity', speaker_tag: 2
word: 'might', speaker_tag: 2
word: 'be', speaker_tag: 2
word: 'low', speaker_tag: 2
word: 'why', speaker_tag: 2
word: 'don't', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'continue', speaker_tag: 2
word: 'taking', speaker_tag: 2
word: 'your', speaker_tag: 2
word: 'Tylenol', speaker_tag: 2
word: 'for', speaker_tag: 2
word: 'your', speaker_tag: 2
word: 'draw', speaker_tag: 2
word: 'temperature', speaker_tag: 2
word: 'in', speaker_tag: 2
word: 'your', speaker_tag: 2
word: 'headache', speaker_tag: 2
word: 'and', speaker_tag: 2
word: 'write', speaker_tag: 2
word: 'some', speaker_tag: 2
word: 'cough', speaker_tag: 2
word: 'syrup', speaker_tag: 2
word: 'so', speaker_tag: 2
word: 'if', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'can', speaker_tag: 2
word: 'get', speaker_tag: 2
word: 'it', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'can', speaker_tag: 2
word: 'get', speaker_tag: 2
word: 'it', speaker_tag: 2
word: 'in', speaker_tag: 2
word: 'the', speaker_tag: 2
word: 'pharmacy', speaker_tag: 2
word: 'thank', speaker_tag: 2
word: 'you', speaker_tag: 2

The above ouput is not accurate with audio file. All words are showing speaker tag as 2. Please check audio file with output.

Versions

google-cloud-speech==0.36.0

[Question]Performance degrade for different win size

Describe the question

i train embedding lstm with variable length batch size, about [24，160]，then i get d-vector with win size 160frame in this model, and use the d-vector train uis-rnn, the uis-rnn predict result is good(90%)。but when i use win size 40frame get the d-vector and train uis-rnn the uis-rnn predict result is bad (40%)。what the reason cause this problem？

Question on time cost during each iteration

Hi,
I've tried to use LibriSpeech to train the model, and I found that "backward" step (loss.backward()) took the longest time in each iteration (Almost 95% of the time). And the larger the datasets, the more time is consumed. Is that normal? Why is backward associated with data?
Thank you in advance.

Add support for estimation of crp_alpha

Currently in this open source version, crp_alpha is passed in as an argument.

We need to add the support to estimate it from training data.

google / uis-rnn Goto Github PK

uis-rnn's Introduction

UIS-RNN

Overview

Disclaimer

Dependencies

Getting Started

Install the package

Run the demo

Run the tests

Core APIs

Glossary

Arguments

Model construction

Training

Input as list of sequences (recommended)

Input as single concatenated sequence

Training on large datasets

Prediction

Citations

References

Baseline diarization system

Speaker recognizer/encoder

Variants

uis-rnn's People

Contributors

Stargazers

Watchers

Forkers

uis-rnn's Issues

====================================================================== FAIL: test_four_clusters (main.TestIntegration) Four clusters on vertices of a square.

Describe the question

My background

My background

Describe the question

My background

Describe the question

My background

Describe the question

My background

Describe the question

My background

Describe the question

My background

Describe the question

My background

Describe the question

My background

Describe the question

My background

Describe the question

My background

Describe the question

My background

Describe the bug

To Reproduce

Data samples

Versions

Describe the question

Recommend Projects

Recommend Topics

Recommend Org

======================================================================
FAIL: test_four_clusters (main.TestIntegration)
Four clusters on vertices of a square.