jymsuper / speakerrecognition_tutorial Goto Github PK

View Code? Open in Web Editor NEW

205.0 205.0 44.0 694.42 MB

Simple d-vector based Speaker Recognition (verification and identification) using Pytorch

License: MIT License

Python 100.00%

deep-learning pytorch speaker-identification speaker-recognition speaker-verification

speakerrecognition_tutorial's People

Contributors

Stargazers

Watchers

speakerrecognition_tutorial's Issues

1,2달 전에 질문 올렸던 프로젝트 진행자입니다.

현재 저는 화자 등록과 인식 기능을 분리하여 실시간으로 작동되는 화자인식 프로그램을 만들었는데, 조용하고 좋은 마이크로 했을때는 인식률이 나쁘지 않은데 예기치 않은 변수가 존재하며 인식률에 의문이 생기기 시작했습니다. 그래서 여쭤보고 싶은게 cnn기반인 Resnet모델을 바꿔보면 더 나은 결과를 볼수 있지 않을까 하는생각입니다. 그래서 현재는 'Resnet18'인 기본 모델로 작동되는것 같은데 혹시 여기 사진에 올려져 있는것처럼 34,50,102,152 이렇게 있는걸 확인했는데 어떻게 바꾸는지 혹시 알수 있을까요?

Error

SIr i am unable to convert single wav to p file, can you do it please sir

verifification.py

sir speaker is different but verify is correct, how sir.

how to calculate EER in this code?

Hi @jymsuper ,

Thanks for sharing this excellent codes.

I have go through the identification.py and verification.py files for calculate the perfermanes after the enroll process.
Could you give me some ideas about how to calculate the EER ?

Many thanks

Can you provide how to convert wav to .p file

please provide a file

performance

@jymsuper I want to know it can be verified (not be identified) on the open set? That is to say, the test speakers not in training dataset. If possible, I want to know performance.

enroll_per_spk does not average

SpeakerRecognition_tutorial/enroll.py

Line 73 in 6dce646

Output the averaged d-vector for each speaker (enrollment)

states that outputs the averaged d-vector

however there is not averagin operation and only agregates embeddings

SpeakerRecognition_tutorial/enroll.py

Line 90 in 6dce646

embeddings[spk] += activation

is that true?

Error

When new speaker is coming, test the speaker verification , output is wrong.

ValueError: threshold must be non-NAN

python3 SpeakerRecognition_tutorial/identification.py

Traceback (most recent call last):
File "SpeakerRecognition_tutorial/identification.py", line 10, in
from DB_wav_reader import read_feats_structure
File "/content/SpeakerRecognition_tutorial/DB_wav_reader.py", line 11, in
np.set_printoptions(threshold=np.nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/core/arrayprint.py", line 259, in set_printoptions
floatmode, legacy)
File "/usr/local/lib/python3.6/dist-packages/numpy/core/arrayprint.py", line 95, in _make_options_dict
raise ValueError("threshold must be non-NAN, try "
ValueError: threshold must be non-NAN,

remove denoising noise

Hello sir, how to remove denoising noise for feature extraction.

Is there any related papers for this code as a reference?

TruncatedInputfromMFB

sorry，my english is bad

class TruncatedInputfromMFB(object):
    """
    input size : (n_frames, dim=40)
    output size : (1, n_win=40, dim=40) => one context window is chosen randomly
    """
    def __init__(self, input_per_file=1):
        super(TruncatedInputfromMFB, self).__init__()
        self.input_per_file = input_per_file
    
    def __call__(self, frames_features):
        network_inputs = []
        num_frames = len(frames_features)
        
        win_size = c.NUM_WIN_SIZE
        half_win_size = int(win_size/2)
        #if num_frames - half_win_size < half_win_size:
        while num_frames - half_win_size <= half_win_size:
            frames_features = np.append(frames_features, frames_features[:num_frames,:], axis=0)
            num_frames =  len(frames_features)
            
        for i in range(self.input_per_file):
            j = random.randrange(half_win_size, num_frames - half_win_size)
            if not j:
                frames_slice = np.zeros(num_frames, c.FILTER_BANK, 'float64')
                frames_slice[0:(frames_features.shape)[0]] = frames_features.shape
            else:
                frames_slice = frames_features[j - half_win_size:j + half_win_size]
            network_inputs.append(frames_slice)
        return np.array(network_inputs)

frames_slice = np.zeros(num_frames, c.FILTER_BANK, 'float64')Is the code wrong？
is
frames_slice = np.zeros((num_frames, c.FILTER_BANK), 'float64')

configure.py 부분에서 궁금한 점이 있습니다.

아래 코드의 TRAIN_WAV_DIR 와 DEV_WAV_DIR 부분은 무엇을 의미하고 있는건가요?

Wave path

TRAIN_WAV_DIR = '/home/admin/Desktop/read_25h_2/train'
DEV_WAV_DIR = '/home/admin/Desktop/read_25h_2/dev'

run train.py error

Before training, I modified SR_Dataset.py line 206 train_DB = read_DB_structure(c.TRAIN_WAV_DIR) , and I delete line 20 in DB_wav_reader.py follow issue #2, but when I run train.py, an error has occurred.

Traceback (most recent call last):
  File "train.py", line 328, in <module>
    main()
  File "train.py", line 92, in main
    train_dataset, valid_dataset, n_classes = load_dataset(val_ratio)
  File "train.py", line 22, in load_dataset
    train_DB, valid_DB = split_train_dev(c.TRAIN_WAV_DIR, val_ratio)
  File "train.py", line 65, in split_train_dev
    (train_len / total_len) * 100))
ZeroDivisionError: division by zero

I don't know how to fix it. Can you give me some ways to prepare the dataset？ I use another dataset.

Thank you. @jymsuper

(EPOCH_DEPRECATION_WARNING, UserWarning)오류

안녕하세요 화자인식에 관심이 있어 이것을 이용하여 프로젝트를 진행중입니다.
제가 거의 이쪽에 관한 지식은 전무한상태에서 시작하려다보니 에러메시지가 떠도 무슨의미인지 잘 모르겟어서 여쭤봅니다. 적힌대로 train->enroll->identification->verification순으로 실행을 해봤는데 최종 결과에서 음성이 일치하는지 비교할때 화자의 이름이 아니고 test라고 뜨더라고요 원인이 궁금합니다.
그리고 train.py에서 epoch1 진행후에
( The epoch parameter in scheduler.step() was not necessary and is being deprecated where possible. Please use scheduler.step() to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose.)이런 에러메시지가 뜨는데 원인을 잘모르겟어서 여쭤봅니다

How to add new speaker for enrollment?

What modifications do I need to do in order to add new speaker using enroll.py file. @jymsuper

Features Computation

Hello,
thanks for this great tutorial!
I'm not able to reproduce the feature extraction step, can you please point me to the right direction?

Now I'm using logfbanks from python_speech_features library, with sr=16000, n_filters=40.

Many thanks!

Resnet모델관련 질문 드립니다.

윗사진은 화자인식의 resnet.py의 resnet모델 코드인데 실제 resnet34와 그 숫자를 비교해봤을때
resnet vs 화자인식의 resnet모델
64 16
128 32
256 64
512 128
이렇게 차이가 나는걸 확인할수 있는데 혹시 이렇게 짜신 이유가 용량이 너무 커서 그러신건가요?
원래 Resnet 모델로 바꿔서 돌려보니깐 GPU가 부족하다고 하긴하네요. 이렇게 바꾸신 특별한 이유가 있으면 궁금합니다.
감사합니다.

train with own dataset

i got this when training with my own .p files


Training set 21600 utts (90.0%)
Validation set 2400 utts (10.0%)
Total 24000 utts

Number of classes (speakers):
240

<torch.utils.data.dataloader.DataLoader object at 0x00000218F4F8A710>
Train Epoch:   1 [       0/   21600 (  0%)]	Time 3.002 (3.002)	Loss 5.5635	Acc 0.0000
Train Epoch:   1 [    5376/   21600 ( 25%)]	Time 0.095 (0.113)	Loss 5.2005	Acc 1.3910
Train Epoch:   1 [   10752/   21600 ( 50%)]	Time 0.032 (0.098)	Loss 4.7337	Acc 2.0177
Traceback (most recent call last):
  File "D:\Python\train.py", line 290, in <module>
    main()
  File "D:\Python\train.py", line 120, in main
    train_loss = train(train_loader, model, criterion, optimizer, use_cuda, epoch, n_classes)
  File "D:\Python\train.py", line 157, in train
    for batch_idx, (data) in enumerate(train_loader):
  File "C:\Users\TA\anaconda3\envs\Python\lib\site-packages\torch\utils\data\dataloader.py", line 652, in __next__
    data = self._next_data()
  File "C:\Users\TA\anaconda3\envs\Python\lib\site-packages\torch\utils\data\dataloader.py", line 692, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Users\TA\anaconda3\envs\Python\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\TA\anaconda3\envs\Python\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "D:\Python\SR_Dataset.py", line 195, in __getitem__
    label = self.spk_to_idx[label]
KeyError: 'wav'

Process finished with exit code 1

Help me pls

train.py 오류

ZeroDivisionError: division by zero
어떻게 해결해야 할까요?

Extract embedding from the .wav

Hi, @jymsuper!
I want to extract embedding from the .wav sound. Tell me please, how can I do this?

jymsuper / speakerrecognition_tutorial Goto Github PK

speakerrecognition_tutorial's People

Contributors

Stargazers

Watchers

Forkers

speakerrecognition_tutorial's Issues

Wave path

Recommend Projects

Recommend Topics

Recommend Org