yistlin / dvector Goto Github PK
View Code? Open in Web Editor NEWSpeaker embedding (d-vector) trained with GE2E loss
Speaker embedding (d-vector) trained with GE2E loss
Hi, awesome repo! I am wondering, how should I cut my data? Is this setup good for sentence-level utterances?
In the code, it seems that the model is trained on short chunks of voiced audio, but in the viasualization script, the embeddings are extracted from whole utterances (so a single embedding vector for the whole sentence). Did you experiment with different setups? For example, did you try to extract embeddings for segments of the given utterance and average them? My data is in an ASR setup, so it is pretty much sentence based. Any advice how to further segment such data? Regards, Jan
Running visualize("preprocessed", "preprocessed/wav2mel.pt", "dvector-step5000.pt", ".")
Getting the following error while torch.jit.load
RuntimeError:
Unknown builtin op: torchaudio_sox::apply_effects_tensor.
Could not find any similar ops to torchaudio_sox::apply_effects_tensor. This op may not exist or may not be currently supported in TorchScript.
:
File "code/__torch__/torchaudio/sox_effects/sox_effects.py", line 5
effects: List[List[str]],
channels_first: bool=True) -> Tuple[Tensor, int]:
_0, _1 = ops.torchaudio_sox.apply_effects_tensor(tensor, sample_rate, effects, channels_first)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return (_0, _1)
'apply_effects_tensor' is being compiled since it was called from 'SoxEffects.forward'
Serialized File "code/__torch__/data/wav2mel.py", line 33
wav_tensor: Tensor,
sample_rate: int) -> Tensor:
_0 = __torch__.torchaudio.sox_effects.sox_effects.apply_effects_tensor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
effects = self.effects
_1 = _0(wav_tensor, sample_rate, effects, True, )
I run the code in preprocess.py and it has an error:
Traceback (most recent call last): File "preprocess.py", line 92, in <module> preprocess(**vars(PARSER.parse_args())) File "preprocess.py", line 71, in preprocess for speaker_name, mel_tensor in tqdm(dataloader, ncols=0, desc="Preprocess"): File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__ for obj in iterable: File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__ return self._get_iterator() File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 918, in __init__ w.start() File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/context.py", line 283, in _Popen return Popen(process_obj) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) RuntimeError: Tried to serialize object __torch__.data.wav2mel.Wav2Mel which does not have a __getstate__ method defined!
does anyone know how to solve this?
my version:
python==3.8 torch==1.8.0 torchaudio==0.8.0
I am using pre-trained model to get the embedding but getting below error:
RuntimeError Traceback (most recent call last)
<ipython-input-5-4bcf886e7e7f> in <module>
2 import torchaudio
3
----> 4 wav2mel = torch.jit.load("wav2mel.pt")
5 dvector = torch.jit.load("dvector.pt").eval()
6
~/anaconda3/lib/python3.8/site-packages/torch/jit/_serialization.py in load(f, map_location, _extra_files)
159 cu = torch._C.CompilationUnit()
160 if isinstance(f, str) or isinstance(f, pathlib.Path):
--> 161 cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
162 else:
163 cpp_module = torch._C.import_ir_module_from_buffer(
RuntimeError:
Class Namespace cannot be used as a value:
Serialized File "code/__torch__/torchaudio/sox_effects/sox_effects.py", line 5
effects: List[List[str]],
channels_first: bool=True) -> Tuple[Tensor, int]:
in_signal = __torch__.torch.classes.torchaudio.TensorSignal.__new__(__torch__.torch.classes.torchaudio.TensorSignal)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_0 = (in_signal).__init__(tensor, sample_rate, channels_first, )
out_signal = ops.torchaudio.sox_effects_apply_effects_tensor(in_signal, effects)
'apply_effects_tensor' is being compiled since it was called from 'SoxEffects.forward'
Serialized File "code/__torch__/data/wav2mel.py", line 29
wav_tensor: Tensor,
sample_rate: int) -> Tensor:
_0 = __torch__.torchaudio.sox_effects.sox_effects.apply_effects_tensor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_1 = _0(wav_tensor, sample_rate, self.effects, True, )
wav_tensor1, _2, = _1
When the input tensor shape is [1, 800] or [1, 320] and When I use the following code
mel_tensor = wav2mel(wav_tensor, 16000) # 16000 is the sample rate
I met with the following error:
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/data/wav2mel.py", line 20, in forward
sample_rate: int) -> Tensor:
wav_tensor0 = (self.sox_effects).forward(wav_tensor, sample_rate, )
mel_tensor = (self.log_melspectrogram).forward(wav_tensor0, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return mel_tensor
class SoxEffects(Module):
File "code/torch/data/wav2mel.py", line 43, in forward
def forward(self: torch.data.wav2mel.LogMelspectrogram,
wav_tensor: Tensor) -> Tensor:
_3 = (self.melspectrogram).forward(wav_tensor, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
mel_tensor = torch.numpy_T(torch.squeeze(_3, 0))
_4 = torch.clamp(mel_tensor, 1.0000000000000001e-09, None)
File "code/torch/torchaudio/transforms.py", line 20, in forward
def forward(self: torch.torchaudio.transforms.MelSpectrogram,
waveform: Tensor) -> Tensor:
specgram = (self.spectrogram).forward(waveform, )
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
mel_specgram = (self.mel_scale).forward(specgram, )
return mel_specgram
File "code/torch/torchaudio/transforms.py", line 41, in forward
waveform: Tensor) -> Tensor:
_0 = torch.torchaudio.functional.functional.spectrogram
_1 = _0(waveform, 0, self.window, 400, 160, 400, 2., False, self.center, self.pad_mode, self.onesided, )
~~ <--- HERE
return _1
class MelScale(Module):
File "code/torch/torchaudio/functional/functional.py", line 18, in spectrogram
waveform0 = waveform
shape = torch.size(waveform0)
waveform2 = torch.reshape(waveform0, [-1, shape[-1]])
~~~~~~~~~~~~~ <--- HERE
spec_f = torch.torch.functional.stft(waveform2, n_fft, hop_length, win_length, window, center, pad_mode, False, onesided, True, )
_0 = torch.slice(shape, 0, -1, 1)Traceback of TorchScript, original code (most recent call last):
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/torchaudio/transforms.py", line 96, in forward
Fourier bins, and time is the number of window hops (n_frame).
"""
return F.spectrogram(
~~~~~~~~~~~~~ <--- HERE
waveform,
self.pad,
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/torchaudio/functional/functional.py", line 88, in spectrogram
# pack batch
shape = waveform.size()
waveform = waveform.reshape(-1, shape[-1])
~~~~~~~~~~~~~~~~ <--- HERE# default values are consistent with librosa.core.spectrum._spectrogram
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
How can I solve this problem?
I am run visualization and get error:
[INFO] model loaded.
Preprocess: 1% 12/889 [00:00<00:27, 31.83it/s]
Traceback (most recent call last):
File "visualize.py", line 89, in
visualize(**vars(PARSER.parse_args()))
File "visualize.py", line 43, in visualize
mel_tensor = wav2mel(wav_tensor, sample_rate)
File "/home/vvs/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/data/wav2mel.py", line 20, in forward
sample_rate: int) -> Tensor:
wav_tensor0 = (self.sox_effects).forward(wav_tensor, sample_rate, )
mel_tensor = (self.log_melspectrogram).forward(wav_tensor0, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return mel_tensor
.....
Traceback of TorchScript, original code (most recent call last):
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/torchaudio/transforms.py", line 96, in forward
Fourier bins, and time is the number of window hops (n_frame).
"""
return F.spectrogram(
~~~~~~~~~~~~~ <--- HERE
waveform,
self.pad,
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/torchaudio/functional/functional.py", line 88, in spectrogram
# pack batch
shape = waveform.size()
waveform = waveform.reshape(-1, shape[-1])
~~~~~~~~~~~~~~~~ <--- HERE
# default values are consistent with librosa.core.spectrum._spectrogram
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
I wanna get a 601 dim dvector,and follow the step "Train from scratch"
then there is an error Traceback to FILE "modules/dvector.py"
"""Forward a batch through network."""
lstm_outs, _ = self.lstm(inputs) # (batch, seg_len, dim_cell)
embeds = torch.tanh(self.embedding(lstm_outs)) # (batch, seg_len, dim_emb)
~~~~~~~~~~~~~~ <--- HERE
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
How to realize the window size is drawn from a uniform distribution within [240ms, 1600ms] during training?
In your source code dvector.py, there are two questions. One is the conditional judgment: if utterance. size (1) < = self. seg _ len:, which should be compared with the 0 th dimension, because the 1 ST dimension is 40, so the horizontal dimension is smaller than seg_len=160, and the following sliding window part unfold cannot be reached; Second, the output shape of unfold is [bacth_size, 40, seg_len], while the input shape of AttentivePooledLSTMDvector should be [bacth_size, seg_len, 40], that is, size(-1) must be 40.
As for the uniform distribution seg_len, can I directly add the evenly distributed seg_len when traversing each utterance?
I appreciate your efforts, nice work.
But your audio_toolkit was implement in librosa and numpy, which was not differentiable.
It might limited the application. Eg. If I have an TTS model to generated Mel spectrogram, and if your dvector if fully differentiable, we can use this like a discriminator, to force the TTS model output exactly as expected person.
From waveform to Melspectrogram, you can make preprocessing fully differentiable with torchaudio, and it seems it can keep consitency with librosa
System:Windows 11
torch :1.11.0+cu113
torchaudio:0.11.0+cu113
I saw that windows may have problems with sox,so is there any solvement to the problem?Like changing a package or something?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.