jjery2243542 / adaptive_voice_conversion Goto Github PK

View Code? Open in Web Editor NEW

465.0 465.0 90.0 279 KB

License: Apache License 2.0

Python 95.90% Shell 4.10%

adaptive_voice_conversion's People

Contributors

Stargazers

Watchers

adaptive_voice_conversion's Issues

VC for Chinese, but result not similar

Hello, I use the pre-trained model you provided to perform voice conversion for Chinese. I checked the results and found that the non-linguistic information of the output file is not similar to the non-linguistic information of the target file. According to the paper, the model should achieve the same effect on all data. How should it be done?

Doc

hi
can you add some doc,
maybe just the commands I need to type to test?

How to generate the pickle file(eg. train.pkl, in_test.pkl, out_test.pkl)?

When I do the preprocessing, I get the error of missing python pickle file
Can you give me some idea about how to generate pickle file ?
Thank you

About train_index_file train_samples_128.json

Hello dears
thanks for releasing code publicly
just one question: file -train_index_file train_samples_128.json is not available. can you add it or describe how to generate it?
I stepped forward as mentioned but the problem is in train.sh:

python3 main.py -c config.yaml -d /groups/jjery2243542/data/vctk/trimmed_vctk_spectrograms/sr_24000_mel_norm -train_set train_128 "-train_index_file train_samples_128.json" -store_model_path /groups/jjery2243542/model/adaptive_vc/vctk_model -t vctk_model -iters 500000 -summary_step 500

Speaker Encoder Loss

An additional question, I didn't find any constraint on speaker encoder. I have an idea - a speaker encoder(Es) loss is added to keep speaker-embeddings invariant if Ec encodes Speaker1 and Es encodes Speaker2. The reconstructed X1_2 should have the same speaker-embeddings with Speaker2. c1 = Ec(X1), s2 = Es(X2), X1_2 = D(c1,s2),
So the loss is torch.nn.L1loss(Es(X1_2)-s2)
Do you think this may help?

What do Prenet and Postnet used for?

Hi, I see Prenet and Postnet code in model.py. Do Prenet used before both Ec and Es and Postnet used after Decoder?

some prolems about this repo

1、in this paper "One-shot Voice Conversion by Separating Speaker and Content
Representations with Instance Normalization ", the architecture of the encoders and decode are different from the implementation of this repo.
2、in source code， what does this function LatentDiscriminator do?
3、when open the doc?

Training still wouldn't start

Hi @jjery2243542 ,
I ran the updated code and the problem is solved. However, when I run the code file main.py , the training does not start. And I get the screen as shown. Do I need to run any other file to start the training or am I missing anything?
I am writing this command to run the file main.py

python main.py -c config.yaml -train_set train -d ./features -train_index_file train_samples_128.json -store_model_path ./model -load_model_path ./model

About Spectral Normalization

Hi @jjery2243542

Thank you for your paper and code.

I see spectral normalization in your code, not in your paper.

Does SN work? How is the effect?

No training speaker encoder?

Don't need to train a speaker encoder?
There seems to be no such step in the code. Do we need to use a pre-trained model?

About the implementation in the code

Hello, I have two questions while reading your code. Could you please help me answer them if you are free.

Why is the forward propagation of training and prediction different?
` def forward(self, x):
emb = self.speaker_encoder(x)
mu, log_sigma = self.content_encoder(x)
eps = log_sigma.new(*log_sigma.size()).normal(0, 1)_
dec = self.decoder(mu + torch.exp(log_sigma / 2) * eps, emb)
return mu, log_sigma, emb, dec

def inference(self, x, x_cond):
emb = self.speaker_encoder(x_cond)
mu, _ = self.content_encoder(x)
dec = self.decoder(mu, emb)
return dec
`
How is the loss function of KL divergence calculated?
loss_kl = 0.5 * torch.mean(torch.exp(log_sigma) + mu ** 2 - 1 - log_sigma)

adaptive_voice_conversion/solver.py

Line 86 in 68c3351

loss_kl = 0.5 * torch.mean(torch.exp(log_sigma) + mu ** 2 - 1 - log_sigma)

About the number of mel-scale spectrogram bin

I found out that several(most) vocoders or other tts models use mel-pectrogram channel "80".

In this work, the model is using 512 channels.

why is this model using 512 channels which is way more than other tts and vocoder models?

Question about preprocessing

Hi @jjery2243542 .
I have a question about preprocessing.

What is the role of the "sample_single_segments.py"?

for i, utt_ind in enumerate(sample_utt_index_list):
    if i % 500 == 0:
        print(f'sample {i} samples')
    utt_id = utt_list[utt_ind]
    t = random.randint(0, len(data[utt_id]) - segment_size)
    samples.append((utt_id, t))

Especially i can't understand above part.

The repo. does not contain train.sh

Thanks for releasing the code.
Could you upload train.sh as well? It's missing in the repo at this point.

loss doesn't decrease

Is it something wrong？The loss_rec=0.25 and loss_kl = 0.28, when training step is 40 000, and it's doesn't decrease any more. The training step is 100 000 now, but the loss_rec is still keep 0.25 and loss_kl 0.28. Is it normal?（lambda keep 1 when training step=20 000）

Difference in parameters between config.yaml and paper

Config.yaml vs Paper

lambda_kl: 1 vs 0.01
batch_size: 128 vs 256
dropout_rate: 0 vs 0.5 on all layers

Anyone know the reason for this? Are these updated values that should be used?

Most notably the batch_size would affect the number of training iterations, where the current 200k would only perform half the training with 128 samples per batch, vs 256 samples per batch.

Question Language

Hey guys,

are there any plans to transfer this research to other languages as well?

Best regards
Chris

CSTR VCTK Corpus dataset download source link

Have anyone have a good source link for download the CSTR VCTK Corpus dataset?

the link https://datashare.is.ed.ac.uk/handle/10283/3443 is too slow to download.

about data preprocess

Urgent Issue --- Training does not start

Hi @jjery2243542 ,
I tried to run this code. The preprocessing part runs successfully. In the training phase, I am continously getting this error. Please help.

Where is AdaIN?

Looking at the model.py,,
In the decoder part, I can't find where AdaIN is.
Can somebody tell me where it is used?

More training?

Is this model only trained by VCTK dataset?
If so, is there a chance to improve the performance by training with more data such as Librispeech?

License of the code?

Hi,
I would be grateful if you could consider adding a license to the code. Apache or MIT would be great, so that it makes it most flexible to be used for research and beyond.

thanks

Can't find pickle files (eg train.pkl, test.pkl)

Hello, I am a newbie in this field. I encountered some problems during preprocessing, suggesting that I cannot find the train.pkl, dev.pkl, test.pkl files. I don't know if you can help me, [thanks.]

train-test split

Is the train-test split used for the pretrained model available?

Cannot find Python Pickle File

Hi, I am new to this field but try to run this for my recent research. However, after I changed the preprocessing config file, I still cannot run it. The error is missing python pickle file (e.g., train.pkl). Do you have some advice on how to generate or find the needed pickle files? Thanks a lot.

About training time

Hello, I wonder how long it take to train the model on VCTK in your experimental environment?

normalizatoin & denormalization with attr

In the inference.py , i can't understand why we should normalize and denormalize with the mean and variance of the attribute file. Could anyone explain it to me?

append_cond_2d not defined in model.py file

append_cond_2d function that is used in Postnet class is not defined.

Steps and docummentation

Dear developer,
please provide us with detailed steps and how we can reproduce the results in a concise way. The code is not making any sense and after a lot of hard work , I still cannot get this code to work.

downsample in content encoder

Hi, in the content encoder, you use the average pooling 1d to down sample the content representation. The content representation is down sampled by factor of 8 compared with the original mel spectrogram. I am wondering if this can effectively retain content information. Could the information of high frequency be recovered well? And if it can, why?

about preprocess

Hello, I am very interested in reading your paper. But I encountered some difficulties in the experiment, I don't know if you can help me.
I use LibriTTS for preprocessing, but there is a data_dir in the configuration file to store the path of the preprocessing file. I don’t know what the preprocessing file refers to here.

voice conversion result can't replicate the paper and demo webpage

Hello, ran your code, but can't reproduce the effect of your paper and demo.

After training the model, the following transformations were made:
The source is male: p259_263.wav-----The target is female: p250_269.wav

Run evaluate.py,
The source of the reconstruction (output.rec_src.wav) sounds like a male voice.
The target of the rebuild (output.rec_tar.wav) sounds like a male voice.
The result of the conversion (output.src2tar.wav) is the same as output.rec_src.wav.
In one word, the converted result is not clear, not transformed and not the desired result.

The paper says that the spectrogram is used as a feature, but the waveform is used as input directly.
The paper says that VAE is used, but the AE is used in the code.
Decoder uses AdaIN in the paper, but it is not in the code.
．．．．．

The distribution of speakers is not as diverse as the paper too.

The above mentioned converted example are in zip file.
result.zip

I feel that this code can't reproduce the results claimed by the paper. Can you improve it?

How to train on other dataset

Hello, I use this code to train on the Chinese dataset. The loss on the training set is:
AE:[425993/2000000], loss_rec=0.21, loss_kl=0.27, lambda=1.0e+0.
And I find that the loss hardly decreases.
Then, i used the data in the training set for voice conversion, and found that The result was terrible, How can I improve it?

Thanks,
Daniel

Where is the file?

dear @jjery2243542 ,
In the testing phase, I am not sure what do you mean by this file ? Can you please help?

-a: the attribute file for normalization ad denormalization.

jjery2243542 / adaptive_voice_conversion Goto Github PK

adaptive_voice_conversion's People

Contributors

Stargazers

Watchers

Forkers

adaptive_voice_conversion's Issues

Config.yaml vs Paper

Recommend Projects

Recommend Topics

Recommend Org