Git Product home page Git Product logo

adaptive_voice_conversion's People

Contributors

jjery2243542 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adaptive_voice_conversion's Issues

VC for Chinese, but result not similar

Hello, I use the pre-trained model you provided to perform voice conversion for Chinese. I checked the results and found that the non-linguistic information of the output file is not similar to the non-linguistic information of the target file. According to the paper, the model should achieve the same effect on all data. How should it be done?

Doc

hi
can you add some doc,
maybe just the commands I need to type to test?

About train_index_file train_samples_128.json

Hello dears
thanks for releasing code publicly
just one question: file -train_index_file train_samples_128.json is not available. can you add it or describe how to generate it?
I stepped forward as mentioned but the problem is in train.sh:

python3 main.py -c config.yaml -d /groups/jjery2243542/data/vctk/trimmed_vctk_spectrograms/sr_24000_mel_norm -train_set train_128 "-train_index_file train_samples_128.json" -store_model_path /groups/jjery2243542/model/adaptive_vc/vctk_model -t vctk_model -iters 500000 -summary_step 500

Speaker Encoder Loss

An additional question, I didn't find any constraint on speaker encoder. I have an idea - a speaker encoder(Es) loss is added to keep speaker-embeddings invariant if Ec encodes Speaker1 and Es encodes Speaker2. The reconstructed X1_2 should have the same speaker-embeddings with Speaker2. c1 = Ec(X1), s2 = Es(X2), X1_2 = D(c1,s2),
So the loss is torch.nn.L1loss(Es(X1_2)-s2)
Do you think this may help?

some prolems about this repo

1、in this paper "One-shot Voice Conversion by Separating Speaker and Content
Representations with Instance Normalization ", the architecture of the encoders and decode are different from the implementation of this repo.
2、in source code, what does this function LatentDiscriminator do?
3、when open the doc?

Training still wouldn't start

Hi @jjery2243542 ,
I ran the updated code and the problem is solved. However, when I run the code file main.py , the training does not start. And I get the screen as shown. Do I need to run any other file to start the training or am I missing anything?
I am writing this command to run the file main.py

python main.py -c config.yaml -train_set train -d ./features -train_index_file train_samples_128.json -store_model_path ./model -load_model_path ./model
a

No training speaker encoder?

Don't need to train a speaker encoder?
There seems to be no such step in the code. Do we need to use a pre-trained model?

About the implementation in the code

Hello, I have two questions while reading your code. Could you please help me answer them if you are free.

  1. Why is the forward propagation of training and prediction different?
    ` def forward(self, x):
    emb = self.speaker_encoder(x)
    mu, log_sigma = self.content_encoder(x)
    eps = log_sigma.new(*log_sigma.size()).normal(0, 1)_
    dec = self.decoder(mu + torch.exp(log_sigma / 2) * eps, emb)
    return mu, log_sigma, emb, dec

    def inference(self, x, x_cond):
    emb = self.speaker_encoder(x_cond)
    mu, _ = self.content_encoder(x)
    dec = self.decoder(mu, emb)
    return dec
    `

  2. How is the loss function of KL divergence calculated?
    loss_kl = 0.5 * torch.mean(torch.exp(log_sigma) + mu ** 2 - 1 - log_sigma)

    loss_kl = 0.5 * torch.mean(torch.exp(log_sigma) + mu ** 2 - 1 - log_sigma)

About the number of mel-scale spectrogram bin

I found out that several(most) vocoders or other tts models use mel-pectrogram channel "80".

In this work, the model is using 512 channels.

why is this model using 512 channels which is way more than other tts and vocoder models?

Question about preprocessing

Hi @jjery2243542 .
I have a question about preprocessing.

What is the role of the "sample_single_segments.py"?

for i, utt_ind in enumerate(sample_utt_index_list):
    if i % 500 == 0:
        print(f'sample {i} samples')
    utt_id = utt_list[utt_ind]
    t = random.randint(0, len(data[utt_id]) - segment_size)
    samples.append((utt_id, t))

Especially i can't understand above part.

loss doesn't decrease

Is it something wrong?The loss_rec=0.25 and loss_kl = 0.28, when training step is 40 000, and it's doesn't decrease any more. The training step is 100 000 now, but the loss_rec is still keep 0.25 and loss_kl 0.28. Is it normal?(lambda keep 1 when training step=20 000)

Difference in parameters between config.yaml and paper

Config.yaml vs Paper

  • lambda_kl: 1 vs 0.01
  • batch_size: 128 vs 256
  • dropout_rate: 0 vs 0.5 on all layers

Anyone know the reason for this? Are these updated values that should be used?

Most notably the batch_size would affect the number of training iterations, where the current 200k would only perform half the training with 128 samples per batch, vs 256 samples per batch.

Question Language

Hey guys,

are there any plans to transfer this research to other languages as well?

Best regards
Chris

Where is AdaIN?

Looking at the model.py,,
In the decoder part, I can't find where AdaIN is.
Can somebody tell me where it is used?

More training?

Is this model only trained by VCTK dataset?
If so, is there a chance to improve the performance by training with more data such as Librispeech?

License of the code?

Hi,
I would be grateful if you could consider adding a license to the code. Apache or MIT would be great, so that it makes it most flexible to be used for research and beyond.

thanks

Can't find pickle files (eg train.pkl, test.pkl)

Hello, I am a newbie in this field. I encountered some problems during preprocessing, suggesting that I cannot find the train.pkl, dev.pkl, test.pkl files. I don't know if you can help me, [thanks.]

train-test split

Is the train-test split used for the pretrained model available?

Cannot find Python Pickle File

Hi, I am new to this field but try to run this for my recent research. However, after I changed the preprocessing config file, I still cannot run it. The error is missing python pickle file (e.g., train.pkl). Do you have some advice on how to generate or find the needed pickle files? Thanks a lot.

About training time

Hello, I wonder how long it take to train the model on VCTK in your experimental environment?

normalizatoin & denormalization with attr

In the inference.py , i can't understand why we should normalize and denormalize with the mean and variance of the attribute file. Could anyone explain it to me?

Steps and docummentation

Dear developer,
please provide us with detailed steps and how we can reproduce the results in a concise way. The code is not making any sense and after a lot of hard work , I still cannot get this code to work.

downsample in content encoder

Hi, in the content encoder, you use the average pooling 1d to down sample the content representation. The content representation is down sampled by factor of 8 compared with the original mel spectrogram. I am wondering if this can effectively retain content information. Could the information of high frequency be recovered well? And if it can, why?

about preprocess

Hello, I am very interested in reading your paper. But I encountered some difficulties in the experiment, I don't know if you can help me.
I use LibriTTS for preprocessing, but there is a data_dir in the configuration file to store the path of the preprocessing file. I don’t know what the preprocessing file refers to here.

voice conversion result can't replicate the paper and demo webpage

Hello, ran your code, but can't reproduce the effect of your paper and demo.

After training the model, the following transformations were made:
The source is male: p259_263.wav-----The target is female: p250_269.wav

Run evaluate.py,
The source of the reconstruction (output.rec_src.wav) sounds like a male voice.
The target of the rebuild (output.rec_tar.wav) sounds like a male voice.
The result of the conversion (output.src2tar.wav) is the same as output.rec_src.wav.
In one word, the converted result is not clear, not transformed and not the desired result.

The paper says that the spectrogram is used as a feature, but the waveform is used as input directly.
The paper says that VAE is used, but the AE is used in the code.
Decoder uses AdaIN in the paper, but it is not in the code.
.....

The distribution of speakers is not as diverse as the paper too.
speaker

The above mentioned converted example are in zip file.
result.zip

I feel that this code can't reproduce the results claimed by the paper. Can you improve it?

How to train on other dataset

Hello, I use this code to train on the Chinese dataset. The loss on the training set is:
AE:[425993/2000000], loss_rec=0.21, loss_kl=0.27, lambda=1.0e+0.
And I find that the loss hardly decreases.
Then, i used the data in the training set for voice conversion, and found that The result was terrible, How can I improve it?

AdaIN

How did you apply AdaIN?
In the Decoder function in model.py, I see that the IN layers are not adaptive, and they are commented.

关于预处理

Hello, I am very interested in reading your paper. But I encountered some difficulties in the experiment, I don't know if you can help me.
I use LibriTTS for preprocessing, but there is a data_dir in the configuration file to store the path of the preprocessing file. I don’t know what the preprocessing file refers to here.

Where is the file?

dear @jjery2243542 ,
In the testing phase, I am not sure what do you mean by this file ? Can you please help?

-a: the attribute file for normalization ad denormalization.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.