Hey, I just got a really good reconstruction result which is too good to be true. I ha

Thanks a lot for your reply! I put the reconstruction result a

can we just reconstruct the wavefrom from fundamental frequency and loudness? about ddsp HOT 7 CLOSED

magenta commented on June 21, 2024

can we just reconstruct the wavefrom from fundamental frequency and loudness?

from ddsp.

Comments (7)

james20141606 commented on June 21, 2024

Another thing which is weird is that when I quantify pearson correlation of spectrograms of original and reconstructed waveform, I found the correlation coefficients are in a very small range. Why is the model so stable at reconstructing the waveform and its corresponding spectrogram?

from ddsp.

james20141606 commented on June 21, 2024

Another question I am really curious about is if we'd like to do human voice reconstruction from multiple sources(different people), should we consider timbre and include z in the model?
Also since the model is really doing a good job on waveform reconstruction. Have you considered to use it on TTS task? Can we use an encoder to generate some features like f0 and loudness from text of some other signal to generate waveform?

from ddsp.

jesseengel commented on June 21, 2024

Hi, glad it's working for you. I'd be happy to hear an example reconstruction if you want to share. My guess is that the model is probably overfitting quite a lot to a small dataset. In that case, segment of loudness and f0 corresponds to a specific phoneme because the dataset doesn't have enough variation. For a large dataset, there will be one to many mappings that the model can't handle without more conditioning (latent or labels). We don't use the latent "z" variables in the models in the timbre_transfer and train_autoencoder colabs, but the encoders and decoders are in the code base and used in models/nsynth_ae.gin as an example.

My intuition is that the model should work well for TTS (the sinusoidal model it's based off is used in audio codecs, so we know it should be able to fit it), but you just need to add grapheme or phoneme conditioning.

from ddsp.

james20141606 commented on June 21, 2024

Thanks a lot for your reply!

I put the reconstruction result analysis here: https://drive.google.com/file/d/1DgjxlMLd-hYtYq4_O99oclqgfliL3Cqx/view
For overfitting issues, I use SHTOOKA dataset which contains audio length around 1hour and 30 min, I think that is not too small for the model to overfit? I am still amazed that the model can handle the data so well, since I have tried parrotron model for spectrogram reconstruction on SHTOOKA dataset and the model could not converge…
I am not sure if I understood more conditioning (latent or labels). here:

For a large dataset, there will be one to many mappings that the model can’t handle without more conditioning (latent or labels).

Do you mean we can add conditionings besides z, f0 and loudness? You also mentioned I could add grapheme or phoneme conditioning for TTS task, do you mean using an encoder to extract phoneme, grapheme or other conditioning and concat with z, f0 and loudness (do we have f0 and loudness in TTS task?) and then feed them to decoder?

I am also curios if I can further improve the result by add z conditioning and use Resnet instead CREPE mode? Or it will be harder to train? Have you try some more complicated model like VAE or GAN using DDSP?

from ddsp.

jesseengel commented on June 21, 2024

There are a lot of options to try, we only have results based on our published work. If you want control over the output, you need to condition with variables that you know how to control. For instance, most TTS systems only use phonemes or text as conditioning, and then let the network figure out what to do with them. You can try to figure out how to interpret Z, but it is not trained to be interpretable as is.

from ddsp.

james20141606 commented on June 21, 2024

Thanks for your reply! For conditioning, do you mean the features after encoder part? If we want more conditioning, do you mean we could try to use some network to encode phoneme or graphene as conditioning? Should I try to make the conditioning similar to similar words? Is it a rule to follow(to find proper conditioning)?

from ddsp.

jesseengel commented on June 21, 2024

The tacotron papers (https://google.github.io/tacotron/) have extensively investigated different types of TTS conditioning. I suggest you check out some of their work.

from ddsp.

can we just reconstruct the wavefrom from fundamental frequency and loudness? about ddsp HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent