Git Product home page Git Product logo

Comments (7)

james20141606 avatar james20141606 commented on June 21, 2024

Another thing which is weird is that when I quantify pearson correlation of spectrograms of original and reconstructed waveform, I found the correlation coefficients are in a very small range. Why is the model so stable at reconstructing the waveform and its corresponding spectrogram?
download

from ddsp.

james20141606 avatar james20141606 commented on June 21, 2024

Another question I am really curious about is if we'd like to do human voice reconstruction from multiple sources(different people), should we consider timbre and include z in the model?
Also since the model is really doing a good job on waveform reconstruction. Have you considered to use it on TTS task? Can we use an encoder to generate some features like f0 and loudness from text of some other signal to generate waveform?

from ddsp.

jesseengel avatar jesseengel commented on June 21, 2024

Hi, glad it's working for you. I'd be happy to hear an example reconstruction if you want to share. My guess is that the model is probably overfitting quite a lot to a small dataset. In that case, segment of loudness and f0 corresponds to a specific phoneme because the dataset doesn't have enough variation. For a large dataset, there will be one to many mappings that the model can't handle without more conditioning (latent or labels). We don't use the latent "z" variables in the models in the timbre_transfer and train_autoencoder colabs, but the encoders and decoders are in the code base and used in models/nsynth_ae.gin as an example.

My intuition is that the model should work well for TTS (the sinusoidal model it's based off is used in audio codecs, so we know it should be able to fit it), but you just need to add grapheme or phoneme conditioning.

from ddsp.

james20141606 avatar james20141606 commented on June 21, 2024

Thanks a lot for your reply!

  1. I put the reconstruction result analysis here: https://drive.google.com/file/d/1DgjxlMLd-hYtYq4_O99oclqgfliL3Cqx/view
  2. For overfitting issues, I use SHTOOKA dataset which contains audio length around 1hour and 30 min, I think that is not too small for the model to overfit? I am still amazed that the model can handle the data so well, since I have tried parrotron model for spectrogram reconstruction on SHTOOKA dataset and the model could not converge…
  3. I am not sure if I understood more conditioning (latent or labels). here:

For a large dataset, there will be one to many mappings that the model can’t handle without more conditioning (latent or labels).

Do you mean we can add conditionings besides z, f0 and loudness? You also mentioned I could add grapheme or phoneme conditioning for TTS task, do you mean using an encoder to extract phoneme, grapheme or other conditioning and concat with z, f0 and loudness (do we have f0 and loudness in TTS task?) and then feed them to decoder?

  1. I am also curios if I can further improve the result by add z conditioning and use Resnet instead CREPE mode? Or it will be harder to train? Have you try some more complicated model like VAE or GAN using DDSP?

from ddsp.

jesseengel avatar jesseengel commented on June 21, 2024

There are a lot of options to try, we only have results based on our published work. If you want control over the output, you need to condition with variables that you know how to control. For instance, most TTS systems only use phonemes or text as conditioning, and then let the network figure out what to do with them. You can try to figure out how to interpret Z, but it is not trained to be interpretable as is.

from ddsp.

james20141606 avatar james20141606 commented on June 21, 2024

Thanks for your reply! For conditioning, do you mean the features after encoder part? If we want more conditioning, do you mean we could try to use some network to encode phoneme or graphene as conditioning? Should I try to make the conditioning similar to similar words? Is it a rule to follow(to find proper conditioning)?

from ddsp.

jesseengel avatar jesseengel commented on June 21, 2024

The tacotron papers (https://google.github.io/tacotron/) have extensively investigated different types of TTS conditioning. I suggest you check out some of their work.

from ddsp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.