Git Product home page Git Product logo

Comments (3)

teticio avatar teticio commented on May 26, 2024

Sorry for the delay - I was on holiday.

The sample rate of the models I uploaded to HF is 22,050 hz. You can do 44,100 if you like, but you need twice the resolution in the x-axis to get the same sample length in seconds. On your machine you should be able to go to 512x512 easily.

I found that around 20,000 samples worked well, but it does depend on how homegeneous they are. Also some genres appear to wrk better than others. Regarding conditional training with a text prompt, it can be done with the codebase but the model is expecting a vector of numbers (en encoding) which can be a text embedding or whatever. You would just need to provide this as a dictionary as described in the README (i.e., it is not so convenient as a HF pipeline that takes the text as an input). I am not sure that you will get great results with these kind of descriptions. I would suggest things that a pretrained language model might better "understand" like "Fast-paced, exciting, orchestral with drums". Obviously you are not going to label 20,000 - the same description can be used for each "slice" of the 30s previews. Perhaps you can find a way to get a meaningful description from the Spotify API / scraping?

from audio-diffusion.

dustyatx avatar dustyatx commented on May 26, 2024

The descriptions are from audio analysis algorithms so my hope is that they will be more accurate description of the audio with stronger signal than something like genre or a feel. I think if I feed the descriptions I have in to a LLM with some good prompt engineering I should get me more natural (human like) descriptions. I have a few installed and I can use LangChain to batch them.

For this experiment I only only be using single shot kick drums (0.1-2 secs long, all padded to 2 secs), I have a pretty good set that ranges from natural real world kick drums (like a Pearl drum set) to more experimental industrial sounds that are completely synthetic. I have a metadata store that I can use to create a more evenly distributed set, so I don't have an over abundance of 808 or 909 etc. So monolithic in that it's only one class of sound but quiet a lot of variety.

For the training could you tell me what type of GPU, how long it took to train on 20k and what I can expect for memory usage. I have a 4090 24GB but I can rent a A100 80GB if necessary.

For vectorizing the text and passing in a dictionary, I can't seem to find anything about that in the readme. I'm not sure how I'm missing it. Can you point me to that? Which tokenizer should I use?

really appreciate the guidance..

from audio-diffusion.

teticio avatar teticio commented on May 26, 2024

I think I mention somewhere in the README that I used one 2080 Ti and it took about 40 hours to train on 20,000 samples. So you will be fine with a 4090, at least at the same resolution (256x256). If you go up to 512x512 (which can allow for higher quality, longer samples) you should still be able to train a VAE (I am doing exactly this right now).

Bear in mind that the LLMs are trained on different texts from the ones you are using, but you can only know by trying.

from audio-diffusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.