Would love your thoughts on if this is a viable experiment to run or if I'm missing so

High fidelity training? about audio-diffusion HOT 3 CLOSED

teticio commented on May 26, 2024

High fidelity training?

from audio-diffusion.

Comments (3)

teticio commented on May 26, 2024

Sorry for the delay - I was on holiday.

The sample rate of the models I uploaded to HF is 22,050 hz. You can do 44,100 if you like, but you need twice the resolution in the x-axis to get the same sample length in seconds. On your machine you should be able to go to 512x512 easily.

I found that around 20,000 samples worked well, but it does depend on how homegeneous they are. Also some genres appear to wrk better than others. Regarding conditional training with a text prompt, it can be done with the codebase but the model is expecting a vector of numbers (en encoding) which can be a text embedding or whatever. You would just need to provide this as a dictionary as described in the README (i.e., it is not so convenient as a HF pipeline that takes the text as an input). I am not sure that you will get great results with these kind of descriptions. I would suggest things that a pretrained language model might better "understand" like "Fast-paced, exciting, orchestral with drums". Obviously you are not going to label 20,000 - the same description can be used for each "slice" of the 30s previews. Perhaps you can find a way to get a meaningful description from the Spotify API / scraping?

from audio-diffusion.

dustyatx commented on May 26, 2024

The descriptions are from audio analysis algorithms so my hope is that they will be more accurate description of the audio with stronger signal than something like genre or a feel. I think if I feed the descriptions I have in to a LLM with some good prompt engineering I should get me more natural (human like) descriptions. I have a few installed and I can use LangChain to batch them.

For this experiment I only only be using single shot kick drums (0.1-2 secs long, all padded to 2 secs), I have a pretty good set that ranges from natural real world kick drums (like a Pearl drum set) to more experimental industrial sounds that are completely synthetic. I have a metadata store that I can use to create a more evenly distributed set, so I don't have an over abundance of 808 or 909 etc. So monolithic in that it's only one class of sound but quiet a lot of variety.

For the training could you tell me what type of GPU, how long it took to train on 20k and what I can expect for memory usage. I have a 4090 24GB but I can rent a A100 80GB if necessary.

For vectorizing the text and passing in a dictionary, I can't seem to find anything about that in the readme. I'm not sure how I'm missing it. Can you point me to that? Which tokenizer should I use?

really appreciate the guidance..

from audio-diffusion.

teticio commented on May 26, 2024

I think I mention somewhere in the README that I used one 2080 Ti and it took about 40 hours to train on 20,000 samples. So you will be fine with a 4090, at least at the same resolution (256x256). If you go up to 512x512 (which can allow for higher quality, longer samples) you should still be able to train a VAE (I am doing exactly this right now).

Bear in mind that the LLMs are trained on different texts from the ones you are using, but you can only know by trying.

from audio-diffusion.

High fidelity training? about audio-diffusion HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent