Comments (3)
Sorry for the delay - I was on holiday.
The sample rate of the models I uploaded to HF is 22,050 hz. You can do 44,100 if you like, but you need twice the resolution in the x-axis to get the same sample length in seconds. On your machine you should be able to go to 512x512 easily.
I found that around 20,000 samples worked well, but it does depend on how homegeneous they are. Also some genres appear to wrk better than others. Regarding conditional training with a text prompt, it can be done with the codebase but the model is expecting a vector of numbers (en encoding) which can be a text embedding or whatever. You would just need to provide this as a dictionary as described in the README (i.e., it is not so convenient as a HF pipeline that takes the text as an input). I am not sure that you will get great results with these kind of descriptions. I would suggest things that a pretrained language model might better "understand" like "Fast-paced, exciting, orchestral with drums". Obviously you are not going to label 20,000 - the same description can be used for each "slice" of the 30s previews. Perhaps you can find a way to get a meaningful description from the Spotify API / scraping?
from audio-diffusion.
The descriptions are from audio analysis algorithms so my hope is that they will be more accurate description of the audio with stronger signal than something like genre or a feel. I think if I feed the descriptions I have in to a LLM with some good prompt engineering I should get me more natural (human like) descriptions. I have a few installed and I can use LangChain to batch them.
For this experiment I only only be using single shot kick drums (0.1-2 secs long, all padded to 2 secs), I have a pretty good set that ranges from natural real world kick drums (like a Pearl drum set) to more experimental industrial sounds that are completely synthetic. I have a metadata store that I can use to create a more evenly distributed set, so I don't have an over abundance of 808 or 909 etc. So monolithic in that it's only one class of sound but quiet a lot of variety.
For the training could you tell me what type of GPU, how long it took to train on 20k and what I can expect for memory usage. I have a 4090 24GB but I can rent a A100 80GB if necessary.
For vectorizing the text and passing in a dictionary, I can't seem to find anything about that in the readme. I'm not sure how I'm missing it. Can you point me to that? Which tokenizer should I use?
really appreciate the guidance..
from audio-diffusion.
I think I mention somewhere in the README that I used one 2080 Ti and it took about 40 hours to train on 20,000 samples. So you will be fine with a 4090, at least at the same resolution (256x256). If you go up to 512x512 (which can allow for higher quality, longer samples) you should still be able to train a VAE (I am doing exactly this right now).
Bear in mind that the LLMs are trained on different texts from the ones you are using, but you can only know by trying.
from audio-diffusion.
Related Issues (20)
- Recommended training hyperparameters for 44.1Khz & 48Khz Samplerate HOT 2
- diffusers v-0.12.0 causes import issues HOT 1
- Diffusers v0.12 removed the `ema_model.averaged_model` attribute HOT 1
- Increasing input size HOT 4
- how does the audio_to_images.py file work? HOT 3
- Whether the longer music sample is the repetition of a shorted sample? HOT 1
- NameError: name 'transformers' is not defined upon running model via Gradio HOT 2
- Dataset constriants HOT 16
- Training own music samples? HOT 1
- Can I input audio file then generate image HOT 2
- Numpy Error HOT 1
- AttributeError: 'AutoencoderKL' object has no attribute 'sample_size' HOT 3
- teticio/audio-diffusion-256 is really good HOT 1
- multi-gpu training HOT 1
- [Little Feedback] Thank you! :) HOT 2
- is it possible to use the train_unet.py script as a regular ldm? HOT 2
- whats the difference between 256 and 512 dataset HOT 1
- Duration of generated audio HOT 4
- WARNING: audio_to_images: No valid audio files were found error! HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from audio-diffusion.