Git Product home page Git Product logo

audio-diffusion's Introduction

Hi there πŸ‘‹ I'm Robert Smith. Please have a look around and feel free to drop me a line if you find anything interesting.

Robert Dargavel Smith's GitHub stats Β  Robert Dargavel Smith's programming languages

Robert Dargavel Smith's LinkedIn profile Β  Robert Dargavel Smith's Medium articles Β  Robert Dargavel Smith's email Β  Robert Dargavel Smith's Hugging Face

Top Repositories

Deej-AI Β  audio-diffusion

audio-diffusion's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

audio-diffusion's Issues

WARNING: audio_to_images: No valid audio files were found error!

I have a directory named dataSet contianing 4 MP3s and a WAV and I cant get past the 'no valid audio files' error.

The script seems to be seeing the files, I'm at a bit of a loss, running on Win10.

CMD log:

python scripts/audio_to_images.py ^
More? --resolution 256,256 ^
More? --input_dir dataSet ^
More? --output_dir data
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 12.69it/s]
audio-diffusion\scripts\audio_to_images.py:64: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn("No valid audio files were found.")
WARNING:audio_to_images:No valid audio files were found.

Starting Point for Dataset

First, love the work, been playing around with it for the past day or two.

I saw in another issue that you mentioned needing to play around to find a happy medium in terms of # of samples, sample length, dataset totals, epochs, etc... but I'm hoping that a rough rubric could come out, as I've struggled to get anything meaningful from a few different approaches.

My latest approach was to take 5 second 'clips' every 10% of a song, since my corpus for generation is rather small (~50 songs). But I'm struggling to understand the relationship between number of samples and epochs... I.E. you mentioned that getting 100 epochs across 20K samples was positive. Would that be a relatively linear relationship e.g. since I have 500 samples (2.5%) I need to multiply my epoch total by an equivalent scaling factor e.g. 10000 epochs? Or is there no clear relationship and I just need to find more samples to be closer to the 20K target?

Thank you again for any advice you can provide.

Diffusion steps 1000.

Hi!

I wonder why you set the number of diffusion steps to 1000. Could it be too big or too small?

Best,
Tristan

[Question] What's the toll on a local desktop computer?

Hello!

I've installed and can use Stable Diffusion on my desktop computer.
Setup:
Windows 10 pro
Nvidia Quadro M4000 8GB GDDR5
AMD Ryzen 7 5800X 8-Core Processor, 4601 Mhz, 8 Core(s), 16 Logical Processor(s)
32GB RAM
Python/powershell install of AUTOMATIC1111/stable-diffusion-webui with Gradio WebUI
Python 3.10.7 for Windows 1
Renders are slow, but achievable in about 90 seconds for 512x1020.

It works great.

I'd like to install audio-diffiusion, but am concerned about render times/file sizes.
Given a library of various genres, how likely is it that my local computer can handle this?
If I had a thousand songs to train in various genres/styles, how long would it take?
Would my computer pretty much be tied up for that time?
How long does it take to render a prompt based on trained input?

If you have a link to where all these answers are already, I'd be grateful.

Thanks for your time.

Chris

IndexError: list index out of range

I'm getting this error either huggingface or local training, I don't exactly what's the problem everything seems fine.

accelerate launch --config_file accelerate_local.yaml \ 
train_unconditional.py \
  --dataset_name mertcobanov/audio-diffusion-256 \
  --resolution 256 \
  --output_dir ddpm-ema-audio-256 \
  --num_epochs 100 \
  --train_batch_size 2 \
  --eval_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --lr_warmup_steps 500 \
  --mixed_precision no \
  --push_to_hub True \
  --hub_model_id audio-diffusion-256 \
  --hub_token $(cat $HOME/.huggingface/token)
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `12` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 699/699 [00:00<00:00, 507kB/s]
Using custom data configuration mertcobanov--audio-diffusion-256-1545067e5255003f
Downloading and preparing dataset None/None (download: 9.98 MiB, generated: 9.98 MiB, post-processed: Unknown size, total: 19.96 MiB) to /home/mert/.cache/huggingface/datasets/mertcobanov___parquet/mertcobanov--audio-diffusion-256-1545067e5255003f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10.5M/10.5M [00:00<00:00, 18.0MB/s]
Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:01<00:00,  1.89s/it]
Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 1764.54it/s]
Dataset parquet downloaded and prepared to /home/mert/.cache/huggingface/datasets/mertcobanov___parquet/mertcobanov--audio-diffusion-256-1545067e5255003f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
Cloning https://huggingface.co/mertcobanov/audio-diffusion-256 into local empty directory.
Epoch 0: 100%|β–ˆ| 129/129 [00:38<00:00,  3.31it/s, ema_decay=0.974, loss=0.0125, lr=2.58e-5,
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [01:26<00:00, 11.53it/s]
Traceback (most recent call last):
  File "/home/mert/development/audio-diffusion/train_unconditional.py", line 319, in <module>
    main(args)
  File "/home/mert/development/audio-diffusion/train_unconditional.py", line 241, in main
    accelerator.trackers[0].writer.add_images(
IndexError: list index out of range
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 313172]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 313172]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 313172]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 313172]].
Traceback (most recent call last):
  File "/home/mert/anaconda3/envs/audio-generation/bin/accelerate", line 10, in <module>
    sys.exit(main())
  File "/home/mert/anaconda3/envs/audio-generation/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/mert/anaconda3/envs/audio-generation/lib/python3.10/site-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/home/mert/anaconda3/envs/audio-generation/lib/python3.10/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/mert/anaconda3/envs/audio-generation/bin/python3.10', 'train_unconditional.py', '--dataset_name', 'mertcobanov/audio-diffusion-256', '--resolution', '256', '--output_dir', 'ddpm-ema-audio-256', '--num_epochs', '100', '--train_batch_size', '2', '--eval_batch_size', '2', '--gradient_accumulation_steps', '8', '--learning_rate', '1e-4', '--lr_warmup_steps', '500', '--mixed_precision', 'no', '--push_to_hub', 'True', '--hub_model_id', 'audio-diffusion-256', '--hub_token', 'my_token']' returned non-zero exit status 1.

Music generation conditioned on text and music

Is it possible to generate music conditioned on both text and music.
Issue running conditional_generation.ipynb:
while loading "audio_diffusion = AudioDiffusion(model_id="teticio/conditional-latent-audio-diffusion-512")" I'm getting error as
"ValueError: mel/audio_diffusion.py as defined in model_index.json does not exist in teticio/conditional-latent-audio-diffusion-512 and is not a module in 'diffusers/pipelines'"

Ask for the losses of vae

Hello, I am running train_vae.py on my own data set, can you provide the loss metrix of vae, I would like to know how much loss models will converge?

train_vae.py report error 'gpu' is not a valid DistributedType

When I test train_vae.py with the command "python scripts/train_vae.py --dataset_name data/audio-diffusion-256 --batch_size 2 --gradient_accumulation_steps 12", the following error occurs:
ValueError: 'gpu' is not a valid DistributedType.
Any suggestion to solve it ? Thanks.

Recommended training hyperparameters for 44.1Khz & 48Khz Samplerate

Hi,

Thanks for the great repository and code, been having some fun training some models with it using the default parameters.

I'm trying to experiment with higher samplerates, such as 44.1Khz and 48Khz.

What are some other configurations in hop_length, n_fft would be needed to achieve good results?

Thanks for the tips!

training notebook

@teticio ,

it would be amazing if you could make a training colab notebook.
The way users could upload their samples and get it trained using colab t4 gpu.

Also how many samples would you recommend to get good results.
Say for eg :

if a user is training on 1s audio clips of different bird chirps , how many audio samples would be needed as input in that case for training using DDPM.
Have you done any experiments.

What if we have only less number of audio files say 10 wav files of 2 s each. would that work?
If so , how many epochs should one train for.

In case you make a training notebook . I hope you mention the recommended number of samples and training epochs in the notebook instructions.

is it possible to use the train_unet.py script as a regular ldm?

i tried to use your code as normal ldm but the model doesn't converge even after 12 epochs (10000+ data examples)
image
I used these settings:

vae = "vae"
weight_decay = 1e-6
train_batch_size = 32
num_epochs = 100
learning_rate = 5e-5
resolution = (256,256)
train_data_dir = "/content/data/images"
num_train_steps=2000
lr_scheduler = "cosine"
lr_warmup_steps=500

other code:

model = UNet2DModel(
    sample_size=latent_resolution,
    in_channels=vqvae.config["latent_channels"],
    out_channels=vqvae.config["latent_channels"],
    layers_per_block=2,
    block_out_channels=(128, 128, 256, 256, 512, 512),
    down_block_types=(
                    "DownBlock2D",
                    "DownBlock2D",
                    "DownBlock2D",
                    "DownBlock2D",
                    "AttnDownBlock2D",
                    "DownBlock2D",
    ),
    up_block_types=(
                    "UpBlock2D",
                    "AttnUpBlock2D",
                    "UpBlock2D",
                    "UpBlock2D",
                    "UpBlock2D",
                    "UpBlock2D",
    ),
)
noise_scheduler = DDIMScheduler(num_train_timesteps=num_train_steps)
        clean_images = batch
        clean_images = clean_images.to(model.device)
        clean_images = clean_images * 0.18215

        # Sample noise that we'll add to the images
        noise = torch.randn(clean_images.shape).to(clean_images.device)
        bsz = clean_images.shape[0]
        # Sample a random timestep for each image
        timesteps = torch.randint(0,noise_scheduler.config.num_train_timesteps,(bsz, ),device=clean_images.device,).long()

        # Add noise to the clean images according to the noise magnitude at each timestep
        # (this is the forward diffusion process)
        noisy_images = noise_scheduler.add_noise(clean_images,noise,timesteps)
        

        model_output = model(noisy_images, timesteps)["sample"]
        loss = F.mse_loss(model_output, noise)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

google colab notebook:
https://colab.research.google.com/drive/1AjH6ujmLIsU-4hP0bX0jtaFv_WGHpu2o?usp=sharing

[Little Feedback] Thank you! :)

Hello teticio! (and the community πŸ˜„),

I'm a sound and music enthusiast, and I recently discovered your project on GitHub. I wanted to express my immense gratitude for your work and for sharing it. It's been a joy to experiment with the diffusers.

Below are some specific details about my experience with your project:

  • Dataset: I used 500 tracks; a hundred of them were my own production (spanning genres like minimalism, jazz, broken beat, and break beat), another hundred from some of my producer friends in a similar style, and the remaining 300 tracks were a curated collection of the best music used in DJ sets and podcasts. All files were in .wav format with a sample rate of 48000hz, and the Spectrogram dimensions were 448*448.

  • Hyperparameters: The training was conducted with the following settings:

    --num_epochs 15
    --train_batch_size 1
    --eval_batch_size 1
    --gradient_accumulation_steps 16
    --learning_rate 1e-4
    --lr_scheduler cosine
    --lr_warmup_steps 500
    --mixed_precision fp16
    --adam_beta1 0.9
    --adam_beta2 0.999
    --adam_weight_decay 1e-6
    --adam_epsilon 1e-08
    --use_ema True
    --ema_inv_gamma 1.0
    --ema_power 0.75
    --ema_max_decay 0.9999
    --hop_length 469
    --sample_rate 48000
    --n_fft 4096
    --start_epoch 0
    --num_train_steps 1000

This amounted to almost 40 hours of training time. The results were fascinating; while some of the generated sounds were unusable, some grooves (especially at the low end) were impressively clear and precise.

This training configuration led to the best results I've seen among all my trials, even with relatively few epochs and limited computational power. It seems that the key to the best possible outcome lies in ensuring perfect consistency of the dataset and prioritizing quality over quantity.

I'd love to share my audio results here (with your permission, of course). Currently, I'm running a new 20-epoch training primarily on jazz sounds and softer old-school trip-hop. I've drastically reduced the size of the dataset (down to 46 tracks) but fine-tuned the hyperparameters accordingly. I'm hopeful for satisfying results.

To conclude, I believe that the creative possibilities with substantial computational power could lead to extraordinary generations. I'm curious to know if anyone else has ventured far with substantial resources.

Once again, thank you! Your work has led me to explore a whole new world of creativity. I'm excited to see what you'll develop next.

(P.S.: I apologize for any issues with the formatting, I'm still getting used to GitHub, and I officially joined today)

Best,
Fonk

Couldn't cast because column names don't match

I'm attempting to run the command

python audio_to_images.py --resolution 64 --hop_length 1024 --input_dir data-in --output_dir data-out

on a directory of .wav files, but I keep receiving the error

ValueError: Couldn't cast
__index_level_0__: null
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 310
to
{'image': Image(decode=True, id=None), 'audio_file': Value(dtype='string', id=None), 'slice': Value(dtype='int16', id=None)}
because column names don't match

Is there a specific filetype for the audio I should be using?

Too many open files during preprocessing

Hi!

I just found your repository. I was intrigued from the beggining and after a close look I am now training on my AI station πŸ€— Great work!

I tried to preprocess a dataset with a lot of audio files. The audio to image preprocessor dies with an exception, telling me that there are to many open files. Does the preprocessor close all opened files?

Looking forward to working more with your repo!

Best,
Tristan

Increasing input size

Have you tried doing it with samples larger than 5 seconds., if so what is the memory requirement for something like that?

multi-gpu training

Wondering how to train the model with multiple gpus? Thanks! (I'm new to the pytorch lightning)

NameError: name 'transformers' is not defined upon running model via Gradio

I'm currently using the colab at https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb

Steps to reproduce:

  1. Run the first cell
try:
    # are we running on Google Colab?
    import google.colab
    !git clone -q https://github.com/teticio/audio-diffusion.git
    %cd audio-diffusion
    %pip install -q -r requirements.txt
except:
    pass

NOTE: Saw this error at the end of running the first cell, not sure if it is related:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. ipython 7.9.0 requires jedi>=0.10, which is not installed.

  1. Run the second cell
import os
import sys
sys.path.insert(0, os.path.dirname(os.path.abspath("")))
  1. Launch the app

  2. Press Submit from the web ui, with any of the models. In this example, I used the teticio/audio-diffusion-instrumental-hiphop-256 model in this case.

  3. Web UI errors out, colab provides this traceback:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/gradio/routes.py", line 374, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1017, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 835, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 857, in run
    item = self.queue.get()
  File "/content/audio-diffusion/app.py", line 9, in generate_spectrogram_audio_and_loop
    audio_diffusion = AudioDiffusion(model_id=model_id)
  File "/content/audio-diffusion/audiodiffusion/__init__.py", line 30, in __init__
    self.pipe = AudioDiffusionPipeline.from_pretrained(self.model_id)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/pipelines/pipeline_utils.py", line 828, in from_pretrained
    transformers_version = version.parse(version.parse(transformers.__version__).base_version)
NameError: name 'transformers' is not defined

Mind if I share?

Hi Robert,

Your results so far are amazing, so much better than any other I've seen so I'm very surprised this repo hasn't blown up in popularity. Mind if I share this with the LAOIN community? I feel like they'd find it really interesting. They're currently working on a huge audio dataset, so I wonder if there could be potential to use it, alongside implementing some sort of text guidance.

You definetely deserve loads of recognition for this but I'd understand if you're happy its not well known yet, if that lets you develop it quietly without too much interference!

Otherwise, I'm looking forward to seeing how far you can take these methods,

Cheers

Duration of generated audio

Hi, I tried to train the VAE with my own audio dataset.
But when I generated some audio by the my model, it only generates 3-second audio file even though origin audio was 26 seconds.
How can I solve this problem?

Numpy Error

which version of Numpy does this require?

How to run this scripts in linux?

image

Hi! I'm running the app.py in the Linux, and it shows that running on a local URL.
It this right? and I want to know how many minutes this should be running this script.

Questions on conditional generations

Hello,

I've been using your script for a while. I have two questions about the conditional generations:

  • First, using it changes drastically the memory requirements. On the example with small mels ("_64") it fits on my 2080 Ti without encodings but with encodings I have to use 16 steps of gradient accumulation (batch size 1 !!). Is this expected ? have you actually managed to train something meaningful on a reasonable cluster such as 2 V100 for example ?

  • Secondly I see with the model you offer the encodings are then between -200 and 200, but they are never really normalized. I checked in the train_unet.py script and it's the case. Is this expected ? Images are normalized tho.

(Pdb) noisy_images,
(tensor([[[[ 0.5455,  1.1415, -1.1012,  ...,  0.6916, -0.6215,  0.9976],
          [-0.2963,  0.7279,  1.5120,  ...,  0.8808,  0.7156, -1.0958],
          [ 1.8480, -0.7409, -0.0349,  ...,  1.5505,  1.2365,  1.9463],
          ...,
          [-0.5625,  1.0219,  0.9210,  ...,  2.3733,  1.0292, -0.5215],
          [ 1.9254, -1.2007, -0.2931,  ..., -0.0562,  1.5366,  0.3513],
          [ 0.3291, -0.5048, -0.1630,  ...,  0.7993, -0.9637,  0.0415]]]],
       device='cuda:0'),)
(Pdb) batch["encoding"]
tensor([[[ 31.4098,  72.1091, -80.4822,  83.4312, -93.8468,  13.4121,   0.1180,
           23.4481, -98.8069,  16.9313]]], device='cuda:0')

Regards,

Training with size 64 works. Training with size 256 does not yield good results.

Hi!

I am digging deeper and finding a lot of insights!

I did two trainings on the same dataset. The dataset has roundabout 140 samples. This dataset is rather small. It is for me to get started.

Training with image size 64 yields some promising results. A model trained on image size 256 yields only noise. Here is a comparison:
grafik

The loss curves look normal.

Any ideas?

Working with your repo is fantastic!

Best,
Tristan

Question about add_noise

I have a batch of data, and the value range of clean_image is [0,1]. Do I need to modify add_noise to adapt to the value range of clean_image?

teticio/audio-diffusion-256 is really good

The results of the model trained on the teticio/audio-diffusion-256 data are very good.
I am curious about this dataset.
What is the final data size of this data? If you cut 20,000 songs into 5 seconds each, that's a much larger scale than 20,000 songs. If the final number of 5-second pieces of data is 20000, how many original music were used?

Second, in the script file provided by the author, it was clipped to 10 seconds, but why did this dataset clip to 5 seconds for training? Does clipping at 5 seconds perform better than clipping at 10 seconds?

High fidelity training?

Would love your thoughts on if this is a viable experiment to run or if I'm missing something important about how diffusion models work with audio/spectrograms. I'm really excited to give your project a test but I'm of a ML enthusiast, so my knowledge is limited to conceptual understandings. Any guidance you can give would be greatly appreciated..

I've noticed that your project as well as Riffusion generates low fidelity <= 10khz, about 1/2 of hifi. I would like to run an experiment to produce short high fidelity single hit sounds.

I've done some experimenting with spectrograms and I found that if i set the n_fft, sample rate, mels, etc I can create spectrograms that covert from audio to image to audio again and it maintains high fidelity sound of 20000 hz. In one experiment I was able to get that hifi result with an image of 512x256; not sure if that is useful but it seemed like a smaller image size would train faster.

I have around 110,000 drum kick sounds that I generated using samples & sound design techniques (FX, blending, pitch shift, stretch, etc). Ranging from analog to electronic to experimental kick drum sounds.

I then used sound analysis algorithms to get the features of the sound and then mapped those to a variety of natural language phrases. I varied them in length from 30-75 tokens, with a nice bell curve with a peak around the 50-60 token length mark.

I have a core i-9 24 core CPU & a 4090 with 24GB ram.. I don't mind letting it run for a week or so to get best results, though I would imagine I start off with sometime small like 128x128 and then scale it up to 512x512. Any advice on training would be super helpful, it's so hard to get good answers around training on datasets > 100 images.

Here is an example of my descriptions. They are all different, sometimes it will have the scientific measure and the value other times it will be a phrase that represents that measure. In this case it's an acoustic kick with a good amount of bass, high freq and distortion.

Class dustkick
Very low brightness
Crest 9
Key C1
Medium-high crest factor
Presence 0
Medium-high distortion
Tonal powerful
Brightness dark
Noisiness 3
Harmonicity 7
Types kicks
Distortion intriguing
Length epic
Headroom exciting
RMS 2
Moderate loudness
Duration ultra long
Tags hit acoustic drum
Ultra high harmonicity
Loudness bright

Training own music samples?

Hi , first of all , loved your remxes and loops that you had shared and great work . I was curious to train a model on my own music samples , however I am trying to follow similar methodology as you as described in the medium article .

So far i have downloaded 30 s samples ( from the preview url link provided by Spotify) , stored as /content/playlist and also spectrograms of downloaded samples using librosa under /content/playlist_spectrograms but i think it is not as you had mentioned .

I am wondering if the next steps are to be follow the following notebook
https://github.com/teticio/audio-diffusion/blob/main/notebooks/train_model.ipynb ? (This produces a model right ? )

then to use following model ?
https://github.com/teticio/audio-diffusion/blob/main/notebooks/audio_diffusion_pipeline.ipynb

I am wondering if this is the right way to move forward ? Or if there are some steps that i am missing ?

Regards Tim

Dataset constriants

@teticio Thanks for making available this nice music generation model. It helped me a lot in my project. I played around with the pre-trained models and the results are very sensible. I would like to train my own model using my library of music recordings. I have a few doubts regarding that. Please help me clarify these.

How many audio files are used for training teticio/audio-diffusion-256 model? What is the time span of each audio file? Can the music recordings be in mp3 or wav format?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.