Git Product home page Git Product logo

Comments (45)

rsxdalv avatar rsxdalv commented on August 22, 2024 2

I will close this issue as Stable Audio has been added. In the future it will be added to the React UI too. I optimized the memory a bit so while it does spike, it's a very brief amount of time so you can use the remaining VRAM freely, I tested this by running Stable Diffusion alongside Stable Audio. (Edit: so by using 'half' the consistent memory consumption is only 6gb, but there still is a few second spike of 14gb, which could perhaps be modified to allow running on smaller GPUs).

Finally, I invested some GPU resources to generate Stable Audio samples and test different prompts at https://promptecho.com/stableaudio . The parameters are quite useful:

  • different sampler can generate different audio; however the speed is basically the same
  • CFG scale can make the audio be 'fluid' when low, but at 0.5 it becomes nonsense. When it's too high it becomes repetitive and unnatural.
  • sampling steps can save time for those who want to tweak them. Some genres can generate with as few as 50 steps, while Electro music seems to do better with 100-200 steps. Going up to 500 steps seems to change almost nothing in many cases.
  • Seconds Total does almost nothing, it does not make it faster nor seem to change it much. I think it might be useful if you want to have everything your prompt contains within a short duration, i.e., 'wind chimes' within 3 seconds rather than spread over 47 seconds.

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024 2

image
river works as you expect ..

from tts-generation-webui.

mykeehu avatar mykeehu commented on August 22, 2024 2

Great work! I tested it, the save works perfectly! Thank you!

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024 1

Ok nevermind stable audio is amazing sometimes. If you have the GPU for it, it generates quickly (anything below the 'default size', which I think is 47 seconds is not going to generate faster, but if you want a full sample it's so quick) and it often generates without needing a lot more steering that you would expect with musicgen. That being said, the license is still the way it is.

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024 1

The filenames and folder names are really long, with a prompt you can easily reach the Windows 255 character path limit. I'd rather say the date-seed format could be more manageable, since the file you're describing is next to it anyway.

Fixed the filenames:

def get_name(prompt):
... return (
... prompt.replace(" ", "")
... .replace(":", "
")
... .replace("'", "")
... .replace('"', "
")
... .replace("\", "")
... .replace(",", "
")
... .replace("(", "")
... .replace(")", "
")
... .replace("?", "")
... .replace("!", "
")
... .replace("&", "")
... # only first 15 characters
... .replace("__", "
")[:15]
... )
...

test get_name

get_name("bamboo flute, zed, reiki, meditation music")
'bamboo_flute_ze'
get_name("funk, disco, R&B, AOR, soft rock, and boogie")
'funk_disco_R_B_'
get_name("Electro House, 320kbps")
'Electro_House_3'
get_name("minimalism, piano, acoustic key E minor, 120bpm and 108bpm piano, Classical, Avant-Garde, dynamic rhythm")
'minimalism_pian'

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024 1

Even the file names need a little fix. Stable Audio Generator produced such a prompt, and it is not saved because of the characters it contains:

Genre: Pop
Mood: Romantic, Atmospheric
Style: 90s
Instruments: Lead, Lead-off
Beats per Minute (BPM): 100
Additional Details: Create a dreamy and nostalgic atmosphere with lush synth pads, gentle piano melodies, a prominent lead melody, and a textural lead-off supporting the track. The music should build up to a cathartic moment filled with emotion and passion, capturing the essence of romanticism in the style of 90s pop.

The problem is with the \n:

Traceback (most recent call last):
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "I:\tts-generation-webui\tts-generation-webui\src\stable_audio\stable_audio.py", line 282, in save_result
    os.makedirs(base_dir, exist_ok=True)
  File "I:\tts-generation-webui\installer_files\env\lib\os.py", line 225, in makedirs
    mkdir(name, mode)
OSError: [WinError 123] File name, directory name or volume label syntax is incorrect: 'outputs-rvc\\Stable Audio\\2024-07-11_21-34-59_Format_Band_\nGe'

Should be fixed in the latest update #342

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

Hi, thanks for requesting this! I have been procrastinating with it actually. One question - such a model would require a huggingface account and a login to be used, since this https://huggingface.co/stabilityai/stable-audio-open-1.0 cannot be automatically downloaded. Would you be ok with that?

Please respond as this is a matter that could really determine whether or not people use it.

from tts-generation-webui.

mykeehu avatar mykeehu commented on August 22, 2024

I don't have a problem downloading the model this way, maybe you could ask for the login to download it? So those who have it can use it, those who don't can't. I don't know why it's tied to a license, but I've seen a video of it making quite good sound effects, so after the login the model would be downloaded.

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

I'd be interested in trying this out too, please.

from tts-generation-webui.

ke1ne avatar ke1ne commented on August 22, 2024

Hi, thanks for requesting this! I have been procrastinating with it actually. One question - such a model would require a huggingface account and a login to be used, since this https://huggingface.co/stabilityai/stable-audio-open-1.0 cannot be automatically downloaded. Would you be ok with that?

Please respond as this is a matter that could really determine whether or not people use it.

For instance, I'm ok with it. Thanks!

from tts-generation-webui.

dairydaddy avatar dairydaddy commented on August 22, 2024

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

I've already downloaded the checkpoint. I presume that those who are enjoying your interface are the sort of people who already have a huggingface account.

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

Stable audio has been added but is causing some problems so it might be added-removed a few times until it's 'stable'.

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

Also, I just want to clarify - with extensive research - stable audio is not a 'stable diffusion 1.5' moment because it has a restrictive, potentially dangerous license (which might be legally unenforceable or impossible to defend in court; it's the very same infamous SD3 license) and I saw comments about Facebook's (notably similarly non-commercially licensed) AudioGen/MusicGen performing similarly.

My biggest issue so far is that running the 'official' inference code results in ~14gb RAM usage, where due to memory management my 24 gb RAM & 24 gb VRAM system would often just fail.

That being said, I really appreciate receiving information about what people want to try and see.

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

I concur on the VRAM issue. I often saturate my RTX 3090 with 24GB of RAM using MusicGen. I have not been able to test MultiBandDiffusion due to VRAM saturation. I have seen that python will not release the VRAM it takes up so it blocks the GPU. I have to restart the machine to liberate the VRAM.
If Stable Audio is even worse than MusicGen, it does make it probematic to test for me.

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

Restarting the webui should be enough. Additionally, after I fix the bugs arising from adding this new model, I can spend more time on 'unload model' buttons throughout the UI; however, there will always be some leftovers that aren't unloaded.
As for Stable Audio - generating a 47 second or a 1 second clip seems to use the same amount of VRAM unless they somehow can fix it all will do it themselves. Honestly there's multiple improvements on the model itself that are waiting to be done by somebody, perhaps they are hoping the community will do it.

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

And as we are talking of other models ... maybe people are interested in ... Toucan TTS with 7000 languages
https://github.com/DigitalPhonetics/IMS-Toucan

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

And as we are talking of other models ... maybe people are interested in ... Toucan TTS with 7000 languages https://github.com/DigitalPhonetics/IMS-Toucan

For this project it seems decent but could be hard to handle if it means everyone has to install espeak.

from tts-generation-webui.

mykeehu avatar mykeehu commented on August 22, 2024

Thank you for fantastic work and this addon! I will use it!

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

Well done. Thank you so much. I have downloaded the latest version. I am getting an error in the Stable_Audio tab
`
Failed to load Stable Audio demo. Please check your configuration.

Error: expected an indented block after 'if' statement on line 548 (stable_audio.py, line 550)
`
I've run update.py and pip install -r requirements_stable_audio.txt

Any ideas how I can resolve this

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

Thanks for reporting, fixed it, just update normally or do a git pull for a very quick update.

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

Ok I think I figured it out:
There needs to be another folder, since you can have 100s of different Stable Audio models, so in your example moving the files to a new folder like this should work:

data/models/stable-audio/my-first-stable-audio-model/model.ckpt

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

Got it working thanks to you.
I created a specific subfolder and put the model.ckpt in it ... but that did not work. The subfolder should also have the model_config.json file from the same huggingface page as the ckpt and then it all worked great. The webui shows the sub folder name, not the ckpt.
image

Another issue is that the output file is always overwritten in the tts-generation-webui folder. It would be great if it could show up in the outputs folder like the TTS and Musicgen do.

A passing note, the init audio button is not working for me.

Many thanks

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

Concerning GPU memory, it has a very low memory footprint in comparison so Musicgen. My RTX 3090 has no problem with SD audio for the moment.

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

So about the outputs - I want to avoid spending a huge amount of time on integrating with the old favorites system and move on to a new system.
When you want them saved, what is the main wish - full integration with the favorites and history and collections, or do you just want to have a folder with all of the files and reasonable filenames/metadata?

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

Now files are being saved to outputs-rvc/stableaudio/...

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

Commercial use is now OK for most people, this makes Stable Audio one if not the best open source model we have! (Many other famous models are not open source, non-commercial etc) https://stability.ai/news/license-update

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

Thank you for sharing this update. This is excellent news from SD. I was starting to worry that the SD project would fold to the GAFA pressure ... which is still a possibility ...

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

Now files are being saved to outputs-rvc/stableaudio/...

Hello
I have been doing tests. I now get individually named folders in outputs-rvc but they are empty. But the audio file still appears at the folder root and is overwritten each time.
I replaced the stableaudio file in src with the new one but maybe there is something else to swap too?
Many thanks

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

Here is the error
FileNotFoundError: [Errno 2] No such file or directory: 'outputs-rvc/Stable Audio/2024-07-10_16-30-30_((piano_solo))__acoustic__key_E_minor__minimalist_high_energy_4/4__120bpm__320kbps__48.0kHz_Stereo__Studio__sorrow__minimalism_genre__Classical__Avant-Garde__dynamic_rhythm_/2024-07-10_16-30-30_((piano_solo))__acoustic__key_E_minor__minimalist_high_energy_4/4__120bpm__320kbps__48.0kHz_Stereo__Studio__sorrow__minimalism_genre__Classical__Avant-Garde__dynamic_rhythm_.wav'
The system creates a folder and sub-folder that use the prompt and it seems that for the export, the system does not find the path with folder names that are not same ...
image

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

No problem with TTS or bark

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

I'm using Rocky Linux 8.10

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

Incidentally, with the latest stable diffusion file, all my outputs are now 48secs long, even if the total seconds are above or below 48secs.

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

seconds_total_slider = gr.Slider( minimum=0, maximum=512, step=1, value=sample_size // sample_rate, label="Seconds total", visible=has_seconds_total,
Maybe the value=sample_size // sample_rate, explains why everything is 48secs long

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

@chlowden

Here is the error
FileNotFoundError: [Errno 2] No such file or directory: 'outputs-rvc/Stable Audio/2024-07-10_16-30-30_((piano_solo))acoustic__key_E_minor__minimalist_high_energy_4/4__120bpm__320kbps__48.0kHz_Stereo__Studio__sorrow__minimalism_genre__Classical__Avant-Garde__dynamic_rhythm/2024-07-10_16-30-30((piano_solo))_acoustic__key_E_minor__minimalist_high_energy_4/4__120bpm__320kbps__48.0kHz_Stereo__Studio__sorrow__minimalism_genre__Classical__Avant-Garde__dynamic_rhythm.wav'
The system creates a folder and sub-folder that use the prompt and it seems that for the export, the system does not find the path with folder names that are not same ...

So I think it could be the parenthesis and file system names, I will add a fix for removing the parenthesis, but could you try just a simple 'water' and see if that generation gets saved?

No problem with TTS or bark

Ok, that helps a lot to know.

Incidentally, with the latest stable diffusion file, all my outputs are now 48secs long, even if the total seconds are above or below 48secs.

Yes, I always saw that behaviour, have you ever seen it generate a different length? To me, if I put say 10s the audio will be silent but still output 48 seconds. I tried online demos and saw the same; so I was waiting for stable diffusion to fix this.

seconds_total_slider = gr.Slider( minimum=0, maximum=512, step=1, value=sample_size // sample_rate, label="Seconds total", visible=has_seconds_total, Maybe the value=sample_size // sample_rate, explains why everything is 48secs long

I will check this part. value here means the default value, but it could be related.

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

Or maybe not as the sample_rate = 32000

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

I checked the source of Stable Audio and it does seem like sample_size as defined within their code could determine the output length, but the gradio API they have made does not allow changing the length. They have a more internal API but it still seems like their model generates the audio equivalent of 512 by 512.

from tts-generation-webui.

chlowden avatar chlowden commented on August 22, 2024

I've done so many tests that I am probably getting confused as to what I can do with what service.

from tts-generation-webui.

rsxdalv avatar rsxdalv commented on August 22, 2024

Got it, I see we are doing a lot of back and forth so it might be useful to go on the new discord server. I will be busy for a while but hopefully can do more from there.

https://discord.com/invite/3JbBrKrH

from tts-generation-webui.

mykeehu avatar mykeehu commented on August 22, 2024

seconds_total_slider = gr.Slider( minimum=0, maximum=512, step=1, value=sample_size // sample_rate, label="Seconds total", visible=has_seconds_total, Maybe the value=sample_size // sample_rate, explains why everything is 48secs long

The 48 seconds is interesting because the official limit is 47 seconds. I'm sorry that you can't set an custom length, but the website says it's fixed:
https://stability.ai/news/introducing-stable-audio-open
"Stable Audio Open is an open source text-to-audio model for generating up to 47 seconds of samples and sound effects."

The filenames and folder names are really long, with a prompt you can easily reach the Windows 255 character path limit. I'd rather say the date-seed format could be more manageable, since the file you're describing is next to it anyway.

from tts-generation-webui.

mykeehu avatar mykeehu commented on August 22, 2024

Fixed the filenames:

Great! Now you can better manage your folders and files, thank you!

from tts-generation-webui.

mykeehu avatar mykeehu commented on August 22, 2024

Even the file names need a little fix. Stable Audio Generator produced such a prompt, and it is not saved because of the characters it contains:

Genre: Pop
Mood: Romantic, Atmospheric
Style: 90s
Instruments: Lead, Lead-off
Beats per Minute (BPM): 100
Additional Details: Create a dreamy and nostalgic atmosphere with lush synth pads, gentle piano melodies, a prominent lead melody, and a textural lead-off supporting the track. The music should build up to a cathartic moment filled with emotion and passion, capturing the essence of romanticism in the style of 90s pop.

The problem is with the \n:

Traceback (most recent call last):
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "I:\tts-generation-webui\tts-generation-webui\src\stable_audio\stable_audio.py", line 282, in save_result
    os.makedirs(base_dir, exist_ok=True)
  File "I:\tts-generation-webui\installer_files\env\lib\os.py", line 225, in makedirs
    mkdir(name, mode)
OSError: [WinError 123] File name, directory name or volume label syntax is incorrect: 'outputs-rvc\\Stable Audio\\2024-07-11_21-34-59_Format_Band_\nGe'

from tts-generation-webui.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.