metavoiceio / metavoice-src Goto Github PK

View Code? Open in Web Editor NEW

3.7K 3.7K 640.0 20.13 MB

Foundational model for human-like, expressive TTS

Home Page: https://themetavoice.xyz/

License: Apache License 2.0

Python 97.18% Dockerfile 0.87% Jupyter Notebook 1.95%

ai deep-learning pytorch speech speech-synthesis text-to-speech tts voice-clone zero-shot-tts

metavoice-src's People

Contributors

Stargazers

Watchers

Forkers

hanasim jetqin zook111 rajendharmendra camenduru mindkhichdi princerumi ishine suryatmodulus djeebus atuxhe inf800 jags111 shaun95 joorjeh polya20 maxmax2016 alekperos genostack tomchapin jaydeep82 traderbhai thecoderraman noernova rsandagon allinbsv zeroxclem lucapericlp fffiloni aayushshah196 lmsanch l4b4r4b4b4 jadouse5 xc0r hitech777 dsmolchanov sorokinvld memento ahmed-ansari-31 render-ai touristshaun mexicanamerican fastrocket fiditenemini envermt wildcatapp kagelee kishorek11 flywiththetide srikalyan dmitriial ilyamk bilawalriaz jason-shen 45678yy cryptoxunm androiddrew yuan-manx leocd91 rashmi190987 demangrove nopeanuts jefedeoro jimmy-inl sinanakkoyun sheridee f901107 hacktaz123 rafael-ariascalles eisenh ailabteam binaryninja kerwinchina hnhnarek nomodify al3xisdani3l kayorlian donaminos baseat30 billitejuzmyhero navezjt ksinceringe gotiz-50 truelasting-m cogentri33 amazewor-69 screedebu-finaltalk alonekaven-y pattyrobo-s ladywib captail-dirtypanet newsabarkatriorgi minglueue-quayleflea postil-kroolspice ajinkyalahade slipkray-z insidelifej o-beckield x-parmatr banditiew

metavoice-src's Issues

Sample.Py script doesn't work on OSX

According to #1 it should be working on M1/M2 Macs. Running the python fam/llm/sample.py doesn't seem to work even with changing the float16, bfloat16 or any obvious dtype. Any suggestions?

venv) ➜  metavoice-src git:(main) python fam/llm/sample.py --device="cpu" --spk_cond_path="assets/bria.mp3" --text="This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model." --dtype="bfloat16"
objc[19467]: Class AVFFrameReceiver is implemented in both /Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x102cd4760) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x144574370). One of the two will be used. Which one is undefined.
objc[19467]: Class AVFAudioReceiver is implemented in both /Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x102cd47b0) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x1445743c0). One of the two will be used. Which one is undefined.
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0 with CUDA None (you have 2.2.0)
    Python  3.10.11 (you have 3.10.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
/Users/username/projects/ai/metavoice-src/fam/llm/layers/attn.py:10: UserWarning: flash_attn not installed, make sure to replace attention mechanism with torch_attn
  warnings.warn("flash_attn not installed, make sure to replace attention mechanism with torch_attn")
Fetching 6 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 119837.26it/s]
number of parameters: 1239.00M
number of parameters: 14.07M
getting cached speaker ref files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.18it/s]
calculating speaker embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1766.77it/s]
batch:   0%|                                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s[hack!!!!] Guidance is on, so we're doubling/tripling batch size!                                                                                                                                                     | 0/1728 [00:00<?, ?it/s]
tokens:   0%|                                                                                                                                                                                                        | 0/1728 [00:00<?, ?it/s]
batch:   0%|                                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 700, in <module>
    sample_utterance(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 544, in sample_utterance
    return _sample_utterance_batch(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 475, in _sample_utterance_batch
    b_tokens = first_stage_model(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 354, in __call__
    return self.causal_sample(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 229, in causal_sample
    y = self.model.generate(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/model.py", line 369, in generate
    return self._causal_sample(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/mixins/causal.py", line 410, in _causal_sample
    batch_idx = self._sample_batch(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/mixins/causal.py", line 264, in _sample_batch
    idx_next = self._sample_next_token(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/mixins/causal.py", line 85, in _sample_next_token
    list_logits, _ = self(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/model.py", line 282, in forward
    x = block(x)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/layers/combined.py", line 50, in forward
    x = x + self.attn(self.ln_1(x))
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/layers/attn.py", line 221, in forward
    y = self._torch_attn(c_x)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/layers/attn.py", line 189, in _torch_attn
    y = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: c10::BFloat16 and value.dtype: c10::BFloat16 instead.```

Do you have to clone the voice everytime?

Lets assume I ran synthesize once, and it yielded a good result. How can I continue to use that voice without having to process it every time? Maybe like a way to make voices included in the preset voices in the demo on https://ttsdemo.themetavoice.xyz/ ?

Improving the latency

Is improving the latency of the model on the roadmap, e.g. real-time TTS?

You forgot to include the Python version in the readme

Maybe one day Python devs will surprise me :)

But seriously, need to add that - thanks!

Optimizations

Hey! Thank you so so much for this repo and great work, this is what the world needs right now, I have been waiting for such a great foundation model for years!

When wanting to use vanilla KV cache (I suppose that's the fastest inference?), I get this error:

/home/ai/.mconda3/envs/metavoice/lib/python3.11/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
Fetching 5 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 38130.04it/s]
number of parameters: 1239.00M
Traceback (most recent call last):
  File "/home/ai/ml/voice/metavoice/metavoice-src/fam/llm/sample.py", line 690, in <module>
    smodel, llm_first_stage, llm_second_stage = build_models(
                                                ^^^^^^^^^^^^^
  File "/home/ai/ml/voice/metavoice/metavoice-src/fam/llm/sample.py", line 565, in build_models
    llm_first_stage = Model(
                      ^^^^^^
  File "/home/ai/ml/voice/metavoice/metavoice-src/fam/llm/sample.py", line 92, in __init__
    self._init_model()
  File "/home/ai/ml/voice/metavoice/metavoice-src/fam/llm/sample.py", line 159, in _init_model
    raise Exception(
Exception: kv_cache only supported for flash attention 2 but found torch_attn inside model!

I would be super grateful for help, thanks!

AttributeError: torch._inductor.config.fx_graph_cache does not exist

Hi,
I am getting the error AttributeError: torch._inductor.config.fx_graph_cache does not exist when using the latest version (also using the latest version of PyTorch)
Do you know how I can fix this?
Thanks!

bardia hack ایران

I will do it with a code for anyone who wants to produce a sound with a personal tone

Typing error with Python 3.9 (`TypeError: unsupported operand type(s) for ...`)

Python 3.9.15
Trying to execute the sample.py produced 3 typing errors. I had just removed type annotations and it proceeded to the next step (huggingface credentials verification).

Please specify Python version or remove these type annotations (as I did).

python fam/llm/sample.py --huggingface_repo_id="metavoiceio" --spk_cond_path="assets/bria.mp3"
2024-02-09 19:17:47.280515: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/sample.py", line 21, in <module>
    from fam.llm.decoders import Decoder, EncodecDecoder
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/decoders.py", line 19, in <module>
    class EncodecDecoder(Decoder):
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/decoders.py", line 66, in EncodecDecoder
    ) -> str | torch.Tensor:
TypeError: unsupported operand type(s) for |: 'type' and 'torch._C._TensorMeta'

python fam/llm/sample.py --huggingface_repo_id="metavoiceio" --spk_cond_path="assets/bria.mp3"
2024-02-09 19:20:55.558639: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
/home/paul/anaconda3/envs/tf-gpu/lib/python3.9/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
Traceback (most recent call last):
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/sample.py", line 455, in <module>
    enhancer: Optional[Literal["df"] | BaseEnhancer],
TypeError: unsupported operand type(s) for |: '_LiteralGenericAlias' and 'ABCMeta'

 python fam/llm/sample.py --huggingface_repo_id="metavoiceio" --spk_cond_path="assets/bria.mp3"
2024-02-09 19:23:50.126062: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
/home/paul/anaconda3/envs/tf-gpu/lib/python3.9/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
Traceback (most recent call last):
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/sample.py", line 533, in <module>
    enhancer: Optional[Literal["df"] | BaseEnhancer],
TypeError: unsupported operand type(s) for |: '_LiteralGenericAlias' and 'ABCMeta'

Can it be used for Commercial purposes?

Pretty much as the question says - Can it be used Freely for Commercial purposes?

Speech to Speech example

The studio has speech-to-speech (voice conversion). Presumably that is possible with the OSS model?

If so, I'd love to see a few lines of code demonstrating how it can be done.

Missing file/dependency?

Hi,

When I follow the instructions on the Readme I am getting No module named 'fam.llm.mixins.gpt2loading Seems like some missing file and/or dependency.

How to change similarity and stability in sampling.py?

Hi, great implementation Im impressed by the accuracy of one shot looking forward to the finetune training code released.
In the meantime could you tell me how to change similarity and stability in sampling.py? What does it relate to? I am thinking top P and top K?
Or guidance_scale: Optional[Tuple[float, float]] = (3.0, 1.0)
"""Guidance scale for sampling: (speaker conditioning guidance_scale, prompt conditioning guidance ?scale)."""

Are you using some sort of controlnet?

Thanks

TypeError: 'type' object is not subscriptable on running sample.py

Hi,

Just followed your installation step and when I tried the following command:
python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/ava.flac"

The results are:

Traceback (most recent call last):
  File "fam/llm/sample.py", line 19, in <module>
    from fam.llm.adapters import FlattenedInterleavedEncodec2Codebook, TiltedEncodec
  File "/home/leocd/metavoice/metavoice-src/fam/llm/adapters/__init__.py", line 1, in <module>
    from fam.llm.adapters.flattened_encodec import FlattenedInterleavedEncodec2Codebook
  File "/home/leocd/metavoice/metavoice-src/fam/llm/adapters/flattened_encodec.py", line 4, in <module>
    class FlattenedInterleavedEncodec2Codebook(BaseDataAdapter):
  File "/home/leocd/metavoice/metavoice-src/fam/llm/adapters/flattened_encodec.py", line 8, in FlattenedInterleavedEncodec2Codebook
    def decode(self, tokens: list[list[int]]) -> tuple[list[int], list[list[int]]]:
TypeError: 'type' object is not subscriptable

Thank you

it would be interesting to let the model make other speech sounds. like laughing

bark also did this and it is quite helpful.

we could use semantics like this for the sounds.
[laughter]
[laughs]
[sighs]
[gasps]
[clears throat]
— or ... for hesitations
♪ for song lyrics
CAPITALIZATION for emphasis of a word

maby other emotional words would also be interesting like sad / happy.

but it might be to much work. do you think it would be possible to add something like this though fine tuning?

deepspeed inference

Great work, now the world has two AI voice giants, elevenlabs and metavoice. A great win for the open source community.

Whether to consider supporting deepspeed to improve inference speed?

Better Output

Use:
Speech Stability : 10
Speaker similarity : 4
for better results in the demo
Thank you

MPS Support

Hi,
Congrats on the launch!
Is MPS (Apple Silicon) or MLX support planned?
Thank you!

Will it natively support other languages?

I saw that the readme described the training data mainly in English, and I was worried that it would not learn the prosody in other languages, for example, the prosody in Chinese should be very different from English.
Will it be possible to learn other languages in the future just by fine-tuning?

Encoder checkpoint missing

When cloning the repo and running just the sample code (python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/ava.flac") without any changes, the model successfully downloads but I get the following error:

Fetching 5 files: 100%|█████████████████████████| 5/5 [00:00<00:00, 8053.58it/s]
Traceback (most recent call last):
  File "/kaggle/working/metavoice-src/fam/llm/sample.py", line 690, in <module>
    smodel, llm_first_stage, llm_second_stage = build_models(
  File "/kaggle/working/metavoice-src/fam/llm/sample.py", line 563, in build_models
    smodel = SpeakerEncoder(device=device, eval=True, verbose=False)
  File "/kaggle/working/metavoice-src/fam/quantiser/audio/speaker_encoder/model.py", line 50, in __init__
    checkpoint = torch.load(weights_fpath, map_location="cpu")
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/working/metavoice-src/fam/quantiser/audio/speaker_encoder/ckpt/ckpt.pt'

Is there a separate encoder checkpoint that's needed for this to work?

Model VRAM

How much VRAM does the loaded model require?
He said RTX4080 is not enough for 12GB GPU

Is the Voice-to-Voice better than RVC?

Looking for a better free service than RVC's V2V--how is this V2V service? Thanks!
Will donate if it's good :)
<3

Any chances for lower VRAM or MacOS versions?

It would make it much more accessible for experimenting with it.

is there a way to easily load the model ?

Quantization Support?

Would like to quantize the model to 8 or 4-bit for running on GPU with 6GB of memory and for possible speed boost. Is this possible?

Possible to get timing info?

Is it possible to have this model also generate millisecond-level timestamps for the words (or phonemes) in the prompt?

I currently use speech-marks from a AWS Polly, and if this model could generate the same format, that would be very helpful!

I created a working Dockerfile to build it consistently (and on my system)...

Hi I created this Dockerfile just now to make building easier (and allow it on my system where Python seems unhappy, Mac)...

I am wondering about if this could be included and improved upon since I haven't gotten it running locally yet but at least got it built for me...

Branch with changes: main...groovybits:metavoice-src:docker

Also I am wondering about which port it uses and how to use this? Or more information on how to use it locally like an http server if this is what does that?

Thanks! I am reading up / exploring as I can since this looks amazing :)

Updated: builds in docker compose now, not sure if "working" yet (need to look closer).

Does this support streaming responses?

Great!

Wonderful work.

Any timeline for fine tuning code

I'm looking to finetune this model, as you mentioned that you will open source finetuning code, do you have any timeline in mind for that.
And if possible, please do provide some info regarding training dataset.

ModuleNotFoundError: No module named 'df'

When I run sample. py, the following error occurs：

python /root/metavoice-src/fam/llm/sample.py
Traceback (most recent call last):
File "/root/metavoice-src/fam/llm/sample.py", line 22, in
from fam.llm.enhancers import BaseEnhancer, get_enhancer
File "/root/metavoice-src/fam/llm/enhancers.py", line 5, in
from df.enhance import enhance, init_df, load_audio, save_audio
ModuleNotFoundError: No module named 'df'

It seems that df. enhance is not a third-party library, and I couldn't find the corresponding file in the folder either

Acknowledgement

Hey I found your speaker encoder code is similar to Link this code.

Can you please credit them?

I will do it with a code for anyone who wants to produce a sound with a personal tone

Originally posted by @bardia-1090 in #60

حسن

انت بتعمل اه كل يوم حياه ماشيه ازاى

Does it support ONNX conversion?

Rough fine-tuning guidance

I know the repo ReadMe says "soon", but would it be possible to give some very rough advice on how to fine-tune to improve on the voice's match with a custom speaker?

I guess the demo is just extracting embeddings from bria.mp3, but I'd like to go one step further to get a better voice match. Thanks.

Run the model on cpu

Hello, is there a way to run the model on cpu? model size is small with just 1.2B parameter so i was thinking to myself if there's a way to run this model only on cpu

Hi

pip install doesn't work on Nvidia RTX 2070 Super on Ubuntu 20.04.3 LTS

NVIDIA-SMI 525.147.05
Driver Version: 525.147.05
CUDA Version: 12.0 (also tested with 12.2)

Fresh install of Ubuntu
Using virtualenv to create a "packages" folder then source packages/bin/activate
Then "pip install -r requirements.txt"

Using this command: python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/ava.flac"

I get this:

/home/user/metavoice-src/packages/lib/python3.10/site-packages/df/io.py:9: UserWarning: torchaudio.backend.common.AudioMetaDatahas been moved totorchaudio.AudioMetaData. Please update the import path. from torchaudio.backend.common import AudioMetaData Traceback (most recent call last): File "/home/user/metavoice-src/fam/llm/sample.py", line 22, in <module> from fam.llm.model import GPT, GPTConfig File "/home/user/metavoice-src/fam/llm/model.py", line 12, in <module> from fam.llm.layers import Block, LayerNorm, RMSNorm File "/home/user/metavoice-src/fam/llm/layers/__init__.py", line 1, in <module> from fam.llm.layers.attn import SelfAttention File "/home/user/metavoice-src/fam/llm/layers/attn.py", line 3, in <module> from flash_attn import ( # type: ignore File "/home/user/metavoice-src/packages/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module> from flash_attn.flash_attn_interface import ( File "/home/user/metavoice-src/packages/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module> import flash_attn_2_cuda as flash_attn_cuda ImportError: /home/user/metavoice-src/packages/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

This occurs even when trying a new virtualenv. Is there a specific Python/CUDA version i should be on?

Is it possible to run this model on 4Gb memory GPU?

Does it support Arabic

I have 50k high quality Arabic dataset,is possible to train the model on Arabic?

Error during installation of deps on runpod instances

Any suggestion to fix this error?

Collecting xformers
  Using cached xformers-0.0.23.post1-cp310-cp310-manylinux2014_x86_64.whl (213.0 MB)
  Using cached xformers-0.0.23-cp310-cp310-manylinux2014_x86_64.whl (213.0 MB)
  Using cached xformers-0.0.22.post7-cp310-cp310-manylinux2014_x86_64.whl (211.8 MB)
  Using cached xformers-0.0.22-cp310-cp310-manylinux2014_x86_64.whl (211.6 MB)
  Using cached xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl (167.0 MB)
  Using cached xformers-0.0.20-cp310-cp310-manylinux2014_x86_64.whl (109.1 MB)
  Using cached xformers-0.0.19-cp310-cp310-manylinux2014_x86_64.whl (108.2 MB)
Collecting pyre-extensions==0.0.29
  Using cached pyre_extensions-0.0.29-py3-none-any.whl (12 kB)
Collecting xformers
  Using cached xformers-0.0.18-cp310-cp310-manylinux2014_x86_64.whl (123.8 MB)
Collecting pyre-extensions==0.0.23
  Using cached pyre_extensions-0.0.23-py3-none-any.whl (11 kB)
Collecting xformers
  Using cached xformers-0.0.17-cp310-cp310-manylinux2014_x86_64.whl (123.6 MB)
  Using cached xformers-0.0.16-cp310-cp310-manylinux2014_x86_64.whl (50.9 MB)
  Using cached xformers-0.0.13.tar.gz (292 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-mu60tz0e/xformers_4aa19d62bcc44da58b0f254d6573d196/setup.py", line 18, in <module>
          import torch
      ModuleNotFoundError: No module named 'torch'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Please make a simple gradio app that supports text to speech, 0 shot voice cloning, and true training for voice cloning

It shouldn't be hard for you. It can be ugly looking and bad coded, just works is sufficient

Pip install not working

Following install instructions, when I run
pip install -r requirements
I get the error

Collecting flash-attn
  Using cached flash_attn-2.5.2.tar.gz (2.5 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-tx9pk_0o/flash-attn_5537406baed04038937f619dc63a1f9f/setup.py", line 17, in <module>
          from wheel.bdist_wheel import bdist_wheel as _bdist_wheel
      ModuleNotFoundError: No module named 'wheel'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Have tried installing via pip install packaging and also adding

[build-system]
requires = ["packaging"]

but does not fix.

Running on Ubuntu 22.0.4 LTS 64-bit

bfloat16 not supported on Google Colab T4

I'm trying to get this working in Google Colab based on the info in the readme, but I get the following error:

  File "/content/metavoice-src/fam/llm/sample.py", line 696, in <module>
    smodel, llm_first_stage, llm_second_stage = build_models(
  File "/content/metavoice-src/fam/llm/sample.py", line 571, in build_models
    llm_first_stage = Model(
  File "/content/metavoice-src/fam/llm/sample.py", line 81, in __init__
    nullcontext() if device_type == "cpu" else torch.amp.autocast(device_type=device_type, dtype=ptdtype)
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 305, in __init__
    raise RuntimeError(
RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.

Is it possible to control the speed of speech?

Thanks for the great work! I wonder if it is possible to control the speed of the output speech?

Voice presets

Are other voice presets other than ava avail for download somewhere?

windows support / Temp wav file error

When I run sample.py locally on Windows 11, the code crashes on line 84 of enhancer.py.

Exception has occurred: LibsndfileError
Error opening 'C:\\Users\\user\\AppData\\Local\\Temp\\tmpms4y14wd.wav': System error.
  File "D:\metavoice-src\fam\llm\enhancers.py", line 84, in __call__
    save_audio(output_file, enhanced, self.df_state.sr())
  File "D:\metavoice-src\fam\llm\sample.py", line 506, in _sample_utterance_batch
    enhancer(str(wav_file) + ".wav", enhanced_tmp.name)
  File "D:\metavoice-src\fam\llm\sample.py", line 544, in sample_utterance
    return _sample_utterance_batch(
  File "D:\metavoice-src\fam\llm\sample.py", line 695, in <module>
    sample_utterance(
soundfile.LibsndfileError: Error opening 'C:\\Users\\user\\AppData\\Local\\Temp\\tmpms4y14wd.wav': System error.

After adding a few breakpoints, I discovered that a zero-length WAV file appears at the temporary path before the crash. However, it gets deleted at the moment of the crash. On the other hand, the WAV file in the samples directory is being created successfully. Could it be an issue with the enhancer?

I had to install 3 copies of xformers: