Git Product home page Git Product logo

metavoice-src's People

Contributors

eltociear avatar fakerybakery avatar l4b4r4b4b4 avatar lama-thematique avatar lucapericlp avatar shhossain avatar sidroopdaska avatar vatsalaggarwal avatar vshulman avatar ya0guang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metavoice-src's Issues

Sample.Py script doesn't work on OSX

According to #1 it should be working on M1/M2 Macs. Running the python fam/llm/sample.py doesn't seem to work even with changing the float16, bfloat16 or any obvious dtype. Any suggestions?

venv) ➜  metavoice-src git:(main) python fam/llm/sample.py --device="cpu" --spk_cond_path="assets/bria.mp3" --text="This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model." --dtype="bfloat16"
objc[19467]: Class AVFFrameReceiver is implemented in both /Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x102cd4760) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x144574370). One of the two will be used. Which one is undefined.
objc[19467]: Class AVFAudioReceiver is implemented in both /Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x102cd47b0) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x1445743c0). One of the two will be used. Which one is undefined.
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0 with CUDA None (you have 2.2.0)
    Python  3.10.11 (you have 3.10.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
/Users/username/projects/ai/metavoice-src/fam/llm/layers/attn.py:10: UserWarning: flash_attn not installed, make sure to replace attention mechanism with torch_attn
  warnings.warn("flash_attn not installed, make sure to replace attention mechanism with torch_attn")
Fetching 6 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 119837.26it/s]
number of parameters: 1239.00M
number of parameters: 14.07M
getting cached speaker ref files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.18it/s]
calculating speaker embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1766.77it/s]
batch:   0%|                                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s[hack!!!!] Guidance is on, so we're doubling/tripling batch size!                                                                                                                                                     | 0/1728 [00:00<?, ?it/s]
tokens:   0%|                                                                                                                                                                                                        | 0/1728 [00:00<?, ?it/s]
batch:   0%|                                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 700, in <module>
    sample_utterance(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 544, in sample_utterance
    return _sample_utterance_batch(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 475, in _sample_utterance_batch
    b_tokens = first_stage_model(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 354, in __call__
    return self.causal_sample(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 229, in causal_sample
    y = self.model.generate(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/model.py", line 369, in generate
    return self._causal_sample(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/mixins/causal.py", line 410, in _causal_sample
    batch_idx = self._sample_batch(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/mixins/causal.py", line 264, in _sample_batch
    idx_next = self._sample_next_token(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/mixins/causal.py", line 85, in _sample_next_token
    list_logits, _ = self(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/model.py", line 282, in forward
    x = block(x)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/layers/combined.py", line 50, in forward
    x = x + self.attn(self.ln_1(x))
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/layers/attn.py", line 221, in forward
    y = self._torch_attn(c_x)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/layers/attn.py", line 189, in _torch_attn
    y = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: c10::BFloat16 and value.dtype: c10::BFloat16 instead.```

Optimizations

Hey! Thank you so so much for this repo and great work, this is what the world needs right now, I have been waiting for such a great foundation model for years!

When wanting to use vanilla KV cache (I suppose that's the fastest inference?), I get this error:

/home/ai/.mconda3/envs/metavoice/lib/python3.11/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
Fetching 5 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 38130.04it/s]
number of parameters: 1239.00M
Traceback (most recent call last):
  File "/home/ai/ml/voice/metavoice/metavoice-src/fam/llm/sample.py", line 690, in <module>
    smodel, llm_first_stage, llm_second_stage = build_models(
                                                ^^^^^^^^^^^^^
  File "/home/ai/ml/voice/metavoice/metavoice-src/fam/llm/sample.py", line 565, in build_models
    llm_first_stage = Model(
                      ^^^^^^
  File "/home/ai/ml/voice/metavoice/metavoice-src/fam/llm/sample.py", line 92, in __init__
    self._init_model()
  File "/home/ai/ml/voice/metavoice/metavoice-src/fam/llm/sample.py", line 159, in _init_model
    raise Exception(
Exception: kv_cache only supported for flash attention 2 but found torch_attn inside model!

I would be super grateful for help, thanks!

bardia hack ایران

I will do it with a code for anyone who wants to produce a sound with a personal tone

Typing error with Python 3.9 (`TypeError: unsupported operand type(s) for ...`)

Python 3.9.15
Trying to execute the sample.py produced 3 typing errors. I had just removed type annotations and it proceeded to the next step (huggingface credentials verification).

Please specify Python version or remove these type annotations (as I did).

python fam/llm/sample.py --huggingface_repo_id="metavoiceio" --spk_cond_path="assets/bria.mp3"
2024-02-09 19:17:47.280515: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/sample.py", line 21, in <module>
    from fam.llm.decoders import Decoder, EncodecDecoder
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/decoders.py", line 19, in <module>
    class EncodecDecoder(Decoder):
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/decoders.py", line 66, in EncodecDecoder
    ) -> str | torch.Tensor:
TypeError: unsupported operand type(s) for |: 'type' and 'torch._C._TensorMeta'
python fam/llm/sample.py --huggingface_repo_id="metavoiceio" --spk_cond_path="assets/bria.mp3"
2024-02-09 19:20:55.558639: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
/home/paul/anaconda3/envs/tf-gpu/lib/python3.9/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
Traceback (most recent call last):
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/sample.py", line 455, in <module>
    enhancer: Optional[Literal["df"] | BaseEnhancer],
TypeError: unsupported operand type(s) for |: '_LiteralGenericAlias' and 'ABCMeta'
 python fam/llm/sample.py --huggingface_repo_id="metavoiceio" --spk_cond_path="assets/bria.mp3"
2024-02-09 19:23:50.126062: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
/home/paul/anaconda3/envs/tf-gpu/lib/python3.9/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
Traceback (most recent call last):
  File "/home/paul/myapps/data_science/text-to-speech/metavoice-src/fam/llm/sample.py", line 533, in <module>
    enhancer: Optional[Literal["df"] | BaseEnhancer],
TypeError: unsupported operand type(s) for |: '_LiteralGenericAlias' and 'ABCMeta'

Speech to Speech example

The studio has speech-to-speech (voice conversion). Presumably that is possible with the OSS model?

If so, I'd love to see a few lines of code demonstrating how it can be done.

Missing file/dependency?

Hi,

When I follow the instructions on the Readme I am getting No module named 'fam.llm.mixins.gpt2loading Seems like some missing file and/or dependency.

How to change similarity and stability in sampling.py?

Hi, great implementation Im impressed by the accuracy of one shot looking forward to the finetune training code released.
In the meantime could you tell me how to change similarity and stability in sampling.py? What does it relate to? I am thinking top P and top K?
Or guidance_scale: Optional[Tuple[float, float]] = (3.0, 1.0)
"""Guidance scale for sampling: (speaker conditioning guidance_scale, prompt conditioning guidance ?scale)."""

Are you using some sort of controlnet?

Thanks

TypeError: 'type' object is not subscriptable on running sample.py

Hi,

Just followed your installation step and when I tried the following command:
python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/ava.flac"

The results are:

Traceback (most recent call last):
  File "fam/llm/sample.py", line 19, in <module>
    from fam.llm.adapters import FlattenedInterleavedEncodec2Codebook, TiltedEncodec
  File "/home/leocd/metavoice/metavoice-src/fam/llm/adapters/__init__.py", line 1, in <module>
    from fam.llm.adapters.flattened_encodec import FlattenedInterleavedEncodec2Codebook
  File "/home/leocd/metavoice/metavoice-src/fam/llm/adapters/flattened_encodec.py", line 4, in <module>
    class FlattenedInterleavedEncodec2Codebook(BaseDataAdapter):
  File "/home/leocd/metavoice/metavoice-src/fam/llm/adapters/flattened_encodec.py", line 8, in FlattenedInterleavedEncodec2Codebook
    def decode(self, tokens: list[list[int]]) -> tuple[list[int], list[list[int]]]:
TypeError: 'type' object is not subscriptable

Thank you

it would be interesting to let the model make other speech sounds. like laughing

bark also did this and it is quite helpful.

we could use semantics like this for the sounds.
[laughter]
[laughs]
[sighs]
[gasps]
[clears throat]
— or ... for hesitations
♪ for song lyrics
CAPITALIZATION for emphasis of a word

maby other emotional words would also be interesting like sad / happy.

but it might be to much work. do you think it would be possible to add something like this though fine tuning?

deepspeed inference

Great work, now the world has two AI voice giants, elevenlabs and metavoice. A great win for the open source community.

Whether to consider supporting deepspeed to improve inference speed?

Better Output

Use:
Speech Stability : 10
Speaker similarity : 4
for better results in the demo
Thank you

MPS Support

Hi,
Congrats on the launch!
Is MPS (Apple Silicon) or MLX support planned?
Thank you!

Will it natively support other languages?

I saw that the readme described the training data mainly in English, and I was worried that it would not learn the prosody in other languages, for example, the prosody in Chinese should be very different from English.
Will it be possible to learn other languages in the future just by fine-tuning?

Encoder checkpoint missing

When cloning the repo and running just the sample code (python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/ava.flac") without any changes, the model successfully downloads but I get the following error:

Fetching 5 files: 100%|█████████████████████████| 5/5 [00:00<00:00, 8053.58it/s]
Traceback (most recent call last):
  File "/kaggle/working/metavoice-src/fam/llm/sample.py", line 690, in <module>
    smodel, llm_first_stage, llm_second_stage = build_models(
  File "/kaggle/working/metavoice-src/fam/llm/sample.py", line 563, in build_models
    smodel = SpeakerEncoder(device=device, eval=True, verbose=False)
  File "/kaggle/working/metavoice-src/fam/quantiser/audio/speaker_encoder/model.py", line 50, in __init__
    checkpoint = torch.load(weights_fpath, map_location="cpu")
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/working/metavoice-src/fam/quantiser/audio/speaker_encoder/ckpt/ckpt.pt'

Is there a separate encoder checkpoint that's needed for this to work?

Model VRAM

How much VRAM does the loaded model require?
He said RTX4080 is not enough for 12GB GPU

Quantization Support?

Would like to quantize the model to 8 or 4-bit for running on GPU with 6GB of memory and for possible speed boost. Is this possible?

Possible to get timing info?

Is it possible to have this model also generate millisecond-level timestamps for the words (or phonemes) in the prompt?

I currently use speech-marks from a AWS Polly, and if this model could generate the same format, that would be very helpful!

I created a working Dockerfile to build it consistently (and on my system)...

Hi I created this Dockerfile just now to make building easier (and allow it on my system where Python seems unhappy, Mac)...

I am wondering about if this could be included and improved upon since I haven't gotten it running locally yet but at least got it built for me...

Branch with changes: main...groovybits:metavoice-src:docker

Also I am wondering about which port it uses and how to use this? Or more information on how to use it locally like an http server if this is what does that?

Thanks! I am reading up / exploring as I can since this looks amazing :)

Updated: builds in docker compose now, not sure if "working" yet (need to look closer).

Any timeline for fine tuning code

I'm looking to finetune this model, as you mentioned that you will open source finetuning code, do you have any timeline in mind for that.
And if possible, please do provide some info regarding training dataset.

ModuleNotFoundError: No module named 'df'

When I run sample. py, the following error occurs:

python /root/metavoice-src/fam/llm/sample.py
Traceback (most recent call last):
File "/root/metavoice-src/fam/llm/sample.py", line 22, in
from fam.llm.enhancers import BaseEnhancer, get_enhancer
File "/root/metavoice-src/fam/llm/enhancers.py", line 5, in
from df.enhance import enhance, init_df, load_audio, save_audio
ModuleNotFoundError: No module named 'df'

It seems that df. enhance is not a third-party library, and I couldn't find the corresponding file in the folder either

Acknowledgement

Hey I found your speaker encoder code is similar to Link this code.

Can you please credit them?

حسن

انت بتعمل اه كل يوم حياه ماشيه ازاى

Rough fine-tuning guidance

I know the repo ReadMe says "soon", but would it be possible to give some very rough advice on how to fine-tune to improve on the voice's match with a custom speaker?

I guess the demo is just extracting embeddings from bria.mp3, but I'd like to go one step further to get a better voice match. Thanks.

Run the model on cpu

Hello, is there a way to run the model on cpu? model size is small with just 1.2B parameter so i was thinking to myself if there's a way to run this model only on cpu

pip install doesn't work on Nvidia RTX 2070 Super on Ubuntu 20.04.3 LTS

NVIDIA-SMI 525.147.05
Driver Version: 525.147.05
CUDA Version: 12.0 (also tested with 12.2)

Fresh install of Ubuntu
Using virtualenv to create a "packages" folder then source packages/bin/activate
Then "pip install -r requirements.txt"

Using this command: python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/ava.flac"

I get this:

/home/user/metavoice-src/packages/lib/python3.10/site-packages/df/io.py:9: UserWarning: torchaudio.backend.common.AudioMetaDatahas been moved totorchaudio.AudioMetaData. Please update the import path. from torchaudio.backend.common import AudioMetaData Traceback (most recent call last): File "/home/user/metavoice-src/fam/llm/sample.py", line 22, in <module> from fam.llm.model import GPT, GPTConfig File "/home/user/metavoice-src/fam/llm/model.py", line 12, in <module> from fam.llm.layers import Block, LayerNorm, RMSNorm File "/home/user/metavoice-src/fam/llm/layers/__init__.py", line 1, in <module> from fam.llm.layers.attn import SelfAttention File "/home/user/metavoice-src/fam/llm/layers/attn.py", line 3, in <module> from flash_attn import ( # type: ignore File "/home/user/metavoice-src/packages/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module> from flash_attn.flash_attn_interface import ( File "/home/user/metavoice-src/packages/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module> import flash_attn_2_cuda as flash_attn_cuda ImportError: /home/user/metavoice-src/packages/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

This occurs even when trying a new virtualenv. Is there a specific Python/CUDA version i should be on?

Error during installation of deps on runpod instances

Any suggestion to fix this error?

Collecting xformers
  Using cached xformers-0.0.23.post1-cp310-cp310-manylinux2014_x86_64.whl (213.0 MB)
  Using cached xformers-0.0.23-cp310-cp310-manylinux2014_x86_64.whl (213.0 MB)
  Using cached xformers-0.0.22.post7-cp310-cp310-manylinux2014_x86_64.whl (211.8 MB)
  Using cached xformers-0.0.22-cp310-cp310-manylinux2014_x86_64.whl (211.6 MB)
  Using cached xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl (167.0 MB)
  Using cached xformers-0.0.20-cp310-cp310-manylinux2014_x86_64.whl (109.1 MB)
  Using cached xformers-0.0.19-cp310-cp310-manylinux2014_x86_64.whl (108.2 MB)
Collecting pyre-extensions==0.0.29
  Using cached pyre_extensions-0.0.29-py3-none-any.whl (12 kB)
Collecting xformers
  Using cached xformers-0.0.18-cp310-cp310-manylinux2014_x86_64.whl (123.8 MB)
Collecting pyre-extensions==0.0.23
  Using cached pyre_extensions-0.0.23-py3-none-any.whl (11 kB)
Collecting xformers
  Using cached xformers-0.0.17-cp310-cp310-manylinux2014_x86_64.whl (123.6 MB)
  Using cached xformers-0.0.16-cp310-cp310-manylinux2014_x86_64.whl (50.9 MB)
  Using cached xformers-0.0.13.tar.gz (292 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-mu60tz0e/xformers_4aa19d62bcc44da58b0f254d6573d196/setup.py", line 18, in <module>
          import torch
      ModuleNotFoundError: No module named 'torch'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Pip install not working

Following install instructions, when I run
pip install -r requirements
I get the error

Collecting flash-attn
  Using cached flash_attn-2.5.2.tar.gz (2.5 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-tx9pk_0o/flash-attn_5537406baed04038937f619dc63a1f9f/setup.py", line 17, in <module>
          from wheel.bdist_wheel import bdist_wheel as _bdist_wheel
      ModuleNotFoundError: No module named 'wheel'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Have tried installing via pip install packaging and also adding

[build-system]
requires = ["packaging"]

but does not fix.

Running on Ubuntu 22.0.4 LTS 64-bit

bfloat16 not supported on Google Colab T4

I'm trying to get this working in Google Colab based on the info in the readme, but I get the following error:

  File "/content/metavoice-src/fam/llm/sample.py", line 696, in <module>
    smodel, llm_first_stage, llm_second_stage = build_models(
  File "/content/metavoice-src/fam/llm/sample.py", line 571, in build_models
    llm_first_stage = Model(
  File "/content/metavoice-src/fam/llm/sample.py", line 81, in __init__
    nullcontext() if device_type == "cpu" else torch.amp.autocast(device_type=device_type, dtype=ptdtype)
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 305, in __init__
    raise RuntimeError(
RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.

Voice presets

Are other voice presets other than ava avail for download somewhere?

windows support / Temp wav file error

When I run sample.py locally on Windows 11, the code crashes on line 84 of enhancer.py.

Exception has occurred: LibsndfileError
Error opening 'C:\\Users\\user\\AppData\\Local\\Temp\\tmpms4y14wd.wav': System error.
  File "D:\metavoice-src\fam\llm\enhancers.py", line 84, in __call__
    save_audio(output_file, enhanced, self.df_state.sr())
  File "D:\metavoice-src\fam\llm\sample.py", line 506, in _sample_utterance_batch
    enhancer(str(wav_file) + ".wav", enhanced_tmp.name)
  File "D:\metavoice-src\fam\llm\sample.py", line 544, in sample_utterance
    return _sample_utterance_batch(
  File "D:\metavoice-src\fam\llm\sample.py", line 695, in <module>
    sample_utterance(
soundfile.LibsndfileError: Error opening 'C:\\Users\\user\\AppData\\Local\\Temp\\tmpms4y14wd.wav': System error.

After adding a few breakpoints, I discovered that a zero-length WAV file appears at the temporary path before the crash. However, it gets deleted at the moment of the crash. On the other hand, the WAV file in the samples directory is being created successfully. Could it be an issue with the enhancer?

long-form/streaming support?

i wanna use it in role plays and the audio is mostly 500+ chars big so the generation is long.....
is there and stream mode planned?
like in xtts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.