openmoss / anygpt Goto Github PK

View Code? Open in Web Editor NEW

655.0 655.0 48.0 10.05 MB

Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"

Python 98.66% Shell 1.34%

anygpt's People

Contributors

Stargazers

Watchers

anygpt's Issues

hi，when will the pre-train related codes&scripts be released?

I see you have released the mmpretrain file. When will you plan to release all the code and scripts? I believe this will benefit the community a lot and gain much researchers' interest in your great work like MOSS. Besides, how many A100 GPUs were used in the pre-training stage, and how much time did it cost?

How to train the speech tokneizer?

About input formats for training and inference

Anygpt is trained only with the Next Token Prediction task.
Take text to image as an example，Is the training input speech tokens text tokens image tokens music tokens?
I want to know the input formats for training and inference.
training input ：<sos> speech tokens <eos> text tokens <soi> image tokens <eoi> <som> music tokens,
training label ：speech tokens <eos> text tokens <soi> image tokens <eoi> <som> music tokens <eom>. Is my understanding correct about training input and label?

RuntimeError: Error(s) in loading state_dict for SoundStorm

log:

Missing key(s) in state_dict: "net.conformer.layers.0.conv.net.0.weight", "net.conformer.layers.0.conv.net.0.bias", "net.conformer.layers.0.conv.net.2.weight", "net.conformer.layers.0.conv.net.2.bias", "net.conformer.layers.0.conv.net.4.conv.weight", "net.conformer.layers.0.conv.net.4.conv.bias", "net.conformer.layers.0.conv.net.6.gamma", "net.conformer.layers.0.conv.net.7.weight", "net.conformer.layers.0.conv.net.7.bias", "net.conformer.layers.1.conv.net.0.weight", "net.conformer.layers.1.conv.net.0.bias", "net.conformer.layers.1.conv.net.2.weight", "net.conformer.layers.1.conv.net.2.bias", "net.conformer.layers.1.conv.net.4.conv.weight", "net.conformer.layers.1.conv.net.4.conv.bias", "net.conformer.layers.1.conv.net.6.gamma", "net.conformer.layers.1.conv.net.7.weight", "net.conformer.layers.1.conv.net.7.bias", "net.conformer.layers.2.conv.net.0.weight", "net.conformer.layers.2.conv.net.0.bias", "net.conformer.layers.2.conv.net.2.weight", "net.conformer.layers.2.conv.net.2.bias", "net.conformer.layers.2.conv.net.4.conv.weight", "net.conformer.layers.2.conv.net.4.conv.bias", "net.conformer.layers.2.conv.net.6.gamma", "net.conformer.layers.2.conv.net.7.weight", "net.conformer.layers.2.conv.net.7.bias", "net.conformer.layers.3.conv.net.0.weight", "net.conformer.layers.3.conv.net.0.bias", "net.conformer.layers.3.conv.net.2.weight", "net.conformer.layers.3.conv.net.2.bias", "net.conformer.layers.3.conv.net.4.conv.weight", "net.conformer.layers.3.conv.net.4.conv.bias", "net.conformer.layers.3.conv.net.6.gamma", "net.conformer.layers.3.conv.net.7.weight", "net.conformer.layers.3.conv.net.7.bias", "net.conformer.layers.4.conv.net.0.weight", "net.conformer.layers.4.conv.net.0.bias", "net.conformer.layers.4.conv.net.2.weight", "net.conformer.layers.4.conv.net.2.bias", "net.conformer.layers.4.conv.net.4.conv.weight", "net.conformer.layers.4.conv.net.4.conv.bias", "net.conformer.layers.4.conv.net.6.gamma", "net.conformer.layers.4.conv.net.7.weight", "net.conformer.layers.4.conv.net.7.bias", "net.conformer.layers.5.conv.net.0.weight", "net.conformer.layers.5.conv.net.0.bias", "net.conformer.layers.5.conv.net.2.weight", "net.conformer.layers.5.conv.net.2.bias", "net.conformer.layers.5.conv.net.4.conv.weight", "net.conformer.layers.5.conv.net.4.conv.bias", "net.conformer.layers.5.conv.net.6.gamma", "net.conformer.layers.5.conv.net.7.weight", "net.conformer.layers.5.conv.net.7.bias", "net.conformer.layers.6.conv.net.0.weight", "net.conformer.layers.6.conv.net.0.bias", "net.conformer.layers.6.conv.net.2.weight", "net.conformer.layers.6.conv.net.2.bias", "net.conformer.layers.6.conv.net.4.conv.weight", "net.conformer.layers.6.conv.net.4.conv.bias", "net.conformer.layers.6.conv.net.6.gamma", "net.conformer.layers.6.conv.net.7.weight", "net.conformer.layers.6.conv.net.7.bias", "net.conformer.layers.7.conv.net.0.weight", "net.conformer.layers.7.conv.net.0.bias", "net.conformer.layers.7.conv.net.2.weight", "net.conformer.layers.7.conv.net.2.bias", "net.conformer.layers.7.conv.net.4.conv.weight", "net.conformer.layers.7.conv.net.4.conv.bias", "net.conformer.layers.7.conv.net.6.gamma", "net.conformer.layers.7.conv.net.7.weight", "net.conformer.layers.7.conv.net.7.bias", "net.conformer.layers.8.conv.net.0.weight", "net.conformer.layers.8.conv.net.0.bias", "net.conformer.layers.8.conv.net.2.weight", "net.conformer.layers.8.conv.net.2.bias", "net.conformer.layers.8.conv.net.4.conv.weight", "net.conformer.layers.8.conv.net.4.conv.bias", "net.conformer.layers.8.conv.net.6.gamma", "net.conformer.layers.8.conv.net.7.weight", "net.conformer.layers.8.conv.net.7.bias", "net.conformer.layers.9.conv.net.0.weight", "net.conformer.layers.9.conv.net.0.bias", "net.conformer.layers.9.conv.net.2.weight", "net.conformer.layers.9.conv.net.2.bias", "net.conformer.layers.9.conv.net.4.conv.weight", "net.conformer.layers.9.conv.net.4.conv.bias", "net.conformer.layers.9.conv.net.6.gamma", "net.conformer.layers.9.conv.net.7.weight", "net.conformer.layers.9.conv.net.7.bias", "net.conformer.layers.10.conv.net.0.weight", "net.conformer.layers.10.conv.net.0.bias", "net.conformer.layers.10.conv.net.2.weight", "net.conformer.layers.10.conv.net.2.bias", "net.conformer.layers.10.conv.net.4.conv.weight", "net.conformer.layers.10.conv.net.4.conv.bias", "net.conformer.layers.10.conv.net.6.gamma", "net.conformer.layers.10.conv.net.7.weight", "net.conformer.layers.10.conv.net.7.bias", "net.conformer.layers.11.conv.net.0.weight", "net.conformer.layers.11.conv.net.0.bias", "net.conformer.layers.11.conv.net.2.weight", "net.conformer.layers.11.conv.net.2.bias", "net.conformer.layers.11.conv.net.4.conv.weight", "net.conformer.layers.11.conv.net.4.conv.bias", "net.conformer.layers.11.conv.net.6.gamma", "net.conformer.layers.11.conv.net.7.weight", "net.conformer.layers.11.conv.net.7.bias". 
	Unexpected key(s) in state_dict: "net.conformer.layers.0.conv.net1.0.weight", "net.conformer.layers.0.conv.net1.0.bias", "net.conformer.layers.0.conv.net1.2.weight", "net.conformer.layers.0.conv.net1.2.bias", "net.conformer.layers.0.conv.ds_conv.conv.weight", "net.conformer.layers.0.conv.ds_conv.conv.bias", "net.conformer.layers.0.conv.net2.1.gamma", "net.conformer.layers.0.conv.net2.2.weight", "net.conformer.layers.0.conv.net2.2.bias", "net.conformer.layers.1.conv.net1.0.weight", "net.conformer.layers.1.conv.net1.0.bias", "net.conformer.layers.1.conv.net1.2.weight", "net.conformer.layers.1.conv.net1.2.bias", "net.conformer.layers.1.conv.ds_conv.conv.weight", "net.conformer.layers.1.conv.ds_conv.conv.bias", "net.conformer.layers.1.conv.net2.1.gamma", "net.conformer.layers.1.conv.net2.2.weight", "net.conformer.layers.1.conv.net2.2.bias", "net.conformer.layers.2.conv.net1.0.weight", "net.conformer.layers.2.conv.net1.0.bias", "net.conformer.layers.2.conv.net1.2.weight", "net.conformer.layers.2.conv.net1.2.bias", "net.conformer.layers.2.conv.ds_conv.conv.weight", "net.conformer.layers.2.conv.ds_conv.conv.bias", "net.conformer.layers.2.conv.net2.1.gamma", "net.conformer.layers.2.conv.net2.2.weight", "net.conformer.layers.2.conv.net2.2.bias", "net.conformer.layers.3.conv.net1.0.weight", "net.conformer.layers.3.conv.net1.0.bias", "net.conformer.layers.3.conv.net1.2.weight", "net.conformer.layers.3.conv.net1.2.bias", "net.conformer.layers.3.conv.ds_conv.conv.weight", "net.conformer.layers.3.conv.ds_conv.conv.bias", "net.conformer.layers.3.conv.net2.1.gamma", "net.conformer.layers.3.conv.net2.2.weight", "net.conformer.layers.3.conv.net2.2.bias", "net.conformer.layers.4.conv.net1.0.weight", "net.conformer.layers.4.conv.net1.0.bias", "net.conformer.layers.4.conv.net1.2.weight", "net.conformer.layers.4.conv.net1.2.bias", "net.conformer.layers.4.conv.ds_conv.conv.weight", "net.conformer.layers.4.conv.ds_conv.conv.bias", "net.conformer.layers.4.conv.net2.1.gamma", "net.conformer.layers.4.conv.net2.2.weight", "net.conformer.layers.4.conv.net2.2.bias", "net.conformer.layers.5.conv.net1.0.weight", "net.conformer.layers.5.conv.net1.0.bias", "net.conformer.layers.5.conv.net1.2.weight", "net.conformer.layers.5.conv.net1.2.bias", "net.conformer.layers.5.conv.ds_conv.conv.weight", "net.conformer.layers.5.conv.ds_conv.conv.bias", "net.conformer.layers.5.conv.net2.1.gamma", "net.conformer.layers.5.conv.net2.2.weight", "net.conformer.layers.5.conv.net2.2.bias", "net.conformer.layers.6.conv.net1.0.weight", "net.conformer.layers.6.conv.net1.0.bias", "net.conformer.layers.6.conv.net1.2.weight", "net.conformer.layers.6.conv.net1.2.bias", "net.conformer.layers.6.conv.ds_conv.conv.weight", "net.conformer.layers.6.conv.ds_conv.conv.bias", "net.conformer.layers.6.conv.net2.1.gamma", "net.conformer.layers.6.conv.net2.2.weight", "net.conformer.layers.6.conv.net2.2.bias", "net.conformer.layers.7.conv.net1.0.weight", "net.conformer.layers.7.conv.net1.0.bias", "net.conformer.layers.7.conv.net1.2.weight", "net.conformer.layers.7.conv.net1.2.bias", "net.conformer.layers.7.conv.ds_conv.conv.weight", "net.conformer.layers.7.conv.ds_conv.conv.bias", "net.conformer.layers.7.conv.net2.1.gamma", "net.conformer.layers.7.conv.net2.2.weight", "net.conformer.layers.7.conv.net2.2.bias", "net.conformer.layers.8.conv.net1.0.weight", "net.conformer.layers.8.conv.net1.0.bias", "net.conformer.layers.8.conv.net1.2.weight", "net.conformer.layers.8.conv.net1.2.bias", "net.conformer.layers.8.conv.ds_conv.conv.weight", "net.conformer.layers.8.conv.ds_conv.conv.bias", "net.conformer.layers.8.conv.net2.1.gamma", "net.conformer.layers.8.conv.net2.2.weight", "net.conformer.layers.8.conv.net2.2.bias", "net.conformer.layers.9.conv.net1.0.weight", "net.conformer.layers.9.conv.net1.0.bias", "net.conformer.layers.9.conv.net1.2.weight", "net.conformer.layers.9.conv.net1.2.bias", "net.conformer.layers.9.conv.ds_conv.conv.weight", "net.conformer.layers.9.conv.ds_conv.conv.bias", "net.conformer.layers.9.conv.net2.1.gamma", "net.conformer.layers.9.conv.net2.2.weight", "net.conformer.layers.9.conv.net2.2.bias", "net.conformer.layers.10.conv.net1.0.weight", "net.conformer.layers.10.conv.net1.0.bias", "net.conformer.layers.10.conv.net1.2.weight", "net.conformer.layers.10.conv.net1.2.bias", "net.conformer.layers.10.conv.ds_conv.conv.weight", "net.conformer.layers.10.conv.ds_conv.conv.bias", "net.conformer.layers.10.conv.net2.1.gamma", "net.conformer.layers.10.conv.net2.2.weight", "net.conformer.layers.10.conv.net2.2.bias", "net.conformer.layers.11.conv.net1.0.weight", "net.conformer.layers.11.conv.net1.0.bias", "net.conformer.layers.11.conv.net1.2.weight", "net.conformer.layers.11.conv.net1.2.bias", "net.conformer.layers.11.conv.ds_conv.conv.weight", "net.conformer.layers.11.conv.ds_conv.conv.bias", "net.conformer.layers.11.conv.net2.1.gamma", "net.conformer.layers.11.conv.net2.2.weight", "net.conformer.layers.11.conv.net2.2.bias".

qformer_quantizer.py missing keys: 511 unexpected keys: 146

code

	model = cls(
            vit_model=vit_model,
            img_size=img_size,
            drop_path_rate=drop_path_rate,
            use_grad_checkpoint=use_grad_checkpoint,
            vit_precision=vit_precision,
            freeze_vit=freeze_vit,
            num_query_token=num_query_token,
            cross_attention_freq=cross_attention_freq,
            max_txt_len=max_txt_len,
        )

        if pretrained_model_path.startswith('http'):
            print('start download seed model...')
            cached_file = download_cached_file(pretrained_model_path, check_hash=False, progress=True)
            print(cached_file)
            ckpt = torch.load(cached_file, map_location="cpu")
        else:
            ckpt = torch.load(pretrained_model_path, map_location="cpu")
        missing, unexcepted = model.load_state_dict(ckpt, strict=False)
        print('missing keys: ', len(missing), 'unexpected keys:', len(unexcepted))
        return model

Consider next version to use LLM instead of UNet

It would be a VERY interesting project to utilize the Unet in combination with the LLM + Time-Aware Semantic Connector for generating images as the ELLA project had shown.

Not an issue, but it would be fantastic as the LLM is already loaded in memory.

Loss Masking

Thank you for providing the model code and checkpoints.

I'm planning to fine-tune the base model you provided for a downstream task. From what I've seen in the code you shared, it doesn't seem like there is separate loss masking (an action where the prompt doesn't calculate loss and only the target token calculates loss and passes the gradient).

I'm curious if you actually didn't use loss masking for all tokens when conducting instruct tuning (while building the -chat model).

Collaboration request

Hi AnyGPT team,
Thank you for sharing such amazing project.
I would like to collaborate in a project with any of your team member.
Here is my gmail addr: [email protected]
You can connect with me via Discord(ID: ainerd777)
It's going to be a great opportunity and would love to hear from you soon.
Best regards,

GPU resources for AnyGPT instruction turning

hi, how many A100 GPUs were needed for AnyGPT instruction turning?

请教下music vocabulary size of 8192的实现

我看生成music code的代码 tokens = encode_music_by_path(music.strip(), self.music_sample_rate, self.music_tokenizer, self.music_processor, self.device, segment_duration=self.music_segment_duration, one_channel=True, start_from_begin=True) tokens = tokens[0][0] processed_inputs = modality_tokens_to_string(tokens=tokens, modality="music")
而论文提到‘quantized using an RVQ with four quantizers, each with a codebook size of 2048, resulting in a
combined music vocabulary size of 8192.’请问是在下面这行代码实现的吗：
processed_inputs = modality_tokens_to_string(tokens=tokens, modality="music") ，是因为要4个codebook才搞4层的吗？

text chat is not so good.

it's like no intelligence. Am I wrongly prompting?

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/petrelfs/zhanjun.p/mllm/models/bert-base-uncased'. Use `repo_type` argument if needed.

There is an error below.

2024-03-26 12:57:36.254249: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-26 12:57:36.254297: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-26 12:57:36.255652: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-26 12:57:37.344594: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
loading image tokenzier
Traceback (most recent call last):
  File "/content/AnyGPT/anygpt/src/infer/cli_infer_base_model.py", line 337, in <module>
    infer = AnyGPTInference(
  File "/content/AnyGPT/anygpt/src/infer/cli_infer_base_model.py", line 46, in __init__
    self.image_tokenizer = ImageTokenizer(model_path=image_tokenizer_path, load_diffusion=True,
  File "/content/AnyGPT/./seed2/seed_llama_tokenizer.py", line 39, in __init__
    model = Blip2QformerQuantizer.from_pretrained(pretrained_model_path=model_path,
  File "/content/AnyGPT/./seed2/seed_qformer/qformer_quantizer.py", line 354, in from_pretrained
    model = cls(
  File "/content/AnyGPT/./seed2/seed_qformer/qformer_quantizer.py", line 182, in __init__
    self.tokenizer = self.init_tokenizer()
  File "/content/AnyGPT/./seed2/seed_qformer/blip2.py", line 38, in init_tokenizer
    tokenizer = BertTokenizer.from_pretrained("/mnt/petrelfs/zhanjun.p/mllm/models/bert-base-uncased", truncation_side=truncation_side)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 1940, in from_pretrained
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 429, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 111, in _inner_fn
    validate_repo_id(arg_value)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 159, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/petrelfs/zhanjun.p/mllm/models/bert-base-uncased'. Use `repo_type` argument if needed.

hi，when will the train code be released? && do you train image and text tokens all in autoregressive?

can i ask the model to choose which of the voices sound more natural?

Hi, I would like to use this multimodal model to do MOS (mean opinion score) -ish task for me.
Is it supported in current version?
I think not according to the documents, but if yes, please let me know.
Thanks!

When will code, dataset and checkpoints be released?

I hope everything goes well.
I would like to know if you have any plans to open source code, datasets and model checkpoints in the near future? Could you please provide a rough timeline?
Thanks!

Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized

commend I run
!python anygpt/src/infer/cli_infer_base_model.py
--model-name-or-path AnyGPT-base
--image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt
--speech-tokenizer-path models/speechtokenizer/ckpt.dev
--speech-tokenizer-config models/speechtokenizer/config.json
--soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt
--output-dir "infer_output/base"

Below is the error

NeMo-text-processing :: INFO :: Creating ClassifyFst grammars.
Using device: cuda
loading image tokenzier
/home//.cache/torch/hub/checkpoints/eva_vit_g.pth
INFO:root:freeze vision encoder
Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.11.output_query.dense.weight', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.5.output_query.LayerNorm.bias', 'bert.encoder.layer.8.output_query.LayerNorm.weight', 'bert.encoder.layer.2.crossattention.self.query.weight', 'bert.encoder.layer.10.crossattention.output.dense.bias', 'bert.encoder.layer.5.output_query.dense.weight', 'bert.encoder.layer.2.output_query.LayerNorm.weight', 'bert.encoder.layer.7.output_query.LayerNorm.bias', 'bert.encoder.layer.7.intermediate_query.dense.bias', 'bert.encoder.layer.6.output_query.LayerNorm.bias', 'bert.encoder.layer.11.output_query.dense.bias', 'bert.encoder.layer.1.intermediate_query.dense.bias', 'bert.encoder.layer.6.output_query.dense.bias', 'bert.encoder.layer.9.intermediate_query.dense.bias', 'bert.encoder.layer.11.intermediate_query.dense.weight', 'bert.encoder.layer.6.crossattention.output.dense.weight', 'bert.encoder.layer.3.output_query.LayerNorm.bias', 'bert.encoder.layer.8.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.2.output_query.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.6.crossattention.self.query.weight', 'bert.encoder.layer.8.crossattention.self.value.weight', 'bert.encoder.layer.8.crossattention.output.dense.weight', 'bert.encoder.layer.8.crossattention.output.dense.bias', 'bert.encoder.layer.10.output_query.LayerNorm.weight', 'bert.encoder.layer.10.output_query.dense.weight', 'bert.encoder.layer.6.crossattention.self.query.bias', 'bert.encoder.layer.6.output_query.LayerNorm.weight', 'bert.encoder.layer.6.crossattention.self.value.bias', 'bert.encoder.layer.2.crossattention.self.value.weight', 'bert.encoder.layer.8.intermediate_query.dense.weight', 'bert.encoder.layer.2.output_query.LayerNorm.bias', 'bert.encoder.layer.6.crossattention.output.dense.bias', 'bert.encoder.layer.4.intermediate_query.dense.bias', 'bert.encoder.layer.10.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.2.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.intermediate_query.dense.weight', 'bert.encoder.layer.4.crossattention.self.key.weight', 'bert.encoder.layer.2.crossattention.self.query.bias', 'bert.encoder.layer.7.intermediate_query.dense.weight', 'bert.encoder.layer.10.crossattention.self.query.weight', 'bert.encoder.layer.9.intermediate_query.dense.weight', 'bert.encoder.layer.6.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.9.output_query.LayerNorm.bias', 'bert.encoder.layer.3.intermediate_query.dense.weight', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.8.output_query.LayerNorm.bias', 'bert.encoder.layer.4.output_query.dense.bias', 'bert.encoder.layer.2.crossattention.self.key.bias', 'bert.encoder.layer.1.output_query.LayerNorm.bias', 'bert.encoder.layer.4.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.6.crossattention.self.value.weight', 'bert.encoder.layer.4.crossattention.self.value.weight', 'bert.encoder.layer.0.output_query.LayerNorm.bias', 'bert.encoder.layer.9.output_query.LayerNorm.weight', 'bert.encoder.layer.4.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.7.output_query.LayerNorm.weight', 'bert.encoder.layer.8.crossattention.self.key.bias', 'bert.encoder.layer.8.output_query.dense.bias', 'bert.encoder.layer.0.intermediate_query.dense.weight', 'bert.encoder.layer.2.intermediate_query.dense.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.3.output_query.LayerNorm.weight', 'bert.encoder.layer.6.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.10.crossattention.self.value.weight', 'bert.encoder.layer.2.crossattention.self.value.bias', 'bert.encoder.layer.11.output_query.LayerNorm.bias', 'bert.encoder.layer.6.crossattention.self.key.weight', 'bert.encoder.layer.4.crossattention.self.key.bias', 'bert.encoder.layer.0.output_query.dense.weight', 'bert.encoder.layer.4.crossattention.self.query.weight', 'bert.encoder.layer.6.crossattention.self.key.bias', 'bert.encoder.layer.5.intermediate_query.dense.weight', 'bert.encoder.layer.1.output_query.dense.weight', 'bert.encoder.layer.5.output_query.LayerNorm.weight', 'bert.encoder.layer.2.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.9.output_query.dense.weight', 'bert.encoder.layer.4.crossattention.self.query.bias', 'bert.encoder.layer.11.intermediate_query.dense.bias', 'bert.encoder.layer.6.output_query.dense.weight', 'bert.encoder.layer.5.output_query.dense.bias', 'bert.encoder.layer.6.intermediate_query.dense.weight', 'bert.encoder.layer.2.output_query.dense.bias', 'bert.encoder.layer.8.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.4.crossattention.self.value.bias', 'bert.encoder.layer.1.output_query.LayerNorm.weight', 'bert.encoder.layer.10.output_query.LayerNorm.bias', 'bert.encoder.layer.3.output_query.dense.weight', 'bert.encoder.layer.4.output_query.LayerNorm.weight', 'bert.encoder.layer.8.crossattention.self.value.bias', 'bert.encoder.layer.8.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.10.output_query.dense.bias', 'bert.encoder.layer.8.crossattention.self.query.bias', 'bert.encoder.layer.4.intermediate_query.dense.weight', 'bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.10.crossattention.self.key.bias', 'bert.encoder.layer.0.output_query.dense.bias', 'bert.encoder.layer.4.output_query.LayerNorm.bias', 'bert.encoder.layer.3.output_query.dense.bias', 'bert.encoder.layer.7.output_query.dense.bias', 'bert.encoder.layer.3.intermediate_query.dense.bias', 'bert.encoder.layer.1.output_query.dense.bias', 'bert.encoder.layer.4.output_query.dense.weight', 'bert.encoder.layer.10.crossattention.output.dense.weight', 'bert.encoder.layer.8.crossattention.self.query.weight', 'bert.encoder.layer.10.crossattention.self.query.bias', 'bert.encoder.layer.9.output_query.dense.bias', 'bert.encoder.layer.4.crossattention.output.dense.bias', 'bert.encoder.layer.7.output_query.dense.weight', 'bert.encoder.layer.2.intermediate_query.dense.bias', 'bert.encoder.layer.10.crossattention.self.value.bias', 'bert.encoder.layer.10.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.output_query.LayerNorm.weight', 'bert.encoder.layer.5.intermediate_query.dense.bias', 'bert.encoder.layer.4.crossattention.output.dense.weight', 'bert.encoder.layer.8.output_query.dense.weight', 'bert.encoder.layer.6.intermediate_query.dense.bias', 'bert.encoder.layer.2.crossattention.output.dense.weight', 'bert.encoder.layer.10.intermediate_query.dense.weight', 'bert.encoder.layer.0.intermediate_query.dense.bias', 'bert.encoder.layer.2.crossattention.output.dense.bias', 'bert.encoder.layer.10.intermediate_query.dense.bias', 'bert.encoder.layer.8.intermediate_query.dense.bias', 'bert.encoder.layer.2.crossattention.self.key.weight', 'bert.encoder.layer.10.crossattention.self.key.weight', 'bert.encoder.layer.11.output_query.LayerNorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
missing keys: 511 unexpected keys: 146
oading music tokenizer
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
loading audio tokenizer
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
loading llm

关于论文中音乐tokenize和音乐生成示例的问题

您好，非常喜欢这篇一统模态的工作！有两个小问题希望能够解答：

首先论文中提到resulting in a combined music vocabulary size of 8192. We encode 5 seconds music into 250 latent frames, ultimately generating a 250 × 4 codes matrix. ，这和表格1中关于Music的参数似乎并不一致？

此外，在论文中提到针对音乐的部分使用包括歌词在内的元数据，但是在实例中没有展示带有歌词的音频，这是出于什么原因？（顺便示例中的音乐和久美子反差太大了哈哈😂）

Is there any evaluation on VQA datasets?

Question about training stage and dataset

If my understand is correct, you trained the model though 2 stages:

Pretraining, which used the data you list in Paper Table 7.
Fine-tuning with Instruction data.
However, here is a detail confused me, does the base model you released is trained by pretraining stage? If yes, why the model can handle TTS task which is not in training progress.

Another question is I noticed there have lots of code about audio modality, I assume you team already prepared relative data and generated instruction for it. Why remove it finally? Does it will hurt speech or music relative task performance or some else reasons?

Thanks and expect your response.

ModuleNotFoundError: No module named 'mmgpt.src'

There is an error below.

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
2024-03-26 10:15:35.821675: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-26 10:15:35.821777: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-26 10:15:35.952168: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-26 10:15:38.385135: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "/content/AnyGPT/anygpt/src/infer/cli_infer_base_model.py", line 24, in <module>
    from infer.pre_post_process import extract_text_between_tags
  File "/content/AnyGPT/./anygpt/src/infer/pre_post_process.py", line 7, in <module>
    from mmgpt.src.m_utils.prompter_mmgpt import Prompter
ModuleNotFoundError: No module named 'mmgpt.src'

It occurs in Google Colab below

https://colab.research.google.com/drive/13_gZPIRG6ShkAbI76-hC_etvfGhry0DZ?usp=sharing

code/dataset/model

i dont get why it seems to be common now to make empty github/hf repos to farm stars ?

are the regulatory bodies that prevent a direct release ?

openmoss / anygpt Goto Github PK

anygpt's People

Contributors

Stargazers

Watchers

Forkers

anygpt's Issues

Below is the error

Recommend Projects

Recommend Topics

Recommend Org