Git Product home page Git Product logo

st-llm's Introduction

ST-LLM

hf arXiv License

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

News ๐Ÿ“ข

  • [2024/3/28] All codes and weights are available now! Welcome to watch this repository for the latest updates.

Introduction ๐Ÿ’ก

  • ST-LLM is a temporal-sensitive video large language model. Our model incorporates three key architectural:
    • (1) Joint spatial-temporal modeling within large language models for effective video understanding.
    • (2) Dynamic masking strategy and mask video modeling for efficiency and robustness.
    • (3) Global-local input module for long video understanding.
  • ST-LLM has established new state-of-the-art results on MVBench, VideoChatGPT Bench and VideoQA Bench:
MethodMVBenchVcgBenchVideoQABench
AvgCorrectDetailContextTemporalConsistMSVDMSRVTTANet
VideoChatGPT32.72.382.402.522.621.982.3764.949.335.7
LLaMA-VID-2.892.963.003.532.462.5169.757.747.4
Chat-UniVi-2.992.892.913.462.892.8165.054.645.8
VideoChat251.12.983.022.883.512.662.8170.054.149.1
ST-LLM54.93.153.233.053.742.932.8174.663.250.9

Demo ๐Ÿค—

Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:

CUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight

We have also prepared local scripts that are easy to modify๏ผšdemo.py

Examples ๐Ÿ‘€

  • Video Description: for high-difficulty videos with complex scene changes, ST-LLM can accurately describe all the contents.

  • Action Identification: ST-LLM can accurately and comprehensively describe the actions occurring in the video.

  • Reasoning: for the challenging open-ended reasoning questions, STLLM can also provide reasonable answers.

Installation ๐Ÿ› ๏ธ

Git clone our repository, creating a Python environment and activate it via the following command

git clone https://github.com/farewellthree/ST-LLM.git
cd ST-LLM
conda create --name stllm python=3.10
conda activate stllm
pip install -r requirement.txt

Training & Validation ๐Ÿ“Š

The instructions of data, training and evaluating can be found in trainval.md.

Acknowledgement ๐Ÿ‘

Citation โœ๏ธ

If you find the code and paper useful for your research, please consider staring this repo and citing our paper:

@article{liu2023one,
  title={One for all: Video conversation is feasible without video instruction tuning},
  author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},
  journal={arXiv preprint arXiv:2309.15785},
  year={2023}
}
@article{liu2023one,
  title={ST-LLM: Large Language Models Are Effective Temporal Learners},
  author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
  journal={https://arxiv.org/abs/2404.00308},
  year={2023}
}

st-llm's People

Contributors

farewellthree avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

st-llm's Issues

Inference Error

Hi,

I run the scripts/inference/vcgbench/test_general.sh following your ReadMe. But I get the following error which is difficult to be solved for me.

Traceback (most recent call last):
File "/mnt_alipayshnas/zirui.lgp/ST-LLM/stllm/test/vcgbench/videochatgpt_benchmark_general.py", line 127, in
run_inference(args)
File "/mnt_alipayshnas/zirui.lgp/ST-LLM/stllm/test/vcgbench/videochatgpt_benchmark_general.py", line 109, in run_inference
llm_message = chat.answer(conv=chat_state,
File "/mnt_alipayshnas/zirui.lgp/ST-LLM/stllm/conversation/conversation.py", line 231, in answer
outputs = llama_model.generate(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1609, in generate
result = self._beam_search(
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 3062, in _beam_search
outputs = self(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt_alipayshnas/zirui.lgp/ST-LLM/stllm/models/st_llm.py", line 119, in forward
return super(STLLMForCausalLM, self).forward(inputs_embeds=inputs_embeds, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1196, in forward
outputs = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt_alipayshnas/zirui.lgp/ST-LLM/stllm/models/st_llm.py", line 58, in forward
return super(STLLMLlamaModel, self).forward(inputs_embeds=inputs_embeds, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 990, in forward
causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1071, in _update_causal_mask
attention_mask.shape[-1] if isinstance(attention_mask, torch.Tensor) else cache_position[-1] + 1
IndexError: index -1 is out of bounds for dimension 0 with size 0

Some weights of the model checkpoint at stllm/output/instructblipbase_stllm_conversation/ were not used?

dear author:
when loading pretrained ckpt, some weights are not used, is it normal??

Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:14<00:00,  7.15s/it]
Some weights of the model checkpoint at stllm/output/instructblipbase_stllm_conversation/checkpoint-5490/ were not used when initializing STLLMForCausalLM: ['model.stllm_model.Qformer.bert.encoder.layer.4.attention.output.LayerNorm.weight', 

'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.self.key.bias', 


'model.stllm_model.Qformer.bert.encoder.layer.5.attention.self.key.weight', 

'model.stllm_model.visual_encoder.blocks.22.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.embeddings.position_ids', 'model.stllm_model.visual_encoder.blocks.24.norm1.bias', 

'model.stllm_model.Qformer.bert.encoder.layer.4.attention.output.dense.bias', 

'model.stllm_model.Qformer.bert.encoder.layer.1.attention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.33.norm2.bias', 'model.stllm_model.visual_encoder.blocks.37.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.5.attention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.24.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.self.query.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.attention.self.value.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.intermediate.dense.bias', 'model.stllm_model.visual_encoder.blocks.14.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.4.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.29.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.self.query.weight', 'model.stllm_model.visual_encoder.blocks.32.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.12.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.37.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.33.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.36.norm2.weight', 'model.stllm_model.visual_encoder.blocks.20.norm1.bias', 'model.stllm_model.visual_encoder.blocks.16.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.17.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.10.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.6.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.self.query.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.output_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.33.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.5.attention.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.intermediate_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.15.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.0.norm2.weight', 'model.stllm_model.visual_encoder.blocks.26.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.28.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.5.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.31.norm2.bias', 'model.stllm_model.visual_encoder.blocks.18.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.9.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.24.norm2.weight', 'model.stllm_model.visual_encoder.blocks.9.norm2.weight', 'model.stllm_model.visual_encoder.blocks.25.norm2.weight', 'model.stllm_model.visual_encoder.blocks.4.norm2.bias', 'model.stllm_model.visual_encoder.blocks.15.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.27.norm1.weight', 'model.stllm_model.visual_encoder.blocks.25.attn.proj.bias', 'model.stllm_model.visual_encoder.pos_embed', 'model.stllm_model.visual_encoder.blocks.11.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.16.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.intermediate_query.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.attention.self.value.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.output_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.22.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.34.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.self.query.weight', 'model.stllm_model.visual_encoder.blocks.0.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.17.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.8.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.32.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.14.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.22.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.1.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.17.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.23.norm1.weight', 'model.stllm_model.visual_encoder.blocks.26.norm2.bias', 'model.stllm_model.visual_encoder.blocks.38.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.5.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.intermediate_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.19.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.17.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.19.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.output.dense.weight', 'model.stllm_model.up_proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.12.norm2.weight', 'model.stllm_model.visual_encoder.blocks.11.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.28.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.intermediate.dense.bias', 'model.stllm_model.visual_encoder.blocks.26.norm1.bias', 'model.stllm_model.visual_encoder.blocks.34.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.self.query.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.output_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.self.query.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.18.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.27.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.7.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.24.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.20.norm1.weight', 'model.stllm_model.visual_encoder.blocks.36.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.5.attention.output.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.0.norm1.weight', 'model.stllm_model.visual_encoder.blocks.7.norm1.bias', 'model.stllm_model.visual_encoder.blocks.33.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.33.norm1.bias', 'model.stllm_model.visual_encoder.blocks.4.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.7.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.23.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.output_query.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.31.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.self.key.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.26.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.3.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.33.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.11.norm2.bias', 'model.stllm_model.visual_encoder.blocks.33.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.output.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.16.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.25.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.15.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.36.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.38.norm1.weight', 'model.stllm_model.visual_encoder.blocks.19.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.1.intermediate.dense.bias', 'model.stllm_model.visual_encoder.blocks.8.norm1.weight', 'model.stllm_model.visual_encoder.blocks.27.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.9.norm2.bias', 'model.stllm_model.mvm_decoder.head.weight', 'model.stllm_model.visual_encoder.blocks.12.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.2.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.3.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.34.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.attention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.20.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.5.norm2.weight', 'model.stllm_model.visual_encoder.blocks.35.norm1.bias', 'model.stllm_model.visual_encoder.blocks.38.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.output_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.29.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.23.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.4.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.self.query.weight', 'model.stllm_model.visual_encoder.blocks.13.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.26.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.21.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.11.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.30.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.12.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.self.query.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.12.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.25.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.attention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.29.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.3.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.9.output_query.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.output.LayerNorm.bias', 'model.stllm_model.mvm_decoder.head.bias', 'model.stllm_model.visual_encoder.blocks.10.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.output_query.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.2.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.17.norm2.bias', 'model.stllm_model.visual_encoder.blocks.28.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.29.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.self.query.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.output_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.13.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.16.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.4.attention.self.query.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.2.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.intermediate_query.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.15.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.output_query.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.6.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.31.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.4.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.24.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.37.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.19.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.attention.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.attention.self.value.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.3.norm2.bias', 'model.stllm_model.visual_encoder.blocks.7.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.25.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.1.attention.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.27.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.14.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.output_query.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.7.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.17.attn.qkv.weight', 'model.stllm_model.down_proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.0.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.28.norm2.weight', 'model.stllm_model.visual_encoder.blocks.34.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.34.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.self.value.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.6.norm2.weight', 'model.stllm_model.visual_encoder.blocks.21.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.23.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.3.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.intermediate_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.self.key.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.self.key.bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.2.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.27.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.0.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.23.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.7.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.11.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.31.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.22.norm1.weight', 'model.stllm_model.visual_encoder.blocks.11.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.attention.self.value.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.8.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.37.norm2.weight', 'model.stllm_model.visual_encoder.blocks.28.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.14.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.19.norm2.weight', 'model.stllm_model.visual_encoder.blocks.23.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.21.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.14.norm2.weight', 'model.stllm_model.visual_encoder.blocks.1.norm2.bias', 'model.stllm_model.visual_encoder.blocks.10.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.16.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.self.query.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.1.norm2.weight', 'model.stllm_model.visual_encoder.blocks.23.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.38.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.30.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.30.norm2.weight', 'model.stllm_model.visual_encoder.blocks.8.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.5.norm2.bias', 'model.stllm_model.visual_encoder.blocks.31.norm1.bias', 'model.stllm_model.visual_encoder.blocks.15.norm1.bias', 'model.stllm_model.visual_encoder.blocks.35.norm2.bias', 'model.stllm_model.visual_encoder.blocks.29.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.8.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.38.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.output_query.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.intermediate_query.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.26.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.36.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.21.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.0.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.15.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.37.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.4.norm1.bias', 'model.stllm_model.visual_encoder.blocks.19.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.self.query.bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.output_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.intermediate_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.29.norm2.weight', 'model.stllm_model.visual_encoder.blocks.34.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.self.query.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.37.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.25.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.25.norm2.bias', 'model.stllm_model.visual_encoder.blocks.8.norm2.bias', 'model.stllm_model.visual_encoder.blocks.16.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.23.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.13.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.16.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.20.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.28.norm1.bias', 'model.stllm_model.visual_encoder.blocks.1.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.12.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.30.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.output_query.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.6.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.4.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.9.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.14.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.1.attention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.24.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.27.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.33.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.intermediate.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.21.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.24.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.25.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.output_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.0.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.10.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.self.key.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.38.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.26.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.5.output_query.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.12.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.output_query.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.32.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.2.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.32.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.14.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.4.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.self.value.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.35.norm2.weight', 'model.stllm_model.visual_encoder.patch_embed.proj.weight', 'model.stllm_model.visual_encoder.blocks.36.norm1.bias', 'model.stllm_model.visual_encoder.blocks.14.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.27.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.0.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.15.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.18.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.1.norm1.bias', 'model.stllm_model.visual_encoder.blocks.0.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.intermediate_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.15.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.intermediate_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.24.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.20.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.26.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.self.key.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.self.query.weight', 'model.stllm_model.visual_encoder.blocks.1.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.11.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.15.norm1.weight', 'model.stllm_model.visual_encoder.blocks.32.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.23.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.self.query.weight', 'model.stllm_model.visual_encoder.blocks.14.norm2.bias', 'model.stllm_model.visual_encoder.blocks.16.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.13.norm1.bias', 'model.stllm_model.visual_encoder.blocks.2.attn.proj.weight', 'model.stllm_model.mvm_decoder.norm.bias', 'model.stllm_model.visual_encoder.blocks.37.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.25.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.20.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.35.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.9.norm1.bias', 'model.stllm_model.visual_encoder.blocks.36.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.14.norm1.bias', 'model.stllm_model.visual_encoder.blocks.23.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.self.query.weight', 'model.stllm_model.visual_encoder.blocks.21.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.13.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.self.value.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.self.value.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.1.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.intermediate_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.output_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.2.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.38.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.output.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.output_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.18.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.output_query.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.2.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.13.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.self.value.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.7.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.5.attention.self.query.weight', 'model.stllm_model.visual_encoder.blocks.1.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.14.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.self.value.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.34.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.37.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.29.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.self.query.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.intermediate_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.21.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.7.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.self.value.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.29.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.5.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.32.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.5.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.1.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.8.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.38.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.13.norm2.weight', 'model.stllm_model.visual_encoder.blocks.19.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.27.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.36.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.attention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.36.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.attention.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.embeddings.position_embeddings.weight', 'model.stllm_model.Qformer.bert.encoder.layer.9.output_query.dense.bias', 'model.stllm_model.ln_vision.weight', 'model.stllm_model.visual_encoder.blocks.3.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.34.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.intermediate_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.35.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.31.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.3.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.34.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.4.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.2.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.intermediate.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.6.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.intermediate_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.34.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.output_query.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.8.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.12.norm1.bias', 'model.stllm_model.llama_proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.37.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.30.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.attention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.6.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.17.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.attention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.8.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.10.attn.proj.weight', 'model.stllm_model.query_tokens', 'model.stllm_model.visual_encoder.blocks.17.norm1.bias', 'model.stllm_model.visual_encoder.blocks.18.norm2.bias', 'model.stllm_model.visual_encoder.blocks.27.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.16.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.31.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.4.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.4.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.intermediate.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.38.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.35.attn.q_bias', 'model.stllm_model.mvm_decoder.norm.weight', 'model.stllm_model.ln_vision.bias', 'model.stllm_model.visual_encoder.blocks.17.norm2.weight', 'model.stllm_model.visual_encoder.blocks.19.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.30.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.31.mlp.fc1.weight', 'model.stllm_model.visual_encoder.patch_embed.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.output_query.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.8.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.self.query.weight', 'model.stllm_model.visual_encoder.blocks.3.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.36.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.18.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.12.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.output_query.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.output_query.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.self.query.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.output_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.28.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.33.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.7.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.13.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.self.key.bias', 'model.stllm_model.llama_proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.29.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.9.intermediate_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.32.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.7.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.8.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.22.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.6.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.10.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.11.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.29.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.output_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.11.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.28.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.intermediate.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.intermediate_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.28.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.33.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.1.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.35.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.1.intermediate_query.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.attention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.37.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.output_query.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.12.norm2.bias', 'model.stllm_model.visual_encoder.blocks.5.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.9.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.30.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.22.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.1.output_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.15.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.22.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.output_query.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.30.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.29.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.output.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.intermediate_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.1.attention.self.query.weight', 'model.stllm_model.visual_encoder.blocks.25.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.intermediate.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.output_query.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.22.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.4.intermediate_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.13.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.31.norm1.weight', 'model.stllm_model.down_proj.weight', 'model.stllm_model.visual_encoder.blocks.13.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.35.norm1.weight', 'model.stllm_model.visual_encoder.blocks.1.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.37.norm2.bias', 'model.stllm_model.visual_encoder.blocks.35.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.6.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.37.norm1.bias', 'model.stllm_model.visual_encoder.blocks.9.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.28.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.35.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.35.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.26.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.11.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.30.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.34.norm1.weight', 'model.stllm_model.visual_encoder.blocks.36.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.24.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.12.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.output_query.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.3.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.1.norm1.weight', 'model.stllm_model.visual_encoder.blocks.35.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.9.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.23.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.22.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.9.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.output_query.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.31.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.self.value.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.attention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.1.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.27.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.20.mlp.fc2.weight', 'model.stllm_model.residual_index', 'model.stllm_model.visual_encoder.blocks.19.norm1.bias', 'model.stllm_model.visual_encoder.blocks.13.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.1.attention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.18.norm1.bias', 'model.stllm_model.visual_encoder.blocks.20.norm2.weight', 'model.stllm_model.visual_encoder.blocks.38.attn.v_bias', 'model.stllm_model.Qformer.bert.embeddings.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.30.attn.qkv.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.19.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.7.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.21.norm1.weight', 'model.stllm_model.visual_encoder.blocks.30.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.intermediate_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.19.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.4.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.26.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.21.norm1.bias', 'model.stllm_model.visual_encoder.blocks.13.norm2.bias', 'model.stllm_model.visual_encoder.blocks.13.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.21.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.16.norm2.bias', 'model.stllm_model.visual_encoder.blocks.18.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.31.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.self.query.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.18.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.36.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.self.key.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.output_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.32.norm1.bias', 'model.stllm_model.visual_encoder.blocks.22.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.27.norm2.weight', 'model.stllm_model.visual_encoder.blocks.34.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.5.norm1.bias', 'model.stllm_model.visual_encoder.blocks.32.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.8.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.5.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.18.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.self.query.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.crossattention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.15.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.10.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.11.norm1.bias', 'model.stllm_model.visual_encoder.blocks.26.norm1.weight', 'model.stllm_model.visual_encoder.blocks.20.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.33.norm2.weight', 'model.stllm_model.visual_encoder.blocks.38.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.32.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.16.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.30.norm1.bias', 'model.stllm_model.visual_encoder.blocks.28.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.10.norm1.weight', 'model.stllm_model.visual_encoder.blocks.36.attn.qkv.weight', 'model.stllm_model.up_proj.bias', 'model.stllm_model.visual_encoder.blocks.10.norm2.weight', 'model.stllm_model.visual_encoder.blocks.38.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.5.output_query.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.5.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.6.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.29.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.17.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.5.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.intermediate.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.output_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.10.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.output.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.crossattention.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.0.output_query.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.attention.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.intermediate_query.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.0.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.1.output.dense.bias', 'model.stllm_model.Qformer.bert.embeddings.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.17.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.3.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.32.norm2.bias', 'model.stllm_model.visual_encoder.blocks.34.attn.proj.weight', 'model.stllm_model.embed_tokens.weight', 'model.stllm_model.visual_encoder.blocks.9.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.21.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.27.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.output.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.crossattention.self.query.bias', 'model.stllm_model.visual_encoder.blocks.32.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.output_query.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.4.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.12.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.18.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.22.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.intermediate_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.30.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.output_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.4.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.output_query.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.crossattention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.15.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.28.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.self.query.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.output_query.LayerNorm.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.2.norm1.weight', 'model.stllm_model.visual_encoder.blocks.17.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.36.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.attention.self.key.bias', 'model.stllm_model.visual_encoder.blocks.5.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.23.norm2.bias', 'model.stllm_model.visual_encoder.blocks.31.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.3.output_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.27.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.11.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.9.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.0.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.7.intermediate.dense.bias', 'model.stllm_model.visual_encoder.blocks.21.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.4.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.33.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.15.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.20.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.6.norm2.bias', 'model.stllm_model.visual_encoder.blocks.0.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.1.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.intermediate.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.intermediate_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.9.attn.proj.weight', 'model.stllm_model.Qformer.bert.encoder.layer.9.output_query.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.2.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.self.key.bias', 'model.stllm_model.Qformer.bert.encoder.layer.6.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.4.norm2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.7.attention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.31.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.11.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.22.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.24.norm2.bias', 'model.stllm_model.visual_encoder.blocks.14.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.9.attention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.22.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.2.norm1.bias', 'model.stllm_model.visual_encoder.blocks.16.attn.q_bias', 'model.stllm_model.visual_encoder.blocks.14.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.output_query.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.8.attention.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.21.norm2.weight', 'model.stllm_model.visual_encoder.blocks.38.norm1.bias', 'model.stllm_model.visual_encoder.blocks.18.norm1.weight', 'model.stllm_model.visual_encoder.blocks.10.norm2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.3.attention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.25.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.24.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.29.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.5.output.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.6.norm1.bias', 'model.stllm_model.visual_encoder.blocks.26.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.33.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.intermediate.dense.bias', 'model.stllm_model.visual_encoder.blocks.8.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.embeddings.word_embeddings.weight', 'model.stllm_model.visual_encoder.blocks.18.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.self.value.bias', 'model.stllm_model.visual_encoder.blocks.23.norm1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.self.value.weight', 'model.stllm_model.visual_encoder.blocks.6.mlp.fc2.bias', 'model.stllm_model.visual_encoder.blocks.12.norm1.weight', 'model.stllm_model.visual_encoder.blocks.25.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.10.mlp.fc1.bias', 'model.stllm_model.visual_encoder.blocks.7.norm2.weight', 'model.stllm_model.visual_encoder.blocks.17.attn.v_bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.output_query.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.5.mlp.fc2.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.attention.output.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.5.mlp.fc1.weight', 'model.stllm_model.visual_encoder.blocks.19.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.24.attn.proj.weight', 'model.stllm_model.visual_encoder.blocks.2.norm2.weight', 'model.stllm_model.visual_encoder.blocks.20.attn.proj.bias', 'model.stllm_model.visual_encoder.blocks.32.norm2.weight', 'model.stllm_model.visual_encoder.blocks.10.mlp.fc1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.9.output.dense.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.20.norm2.bias', 'model.stllm_model.visual_encoder.blocks.37.norm1.weight', 'model.stllm_model.visual_encoder.blocks.3.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.10.attention.self.key.weight', 'model.stllm_model.Qformer.bert.encoder.layer.2.intermediate.dense.weight', 'model.stllm_model.visual_encoder.blocks.26.attn.qkv.weight', 'model.stllm_model.visual_encoder.blocks.3.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.output.dense.weight', 'model.stllm_model.visual_encoder.blocks.25.attn.q_bias', 'model.stllm_model.Qformer.bert.encoder.layer.2.attention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.20.attn.v_bias', 'model.stllm_model.visual_encoder.blocks.3.attn.proj.bias', 'model.stllm_model.Qformer.bert.encoder.layer.0.output_query.LayerNorm.weight', 'model.stllm_model.visual_encoder.blocks.7.mlp.fc1.bias', 'model.stllm_model.Qformer.bert.encoder.layer.1.attention.self.value.bias', 'model.stllm_model.Qformer.bert.encoder.layer.11.intermediate_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.24.mlp.fc2.weight', 'model.stllm_model.visual_encoder.blocks.9.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.output.LayerNorm.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.crossattention.self.key.weight', 'model.stllm_model.visual_encoder.blocks.28.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.6.attention.output.LayerNorm.bias', 'model.stllm_model.visual_encoder.blocks.6.norm1.weight', 'model.stllm_model.Qformer.bert.encoder.layer.9.output_query.dense.weight', 'model.stllm_model.visual_encoder.blocks.16.norm1.weight', 'model.stllm_model.visual_encoder.blocks.35.mlp.fc2.weight', 'model.stllm_model.Qformer.bert.encoder.layer.10.output_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.5.attention.self.query.bias', 'model.stllm_model.Qformer.bert.encoder.layer.4.output_query.dense.weight', 'model.stllm_model.Qformer.bert.encoder.layer.8.intermediate_query.dense.bias', 'model.stllm_model.visual_encoder.blocks.11.attn.proj.bias', 'model.stllm_model.visual_encoder.cls_token', 'model.stllm_model.Qformer.bert.encoder.layer.0.attention.output.dense.bias', 'model.stllm_model.visual_encoder.blocks.19.norm1.weight', 'model.stllm_model.visual_encoder.blocks.0.norm2.bias', 'model.stllm_model.visual_encoder.blocks.9.mlp.fc1.weight']
- This IS expected if you are initializing STLLMForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing STLLMForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/media/sdb/long/conda/envs/stllm/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Loading VIT
Loading VIT Done
Loading Q-Former pretrained/instruct_blip_vicuna7b_trimmed.pth
Loading Q-Former Done

how to modify the code to support llama3?

Hi, dear all:
Thought the paper did superior performance with only vicuna-7b models, I want to exploere the potentials on stronger LLMs, such as llama3 or Yi. Anyone give some tips on how to modify the code to support llama3 training with STLLM datasets. Thank you ...

MVBenchๆต‹่ฏ•็ป“ๆžœไธไธ€่‡ด

ๆ‚จๅฅฝ๏ผŒๆˆ‘ๅœจๆœฌๅœฐ้…็ฝฎ็š„EVA ViT-g+InstructBLIP+Vicuna1.1ๆจกๅž‹๏ผŒๅœจMVBenchไธŠๆต‹่ฏ•็ป“ๆžœไธบ35.45%๏ผŒๅ…ทไฝ“ๆต‹่ฏ•็ป“ๆžœ่ง้™„ไปถ๏ผŒ่ƒฝๅธฎๅฟ™็œ‹ไธ‹ๆ˜ฏๅ“ช้‡Œๅ‡บ็š„้—ฎ้ข˜ๅ—๏ผŸ
instructblipbase_stllm_qa_mvbench_fps1.json

GPU out of memory

Hello, thank you for your excellent code.

I am trying to reproduce your results, but I keep encountering a GPU OOM (Out of Memory) error. I am using 16 A100 GPUs (each with 40GB of memory) for training. Even after reducing the batch size from 16 to 8, I still face CUDA OOM errors. Your paper mentions that you used 8 A100 GPUs for training. Could you please share the specific GPU settings you used?

Thank you.

Training Time Discrepancies

Hi @farewellthree,

Thank you for sharing your work. I've been trying to training your STLLM models using the configurations provided - specifically the following two: instructblipbase_stllm_conversation.yaml and instructblipbase_stllm_qa.yaml on 8 A100 GPUs. I observed the training time for the conversation model to be around 16 hours, whereas the paper suggests approximately "6 hours for 2 epochs using Deepspeed's zero-2 setting".

Here is the command I use to initiate training:
deepspeed --master_port=20000 --include=localhost:0,1,2,3,4,5,6,7

I'm looking for a clarification on whether there might be any configuration adjustments (e.g., batch size, optimizer settings) that could help align the training time more closely with what is in the paper. Additionally, could you provide the expected training duration for tboth configurations?

Any suggestions to improve training efficiency would be greatly appreciated. Thank you.

Question about eval

Hi @farewellthree @yxgeee @xinntao @yeliudev @ARCer , thanks for your great project.

I have reproduced the training. However, when I try to use the trained model (output/instructblipbase_stllm_qa) with the official eval script, I encounter pretty low acc in the MVBench eval (acc<10%).

Is that anything wrong in my eval scripts? Or do I need to modify something (e.g., config) for the trained model?

python stllm/test/mvbench/mv_bench_infer.py \
    --cfg-path config/instructblipbase_stllm_qa.yaml \
    --ckpt-path output/instructblipbase_stllm_qa \
    --anno-path /path/to/mvbench/json \
    --output_dir $output \
    --output_name instructblipbase_stllm_qa_mvbench_fps1 \
    --num-frames 0 \
    --ask_simple

How to fine-tune on image data

I want to add some Image QA data during training, and I only change the config file. But the log seems abnormal:

{'loss': 0.963, 'learning_rate': 1.9985566339747023e-05, 'epoch': 0.09}                                                                          
  5%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Š                                                                                                  | 201/4280 [18:45<6:21:24,  5.61s/it]
Invalidate trace cache @ step 1026: expected module 1451, but got module 4
  5%|โ–ˆโ–ˆโ–ˆโ–ˆโ–‰                                                                                                  | 204/4280 [18:53<4:11:27,  3.70s/it]
Invalidate trace cache @ step 1026: expected module 4, but got module 1451
  5%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                                                                                  | 208/4280 [19:21<7:03:28,  6.24s/it]
Invalidate trace cache @ step 1026: expected module 1451, but got module 4
  5%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                                                                                  | 212/4280 [19:44<6:54:53,  6.12s/it]
Invalidate trace cache @ step 1026: expected module 1451, but got module 4
  5%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                                                                                                 | 214/4280 [19:53<6:06:23,  5.41s/it]
Invalidate trace cache @ step 1026: expected module 1451, but got module 4
  5%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                                                                                                 | 228/4280 [21:23<7:40:36,  6.82s/it]
Invalidate trace cache @ step 1026: expected module 1451, but got module 4
  5%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ                                                                                                 | 231/4280 [21:32<4:42:21,  4.18s/it]
Invalidate trace cache @ step 1026: expected module 4, but got module 1451
  6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹                                                                                                 | 237/4280 [22:13<7:21:49,  6.56s/it]
Invalidate trace cache @ step 1026: expected module 1451, but got module 4
  6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š                                                                                                 | 240/4280 [22:29<6:57:20,  6.20s/it]
Invalidate trace cache @ step 1026: expected module 1451, but got module 4
  6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š                                                                                                 | 242/4280 [22:39<6:24:21,  5.71s/it]

Why there is information like Invalidate trace cache @ step 1026: expected module 1451, but got module 4?

MVM loss

Is the mvm loss the MSE loss or the cosine similarity loss? In the paper, the mvm loss is the MSE, but in the code the loss is different.

loss_mvm = (2 - 2 * (mask_img_output * unmask_img_output).sum(dim=-1)).mean()

MVbench็š„ๆต‹่ฏ•ๆƒ…ๅ†ต

ไฝœ่€…ๅฅฝ๏ผŒๆˆ‘ๅœจๆœฌๅœฐๅค็Žฐไบ†ๆจกๅž‹็š„่ฎญ็ปƒ่ฟ‡็จ‹ใ€‚้‡‡็”จvideochat2็›ธๅŒ็š„่ฎญ็ปƒ้›†๏ผŒๅนถไธ”ไฟฎๆ”นไบ†ไฝ ๆ‰€ๆๅˆฐ็š„ไธคไธชๆ•ฐๆฎ้›†๏ผˆvideochat1 videochatgpt๏ผ‰็š„ๆ ‡ๆณจๅ†…ๅฎนใ€‚
้‡‡็”จ4ไธชepoch๏ผŒๅœจmvbenchไธŠ็š„ๆ€ง่ƒฝๅคงๆฆ‚ๆ˜ฏ51.2%๏ผˆๅผ€ๆบๆจกๅž‹ๆœฌๅœฐๅค็Žฐๆ€ง่ƒฝ54.85%๏ผ‰
ๅŸบไบŽๅญ˜ๅœจๅทฎๅผ‚่พƒๅคง๏ผŒ
่ฏท้—ฎ๏ผŒๅœจ่ฎญ็ปƒ่ฟ‡็จ‹ไธญๆœ‰ไป€ไนˆ้œ€่ฆๆณจๆ„็š„ไบ‹้กนๅ—๏ผŸ

AttributeError: 'MetaLoader' object has no attribute 'dataset'

When I try to resume a checkpoint from last training, it raises an error:

[rank12]: Traceback (most recent call last):
[rank12]:   File "/cpfs/29f69eb5e2e60f26/user/GPT/pretrain/mm_intern/duyifan/ST-LLM-temp/stllm/train/train_hf.py", line 278, in <module>
[rank12]:     train()
[rank12]:   File "/cpfs/29f69eb5e2e60f26/user/GPT/pretrain/mm_intern/duyifan/ST-LLM-temp/stllm/train/train_hf.py", line 267, in train
[rank12]:     trainer.train(resume_from_checkpoint=True)
[rank12]:   File "/cpfs/29f69eb5e2e60f26/user/GPT/pretrain/mm_intern/duyifan/miniconda3/envs/stllm/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
[rank12]:     return inner_training_loop(
[rank12]:   File "/cpfs/29f69eb5e2e60f26/user/GPT/pretrain/mm_intern/duyifan/miniconda3/envs/stllm/lib/python3.10/site-packages/transformers/trainer.py", line 1893, in _inner_training_loop
[rank12]:     epoch_iterator = skip_first_batches(epoch_iterator, steps_trained_in_current_epoch)
[rank12]:   File "/cpfs/29f69eb5e2e60f26/user/GPT/pretrain/mm_intern/duyifan/miniconda3/envs/stllm/lib/python3.10/site-packages/accelerate/data_loader.py", line 1086, in skip_first_batches
[rank12]:     dataset = dataloader.dataset
[rank12]: AttributeError: 'MetaLoader' object has no attribute 'dataset'

HF weights

Many thanks to the great contribution! What is the difference between the QA_weight and Conversation_weight in the huggingface repo?

Cuda Version

Hello,
I am trying to run the scripts on a machine with CUDA 11.7 installed, but I keep getting the error :
RuntimeError: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

So I was wondering which version of CUDA I should install to solve. Thanks in advance for your help.

ๅพฎ่ฐƒ่ฟ‡็จ‹ไธญloss้—ฎ้ข˜

่ฎญ็ปƒ่ฟ‡็จ‹ไธญๅพˆๅฟซๅ‡บ็Žฐloss่ทณๅ˜ไธบ0็š„็Žฐ่ฑก๏ผŒ้™ไฝŽๅญฆไน ็Ž‡ๆ— ๆณ•่งฃๅ†ณ่ฏฅ้—ฎ้ข˜ใ€‚
image
้…็ฝฎๆ–‡ไปถๅฆ‚ไธ‹๏ผš
model:
arch: st_llm_hf
model_type: instructblip_vicuna0
use_grad_checkpoint: True
max_txt_len: 256
end_sym: "###"
#prompt_path: "prompts/alignment.txt"
prompt_template: '###Human: {} ###Assistant: '
llama_model: '/root/qfs/lmm/weights/stllm/pretrained/vicuna-7b-v1.1/'
ckpt: '/root/qfs/lmm/weights/stllm/pretrained/instruct_blip_vicuna7b_trimmed.pth'
q_former_model: '/root/qfs/lmm/weights/stllm/pretrained/instruct_blip_vicuna7b_trimmed.pth'
qformer_text_input: True
freeze_LLM: False
video_input: "residual"
residual_size: 16
use_mask : True
mvm_decode: True

datasets:
caption_ไฝ“่‚ฒ240402_en:
num_frames: 64

run:
task: video_text_it
bf16: True
tf32: False
output_dir: "./output/instructblipbase_stllm_conversation"
num_train_epochs: 4
dataloader_num_workers: 2
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
gradient_accumulation_steps: 1
evaluation_strategy: "no"

learning_rate: 2e-5

learning_rate: 1e-10
weight_decay: 0.

warmup_ratio: 0.03

warmup_ratio: 0.3
lr_scheduler_type: 'cosine'
logging_steps: 1
model_max_length: 1024
save_steps: 3000
#save_strategy: "epoch"
save_total_limit: 10
deepspeed: 'stllm/train/zero2.json'

deepspeed: 'stllm/train/zero3.json'

deepspeed: 'stllm/train/zero3_offload.json'

The specific settings of Lora finetuning

Thank you for your open-source work!
Can you share the specific settings used for the LoRA Futuning results reported in the paper? I tried training with LoRA but the results differed from those reported in the paper (InstructBLIP).

Improve HF artifacts

Hi there,

Niels here from the open-source team at HF. Congrats on your work! I discovered it here: https://huggingface.co/papers/2406.06040 (feel free to claim the paper so that it appears under your HF profile).

I see the ST-LLM models are available here: https://huggingface.co/farewellthree/ST_LLM_weight/tree/main, however we've got some suggestions to improve the usability/visibility.

  • It would be great to make the checkpoints Transformers compatible, by following this guide: https://huggingface.co/docs/transformers/custom_models. Basically this allows people to directly use your models through the Transformers API, along with trust_remote_code=True (the code itself would live on the hub).
  • Perhaps they can be pushed to https://huggingface.co/TencentARC?
  • would be great to add pipeline_tag: video-text-to-text to the model card's metadata, enabling people to find it easily
  • Moreover, we encourage people to push model checkpoints to separate model repositories, and also at the root of the repo, such that download stats work for your models (currently the model repo says "downloads aren't tracked").

Let me know if you need any help regarding this.

Cheers,

Niels

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.