pku-yuangroup / moe-llava Goto Github PK

View Code? Open in Web Editor NEW

1.9K 1.9K 118.0 16.95 MB

Mixture-of-Experts for Large Vision-Language Models

Home Page: https://arxiv.org/abs/2401.15947

License: Apache License 2.0

Python 93.97% HTML 0.66% JavaScript 0.87% CSS 0.16% Shell 4.34%

large-vision-language-model mixture-of-experts moe multi-modal

moe-llava's People

Contributors

Stargazers

Watchers

Forkers

xiechengmude youlixiya evdcush llziss4ai orefaleoluwayinka shiyukonghui freakynit ftgreat yangfukui big-data-ai eltociear mbrukman hezhenjun0616 techthiyanes ardha27 bandoon sennoy11012 hhy5277 supermario-ai f901107 yanxg ddbon-calvin flyinggh danozworld camenduru 2132660698 anderson-lee-mr goober86 kustomzone strategist922 saramonuz9786 ailabteam meetraj19 charliechap3 arkboy1224 tomchapin lucataco tony163163 liguangjie0423 luhong0111 guojiaming0625 leetesla assassindesign baderzakaria kimx3966 jzozaya weifei7 ken2190 peytontolbert gitrjaa namansharma5 johnwick123f georgegu biswarup-choudhury-oxa joeaelkhoury li563042811 nngocson2002 arthritiskneedoctor cometyang dcoresoftware sereact inuwamobarak rak55 weisili2016 azure-arc-0 vital121 kieojk angelalita zhaopufeng zengyangche jbalber wangguojim leejodie 140153cyk gurpreetkaurjethra ashnoorsingh 021gink xiaozhiob nagase-kotono yangxu2022 sunyeul deepalcoholic tonywang-sh adambear paperwave xvwcreator terry-for-github massu2002 0iui0 dunghuynhandy osbarcelos79 charliefruan muhfaridansutariya qaqdev alanyao91168 spidercatfly shimomurakei qinb gesen2egee utopic-dev

moe-llava's Issues

第二阶段，loss下降到多少比较合理？

当我跑第二阶段的时候，发现跑的很慢，并且GPU使用率忽高忽低。

同时我看到loss从开始就是1.1 ，目前跑了20个step，依旧在1.2左右，loss貌似没有下降。这个正常吗？

你训练的时候，第二阶段LOSS下降到多少比较合理呢？

{'loss': 0.0, 'learning_rate': 1.6877637130801689e-07, 'epoch': 0.0}

{'loss': 2.1559, 'learning_rate': 8.438818565400844e-08, 'epoch': 0.0}                                                                                                                                               
{'loss': 0.0, 'learning_rate': 1.6877637130801689e-07, 'epoch': 0.0}                                                                                                                                                 
{'loss': 0.0, 'learning_rate': 2.5316455696202533e-07, 'epoch': 0.0}                                                                                                                                                 
{'loss': 0.0, 'learning_rate': 3.3755274261603377e-07, 'epoch': 0.0}                                                                                                                                                 
{'loss': 0.0, 'learning_rate': 4.219409282700422e-07, 'epoch': 0.0}                                                                                                                                                  
{'loss': 0.0, 'learning_rate': 5.063291139240507e-07, 'epoch': 0.0}                                                                                                                                                  
{'loss': 0.0, 'learning_rate': 5.907172995780591e-07, 'epoch': 0.0}                                                                                                                                                  
{'loss': 0.0, 'learning_rate': 6.751054852320675e-07, 'epoch': 0.0}                                                                                                                                                  
{'loss': 0.0, 'learning_rate': 7.59493670886076e-07, 'epoch': 0.0}

During the pre-training phase, I can obtain the correct loss convergence. However, in the finetuning stage, except for the first iteration, the rest of the losses are all 0.0. Could you please tell me where the problem might be?

Wrong depedancies, why deepspeed dependency for inference, better transformers integration

Hi - thanks for a great repo and model! got it working today but seems to be many unneccessary hurdles:

minimum python version is wrong - I have run natively in python 3.8.10 please correct the readme..
why is there dependency on deepspeed in the inference code? model does not need deepspeed for inference.
perhaps I misuderstand but torch weights is loaded as part of the inference even after the softtensors - all other multimodal models directl load the tensor..if has to be done this way wher is the .bin model located so I can save it locally rather than store in cache?
rather than have to clone the repo, in hugging face repo you can change trust_loal_code=True, vikhyat has done this for his custom model see here vikhyat/moondream#18
running inference.py directly works exactly the same as deepspeed inference.py do why run it this way?
also regarding quantisation - seems to be a fault with tranformers/acce/erate/deepspeed integration and versioning so suspect wont be able to utilise quantisation until those libraries have corrected things

Eval on MMVET

Hi,

I try to eval the model in MMVET. But there is no such file scripts/eval_gpt_mmvet.py in your code.

Error during training on custom dataset

Describe the issue

Hello,

I am training llava-mistral on custom dataset, but somewhere during training, I encounter the following error:

  train()
  File "/home/ubuntu/scripts/MoE-LLaVA/moellava/train/train.py", line 1465, in train
    trainer.train()
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ubuntu/scripts/MoE-LLaVA/moellava/model/language_model/llava_mistral.py", line 68, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
  File "/home/ubuntu/scripts/MoE-LLaVA/moellava/model/llava_arch.py", line 302, in prepare_inputs_labels_for_multimodal
    cur_image_features = image_features[cur_image_idx].to(self.device)
IndexError: list index out of range

So, I was wondering if anyone can help? Thanks

inference error in llavamistral

Question

Hello,

I have trained a LlavaMistralForCausalLM model based on openchat (not moe version), but when I use predict.py
I get the following error:

File ~/scripts/MoE-LLaVA/moellava/model/language_model/llava_mistral.py:94, in LlavaMistralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, images, return_dict)
     76     (
     77         input_ids,
     78         position_ids,
   (...)
     89         images
     90     )
     92 # dist.barrier()
     93 # print(f'rank {dist.get_rank()}', 'after prepare_inputs_labels_for_multimodal')
---> 94 out = super().forward(
     95     input_ids=input_ids,
     96     attention_mask=attention_mask,
     97     position_ids=position_ids,
     98     past_key_values=past_key_values,
     99     inputs_embeds=inputs_embeds,
    100     labels=labels,
    101     use_cache=use_cache,
    102     output_attentions=output_attentions,
    103     output_hidden_states=output_hidden_states,
    104     return_dict=return_dict
    105 )
    106 # dist.barrier()
    107 # print(f'rank {dist.get_rank()}', 'after LLM')
    108 return out

File /opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:1053, in MistralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1050 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   1052 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1053 outputs = self.model(
   1054     input_ids=input_ids,
   1055     attention_mask=attention_mask,
   1056     position_ids=position_ids,
   1057     past_key_values=past_key_values,
   1058     inputs_embeds=inputs_embeds,
   1059     use_cache=use_cache,
   1060     output_attentions=output_attentions,
   1061     output_hidden_states=output_hidden_states,
   1062     return_dict=return_dict,
   1063 )
   1065 hidden_states = outputs[0]
   1066 logits = self.lm_head(hidden_states)

File /opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:908, in MistralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    905     attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
    906 else:
    907     # 4d mask is passed through the layers
--> 908     attention_mask = _prepare_4d_causal_attention_mask(
    909         attention_mask,
    910         (batch_size, seq_length),
    911         inputs_embeds,
    912         past_key_values_length,
    913         sliding_window=self.config.sliding_window,
    914     )
    916 hidden_states = inputs_embeds
    918 # decoder layers

File /opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:306, in _prepare_4d_causal_attention_mask(attention_mask, input_shape, inputs_embeds, past_key_values_length, sliding_window)
    304 # 4d mask is passed through the layers
    305 if attention_mask is not None:
--> 306     attention_mask = attn_mask_converter.to_4d(
    307         attention_mask, input_shape[-1], key_value_length=key_value_length, dtype=inputs_embeds.dtype
    308     )
    309 else:
    310     attention_mask = attn_mask_converter.to_causal_4d(
    311         input_shape[0], input_shape[-1], key_value_length, dtype=inputs_embeds.dtype, device=inputs_embeds.device
    312     )

File /opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:136, in AttentionMaskConverter.to_4d(self, attention_mask_2d, query_length, dtype, key_value_length)
    132 expanded_attn_mask = self._expand_mask(attention_mask_2d, dtype, tgt_len=input_shape[-1]).to(
    133     attention_mask_2d.device
    134 )
    135 if causal_4d_mask is not None:
--> 136     expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min)
    138 # expanded_attn_mask + causal_4d_mask can cause some overflow
    139 expanded_4d_mask = expanded_attn_mask

RuntimeError: The size of tensor a (622) must match the size of tensor b (1243) at non-singleton dimension 3

Should I change the inference code for non moe models?
Thanks for the help

用LLava官方脚本替换Qwen2，用mpt的template训练 loss 0

楼主有遇到过类似的情况吗？
{'loss': 0.0, 'learning_rate': 0.001435114503816794, 'epoch': 0.02}
2%|██▊ | 188/8720 [14:45<11:05:28, 4.68s/it]WARNING: tokenization mismatch: 58 vs. 59. (ignored)
WARNING: tokenization mismatch: 41 vs. 42. (ignored)

Inference without Deepspeed

Describe the issue

Describe the bug
I am having issues infering the model after installing deepspeed for windows. my configuration is mentioned below, is there anyway we can perform inference with just a python script. i tried to directly call python moellava/serve/cli.py but it looks like in the code still needs deepspeed. Also have no issues when i perform inference for base LLAVA model

To Reproduce

deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-StableLM-1.6B-4e" --image-file "other/test2.jpg"

when i run the above i get this error:

'deepspeed' is not recognized as an internal or external command,

even when i set the path to deepspeed lib from conda env, i still get this error

System info

OS: Windows 10
1 GPU, Nvidia Quadro RTX 5000 16gb Vram
deepspeed installed from the windows instruction from here [https://github.com/microsoft/DeepSpeed]
Python version = 3.10
conda version torch==2.1.2 using cuda 12.1

Wrong cuda allocation

Hi,

when I try to run an inference with any MoE-LLaVA on a node with 4x A100 I run into an issue with tensor allocation:

I have installed MoE-LLaVA from the latest main commit (188d462)

Traceback (most recent call last):
  File "/workspace/models.py", line 235, in forward
    output_ids = self.model.generate(
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate
    return self.sample(
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2861, in sample
    outputs = self(
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/workspace/MoE-LLaVA/moellava/model/language_model/llava_phi_moe.py", line 336, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
  File "/workspace/MoE-LLaVA/moellava/model/llava_arch.py", line 198, in prepare_inputs_labels_for_multimodal
    image_features_minibatch = self.encode_images(images_minibatch)  # [mini_b, l, c]
  File "/workspace/MoE-LLaVA/moellava/model/llava_arch.py", line 153, in encode_images
    image_features = self.get_model().mm_projector.forward_image(image_features)
  File "/workspace/MoE-LLaVA/moellava/model/multimodal_projector/builder.py", line 138, in forward_image
    return self.image_spatial_proj(image_feature)
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/workspace/data/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

All tensors (input_ids and image tensor) including the model are located on cuda:0 before I run model.generate. Could this be related to haotian-liu/LLaVA#769?

Do you replicate the weights of the FFNs from stage 1 or stage 2?

In the related paper in section 3.2 Architecture of MoE-LLaVA paragraph "MoE Forward" (page 3) you state that you replicate the FFNs from stage 1 to form an ensemble of experts.

Shouldn't this be stage 2?

paper和readme指标不一致

paper和readme指标不一致，请问以哪个为准？

Images for training

Could you share the sampled images for Moe training, i.e. the images for Stage II SViT-157k, LVIS-220k LRV-331k, MIMIC-IT-256k

traning dataset?

[Usage] The training always stuck after formatting inputs

Describe the issue

Issue:
In pretraining or finetuning, the training always stuck after the log "Formatting inputs...Skip in lazy mode". Everytime I need to force shutting down my GPU server because it was out-of-response. Both the GUI and the SSH does not response at all.

Command (scripts/v1/phi2/pretrain.sh):

deepspeed --num_gpus=2 moellava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path microsoft/phi-2 \
    --version plain \
    --data_path ${JSON_FOLDER}/llava_image_.json \
    --image_folder ${IMAGE_FOLDER} \
    --image_tower google/siglip-so400m-patch14-384 \
    --image_projector_type mlp2x_gelu \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ./checkpoints/llavaphi-2.7b-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 24000 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 0 \
    --lazy_preprocess True \
    --cache_dir ${CACHE_FOLDER}

When using the zero2.json, the log stops here
Log:

    (mm_projector): build_projector(
      (image_spatial_proj): Sequential(
        (0): Linear(in_features=1024, out_features=2560, bias=True)
        (1): GELU(approximate='none')
        (2): Linear(in_features=2560, out_features=2560, bias=True)
      )
      (video_patch_proj): Identity()
      (video_spatial_proj): Identity()
      (video_temproal_proj): Identity()
      (video_global_proj): Identity()
    )
  )
  (lm_head): Linear(in_features=2560, out_features=51200, bias=False)
)
Formatting inputs...Skip in lazy mode

When using the zero2_offload.json, the log stops here
Log:

    (mm_projector): build_projector(
      (image_spatial_proj): Sequential(
        (0): Linear(in_features=1024, out_features=2560, bias=True)
        (1): GELU(approximate='none')
        (2): Linear(in_features=2560, out_features=2560, bias=True)
      )
      (video_patch_proj): Identity()
      (video_spatial_proj): Identity()
      (video_temproal_proj): Identity()
      (video_global_proj): Identity()
    )
  )
  (lm_head): Linear(in_features=2560, out_features=51200, bias=False)
)
Formatting inputs...Skip in lazy mode
Using /home/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Creating extension directory /home/xxx/.cache/torch_extensions/py310_cu117/cpu_adam...
Using /home/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xxx/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -D ...
[2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam ...
[3/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam ...
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o - ...
Loading extension module cpu_adam...
Time to load cpu_adam op: 39.41885256767273 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 39.50555872917175 seconds

I tried to remove folders under ~/.cache/torch_extensions, the stuck still happens.

My machine has 10*RTX 3090 and RAM (512G to 1T). In the above command, I just try to use 2 cards. As the machine breaks down everytime I start training, I have no idea what was wrong. Please help~

The environment exactly follows the instructions except for two additional commands "pip install deepspeed -U" and "pip install accelerate -U". I updated these two packages when I tried to solve the same problem.

Screenshots:

[Discussion] Implementation of Qwen1.5 for the project

Discussion

Firstly, Wish you have a nice day on Chinese New Year.
I am currently catching up with your progress in integrating Qwen1.5 to this project. Since the Qwen1.5 shares a similar structure with Qwen1 models, I followed the Qwen1 template to integrate the code.
Currently, I have succeed in training and fine-tuning the model. But I came across the problem in evaluating the models on TextVQA.

builder.py

                if 'qwen2' in model_base.lower():
                    model = LlavaQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
                    model.config.eos_token_id = tokenizer.eos_token_id
                    model.generation_config = GenerationConfig.from_pretrained(model_base, pad_token_id=tokenizer.pad_token_id)
                    # model.generation_config.repetition_penalty = None
                    model.generation_config.do_sample = False  # use greedy decoding
                    model.generation_config.repetition_penalty = 1.0  # disable repetition penalty

The other codes follows the Qwen1 settings.

Unfortunately, the code outputs the following error.

    attention_mask = _prepare_4d_causal_attention_mask(
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 307, in _prepare_4d_causal_attention_mask
    attention_mask = attn_mask_converter.to_4d(
  File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 137, in to_4d
    expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min)
RuntimeError: The size of tensor a (654) must match the size of tensor b (1307) at non-singleton dimension 3

May I consult you for detailed instructions? Or could we work together to enhance the performance?

I reviewed the code and suspect the reason may be the incapability of padding='right', which is not supported by qwen1.5-flash-atten. The problem demonstrated in the official implementation.

Method to Replicate Results from Huggingface Spaces

I tried Huggingface Spaces and got wonderful results. Thank you. How can I replicate these results locally?

can support mixtral 7BX8 model ?

Is llavallama moe supported?

Hi, have you tested the result for llava_llama version? Would an extra moe stage improve original llava results?

[Discussion] How to improve model's understanding of high-resolution images？

Discussion

At present, I have some tasks that need to parse images with high resolution and different aspect ratios, and llava's processing method is relatively simple at present. I've seen how other projects（nvait、vary） are handled, so how are you currently improving your model's ability to understand high-resolution images

用自己的数据训练MOE-LLAVA，pretrain阶段,loss下降的非常快

我用字节的数据来训练moe llava，我的数据量比较大，大概是3kw条记录 + moe-llava的数据。

总共大约有500000 个step

在pretrain阶段，第0~3000 step ， loss下降到了1.4

从3000个step 到第7000个step， loss下降到1.1

我还需要等待吗？现在只跑了大约2% 的step进度。。。还要等待吗？是否要等loss下降到0.3左右？

在你的pretrain阶段，loss下降到多少后，你就停止了？

languageBindVideo model may be hang ?

当我集成了LanguageBind_Video_merge 模型的时候，在训练的过程中，发现了hang的现象

同时过了30分钟，则报错：NCCL 超时。。同时去掉视频相关数据，则训练一切正常

root@A03-R40-I16-12-8000045:/export/App/training_platform/PinoModel# py-spy dump -p 3261644
Process 3261644: /usr/bin/python -u moellava/train/train_mem.py --local_rank=5 --deepspeed ./scripts/zero3_offload.json --model_name_or_path /export/App/training_platform/PinoModel/mixtral/Mixtral-8x7B-Instruct-v0.1 --version mixtral --data_path /mnt/moe/moe/dataset/data_root/train_json/pretrain/valley_llavaimage.json --image_folder /mnt/moe/moe/dataset/data_root --image_tower /export/App/training_platform/PinoModel/openai/clip-vit-large-patch14-336 --image_projector_type mlp2x_gelu --video_tower /export/App/training_platform/PinoModel/LanguageBind/LanguageBind_Video_merge --video_folder /mnt/moe/moe/dataset/data_root --tune_mm_mlp_adapter True --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir ./checkpoints/llavamixtral-7b-pretrain --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 2400 --save_total_limit 1 --learning_rate 1e-3 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 8 --lazy_preprocess True --report_to tensorboard --cache_dir ./cache_dir
Python v3.10.12 (/usr/bin/python3.10)

Thread 3261644 (active): "MainThread"
    <listcomp> (deepspeed/runtime/zero/partition_parameters.py:1138)
    _all_gather_dtype (deepspeed/runtime/zero/partition_parameters.py:1138)
    all_gather_coalesced (deepspeed/runtime/zero/partition_parameters.py:1252)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    __all_gather_params_ (deepspeed/runtime/zero/partitioned_param_coordinator.py:458)
    __all_gather_params (deepspeed/runtime/zero/partitioned_param_coordinator.py:429)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    fetch_sub_module (deepspeed/runtime/zero/partitioned_param_coordinator.py:380)
    decorate_context (torch/utils/_contextlib.py:115)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    pre_sub_module_forward_function (deepspeed/runtime/zero/parameter_offload.py:452)
    decorate_context (torch/utils/_contextlib.py:115)
    _pre_forward_module_hook (deepspeed/runtime/zero/parameter_offload.py:340)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    _call_impl (torch/nn/modules/module.py:1557)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (transformers/models/clip/modeling_clip.py:263)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (transformers/models/clip/modeling_clip.py:372)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (torch/utils/checkpoint.py:230)
    apply (torch/autograd/function.py:539)
    checkpoint (torch/utils/checkpoint.py:450)
    inner (torch/_dynamo/external_utils.py:17)
    _fn (torch/_dynamo/eval_frame.py:333)
    inner (torch/_compile.py:24)
    forward (transformers/models/clip/modeling_clip.py:622)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (transformers/models/clip/modeling_clip.py:844)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (transformers/models/clip/modeling_clip.py:917)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (clip_encoder.py:50)
    decorate_context (torch/utils/_contextlib.py:115)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    encode_images (moellava/model/llava_arch.py:152)
    prepare_inputs_labels_for_multimodal (moellava/model/llava_arch.py:198)
    forward (llava_mixtral.py:83)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (deepspeed/runtime/engine.py:1842)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    _call_impl (torch/nn/modules/module.py:1527)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    compute_loss (transformers/trainer.py:2795)
    training_step (transformers/trainer.py:2772)
    _inner_training_loop (transformers/trainer.py:1868)
    train (transformers/trainer.py:1539)
    train (train.py:1475)
    <module> (train_mem.py:13)
Thread 3262753 (idle): "Thread-1"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    wait_result_broken_or_wakeup (concurrent/futures/process.py:385)
    run (concurrent/futures/process.py:320)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3264158 (idle): "Thread-2"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3267395 (idle): "Thread-3 (_pin_memory_loop)"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:31)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:54)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268088 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268152 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268153 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268154 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268155 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268156 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268157 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268158 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3303923 (idle)
Thread 3303931 (idle)
Thread 3303916 (idle)
Thread 3303934 (idle)
Thread 3303942 (idle)
Thread 3303945 (idle)
Thread 3303952 (idle)
Thread 3303949 (idle)

[Usage] ValueError: Unknown image tower: /data1/ljq/Moellava/MoE-LLaVA-Qwen-1.8B-4e/clip-vit-large-patch14-336

Describe the issue

when i enter $ deepspeed --include localhost:0 /data1/ljq/Moellava/MoE-LLaVA-main/moellava/serve/cli.py --model-path "/data1/ljq/Moellava/MoE-LLaVA-Qwen-1.8B-4e" --image-file "/data1/ljq/Moellava/MoE-LLaVA-main/moellava/serve/examples/extreme_ironing.jpg",it turns errors what can i do?

Moe finetune error

The following error occurred while running the script finetune_moe.sh:
The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer.

Allow custom storage path of the google/siglip-so400m-patch14-384

Now the moellava/model/multimodal_encoder/builder.py assumes using the default huggingface cache path.

[Question] About nlp_tune data.

Question

Hello, your work is wonderful. b.t.w, I find nlp_tune.json in your provided dataset. Would you please tell me which dataset is it from? Thanks a lot!

Openchat, quantisation, multiimage

Please release the Moe-Llava-openchat model on hugging face (you have an example of training this on your training.sh)
Is it possible to use quatisation with image-to-text pipelines in transformers?
the reason for this is that openchat performs well on quantized gptq https://huggingface.co/TheBloke/openchat-3.5-0106-GPTQ
your video-llava model offers image to video comparison. does this model support image-to-image comparison? if so please show an example pythomn transformers example with/without pipeline

image not processed

When running deepspeed --include localhost:0 predict.py, got the error, it seems passing the image path to the part expecting loaded image:

Traceback (most recent call last):
  File "models_path/models/PKU_MoE/MoE-LLaVA/predict.py", line 49, in <module>
    main()
  File "models_path/models/PKU_MoE/MoE-LLaVA/predict.py", line 22, in main
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(model.device, dtype=torch.float16)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "models_path/.venv/lib/python3.11/site-packages/transformers/models/siglip/image_processing_siglip.py", line 174, in preprocess
    images = make_list_of_images(images)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "models_path/.venv/lib/python3.11/site-packages/transformers/image_utils.py", line 162, in make_list_of_images
    raise ValueError(
ValueError: Invalid image type. Expected either PIL.Image.Image, numpy.ndarray, torch.Tensor, tf.Tensor or jax.ndarray, but got <class 'str'>.
[2024-02-01 13:29:06,242] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 119765

[Question] Scale down futher to support IOT usecases?

Question

I'm trying to see what can run on an 8GB Raspberry Pi 5, and it occours to me that your approach might scale down really well. Any tips for replicating what you did with something like TinyLlama or trying for an 8 bit quantization of LlaVA-Phi? I'd love to try training some sort of student model as an experiment from the more successful models you've trained.

Whether stage-2 pre-train model(llavaphi-2.7b-finetune) is released?

[Question] Image patch representation in this work

Question

Hello.
Firstly I will thank your assistance in debugging Qwen1.5 problem. I have achieved remarkable performance on Qwen1.5.
I am now working on intergrating your codebase with LLaVA-Next (Aiming to intergrate the high-resolution support). I am now came up with a question about image patch representation of your code.

As is shown in Official LLaVA repo, the image feature map are flatten explicitly. But in your implementation, I did not find any operation to flatten image features. I am curious about the organization of image features in your work.

Error in predict.py

I run the predict.py and encounter the following errors.

  File "/apdcephfs/private_kuofenggao/Agent/MoE-LLaVA/moellava/model/language_model/llava_phi_moe.py", line 385, in forward
    moe_loss = self.router_aux_loss_coef * sum(moe_losses)
                                           ^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'int' and 'list'

I find the code where the errors occur.

moe_loss, moe_losses = None, []
if len(outputs[-1]) > 0:
    moe_loss_list = outputs[-1]
    import pdb
    pdb.set_trace()
    for moe_loss in moe_loss_list:
        if moe_loss is not None:
            moe_losses.append(moe_loss)
    moe_loss = self.router_aux_loss_coef * sum(moe_losses)

The observation indicates that moe_loss_list is equal to moe_losses. Has an error occurred? Have you encountered a similar issue?

(Pdb) moe_loss_list
[[tensor(1.0414, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.0441, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.0763, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.1166, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.0224, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.1178, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.5922, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.6191, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.1845, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.1891, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.2862, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.3038, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.3900, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.9966, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.4283, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.4523, device='cuda:0'), tensor(0., device='cuda:0')]]
(Pdb) moe_losses
[[tensor(1.0414, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.0447, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.0767, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.1124, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.0219, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.1151, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.5845, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.6215, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.1875, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.1940, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.2809, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.3052, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.3910, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(2.0025, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.4418, device='cuda:0'), tensor(0., device='cuda:0')], [tensor(1.4324, device='cuda:0'), tensor(0., device='cuda:0')]]

推理效率对比问题

Describe the issue

MoE的一个优势是推理效率，但是在paper里没看到和同size的VLM做对比，请问具体情况如何？

Can the author elaborate a bit more about how the stage 3 was achieved?

Awesome work. Just a quick question, when converting from LVLM to MoE-LVLM, wonder whether the FF layer weights are split into smaller experts or are they duplicated? Any code reference for that would be really helpful to understand the details

panic on finetune

当我在fine第二阶段的时候，报错如下？该怎么解决呢？

    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/export/App/training_platform/PinoModel/moe-llava/moellava/model/language_model/llava_mixtral.py", line 83, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
  File "/export/App/training_platform/PinoModel/moe-llava/moellava/model/llava_arch.py", line 302, in prepare_inputs_labels_for_multimodal
    cur_image_features = image_features[cur_image_idx].to(self.device)
IndexError: list index out of range

ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects

Hi while trying to work with
https://huggingface.co/LanguageBind/MoE-LLaVA-Qwen-1.8B-4e

i got error:```

当我使用moe-llava的架构集成了mixtral 7BX8的时候，奇怪的事情发生了

1.该架构仅能使用zero3_offload.json来进行训练。。
2. 当我的训练数据源是video-llava的数据源，里面既包含了图片的训练，也包含了视频的训练。
仅当270个step后，就会卡住，一直等待NCCL 超时
报错信息如下：

'loss': 6.8931, 'learning_rate': 0.0009831912069017822, 'epoch': 0.03}
{'loss': 6.8843, 'learning_rate': 0.0009838432886246189, 'epoch': 0.03}
  3%|▎         | 270/9847 [06:18<3:39:12,  1.37s/it]Invalidate trace cache @ step 1: expected module 15, but got module 313
[E ProcessGroupNCCL.cpp:467] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800376 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:467] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:467] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800436 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:467] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=75264, NumelOut=602112, Timeout(ms)=1800000) ran for 1800709 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:467] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800860 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:467] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800869 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:467] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800882 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:467] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800988 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=75264, NumelOut=602112, Timeout(ms)=1800000) ran for 1800709 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=75264, NumelOut=602112, Timeout(ms)=1800000) ran for 1800709 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:852] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:852] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800860 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800860 milliseconds before timing out.
[2024-02-03 11:48:17,107] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2694488
[2024-02-03 11:48:17,108] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2694489
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:852] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800436 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:852] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800376 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800436 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800376 milliseconds before timing out.
[2024-02-03 11:48:20,395] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2694490
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:852] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800869 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800869 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:852] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800988 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800988 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:852] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800882 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=202558, OpType=_ALLGATHER_BASE, NumelIn=2359296, NumelOut=18874368, Timeout(ms)=1800000) ran for 1800882 milliseconds before timing out.
[2024-02-03 11:48:25,872] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2694491
[2024-02-03 11:48:27,258] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2694492
[2024-02-03 11:48:27,261] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2694493
[2024-02-03 11:48:27,263] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2694494
[2024-02-03 11:48:27,265] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2694495
[2024-02-03 11:48:27,267] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python', '-u', 'moellava/train/train_mem.py', '--local_rank=7', '--deepspeed', './scripts/zero3_offload.json', '--model_name_or_path', '/export/App/training_platform/PinoModel/mixtral/Mixtral-8x7B-Instruct-v0.1', '--version', 'mixtral', '--data_path', '/mnt/moe/moe/dataset/data_root/train_json/pretrain/valley_llavaimage.json', '--image_folder', '/mnt/moe/moe/dataset/data_root', '--image_tower', '/export/App/training_platform/PinoModel/openai/clip-vit-large-patch14-336', '--image_projector_type', 'mlp2x_gelu', '--video_tower', '/export/App/training_platform/PinoModel/LanguageBind/LanguageBind_Video_merge', '--video_folder', '/mnt/moe/moe/dataset/data_root', '--tune_mm_mlp_adapter', 'True', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--bf16', 'True', '--output_dir', './checkpoints/llavamixtral-7b-pretrain', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2400', '--save_total_limit', '1', '--learning_rate', '1e-3', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '8', '--lazy_preprocess', 'True', '--report_to', 'tensorboard', '--cache_dir', './cache_dir'] exits with return code = -6

3.当我训练数据源是video-llava的数据源，只包含图片的数据的时候，则不会有任何错误。。并且也能够跑完pretrain。

请问，这是为什么呢？

Support cuda 12

Describe the issue

When I run deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-Phi2-2.7B-4e" , it shows errors as below:
OSError: libcufft.so.10: cannot open shared object file: No such file or directory

> Hi, everyone. Sorry for that, we updated the new runing command to fix it. Checking out [here](https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/scripts/v1/qwen/finetune_moe.sh)

          > Hi, everyone. Sorry for that, we updated the new runing command to fix it. Checking out [here](https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/scripts/v1/qwen/finetune_moe.sh)

I still have the following error

AssertionError: The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer

here is my command

torchrun $DISTRIBUTED_ARGS moellava/train/train_mem.py \
    --moe_enable True --num_experts ${num_experts} --top_k_experts ${top_k_experts} --capacity_factor 1.5 \
    --moe_mode ${moe_mode} --use_residual ${use_residual} --router_aux_loss_coef ${router_aux_loss_coef} \
    --train_modules mlp.w1 mlp.w2 mlp.c_proj wg \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path ./checkpoints/llavaqwen1.5-1.8b-finetune \
    --version qwen \
    --data_path ${JSON_FOLDER}/llava_image_tune_.json ${JSON_FOLDER}/nlp_tune.json \
    --image_folder ${IMAGE_FOLDER} \
    --image_tower openai/clip-vit-large-patch14-336 \
    --image_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llavaqwen-1.8b-finetune-moe \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 24000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --cache_dir "./cache_dir"

here is my package version

accelerate                0.21.0
deepspeed                 0.9.5
torch                     2.0.1
torchvision               0.15.2
transformers              4.37.0

Originally posted by @hxhcreate in #17 (comment)

finetune阶段内存占用太高

当我在finetune阶段的时候，

我的机器内存是1.9TB，当我运行的过程中内存占用了1.9TB，并且有354个进程在运行。

但是在checkpoint阶段，由于需要额外的内存，因此导致checkpoint被OOM ,进而导致退出。。

这个问题，该怎么解决呢？

License Questions

Is this available for commercial use? I see the Apache 2.0 license is used but on hugging faces it says that it is a research preview only.

[Question] how to check activate parameters of MoE models?

Question

hello, thanks for nice work!

is there any code for check activate parameters of MoE models?

Can i use this to detect events in the video???

i would feed frame by frame with a prompt asking if this a goal, penalty or yelllow card , is it possible to use MoE on video

is support video ?

is support video train?

Reproducing the stage1 and stage2 Model problem on L40s

Thank you for your excellent job.
I followed your work and download the released dataset from your link.
Since you have kindly provided an end-to-end script and processed dataset file. I thought we can quickly reproduce your excellent work. But After two days of training, we get our LLaVA-phi2 model. It can infer by your code.

But It can not reproduce the excellent accuracy in your paper. Would you mind sharing any train logs or detailed information with us. Therefore, we can debug the training process and find out what happened.

MoE-LLaVA-StableLM for 4-bits and 8-bit

Hello!

I would like to ask if there are any plans to implement support with 4 and 8 bits to be able to infer with the model?

Thank you!

supports Chinese or multiple images?

Thank you for your work! I'm wondering, does the model you've developed support Chinese and multiple images?

/deepspeed/comm/comm.py", line 341, in all_to_all_single return cdb.all_to_all_single(output=output, AttributeError: 'NoneType' object has no attribute 'all_to_all_single'

I found this MoE runs on DeepSpeed, but deepspeed has issues when runing on server without MPI. Any solution?

pku-yuangroup / moe-llava Goto Github PK

moe-llava's People

Contributors

Stargazers

Watchers

Forkers

moe-llava's Issues

Describe the issue

Question

Describe the issue

Describe the issue

Discussion

Discussion

Describe the issue

Question

Question

Question

Describe the issue

Describe the issue

Question

Recommend Projects

Recommend Topics

Recommend Org