jshilong / gpt4roi Goto Github PK
View Code? Open in Web Editor NEWGPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
License: Other
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
License: Other
Professor, it took me a few days to figure out my previous mistakes. But this error cannot be solved. Can you do me a favor? I have resolved all the environment versions, but they still don't work.
When I input promote, the model will report this error, and the following error will appear in the code.
Happy New Year's Day, Professor. I am indeed quite clumsy, could you please give me some guidance.
@jshilong Hello, in my training process of stage 2, the loss is always zero, like the below figure. Is it normal?
In the paper, this code does have a demo, but did you develop evaluation script on dev set or some existing datasets?
Hi Authors,
I was working with the flickr30k dataset and noticed that it returns the original bounding boxes (ori_bboxes) directly, whereas other referring expression datasets utilize selected bounding boxes (select_bboxes) to replace ori_boxes.
ori_bboxes = torch.cat([ori_bboxes], dim=0)
Wouldn't this result in a mismatch between the region questions and the bounding boxes? Could you shed some light on this? Am I missing something?
Thank you!
How can I design WORKDIR and STAGE1WORKDIR if I want to continue fine-tuning on your GPT4RoI weight node,
In the video of #41 I demonstrate an error when running the demo.
It seems to be missing required property in one of the events sent through gradio
Given the error occurs inside gradio-dev
runtime I am unsure if is due to the app.py
sending the wrong data, or if there is some issue inside the actual gradio-dev
package
Running on local URL: http://0.0.0.0:20012
To create a public link, set `share=True` in `launch()`.
Task exception was never retrieved
future: <Task finished name='6976h8jtnyr_7' coro=<Queue.process_events() done, defined at /workspaces/GPT4RoI/gradio-dev/gradio/queueing.py:342> exception=1 validation error for PredictBody
event_id
Field required [type=missing, input_value={'data': [], 'event_data'...on_hash': '6976h8jtnyr'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.5/v/missing>
Traceback (most recent call last):
File "/workspaces/GPT4RoI/gradio-dev/gradio/queueing.py", line 346, in process_events
client_awake = await self.gather_event_data(event)
File "/workspaces/GPT4RoI/gradio-dev/gradio/queueing.py", line 219, in gather_event_data
data, client_awake = await self.get_message(event, timeout=receive_timeout)
File "/workspaces/GPT4RoI/gradio-dev/gradio/queueing.py", line 448, in get_message
return PredictBody(**data), True
File "/home/vscode/miniconda3/envs/gpt4roi/lib/python3.9/site-packages/pydantic/main.py", line 164, in __init__
__pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)
pydantic_core._pydantic_core.ValidationError: 1 validation error for PredictBody
event_id
Field required [type=missing, input_value={'data': [], 'event_data'...on_hash': '6976h8jtnyr'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.5/v/missing
The chatbot works well according to the given region when I ask the first question. But for the second ask it keeps processing and never generates a sentence. It may be an issue to be fixed.
Hi @jshilong great work considering ROI for language models.
I am getting this error "ValueError: The following model_kwargs
are not used by the model: ['images']" while trying the inference code. Probably because 'images' is not considered as a parameter to the model.generate function.
with torch.amp.autocast(device_type='cuda'):
output_ids = self.model.generate(
input_ids,
images=image.unsqueeze(0).half().cuda(),
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
stopping_criteria=[stopping_criteria])
Could you please confirm if you are using any specific version of the 'torch', 'llava', or 'transformers' library?
Thank you!
What an amazing job, and thanks for your contributions to the open source community, I'd like to try out some new ideas by using model weights, so do you have any plans to release weights anytime soon?
Hi, Thanks for your excellent work.
Now I ran into an issue when I tried to load GPT4ROI weights to perform stage2 training and there was an error
”Error(s) in loading state_dict for SPILlavaMPTForCausalLM:
size mismatch for lm_head.weight: copying a param with shape torch.Size([32006, 4096]) from checkpoint, the shape in current model is torch.Size([32005, 4096]).“
How to solve this problem?
Looking forward to your reply!
Hi @jshilong, thanks again for releasing the code and the models!
I am trying to finetune the model from stage 2. Could you please share a stage 2 checkpoint.
I am getting a 'ValueError: Can't find a valid checkpoint at ./exp/stage2/checkpoint-0' when trying to start from the current weight directory as the starting point.
Appreciate your help!
Hi, Currently the two bash scripts look similar. Can you please confirm the commands for the 1st and 2nd stage of training? I noticed that the data loading is being controlled from config. How exactly the model is properly frozen in two separate stages?
Hi,
I appreciate the effort you put into your framework, but I encountered some confusion while attempting to retrain it. The guidance suggests using the original LLaMA weights for training, but I noticed in your script that the model name input is set as vicuna-7b
: /mnt/petrelfs/share_data/zhangshilong/vicuna-7b/
.
I attempted to use both the original LLaMA and LLaVA huggingface format (haven't applied your delta since it haven't been released yet), but it always resulted in this error:
File "/gpt4roi/gpt4roi/train/train_mem.py", line 16, in <module>
train()
File "/gpt4roi/gpt4roi/train/train.py", line 641, in train
model.initialize_vision_tokenizer(mm_use_im_start_end=model_args.mm_use_im_start_end,
File "/gpt4roi/gpt4roi/models/spi_llava.py", line 295, in initialize_vision_tokenizer
raise ValueError(
ValueError: Unexpected embed_tokens_weight shape. Pretrained: torch.Size([2, 4096]). Current: torch.Size([32006, 4096]). Numer of new tokens: 0.
I would appreciate your guidance in resolving the error and making the code runnable. Could you please provide me with the necessary steps or adjustments to address the issue?
How much GPU memory is required for inference?
Hi, although the Hallucination_questions
and answer' is added into the
sources' as shown in line 442, the sources
was overwritten in line 449. Therefore, the coversations for solving hallucination are not actually added into the variable sources
.
(Pdb) list
450 # print(copy_source)
451 sources = preprocess_multimodal(
452 copy.deepcopy([e['conversations'] for e in copy_source]),
453 self.multimodal_cfg, cur_token_len)
454
455 -> data_dict = preprocess(
456 sources,
457 self.tokenizer)
458 if isinstance(i, int):
459 data_dict = dict(input_ids=data_dict['input_ids'][0],
460 labels=data_dict['labels'][0])
(Pdb) sources
[[{'from': 'human', 'value': 'Can you describe the main features of this image for me?\nThe <im_start><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_patch><im_end> provides an overview of the picture. \n'}, {'from': 'gpt', 'value': "The image shows a baseball field with a group of people gathered, possibly during a small-town baseball game. There is a batter on deck, most likely a female softball player, stretching and practicing her batting stance while holding a baseball bat. She's positioned closer to the left side of the field. \n\nSeveral people stand around the field, some closer to the batter and others farther away, possibly teammates, coaches, or spectators. There are a few sports balls scattered around the field, with two located near the center and another one toward the right side.\n\nOverall, the scene depicts an active sports event, with people of various ages and roles involved in the game."}]]
(Pdb)
Hello@jshilong, have you encountered this problem?
I have trained the model of both two stages. Then I merge the trained model with llama as you described.
When I load the merged model to do test, the errors in below occured.
Traceback (most recent call last):
File "/hy/code/gpt4roi/train_net.py", line 326, in <module>
launch(
File "/hy/code/gpt4roi/detectron2/detectron2/engine/launch.py", line 84, in launch
main_func(*args)
File "/hy/code/gpt4roi/train_net.py", line 311, in main
res = Trainer.test(cfg, model)
File "/hy/code/gpt4roi/detectron2/detectron2/engine/defaults.py", line 617, in test
results_i = inference_on_dataset(model, data_loader, evaluator)
File "/hy/code/gpt4roi/detectron2/detectron2/evaluation/evaluator.py", line 158, in inference_on_dataset
outputs = model(inputs)
File "/workspace/conda_env/gpt4roi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/hy/code/gpt4roi/gpt4roi.py", line 153, in get_output
output_ids = self.model.generate(
File "/workspace/conda_env/gpt4roi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/conda_env/gpt4roi/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
return self.sample(
File "/workspace/conda_env/gpt4roi/lib/python3.10/site-packages/transformers/generation/utils.py", line 2562, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Nice work! Is there any CLI interface for inference ? Thanks!
Hello, the download link of the pre-trained weights of huggingface you opened is not available, can you update it? Or other download channels. Thank you very much!
how can solve it?
@jshilong
Hi Authors,
Thank you for your great work.
While running the demo, I encountered an issue where, after loading an image and subsequently drawing a bounding box, there is no response upon entering text in the chatbox. This appears similar to the problem described in closed issue #9. I have ensured that the gradio_box is correctly set up and followed all provided instructions. The same error is experienced when executing app_box.py in gradio_box. I would really appreciate some help.
Thank you.
I want to use the demo provided by the author to verify the results, but the demo provided on the author's webpage cannot be used. How can I solve this problem?
Some errors occur as I run training code, which might be corresponding to transformer. Also, can you provide the version of other modules in requirements.txt?
Hello, thank you very much for your excellent work. However, I have some doubts about the dataset, and I would appreciate it if you could clarify them for me. Where can I download the train.json file for visual_genome? Do I need to run EVA-02-DET myself to obtain the llava_150k_bbox_pred_results.pkl file?
Hi,
Great work! I have a question w.r.t the vision backbone used in paper.
In the paper, it says ViT-H, while in both code and checkpoint, it shows ViT-L.
Thanks!
Hello. Thank you for your excellent work.
I encountered some issues while using the "gradio-box". I install the gradio-box following the instruction successfully. The first uploaded image works well with the gradio-box.
But when I upload the second image after clicking the clear button, it can not show image correctly.
The browser console has provided the following error message
Could you please answer it.
Hello, Shilong,
Would you like share your pretained weights of Stage 1?
Thanks a lot.
Hi,
Thanks for open-sourcing this great work! We are developing some region captioning models and would like to perform a fair comparison with GPT4ROI. Is it possible to release the VG validation data you used for calculating the scores in Table 4? Thanks in advance!
Hi,
I've observed that this code comes with built-in MMCV1.4.7 and MMDet. However, the native mmdet and mmcv may more convenient to use. So, could you delineate the main modifications in comparison with the native MMCV1.4.7 and MMDet?
Hi @jshilong, in the documentation, it's mentioned that GPT4RoI was trained on 8 A100 GPUs. Could you please provide insights into how much time it took for both stage-1 and stage-2? Having this information would be extremely helpful.
Thank you in advance.
Hi @jshilong, thanks for your great project!
I would like to reproduce your experimental results. Do you have a plan to release your evaluation scripts (e.g., Visual7W and VCR)? Thank you.
Hi, @jshilong @PeizeSun @ShoufaChen
I would like to ask some questions about "Table 4: Compariation of region caption ability on the validation dataset on Visual Genome".
Do you divide the validation dataset for VG region caption task by yourselves?
In the original VG dataset, it seems that there is no validation split.
Could you please provide a link or a README to the validation dataset with me?
Do you reproduce the result of GRiT?
In GRiT's paper, it also seems that there is no related experimental result (e.g., CIDEr for the validation dataset for VG region captioning).
Could you provide more details about this experiment?
Thank you in advance.
Hello, thank you very much for your excellent contribution. But I encountered some issues while using the "app. py" code. My Graph Box is all correct. However, after entering text and pressing Enter, the run function cannot be triggered, which means that the demo has no response. After debugging, we still haven't found the problem. Could you please answer it.
I faced the following error when I launched the 2nd stage of pre-training.
ValueError: loaded state dict contains a parameter group that doesn’t match the size of optimizer’s group
This error is likely due to number of trainable parameters are different in the 2nd stage than the 1st stage. How did you resolved this?
Hello, thank you for your contribution. I meet a question on line 66 of the file models/spi_llava.py,
image_forward_outs = vision_tower(images,output_hidden_states=True)
What is the structure of this vision_tower?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.