penghao-wu / vstar Goto Github PK

View Code? Open in Web Editor NEW

490.0 490.0 32.0 19.3 MB

PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"

Home Page: https://vstar-seal.github.io/

License: MIT License

Python 99.58% Shell 0.42%

vstar's People

Contributors

Stargazers

Watchers

vstar's Issues

What are the supported instances?

I tried several instances, but couldn't find a corresponding result, and only returned the original result. I tried questions like, "What number is the pointer pointing to?" and "What color is the person on the left wearing?"

Great but Very Slow with 24GB

Hi,

First let me congratulate you as I think your approach is spot on: in order to have an efficient visual processing it is very likely that one should simulate how fovea is working that is one should focus on a reduced area and move around depending on what has been found.

I have tried using your demo on a RTX 4090 GB with 24GB and there isnt obviously enough memory as it is offloaded to the CPU since it takes 5 minutes to run one example.

By setting the flag 'load in 8 bits' to True the models can fit in the GPU memory however the code is obviously not compatible with bits and bytes since a few blocking exceptions are raised.

I would be grateful if you could do the required changes or simply reduce the GPU memory requirements. This would allow more people to test your great work.

Expected output of the final answer?

Hi, I've asked: what's the color of the content in the cup? with the following image:

Here is the output:

What's the meaning of the values in the final answer?

Error: Missing preprocessor_config.json in craigwu/seal_vsm_7b Repository

Hello!

I am trying to run the code from the repository on Google Colab. I managed to clone the repository and install the requirements, but when I try to instantiate the models with:

vsm_model_url = "craigwu/seal_vsm_7b"
vqa_model_url = "craigwu/seal_vqa_7b"

I get the following error:

OSError: craigwu/seal_vsm_7b does not appear to have a file named preprocessor_config.json. Check 'https://huggingface.co/craigwu/seal_vsm_7b/tree/main' for available files.

Indeed, the repository does not have the specified file, which prevents me from progressing. My goal is to run the code that, given an image and a prompt, displays bounding boxes and their coordinates for the object indicated in the prompt. So far, I have not been successful. I would appreciate any guidance on how to proceed or if it is possible to achieve this with the current version of the repository. Thanks!

How much memory do I need to run the demo?

Hi, I'd like to run the demo on a 4090 with 24G VRAM, is that enough?

Some recommendations about your paper

I guess your paper is being reviewed, and there might be more changes. Therefore, some of my recommendations might be irrelevant.

Version: https://arxiv.org/pdf/2312.14135v2.pdf

Figure 1

This figure should be referenced somewhere first (e.g. the paragraph that you mention V* mechanism for the first time) because to me, this figure is kinda out of context. I don't really know which part is closely related to this figure.

Algorithm 1

You should explain what the symbols are.

In V* Benchmark evaluation, are the options randomly shuffled for each question?

Hi, congrats on the great work!

I noticed that in the released V* benchmark, the correct answer for each question is always the first option. I wonder if the authors have shuffled the options when evaluating on the benchmark (Table 1 in the paper)? I empirically found for models such as LLaVA1.5, when options are shuffled, the accuracy is way lower than not shuffled.

Thanks!

V*Bench

Hi,

First of all, congratulations for this great work.

I have a question related to the benchmark. In the .json files, there is the 'bbox' parameter where the coordinates for the objects to be located in the format [x, y, width, height] (I assume). Two questions about it:

To determine the coordinates of the object, did you do it manually? I mean, did a human "draw" the bounding box, and then write the coordinates for the .json file? Or did you ask the LLM to locate the object, have the coordinates for the bounding box printed, and then write it on the .json?
After revising the code, I don't get the point for this 'bbox' parameter on the .json files. You don't use it anywhere in the 'vstar_bench_eval' script, right? I only see the 'question' and 'options' parameters of the 'annotation' variable being used through the main function. Am I missing something?

Thanks in advance.

Improvement for counting?

Hi, thanks for the great work! I've played it for a while and it's very very impressive!
But the model failed most of my question when counting, is there some room for improvement?

For example, when I ask : how many glasses of wine here?

Training cost

May I ask how many GPUs you use when training the model? Thank you!

python app.py

I encountered the following problems when running python app.py? Can you help me find out why?

app.py

Hello, thanks for your great work. However, I ran app.py and encountered an error. How can I resolve it?

Some questions about the valid check while inference

I noticed that there are several lines of code in the inference function:

## input valid check
if not re.match(r"^[A-Za-z ,.!?\'\"]+$", input_str) or len(input_str) < 1:
	output_str = "[Error] Invalid input: ", input_str
	return output_str, None

I want to ask if the purpose of this check is because the search model vocabulary does not contain characters other than these?
I really want to figure this out because some of my inputs contain common characters like ":"";", but these are banned, which makes me very frustrated.

Prompt for GPT4V

Hi authors, congrats on this great work!

May I know what your prompt is for evaluating GPT4V? We tested ourselves but found that the results were pretty different, especially the spatial relationship subset (where the accuracy is even far less than 50% for two-option MCQs).

How to finetune it with myself dataset

As the title show, Some data with many yolo style rectangle labels.

Is there some important info I have ignored?

best wishes.

penghao-wu / vstar Goto Github PK

vstar's People

Contributors

Stargazers

Watchers

Forkers

vstar's Issues

Figure 1

Algorithm 1

Recommend Projects

Recommend Topics

Recommend Org