Git Product home page Git Product logo

vstar's People

Contributors

eltociear avatar penghao-wu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vstar's Issues

In V* Benchmark evaluation, are the options randomly shuffled for each question?

Hi, congrats on the great work!

I noticed that in the released V* benchmark, the correct answer for each question is always the first option. I wonder if the authors have shuffled the options when evaluating on the benchmark (Table 1 in the paper)? I empirically found for models such as LLaVA1.5, when options are shuffled, the accuracy is way lower than not shuffled.

Thanks!

Training cost

May I ask how many GPUs you use when training the model? Thank you!

V*Bench

Hi,

First of all, congratulations for this great work.

I have a question related to the benchmark. In the .json files, there is the 'bbox' parameter where the coordinates for the objects to be located in the format [x, y, width, height] (I assume). Two questions about it:

  1. To determine the coordinates of the object, did you do it manually? I mean, did a human "draw" the bounding box, and then write the coordinates for the .json file? Or did you ask the LLM to locate the object, have the coordinates for the bounding box printed, and then write it on the .json?
  2. After revising the code, I don't get the point for this 'bbox' parameter on the .json files. You don't use it anywhere in the 'vstar_bench_eval' script, right? I only see the 'question' and 'options' parameters of the 'annotation' variable being used through the main function. Am I missing something?

Thanks in advance.

python app.py

I encountered the following problems when running python app.py? Can you help me find out why?

Expected output of the final answer?

Hi, I've asked: what's the color of the content in the cup? with the following image:
image

Here is the output:
image
image

What's the meaning of the values in the final answer?

Great but Very Slow with 24GB

Hi,

First let me congratulate you as I think your approach is spot on: in order to have an efficient visual processing it is very likely that one should simulate how fovea is working that is one should focus on a reduced area and move around depending on what has been found.

I have tried using your demo on a RTX 4090 GB with 24GB and there isnt obviously enough memory as it is offloaded to the CPU since it takes 5 minutes to run one example.

By setting the flag 'load in 8 bits' to True the models can fit in the GPU memory however the code is obviously not compatible with bits and bytes since a few blocking exceptions are raised.

I would be grateful if you could do the required changes or simply reduce the GPU memory requirements. This would allow more people to test your great work.

app.py

image

Hello, thanks for your great work. However, I ran app.py and encountered an error. How can I resolve it?

Prompt for GPT4V

Hi authors, congrats on this great work!

May I know what your prompt is for evaluating GPT4V? We tested ourselves but found that the results were pretty different, especially the spatial relationship subset (where the accuracy is even far less than 50% for two-option MCQs).

Improvement for counting?

Hi, thanks for the great work! I've played it for a while and it's very very impressive!
But the model failed most of my question when counting, is there some room for improvement?

For example, when I ask : how many glasses of wine here?
image
image

Some recommendations about your paper

I guess your paper is being reviewed, and there might be more changes. Therefore, some of my recommendations might be irrelevant.

Version: https://arxiv.org/pdf/2312.14135v2.pdf

Figure 1

image

This figure should be referenced somewhere first (e.g. the paragraph that you mention V* mechanism for the first time) because to me, this figure is kinda out of context. I don't really know which part is closely related to this figure.

Algorithm 1

image

You should explain what the symbols are.

What are the supported instances?

I tried several instances, but couldn't find a corresponding result, and only returned the original result. I tried questions like, "What number is the pointer pointing to?" and "What color is the person on the left wearing?"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.