penghao-wu / vstar Goto Github PK
View Code? Open in Web Editor NEWPyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
Home Page: https://vstar-seal.github.io/
License: MIT License
PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
Home Page: https://vstar-seal.github.io/
License: MIT License
Hi, congrats on the great work!
I noticed that in the released V* benchmark, the correct answer for each question is always the first option. I wonder if the authors have shuffled the options when evaluating on the benchmark (Table 1 in the paper)? I empirically found for models such as LLaVA1.5, when options are shuffled, the accuracy is way lower than not shuffled.
Thanks!
May I ask how many GPUs you use when training the model? Thank you!
Hi,
First of all, congratulations for this great work.
I have a question related to the benchmark. In the .json files, there is the 'bbox' parameter where the coordinates for the objects to be located in the format [x, y, width, height] (I assume). Two questions about it:
Thanks in advance.
I encountered the following problems when running python app.py? Can you help me find out why?
Hi, I'd like to run the demo on a 4090 with 24G VRAM, is that enough?
As the title show, Some data with many yolo style rectangle labels.
Is there some important info I have ignored?
best wishes.
Hi,
First let me congratulate you as I think your approach is spot on: in order to have an efficient visual processing it is very likely that one should simulate how fovea is working that is one should focus on a reduced area and move around depending on what has been found.
I have tried using your demo on a RTX 4090 GB with 24GB and there isnt obviously enough memory as it is offloaded to the CPU since it takes 5 minutes to run one example.
By setting the flag 'load in 8 bits' to True the models can fit in the GPU memory however the code is obviously not compatible with bits and bytes since a few blocking exceptions are raised.
I would be grateful if you could do the required changes or simply reduce the GPU memory requirements. This would allow more people to test your great work.
Hi authors, congrats on this great work!
May I know what your prompt is for evaluating GPT4V? We tested ourselves but found that the results were pretty different, especially the spatial relationship subset (where the accuracy is even far less than 50% for two-option MCQs).
I guess your paper is being reviewed, and there might be more changes. Therefore, some of my recommendations might be irrelevant.
Version: https://arxiv.org/pdf/2312.14135v2.pdf
This figure should be referenced somewhere first (e.g. the paragraph that you mention V* mechanism for the first time) because to me, this figure is kinda out of context. I don't really know which part is closely related to this figure.
You should explain what the symbols are.
I tried several instances, but couldn't find a corresponding result, and only returned the original result. I tried questions like, "What number is the pointer pointing to?" and "What color is the person on the left wearing?"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.