yuliang-liu / monkey Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 95.0 83.68 MB

【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models

License: MIT License

Python 99.60% Shell 0.40%

monkey's People

Contributors

Stargazers

Watchers

monkey's Issues

question about Multi-level Description Generation

Hi @MelosY @Yuliang-Liu .
Will you open source related data and corresponding code in the future？

A Slight Issue on Image Captioning

Issue

As is shown in the pics below, Monkey seems to be distracted when analysing THIS LOGO.

Where is "湖北" ?

While Monkey produces wrong result on the school name at the first sight, it seems to have obtained the infomation, and answered correctly when asked in text afterwards:

In my case, 7 tests were carried out, with 6 wrong answers and 1 without mentioning the school's name.
The issue itself is not critical, through, but it gains its own necessity to be fixed :)

Minimal Reproduce Steps

Download the Logo
Visit the Demo or Demo_chat site
Upload the logo, then click on "Generate"
Wait for the result to come

Questions about using LoRA

Hello,
I want to make use of LORA, and I have added the the contents of model_qwen_nvdia3090.py. However I have a few questions:

Should I add '--use_lora' in the finetune/finetune_ds_debug.sh, just like Qwen-vl?
What should I do to freeze other modules except LoRA and Resampler modules in finetune_multitask.py.

demo.py caption result is not the same with the online demo

@Yuliang-Liu using the demo.py script, caption result is : "333 Smooth lighting, perfect shading. Intricate and mesmerizing, surrounding finely shattered self-luminous rainbow."
what is yours online demo parameter setting

>>> kwargs = dict()
>>> kwargs['fp16'] = True
>>> kwargs['bf16'] = False
>>> model = MonkeyLMHeadModel.from_pretrained(checkpoint, device_map='cuda', **kwargs).eval()
>>> tokenizer = QWenTokenizer.from_pretrained(checkpoint)
>>> tokenizer.padding_side = 'left'
>>> tokenizer.pad_token_id = tokenizer.eod_id

>>> print(query)
<img>7c844f8f477e79c8dad934a907337f31_3</img> Write a comprehensive and concise caption and style of the image using the original caption:: "anime style.The latest flat anime character design artwork has hyper-exceptional amount of finely beautiful details, which is delicately generated by the most technically skilled illustrator. The best framing and the best composition from Hatsune Miku's hip to her frontal face. Being in highly fashionable feminine clothing. All the features and proportions and shapes of her face and eyes and hair and her perfect feminine body are delicately super precisely reproduced original Hatsune Miku of the THE VOCALOID official artworks true to life, the bishoujo's luscious loving pose. Pale color.::333 Smooth lighting, perfect shading. Intricate and mesmerizing, surrounding finely shattered self-luminous rainbow.::77 Letter.::-0.1 "

>>> attention_mask = input_ids.attention_mask
>>> input_ids = input_ids.input_ids
>>> pred = model.generate(
...             input_ids=input_ids.cuda(),
...             attention_mask=attention_mask.cuda(),
...             do_sample=True,
...             temperature=0.7,
...             max_new_tokens=250,
...             min_new_tokens=1,
...             length_penalty=3,
...             num_return_sequences=1,
...             output_hidden_states=True,
...             use_cache=True,
...             pad_token_id=tokenizer.eod_id,
...             eos_token_id=tokenizer.eod_id,
...             )

>>> response = tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()
>>> print(response)
333 Smooth lighting, perfect shading. Intricate and mesmerizing, surrounding finely shattered self-luminous rainbow.

but in the online demo

the caption image is

The details of the 1.45M data

Hello! Thanks for your great work! I am curious about the details of the 1.45M data you use for instruction tuning. I guess that it includes 400k(from CC3M), COCO datasets(around 100k-200k ?) and the downstream tasks data(like DocVQA, TextVQA, ...). Is it possible and convenient to reveal? Thank you in advance for your reading and replying.

something about table1

Nice work！Thanks for your contribution first.
I wonder if there is any mistake that i found infovqa only has 23946 questions while you write 47k. Besides how did you use the tabfact as image?

作者您好~ 请问文中公开的文档街景方向 VQA的指标有使用对应训练集做微调吗？

The evaluate setting of Qwen-VL

Hello, thanks for your great work! I read your paper in detail and find that you've evaluate Qwen-VL in DUE-Benchmark which is not reported in its official paper, like Deepform, KLC, WTQ, TableFact, VisualMRC. I want to know the generation config of Qwen-VL to reproduce your result if possible and convenient!(like do_sample,max_new_tokens,top_p,top_k,length_penalty and so forth~). Can you share it? Sincerely thanks for it !
Additionally, I guess that you may use DUE_evaluator as your evaluate script, isn't it?

Demo is not available

Demo is not available, HTTP ERROR 502

请问支持多张图片输入吗

ESTVQA数据集官网无法登录，冒昧问下是否可以分享目前您获取的ESTQVA数据集或者说您是如何获取的，方便告知吗？

对中文的支持程度

你好，我看到你们的训练数据好像都是英文的，模型对中文的支持能力是不是不太好？

How to reproduce the results of other baseline LVLMs such as BLIP2 and InstructBLIP using this repo?

Output of local demo is very different from online demo.

Appreciate your great work!

This is the caption output of your online domo site, the result looks good.

I set up the model environment and change the checkpoint path, then I run demo.py in my server. And upload the same image, but the result is quite different. I tried other images, it got same issue.

Do I miss something or need t change something? Could you please help with that? Thanks a lot!

ModuleNotFoundError: No module named 'transformers_modules.monkey.qwen_generation_utils'

When I run the code,
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "echo840/Monkey"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eod_id
img_path = ""
question = ""
query = f'{img_path} {question} Answer: ' #VQA

query = f'{img_path} Generate the detailed caption in English: ' #detailed caption

input_ids = tokenizer(query, return_tensors='pt', padding='longest')
attention_mask = input_ids.attention_mask
input_ids = input_ids.input_ids

pred = model.generate(
input_ids=input_ids.cuda(),
attention_mask=attention_mask.cuda(),
do_sample=False,
num_beams=1,
max_new_tokens=10,
min_new_tokens=1,
length_penalty=1,
num_return_sequences=1,
output_hidden_states=True,
use_cache=True,
pad_token_id=tokenizer.eod_id,
eos_token_id=tokenizer.eod_id,
)
response = tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()
print(response)
the weight from

but have some problem.

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("/root/autodl-tmp/monkey", device_map='cuda', trust_remote_code=True).eval()
Traceback (most recent call last):
File "", line 1, in
File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 553, in from_pretrained
model_class = get_class_from_dynamic_module(
File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 500, in get_class_from_dynamic_module
return get_class_in_module(class_name, final_module.replace(".py", ""))
File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 200, in get_class_in_module
module = importlib.import_module(module_path)
File "/root/miniconda3/envs/monkey/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/root/.cache/huggingface/modules/transformers_modules/monkey/modeling_monkey.py", line 29, in
from .modeling_qwen import QWenModel,QWenPreTrainedModel,QWenLMHeadModel
File "/root/.cache/huggingface/modules/transformers_modules/monkey/modeling_qwen.py", line 40, in
from .qwen_generation_utils import (
ModuleNotFoundError: No module named 'transformers_modules.monkey.qwen_generation_utils'

Train BUG

Hi! When I execute ./finetune/finetune_ds_debug.sh, the following error occurs. How can I resolve this?

Traceback (most recent call last):
  File "/mnt2/jiaxingchen/project/Monkey/finetune_multitask.py", line 397, in <module>
    train()
  File "/mnt2/jiaxingchen/project/Monkey/finetune_multitask.py", line 327, in train
    tokenizer = QWenTokenizer.from_pretrained(
  File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
    return cls._from_pretrained(
  File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/mnt2/jiaxingchen/project/Monkey/monkey_model/tokenization_qwen.py", line 114, in __init__
    super().__init__(**kwargs)
  File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "/mnt2/jiaxingchen/project/Monkey/monkey_model/tokenization_qwen.py", line 217, in _add_tokens
    if surface_form not in SPECIAL_TOKENS + self.IMAGE_ST:
AttributeError: 'QWenTokenizer' object has no attribute 'IMAGE_ST'

ChartQA 评测

您好！我注意到ChartQA的测试集分为人工标注问题和机器标注问题两部分。请问您在论文中报告的性能是基于哪部分问题的准确率？

demo error

python demo.py

ConnectionResetError: [Errno 104] Connection reset by peer

File "/Monkey/demo.py", line 287, in main
_launch_demo(args, model, tokenizer)
File "/Monkey/demo.py", line 252, in _launch_demo
chatbot = gr.Chatbot(label='Monkey', elem_classes="control-height", height=600,avatar_images=("https://ooo.0x0.ooo/2023/11/09/OehsLx.png","https://ooo.0x0.ooo/2023/11/09/OehGBC.png"),layout="bubble",bubble_full_width=False,show_copy_button=True)

使用在线demo遇到了下面的问题

modeling_qwen_nvdia3090.py

"In "Add LoRA: You need to replace the contents of model_qwen.py with the contents of model_qwen_nvdia3090.py," it seems there is a typo in model_qwen_nvdia3090.py. To reduce confusion, please change it to modeling_qwen_nvdia3090.py."

请问下该文件/Qwen/Qwen-VL/resolve/main/tf_model.h5 在哪下载？

HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Qwen/Qwen-VL/resolve/main/tf_model.h5 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe6b9d0a520>: Failed to establish a new connection: [Errno 101] Network is unreachable'))urllib3.exceptions

Compare with llava1.5

Great job, thank you for sharing.
I would like to know what version of the mmbench rating you are using? To my knowledge, there are currently two versions available: 0712 and 1003.
thanks

AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.

你好，运行pip安装依赖的时候报错，该如何解决呢
我是python3.11 + windows10
已经成功安装了torch但仍然显示以下错误
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [23 lines of output]
[WARNING] Unable to import torch, pre-compiling ops will be disabled. Please visit https://pytorch.org/ to see how to properly install torch on your system.
�[93m [WARNING] �[0m unable to import torch, please install it if you want to pre-compile any deepspeed ops.
DS_BUILD_OPS=1
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-ck45v4fo\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-ck45v4fo\overlay\Lib\site-packages\setuptools\build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-ck45v4fo\overlay\Lib\site-packages\setuptools\build_meta.py", line 480, in run_setup
super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-ck45v4fo\overlay\Lib\site-packages\setuptools\build_meta.py", line 311, in run_setup
exec(code, locals())
File "", line 147, in
AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

加载tokenizer报错

指定了tokenizer的本地路径还是报错：
ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported.

ValueError: We were not able to get the tokenizer using AutoTokenizer.from_pretrained
with the string that you have passed XXX/monkey-model. If you have a custom tokenizer, you can pass it as input.
For now, we only support quantization for text model. Support for vision, speech and multimodel will come later.

训练数据示例

您好，训练数据示例能否提供一下！

这个模型推理要占多少显存啊

请问论文中EST VQA的数据集是在某个网站评测的吗？我下载到的test数据只有问题没有答案

关于训练问题

作者您好，在不改代码的情况下直接用finetune_ds_debug.sh微调，相当于是全量微调是么，微调出来的模型可以直接用么，还是需要别的处理呢？

Memory requirement?

@echo840 I am using the demo.py but the model goes OOM even after having 96Gb of GPU memory. Looks like it is only using single gpu and not distributing the model into multiple GPUs.

demo能正常推理，但训练报错

下载模型到本地后，能成功运行demo.py，但在运行finetune_ds_debug.sh时报错：
RuntimeError: Error building extension 'fused_adam'

ImportError: XXX/.cache/torch_extensions/py39_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
求助，请问这是什么原因呢？

会发布int4模型么？

Poor MME scores

I used the model weights you posted to evaluate MME, but I got relatively poor MME scores, which do not match your scores on the MME leaderboard. MME Perception 1484（yours 1522）, Cognition 375 （yours 401）(the MME scores I evaluate are almost the same as Qwen-vl-chat). Could you post your MME evaluation script?

This is the script that generates answers:

import os
from tqdm import tqdm
import sys
from monkey_model.modeling_monkey import MonkeyLMHeadModel
from monkey_model.tokenization_qwen import QWenTokenizer
from transformers.generation import GenerationConfig
tokenizer = QWenTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eod_id

model = MonkeyLMHeadModel.from_pretrained(
        checkpoint, device_map='cuda', trust_remote_code=True).eval()

model.generation_config = GenerationConfig.from_pretrained(checkpoint, trust_remote_code=True)
model.generation_config.top_p = 0.01
root = 'Your_Results'
os.makedirs(output, exist_ok=True)
for filename in os.listdir(root):
    with open(os.path.join(root, filename), 'r') as fin, open(os.path.join(output, filename), 'w') as fout:
        lines = fin.read().splitlines()
        filename = filename.replace('.txt', '')
        for line in tqdm(lines):
            img, question, gt = line.strip().split('\t')
            img_path = os.path.join('images', filename, img)
            assert os.path.exists(img_path), img_path
            query = f'<img>{img_path}</img>{question} Answer:'
            input_ids = tokenizer([query], return_tensors='pt', padding='longest')
            pred = model.generate(
                input_ids=input_ids.input_ids.cuda(),
                attention_mask=input_ids.attention_mask.cuda(),
                do_sample=False,
                num_beams=1,
                max_new_tokens=5,
                min_new_tokens=1,
                length_penalty=1,
                num_return_sequences=1,
                output_hidden_states=True,
                use_cache=True,
                pad_token_id=tokenizer.eod_id,
                eos_token_id=tokenizer.eod_id,
            )

            response = [
                tokenizer.decode(_[input_ids.input_ids.size(1):].cpu(),
                                skip_special_tokens=True).strip() for _ in pred
            ][0]

            print(img, question, gt, response, sep='\t', file=fout)

关于Table6的一些问题

1、里面的mmbench分数是test 还是val集的
2、第三行中的vicuna7b 模型，输入尺寸是448*448，这个模型是怎么训练的？直接把positional embedding 线性插值吗？
3、预训练阶段使用CC3M 和 CCSBU，哪一个效果更好？

Performance compared to llava?

Hi authors, I ran images in your demo but I got a much worse results compared to Llava. The captions are very short even if I follow the same prompt for detailed description in the paper and I used the same image for both llava and this. The higher resolution also doesn't capture the small text in the correct location. Is there something wrong with the demo? I cannot get results close to anything shown in the paper.

Can you explain how you added perceiver resampler too? Since perceiver resampler is used for videos, is the temporal dimension used for the number of images?

Thanks in advance.

对中文提问理解有问题？回答逐渐离谱

demo.py如何生成中文描述

全参微调显存占用

目前在4 * A100-SXM4-40GB, 内存1T, 使用deepspeed zero2, offload_optimizer=cpu, per_device_train_batch_size=1, 全参微调, 报错cuda显存不够, 请问至少需要多少资源可以进行训练?

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 4 GPUs per node.
SW: Model with 9708M total params.
  per CPU  |  per GPU |   Options
  216.99GB |  18.08GB | offload_optimizer=cpu 
  216.99GB |  72.33GB | offload_optimizer=none

The question about img encoder

why the whole resized img isn't used lora?

model inference

I want to feed the model with a picture and a question each time and ask the model to return the corresponding answer. Is there any simple code implementation example for this?

Demo is down?

can't access demo and demo-chat since this morning

Online demo inference speed?

Great work! I tried the online demo and find the inference speed is very fast (about 2s/image), do you hava some acceleration tricks? Or it is running on A100?

训练一直卡在loading base model中, 使用deepspeed zero3.config的时候稳定复现

8卡a100，需要对训练脚本做什么修改么？

A question regarding dataset

Hi, I'm pleased to take an interest in your work. I noticed that the paper mentions a 1.45 million image dataset. Could you clarify the relationship between this dataset and the CC3M-400K? Did you filter 1 million examples from COCO in addition to the 400K data from CC3M?

yuliang-liu / monkey Goto Github PK

monkey's People

Contributors

Stargazers

Watchers

Forkers

monkey's Issues

Issue

Minimal Reproduce Steps

query = f'{img_path} Generate the detailed caption in English: ' #detailed caption

Recommend Projects

Recommend Topics

Recommend Org