Git Product home page Git Product logo

monkey's People

Contributors

echo840 avatar melosy avatar shuozhang2003 avatar yuliang-liu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

monkey's Issues

A Slight Issue on Image Captioning

Issue

As is shown in the pics below, Monkey seems to be distracted when analysing THIS LOGO.
image
image

Where is "湖北" ?

While Monkey produces wrong result on the school name at the first sight, it seems to have obtained the infomation, and answered correctly when asked in text afterwards:
image

In my case, 7 tests were carried out, with 6 wrong answers and 1 without mentioning the school's name.
The issue itself is not critical, through, but it gains its own necessity to be fixed :)

Minimal Reproduce Steps

  1. Download the Logo
  2. Visit the Demo or Demo_chat site
  3. Upload the logo, then click on "Generate"
  4. Wait for the result to come

Questions about using LoRA

Hello,
I want to make use of LORA, and I have added the the contents of model_qwen_nvdia3090.py. However I have a few questions:

  1. Should I add '--use_lora' in the finetune/finetune_ds_debug.sh, just like Qwen-vl?
  2. What should I do to freeze other modules except LoRA and Resampler modules in finetune_multitask.py.

demo.py caption result is not the same with the online demo

@Yuliang-Liu using the demo.py script, caption result is : "333 Smooth lighting, perfect shading. Intricate and mesmerizing, surrounding finely shattered self-luminous rainbow."
what is yours online demo parameter setting

>>> kwargs = dict()
>>> kwargs['fp16'] = True
>>> kwargs['bf16'] = False
>>> model = MonkeyLMHeadModel.from_pretrained(checkpoint, device_map='cuda', **kwargs).eval()
>>> tokenizer = QWenTokenizer.from_pretrained(checkpoint)
>>> tokenizer.padding_side = 'left'
>>> tokenizer.pad_token_id = tokenizer.eod_id

>>> print(query)
<img>7c844f8f477e79c8dad934a907337f31_3</img> Write a comprehensive and concise caption and style of the image using the original caption:: "anime style.The latest flat anime character design artwork has hyper-exceptional amount of finely beautiful details, which is delicately generated by the most technically skilled illustrator. The best framing and the best composition from Hatsune Miku's hip to her frontal face. Being in highly fashionable feminine clothing. All the features and proportions and shapes of her face and eyes and hair and her perfect feminine body are delicately super precisely reproduced original Hatsune Miku of the THE VOCALOID official artworks true to life, the bishoujo's luscious loving pose. Pale color.::333 Smooth lighting, perfect shading. Intricate and mesmerizing, surrounding finely shattered self-luminous rainbow.::77 Letter.::-0.1 "

>>> attention_mask = input_ids.attention_mask
>>> input_ids = input_ids.input_ids
>>> pred = model.generate(
...             input_ids=input_ids.cuda(),
...             attention_mask=attention_mask.cuda(),
...             do_sample=True,
...             temperature=0.7,
...             max_new_tokens=250,
...             min_new_tokens=1,
...             length_penalty=3,
...             num_return_sequences=1,
...             output_hidden_states=True,
...             use_cache=True,
...             pad_token_id=tokenizer.eod_id,
...             eos_token_id=tokenizer.eod_id,
...             )

>>> response = tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()
>>> print(response)
333 Smooth lighting, perfect shading. Intricate and mesmerizing, surrounding finely shattered self-luminous rainbow.

but in the online demo
企业微信20240103-221040@2x

the caption image is
7c844f8f477e79c8dad934a907337f31_3

The details of the 1.45M data

Hello! Thanks for your great work! I am curious about the details of the 1.45M data you use for instruction tuning. I guess that it includes 400k(from CC3M), COCO datasets(around 100k-200k ?) and the downstream tasks data(like DocVQA, TextVQA, ...). Is it possible and convenient to reveal? Thank you in advance for your reading and replying.

something about table1

Nice work!Thanks for your contribution first.
I wonder if there is any mistake that i found infovqa only has 23946 questions while you write 47k. Besides how did you use the tabfact as image?

The evaluate setting of Qwen-VL

Hello, thanks for your great work! I read your paper in detail and find that you've evaluate Qwen-VL in DUE-Benchmark which is not reported in its official paper, like Deepform, KLC, WTQ, TableFact, VisualMRC. I want to know the generation config of Qwen-VL to reproduce your result if possible and convenient!(like do_sample,max_new_tokens,top_p,top_k,length_penalty and so forth~). Can you share it? Sincerely thanks for it !
Additionally, I guess that you may use DUE_evaluator as your evaluate script, isn't it?

对中文的支持程度

你好,我看到你们的训练数据好像都是英文的,模型对中文的支持能力是不是不太好?

Output of local demo is very different from online demo.

Appreciate your great work!

This is the caption output of your online domo site, the result looks good.
11

I set up the model environment and change the checkpoint path, then I run demo.py in my server. And upload the same image, but the result is quite different. I tried other images, it got same issue.
22

Do I miss something or need t change something? Could you please help with that? Thanks a lot!

ModuleNotFoundError: No module named 'transformers_modules.monkey.qwen_generation_utils'

When I run the code,
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "echo840/Monkey"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eod_id
img_path = ""
question = ""
query = f'{img_path} {question} Answer: ' #VQA

query = f'{img_path} Generate the detailed caption in English: ' #detailed caption

input_ids = tokenizer(query, return_tensors='pt', padding='longest')
attention_mask = input_ids.attention_mask
input_ids = input_ids.input_ids

pred = model.generate(
input_ids=input_ids.cuda(),
attention_mask=attention_mask.cuda(),
do_sample=False,
num_beams=1,
max_new_tokens=10,
min_new_tokens=1,
length_penalty=1,
num_return_sequences=1,
output_hidden_states=True,
use_cache=True,
pad_token_id=tokenizer.eod_id,
eos_token_id=tokenizer.eod_id,
)
response = tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()
print(response)
the weight from
problem
but have some problem.

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("/root/autodl-tmp/monkey", device_map='cuda', trust_remote_code=True).eval()
Traceback (most recent call last):
File "", line 1, in
File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 553, in from_pretrained
model_class = get_class_from_dynamic_module(
File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 500, in get_class_from_dynamic_module
return get_class_in_module(class_name, final_module.replace(".py", ""))
File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 200, in get_class_in_module
module = importlib.import_module(module_path)
File "/root/miniconda3/envs/monkey/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/root/.cache/huggingface/modules/transformers_modules/monkey/modeling_monkey.py", line 29, in
from .modeling_qwen import QWenModel,QWenPreTrainedModel,QWenLMHeadModel
File "/root/.cache/huggingface/modules/transformers_modules/monkey/modeling_qwen.py", line 40, in
from .qwen_generation_utils import (
ModuleNotFoundError: No module named 'transformers_modules.monkey.qwen_generation_utils'

Train BUG

Hi! When I execute ./finetune/finetune_ds_debug.sh, the following error occurs. How can I resolve this?

Traceback (most recent call last):
  File "/mnt2/jiaxingchen/project/Monkey/finetune_multitask.py", line 397, in <module>
    train()
  File "/mnt2/jiaxingchen/project/Monkey/finetune_multitask.py", line 327, in train
    tokenizer = QWenTokenizer.from_pretrained(
  File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
    return cls._from_pretrained(
  File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/mnt2/jiaxingchen/project/Monkey/monkey_model/tokenization_qwen.py", line 114, in __init__
    super().__init__(**kwargs)
  File "/root/miniconda3/envs/monkey/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "/mnt2/jiaxingchen/project/Monkey/monkey_model/tokenization_qwen.py", line 217, in _add_tokens
    if surface_form not in SPECIAL_TOKENS + self.IMAGE_ST:
AttributeError: 'QWenTokenizer' object has no attribute 'IMAGE_ST'

ChartQA 评测

您好!我注意到ChartQA的测试集分为人工标注问题和机器标注问题两部分。请问您在论文中报告的性能是基于哪部分问题的准确率?

modeling_qwen_nvdia3090.py

"In "Add LoRA: You need to replace the contents of model_qwen.py with the contents of model_qwen_nvdia3090.py," it seems there is a typo in model_qwen_nvdia3090.py. To reduce confusion, please change it to modeling_qwen_nvdia3090.py."

请问下该文件/Qwen/Qwen-VL/resolve/main/tf_model.h5 在哪下载?

HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Qwen/Qwen-VL/resolve/main/tf_model.h5 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe6b9d0a520>: Failed to establish a new connection: [Errno 101] Network is unreachable'))urllib3.exceptions

Compare with llava1.5

Great job, thank you for sharing.
I would like to know what version of the mmbench rating you are using? To my knowledge, there are currently two versions available: 0712 and 1003.
thanks

AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.

你好,运行pip安装依赖的时候报错,该如何解决呢
我是python3.11 + windows10
已经成功安装了torch但仍然显示以下错误
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [23 lines of output]
[WARNING] Unable to import torch, pre-compiling ops will be disabled. Please visit https://pytorch.org/ to see how to properly install torch on your system.
�[93m [WARNING] �[0m unable to import torch, please install it if you want to pre-compile any deepspeed ops.
DS_BUILD_OPS=1
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-ck45v4fo\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-ck45v4fo\overlay\Lib\site-packages\setuptools\build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-ck45v4fo\overlay\Lib\site-packages\setuptools\build_meta.py", line 480, in run_setup
super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-ck45v4fo\overlay\Lib\site-packages\setuptools\build_meta.py", line 311, in run_setup
exec(code, locals())
File "", line 147, in
AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

加载tokenizer报错

指定了tokenizer的本地路径还是报错:
ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported.

ValueError: We were not able to get the tokenizer using AutoTokenizer.from_pretrained
with the string that you have passed XXX/monkey-model. If you have a custom tokenizer, you can pass it as input.
For now, we only support quantization for text model. Support for vision, speech and multimodel will come later.

关于训练问题

作者您好,在不改代码的情况下直接用finetune_ds_debug.sh微调,相当于是全量微调是么,微调出来的模型可以直接用么,还是需要别的处理呢?

Memory requirement?

@echo840 I am using the demo.py but the model goes OOM even after having 96Gb of GPU memory. Looks like it is only using single gpu and not distributing the model into multiple GPUs.

demo能正常推理,但训练报错

下载模型到本地后,能成功运行demo.py,但在运行finetune_ds_debug.sh时报错:
RuntimeError: Error building extension 'fused_adam'

ImportError: XXX/.cache/torch_extensions/py39_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
求助,请问这是什么原因呢?

Poor MME scores

I used the model weights you posted to evaluate MME, but I got relatively poor MME scores, which do not match your scores on the MME leaderboard. MME Perception 1484(yours 1522), Cognition 375 (yours 401)(the MME scores I evaluate are almost the same as Qwen-vl-chat). Could you post your MME evaluation script?

This is the script that generates answers:

import os
from tqdm import tqdm
import sys
from monkey_model.modeling_monkey import MonkeyLMHeadModel
from monkey_model.tokenization_qwen import QWenTokenizer
from transformers.generation import GenerationConfig
tokenizer = QWenTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eod_id

model = MonkeyLMHeadModel.from_pretrained(
        checkpoint, device_map='cuda', trust_remote_code=True).eval()

model.generation_config = GenerationConfig.from_pretrained(checkpoint, trust_remote_code=True)
model.generation_config.top_p = 0.01
root = 'Your_Results'
os.makedirs(output, exist_ok=True)
for filename in os.listdir(root):
    with open(os.path.join(root, filename), 'r') as fin, open(os.path.join(output, filename), 'w') as fout:
        lines = fin.read().splitlines()
        filename = filename.replace('.txt', '')
        for line in tqdm(lines):
            img, question, gt = line.strip().split('\t')
            img_path = os.path.join('images', filename, img)
            assert os.path.exists(img_path), img_path
            query = f'<img>{img_path}</img>{question} Answer:'
            input_ids = tokenizer([query], return_tensors='pt', padding='longest')
            pred = model.generate(
                input_ids=input_ids.input_ids.cuda(),
                attention_mask=input_ids.attention_mask.cuda(),
                do_sample=False,
                num_beams=1,
                max_new_tokens=5,
                min_new_tokens=1,
                length_penalty=1,
                num_return_sequences=1,
                output_hidden_states=True,
                use_cache=True,
                pad_token_id=tokenizer.eod_id,
                eos_token_id=tokenizer.eod_id,
            )

            response = [
                tokenizer.decode(_[input_ids.input_ids.size(1):].cpu(),
                                skip_special_tokens=True).strip() for _ in pred
            ][0]

            print(img, question, gt, response, sep='\t', file=fout)

关于Table6的一些问题

1、里面的mmbench分数是test 还是val集的
2、第三行中的vicuna7b 模型,输入尺寸是448*448,这个模型是怎么训练的?直接把positional embedding 线性插值吗?
3、预训练阶段使用CC3M 和 CCSBU,哪一个效果更好?

Performance compared to llava?

Hi authors, I ran images in your demo but I got a much worse results compared to Llava. The captions are very short even if I follow the same prompt for detailed description in the paper and I used the same image for both llava and this. The higher resolution also doesn't capture the small text in the correct location. Is there something wrong with the demo? I cannot get results close to anything shown in the paper.

Can you explain how you added perceiver resampler too? Since perceiver resampler is used for videos, is the temporal dimension used for the number of images?

Thanks in advance.

全参微调显存占用

目前在4 * A100-SXM4-40GB, 内存1T, 使用deepspeed zero2, offload_optimizer=cpu, per_device_train_batch_size=1, 全参微调, 报错cuda显存不够, 请问至少需要多少资源可以进行训练?

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 4 GPUs per node.
SW: Model with 9708M total params.
  per CPU  |  per GPU |   Options
  216.99GB |  18.08GB | offload_optimizer=cpu 
  216.99GB |  72.33GB | offload_optimizer=none

model inference

I want to feed the model with a picture and a question each time and ask the model to return the corresponding answer. Is there any simple code implementation example for this?

Demo is down?

can't access demo and demo-chat since this morning

Online demo inference speed?

Great work! I tried the online demo and find the inference speed is very fast (about 2s/image), do you hava some acceleration tricks? Or it is running on A100?

A question regarding dataset

Hi, I'm pleased to take an interest in your work. I noticed that the paper mentions a 1.45 million image dataset. Could you clarify the relationship between this dataset and the CC3M-400K? Did you filter 1 million examples from COCO in addition to the 400K data from CC3M?
WeChat845181d6d7827dcdbbb8e3c74ecd6246

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.