x-plug / mplug-docowl Goto Github PK

View Code? Open in Web Editor NEW

1.1K 27.0 64.0 102.47 MB

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

License: Apache License 2.0

Python 86.96% Shell 0.95% Jupyter Notebook 12.09%

chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding

mplug-docowl's Issues

About DocStruct4M and DocReason25K

Thanks for your great work!

In the paper: mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding, it mentioned DocStruct4M and DocReason25K dataset, but they are not open-source.

May I ask if there are any open source plans for these two datasets?

how to finetune this model?

:）
another thing，will this model support chinese ocr soon?

Huggingface integration

Thank you for your work!
When will you make this available on Huggingface with instructions please?

Thanks.

DocOwl1.5: Inference results often in wrong order

Hello,
I pulled your repo and so far the inference with the stage 1 model works fine. However, the results I get for the localized text recognition often are in the wrong order. For example, I use this code (basically the demo code from the README.md):

from docowl_infer import DocOwlInfer
model_path = "./models/models--mPLUG--DocOwl1.5-stage1/.../"
docowl = DocOwlInfer(ckpt_path=model_path, anchors="grid_9", add_global_img=False)

image = "image.jpg"
query = "Identify the text within the bounding box <bbox>92, 444, 880, 480</bbox>"
answer = docowl.inference(image, query)

print(answer)

on this image (only the relevant part is left visible)

Which gives the result 8 Spl. Fz.z.Pers.bef.b. 5

Here, the two parts "8 Spl." and "Fz.z.Pers.bef.b." are in the wrong order (the "5" in the end is hallucinated, but that only happens in the anonymized image, not in the original one -> no concern here). Something like that happens quite often. I have the feeling that I missed something there. Do I use the model correctly?

There is indeed a warning the code throws during inferene:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

And also one during model loading:

Some weights of MPLUGDocOwlLlamaForCausalLM were not initialized from the model checkpoint at ... and are newly initialized: ['model.layers.4.self_attn.rotary_emb.inv_freq', ..., 'model.layers.2.self_attn.rotary_emb.inv_freq']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

bounding box visualization

How can I properly visualize a bounding box on an image? It seems that conventional operations don't display it correctly. Do I need to perform any special transformations?"

数据集下载问题

请问mPLUG-DocOwl 1.5的数据集在huggingface上的链接失效是什么原因

ValueError: The current `device_map` had weights offloaded to the disk. Please provide an `offload_folder` for them and RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

Issue when loading the model with huggingface

Hi, I downloaded the repo and tried initializing the model with:

model_path = "mPLUG/DocOwl1.5-Chat"
docowl = DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
print('load model from ', model_path)

However, I get the following:

----> 5 docowl = DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
      6 print('load model from ', model_path)
      7 # exit(0)

Cell In[2], line 5, in DocOwlInfer.__init__(self, ckpt_path, anchors, add_global_img, load_8bit, load_4bit)
      3 model_name = get_model_name_from_path(ckpt_path)
      4 ic(model_name)
----> 5 self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")

     50     else:
     51         tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
---> 52         model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
     53 else:

--> 209     self.model = MPLUGDocOwlLlamaModel(config)
    211     self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
    213     # Initialize weights and apply final processing

File ~/SageMaker/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py:201, in MPLUGDocOwlLlamaModel.__init__(self, config)
    200 def __init__(self, config: MPLUGDocOwlConfig):
--> 201     super(MPLUGDocOwlLlamaModel, self).__init__(config)

    924 self.layers = nn.ModuleList(
--> 925     [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
    926 )
    927 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
    928 self.gradient_checkpointing = False

TypeError: LlamaDecoderLayer.__init__() takes 2 positional arguments but 3 were given

As you can see, I'm using a sagemaker instance. Could you please provide some guidance? Thanks

online demo performance is not so good, if latest one has been updated

online demo always gives very short answer
extract text also wrong
gives messy text output

Missing Images for Paperowl

I have tried to execute the steps as enlisted here for extracting the PaperOwl dataset. Can you please confirm if these images are really missing or is there something wrong in extraction?

imgs/2106.08905v2/figures/out_28170.png
imgs/2106.08905v2/figures/28170.png
imgs/2303.16501v1/tables/table_7.png
imgs/2305.16835v1/figures/fig_result_2.png
imgs/2102.12037v3/figures/table-AUROC-boed.png
imgs/1908.09231v1/tables/table_1.png
.... more images are missing

Question about how to eval mPLUG-PaperOwl or other VLMs on M-paper？

Hello, I would like to ask how to test on M-Paper dataset? For example, for the task Multimodal Diagram Analysis, its input needs to be 𝐶𝑜𝑛𝑡𝑒𝑥𝑡 + 𝐷𝑖𝑎𝑔𝑟𝑎𝑚 𝑠 + 𝑂𝑢𝑡𝑙𝑖𝑛𝑒, and the question instructions, so how are you organizing the input format for the model? Are there any associated script evaluation files about M-Paper?

Check out our datasets, I think they might be useful for training models like this.

We created some large-scale multimodal datasets that contain OCR annotations, for some we ran paddle OCR over LAION images

https://huggingface.co/datasets/wendlerc/LAION5B-en-PaddleOCR-parquet
https://huggingface.co/datasets/wendlerc/LAION5B-hr-en-PaddleOCR-parquet
for toand rendered images with blender,
https://huggingface.co/datasets/wendlerc/RenderedText
and here we captioned synthtext with BLIP2,
https://huggingface.co/datasets/wendlerc/CaptionedSynthText

do you think those might be useful to tune your method?

Best,
Chris

basic instructions for deploying locally?

I tried

ModuleNotFoundError: No module named 'icecream'
(textgen) [root@pve0 DocOwl1.5]# pip install icecream
Collecting icecream
  Using cached icecream-2.1.3-py2.py3-none-any.whl.metadata (1.4 kB)
Requirement already satisfied: colorama>=0.3.9 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (0.4.6)
Requirement already satisfied: pygments>=2.2.0 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (2.17.2)
Requirement already satisfied: executing>=0.3.1 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (2.0.1)
Requirement already satisfied: asttokens>=2.0.1 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (2.4.1)
Requirement already satisfied: six>=1.12.0 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from asttokens>=2.0.1->icecream) (1.16.0)
Using cached icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: icecream
Successfully installed icecream-2.1.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
(textgen) [root@pve0 DocOwl1.5]# python app.py
2024-04-13 16:08:59 | ERROR | stderr | Traceback (most recent call last):
2024-04-13 16:08:59 | ERROR | stderr |   File "/data/mplug-docowl/DocOwl1.5/app.py", line 23, in <module>
2024-04-13 16:08:59 | ERROR | stderr |     no_change_btn = gr.Button.update()
2024-04-13 16:08:59 | ERROR | stderr | AttributeError: type object 'Button' has no attribute 'update'

如果是专注于文档理解任务的话，vision tower选用layoutlmv3初始化是不是会比vit更有竞争力？

如题

When will the PaperOwl's M-Paper dataset be released?

Models Weights

@LukeForeverYoung Hey! Thanks for sharing this amazing work!

Are the model weights and inference code available ?
I would be happy to test them locally.

Getting TypeError: LlamaDecoderLayer.init() takes 2 positional arguments but 3 were given

When I run the inference code:
from docowl_infer import DocOwlInfer
model_path='./mPLUG/DocOwl1.5-chat'
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
print('load model from ', model_path)

I am getting

TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\docowl_infer.py:21, in DocOwlInfer.init(self, ckpt_path, anchors, add_global_img, load_8bit, load_4bit)
19 ic(model_name)
20 print("DocOwl Infer ")
---> 21 self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
22 self.doc_image_processor = DocProcessor(image_size=448, anchors=anchors, add_global_img=add_global_img, add_textual_crop_indicator=True)
23 self.streamer = TextStreamer(self.tokenizer, skip_prompt=True, skip_special_tokens=True)

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\builder.py:54, in load_pretrained_model(model_path, model_base, model_name, load_8bit, load_4bit, device_map, device)
52 tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
53 print("MPLUGDocOwlLlamaForCausalLM")
---> 54 model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
55 else:
56 # Load language model
57 if model_base is not None:
58 # PEFT model

File c:\Users\internanirudh\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_utils.py:3405, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3402 with ContextManagers(init_contexts):
3403 # Let's make sure we don't run the init function of buffer modules
3404 print("ContexManager")
-> 3405 model = cls(config, *model_args, **model_kwargs)
3407 # make sure we use the model's config since the init call might have copied it
3408 config = model.config

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\modeling_mplug_docowl.py:209, in MPLUGDocOwlLlamaForCausalLM.init(self, config)
207 def init(self, config):
208 super(LlamaForCausalLM, self).init(config)
--> 209 self.model = MPLUGDocOwlLlamaModel(config)
211 self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
213 # Initialize weights and apply final processing

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\modeling_mplug_docowl.py:201, in MPLUGDocOwlLlamaModel.init(self, config)
200 def init(self, config: MPLUGDocOwlConfig):
--> 201 super(MPLUGDocOwlLlamaModel, self).init(config)

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\modeling_mplug_docowl.py:33, in MPLUGDocOwlMetaModel.init(self, config)
32 def init(self, config):
---> 33 super(MPLUGDocOwlMetaModel, self).init(config)
34 self.vision_model = MplugOwlVisionModel(
35 MplugOwlVisionConfig(**config.visual_config["visual_model"])
36 )
38 self.vision2text = MplugDocOwlHReducerModel(
39 MplugDocOwlHReducerConfig(**config.visual_config["visual_hreducer"]), config.hidden_size
40 )

File c:\Users\internanirudh\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py:926, in LlamaModel.init(self, config)
923 self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
924 print("LlamaDecoderLayer Start")
925 self.layers = nn.ModuleList(
--> 926 [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
927 )
928 print("LlamaDecoderLayer Ran")
929 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

File c:\Users\internanirudh\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py:926, in (.0)
923 self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
924 print("LlamaDecoderLayer Start")
925 self.layers = nn.ModuleList(
--> 926 [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
927 )
928 print("LlamaDecoderLayer Ran")
929 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

TypeError: LlamaDecoderLayer.init() takes 2 positional arguments but 3 were given

Can y'all give me a solution to this problem

DocOwl 1.5 model links not working

I found links to HF model cards in DocOwl 1.5 README, but none of them works.

Also, I found no models here https://huggingface.co/mPLUG

Can you please provide working links?

Instruction following data

Hi! Thank you for the excellent work!

The unified instruction tuning dataset is a great contribution to the community and can be very useful. I wonder if there is a timetable for its release? Thanks!

Amazing work

This looks really good. And nothing like this has been developed before.

Excited for the source code. Also, all other model fail with documents due to imageprocessor downgrading the resolution to 224. I believe this model handles high resolution for the need for Document understanding.

Does it need OCR to extract the text in the document or is it OCR free mdoel?

Question about <bbox> in DocDownstream dataset

The meaning of the values in < bbox > are confusing. It doesn't look in the format of x1,y1,x2,y2, since it failes to get correct bbox for most images.

这个可以拿来做PI CI图片信息抽取吗？

你好，作者，很感谢你的工作，我拿mPLUG-DocOwl的网上demo来测试了一下相关的PI CI图片，我的目标是让模型得到有关字段的结构化数据，为了更快的审核。
但是现在的demo的效果不尽人意，问相关字段的值很容易出现语言幻觉和回答的不对，回答的数字什么的都是错误的，请问可以通过微调的方式让他更对一些吗，或者是增加它的OCR能力？期待你的回复。

Are the image files included in the four data sets different image files

DocStruct4M
DocDownstream-1.0
DocReason25K
DocLocal4K

Are the image files included in the four data sets different image files

DocOwl1.5 training code?

When will the training code be released?Thx.

How to test the model in DUE-benchmark?

The DUE-benchmark provides the ocr results of pdf-type documnets and other models use the ocr results as input to eval their model. What are the input of your model when you use this benchmark? Are you use png/jpg image？

Online demo parameters

I'd like to get the same results with Omni model as demonstrated in huggingface demo using the inference code in this repo.

Could you share what parameters like anchor/grid, input resolution etc. you use under the hood? Is there any other pre- or postprocessing for the query or the input image that is absent from the inference code?

For example, with an image that says:

MAKE TEXT
STAND OUT FROM 
BACKGROUNDS

I've got the following results:

With inference code:

from docowl_infer import DocOwlInfer

model_path = 'mPLUG/DocOwl1.5-Omni'
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
query = "Parse texts in the image."
answer = docowl.inference(image_path, query)

Output:

<doc>     MAKE TEXT FROM IEX 
    STAOKOROUNDLICKGRIUINI </doc>

While the demo gives outputs:

[doc] TEXT MAKE
STAND OUT FROM
BACKGROUNDS [/doc]

EDIT: Added example

About the data preprocess

Thanks for your great job! I'm really curious about the pre-process of the dataset. So I'd like to ask you about the following:

Such as DocVQA, one question may have several answers. How did you process it? Sampling? Or re-organize dataset per answer?
What is the resolution of the input image?
How did you process the datasets with multi-page pdf(such as KLC/DeppForm)

Thanks for your reply!

What is the difference between DocDownstrem1.0 and the finetuned data used in UReader

Thanks for your great work！ What is the difference between DocDownstrem1.0 and the finetund data used in Ureader？

Spelling errors in DocStruct4M, 'multi_grained_text_localization.jsonl'

All the question prompts are extracted from DocStruct4M, 'multi_grained_text_localization.jsonl' as below,

[
  "Give the bounding box of the text",
  "Predict the bounding box of the text",
  "Detect the text in the bounding box",
  "Identify the text within the bounding box",
  "Recognize the text in the bounding box",
  "Locate the postion of the text"
]

In the last column, 'postion' should be replaced with 'position'.
I wonder whether it matters for training the MLLM, because the error amount is significantly high.

DocOwl1.5 按示例代码运行推理结果仅仅重复部分文字。

DocOwl1.5 按示例代码运行：识别图片中的文字。推理结果仅仅重复部分文字。并在运行代码后出现提示'...Setting pad_token_id to eos_token_id:2 for open-end generation....'

ValueError: Need either a `state_dict` or a `save_folder` containing offloaded weights.

Running Windows 10 venv Python 3.10.6:

from docowl_infer import DocOwlInfer
model_path='mPLUG/DocOwl1.5-stage1'
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=False)
ic| model_name: 'DocOwl1.5-stage1'
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 749/749 [00:00<?, ?B/s]
E:\DocOwl\venv\lib\site-packages\huggingface_hub\file_download.py:148: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\User.cache\huggingface\hub\models--mPLUG--DocOwl1.5-stage1. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 11.5MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 438/438 [00:00<?, ?B/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 4.84k/4.84k [00:00<?, ?B/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████| 16.3G/16.3G [06:55<00:00, 39.2MB/s]
Some weights of MPLUGDocOwlLlamaForCausalLM were not initialized from the model checkpoint at mPLUG/DocOwl1.5-stage1 and are newly initialized: ['model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
generation_config.json: 100%|███████████████████████████████████████| 162/162 [00:00<?, ?B/s]
Traceback (most recent call last):
File "", line 1, in
File "E:\DocOwl\mPLUG-DocOwl\DocOwl1.5\docowl_infer.py", line 19, in init
self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
File "E:\DocOwl\mPLUG-DocOwl\DocOwl1.5\mplug_docowl\model\builder.py", line 52, in load_pretrained_model
model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
File "E:\DocOwl\venv\lib\site-packages\transformers\modeling_utils.py", line 2959, in from_pretrained
dispatch_model(model, **kwargs)
File "E:\DocOwl\venv\lib\site-packages\accelerate\big_modeling.py", line 364, in dispatch_model
weights_map = OffloadedWeightsLoader(
File "E:\DocOwl\venv\lib\site-packages\accelerate\utils\offload.py", line 150, in init
raise ValueError("Need either a state_dict or a save_folder containing offloaded weights.")
ValueError: Need either a state_dict or a save_folder containing offloaded weights.

能不能把mpaper数据集传到modelscope上

从hf下载太慢了，求在modelscope上传一份

Inference is not working with both sagemaker and inference file provided on github

I have created the inference endpoint on sagemaker when i try to invoke it getting the following error.

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "The checkpoint you are trying to load has model type `mplug_docowl` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date."
}

Downloaded the model check points to my machine from Hugging Face and trying to run the inference file.
This one is also failing with following error.

2024-03-29 12:11:24.059622: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 12:11:24.059732: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 12:11:24.156838: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-29 12:11:26.578613: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
ic| model_name: '0735ba4067b5ab76192ce6e7bc5694701ab4d779'
Traceback (most recent call last):
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/docowl_infer.py", line 70, in <module>
    docowl = DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/docowl_infer.py", line 19, in __init__
    self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/builder.py", line 52, in load_pretrained_model
    model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3375, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 209, in __init__
    self.model = MPLUGDocOwlLlamaModel(config)
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 201, in __init__
    super(MPLUGDocOwlLlamaModel, self).__init__(config)
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 33, in __init__
    super(MPLUGDocOwlMetaModel, self).__init__(config)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp>
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
TypeError: LlamaDecoderLayer.__init__() takes 2 positional arguments but 3 were given

按照hugging face的上的参数load pretrainted结果infer出来的东西是混乱的呢？

按照uread的方式加载了hugging face上的预训练的模型。输入了一张简单的图片：

结果输出的是看不懂的东西：

不知道哪里出了问题？是不是参数有变化呢？

How to transfer the multipage pdfs to images(png)?

Hi, I am confused that how to transfer the pdf datasets (Deepform、KLC) to multi images with the true key-value GT pairs for each transfered png image? Because the datasets download in DUE-benchmark have no page ID information.

complex figures

great work. What is the best way to extract complex figures from pdfs? is there a way to parse them as images and then apply ocr (worst case) or?

I noticed that complex figures are not translated:
see from your paper:

Which data make the model learned table to markdown?

These opened dataset can not really find which dataset can hav img -> markdown text information.

And where does the Chinese OCR ability comes from? The whole dataset has no Chinese,

modelscope在线测试效果较差？

如图，在modelscope进行在线测试，模型无法完成文本抽取任务

There is an error in HReducer-Module Code comments

MplugDocOwlHReducerModel --> forward --> line 487
sequence_output = self.reducer(hidden_states) # B,C,H,W -> B,C,H/conv_shape[1],W/(conv_shape[1])
After self.reducer operation, the shape of hidden_states should be (B,C,H/conv_shape[0],W/(conv_shape[1])).

图像编码器配置

请问图像编码器使用多大的图像输入，patch size是多少呢？
如果按照mPLUG-Owl的配置，ViT-L14在224x224分辨率上能够分辨样例中的密集文字吗？比如document和webpage

Dataset Questions

Does mPLUG/DocStruct4M and mPLUG/DocDownstream-1.0 contain image files in the dataset, which cannot be verified on the hugging face.

About Table Parsing in mPLUG-DocOwl1.5 work

Thanks for your great work.

I have a small question: In the Table Parsing section, the text converts all table representations from HTML to Markdown format.

But the table syntax in Markdown does not support merging rows or columns. It says that tags like <ROWSPAN=x> or <COLSPAN=y> are added in the paper.

Why not just use LaTeX code to represent the table? MMD format is compatible with table LaTeX.

At same time, there is another question: in the inference phase, the model outputs the results of the transformed table, which can not be directly rendered. This is because the output format is neither LaTeX format nor Markdown format.

What‘s the difference between DocOwl1.5-Chat+ and DocOwl1.5-Chat?

Training data or model design?

关于patch_positions的疑问

mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py

Line 59 in ebe8ac7

def encode_images(self, images, patch_positions):

请问patch_positions没有被使用吗？

How to get the real bbox and questions about the normalization function

Both Multi-grained Text Grounding and Multi-grained Text Recognition task need bounding box to get the correspondence between specific texts and local positions.
The bbox in DocLocal4K and DocStruct4M datasets seems not real bounding box of the imgs.

My question is

how to get the real image bounding box of the image?
the normalized function, max(min(int(x)/999, 1.0), 0.0) for x in gt_answer.split(',')], will truncate the relatively large coordinates to 1. Isn't there a problem with this?

demo space is down

https://modelscope.cn/studios/damo/mPLUG-DocOwl/summary
This site is down at this moment, it said 当前空间运行错误，暂未发布

instruction with how to use the stage1 model

Hi team,

Is there any instruction on how to use the stage1 model? interested with the document/webpage parsing capabilities.
If not can you provide an example script?

Thanks!!

Can PaperOwl achieve image-text cross QA?

Thanks for your work.

Breakdown the 16GB file into chunks

Please can you split the model into 4GB chunks rather than 1 x 16GB. I have already converted to safe tensors through HF in the repo also.

Will just make it much more useable.

Thanks.

Chinese support

When start supporting Chinese document Q&A?

x-plug / mplug-docowl Goto Github PK

mplug-docowl's Issues

I am getting

Recommend Projects

Recommend Topics

Recommend Org