Git Product home page Git Product logo

mplug-docowl's People

Contributors

hawlyq avatar lukeforeveryoung avatar zhangliang-04 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mplug-docowl's Issues

About the data preprocess

Thanks for your great job! I'm really curious about the pre-process of the dataset. So I'd like to ask you about the following:

  1. Such as DocVQA, one question may have several answers. How did you process it? Sampling? Or re-organize dataset per answer?
  2. What is the resolution of the input image?
  3. How did you process the datasets with multi-page pdf(such as KLC/DeppForm)

Thanks for your reply!

Question about how to eval mPLUG-PaperOwl or other VLMs on M-paper๏ผŸ

Hello, I would like to ask how to test on M-Paper dataset? For example, for the task Multimodal Diagram Analysis, its input needs to be ๐ถ๐‘œ๐‘›๐‘ก๐‘’๐‘ฅ๐‘ก + ๐ท๐‘–๐‘Ž๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘  + ๐‘‚๐‘ข๐‘ก๐‘™๐‘–๐‘›๐‘’, and the question instructions, so how are you organizing the input format for the model? Are there any associated script evaluation files about M-Paper?

Dataset Questions

Does mPLUG/DocStruct4M and mPLUG/DocDownstream-1.0 contain image files in the dataset, which cannot be verified on the hugging face.

There is an error in HReducer-Module Code comments

MplugDocOwlHReducerModel --> forward --> line 487
sequence_output = self.reducer(hidden_states) # B,C,H,W -> B,C,H/conv_shape[1],W/(conv_shape[1])
After self.reducer operation, the shape of hidden_states should be (B,C,H/conv_shape[0],W/(conv_shape[1])).

How to transfer the multipage pdfs to images(png)?

Hi, I am confused that how to transfer the pdf datasets (Deepformใ€KLC) to multi images with the true key-value GT pairs for each transfered png image? Because the datasets download in DUE-benchmark have no page ID information.

About DocStruct4M and DocReason25K

Thanks for your great work!

In the paper: mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding, it mentioned DocStruct4M and DocReason25K dataset, but they are not open-source.

May I ask if there are any open source plans for these two datasets?

ValueError: Need either a `state_dict` or a `save_folder` containing offloaded weights.

Running Windows 10 venv Python 3.10.6:

from docowl_infer import DocOwlInfer
model_path='mPLUG/DocOwl1.5-stage1'
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=False)
ic| model_name: 'DocOwl1.5-stage1'
tokenizer_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 749/749 [00:00<?, ?B/s]
E:\DocOwl\venv\lib\site-packages\huggingface_hub\file_download.py:148: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\User.cache\huggingface\hub\models--mPLUG--DocOwl1.5-stage1. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
tokenizer.model: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 500k/500k [00:00<00:00, 11.5MB/s]
special_tokens_map.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 438/438 [00:00<?, ?B/s]
config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4.84k/4.84k [00:00<?, ?B/s]
model.safetensors: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 16.3G/16.3G [06:55<00:00, 39.2MB/s]
Some weights of MPLUGDocOwlLlamaForCausalLM were not initialized from the model checkpoint at mPLUG/DocOwl1.5-stage1 and are newly initialized: ['model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
generation_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 162/162 [00:00<?, ?B/s]
Traceback (most recent call last):
File "", line 1, in
File "E:\DocOwl\mPLUG-DocOwl\DocOwl1.5\docowl_infer.py", line 19, in init
self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
File "E:\DocOwl\mPLUG-DocOwl\DocOwl1.5\mplug_docowl\model\builder.py", line 52, in load_pretrained_model
model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
File "E:\DocOwl\venv\lib\site-packages\transformers\modeling_utils.py", line 2959, in from_pretrained
dispatch_model(model, **kwargs)
File "E:\DocOwl\venv\lib\site-packages\accelerate\big_modeling.py", line 364, in dispatch_model
weights_map = OffloadedWeightsLoader(
File "E:\DocOwl\venv\lib\site-packages\accelerate\utils\offload.py", line 150, in init
raise ValueError("Need either a state_dict or a save_folder containing offloaded weights.")
ValueError: Need either a state_dict or a save_folder containing offloaded weights.

complex figures

great work. What is the best way to extract complex figures from pdfs? is there a way to parse them as images and then apply ocr (worst case) or?

I noticed that complex figures are not translated:
see from your paper:
image
Pasted Graphic 2

How to test the model in DUE-benchmark?

The DUE-benchmark provides the ocr results of pdf-type documnets and other models use the ocr results as input to eval their model. What are the input of your model when you use this benchmark? Are you use png/jpg image๏ผŸ

DocOwl1.5: Inference results often in wrong order

Hello,
I pulled your repo and so far the inference with the stage 1 model works fine. However, the results I get for the localized text recognition often are in the wrong order. For example, I use this code (basically the demo code from the README.md):

from docowl_infer import DocOwlInfer
model_path = "./models/models--mPLUG--DocOwl1.5-stage1/.../"
docowl = DocOwlInfer(ckpt_path=model_path, anchors="grid_9", add_global_img=False)

image = "image.jpg"
query = "Identify the text within the bounding box <bbox>92, 444, 880, 480</bbox>"
answer = docowl.inference(image, query)

print(answer)

on this image (only the relevant part is left visible)

52_82_combined_0dfL_0_mittlere_seite_original

Which gives the result 8 Spl. Fz.z.Pers.bef.b. 5

Here, the two parts "8 Spl." and "Fz.z.Pers.bef.b." are in the wrong order (the "5" in the end is hallucinated, but that only happens in the anonymized image, not in the original one -> no concern here). Something like that happens quite often. I have the feeling that I missed something there. Do I use the model correctly?

There is indeed a warning the code throws during inferene:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

And also one during model loading:

Some weights of MPLUGDocOwlLlamaForCausalLM were not initialized from the model checkpoint at ... and are newly initialized: ['model.layers.4.self_attn.rotary_emb.inv_freq', ..., 'model.layers.2.self_attn.rotary_emb.inv_freq']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Instruction following data

Hi! Thank you for the excellent work!

The unified instruction tuning dataset is a great contribution to the community and can be very useful. I wonder if there is a timetable for its release? Thanks!

Models Weights

@LukeForeverYoung Hey! Thanks for sharing this amazing work!

Are the model weights and inference code available ?
I would be happy to test them locally.

ๅ›พๅƒ็ผ–็ ๅ™จ้…็ฝฎ

่ฏท้—ฎๅ›พๅƒ็ผ–็ ๅ™จไฝฟ็”จๅคšๅคง็š„ๅ›พๅƒ่พ“ๅ…ฅ๏ผŒpatch sizeๆ˜ฏๅคšๅฐ‘ๅ‘ข๏ผŸ
ๅฆ‚ๆžœๆŒ‰็…งmPLUG-Owl็š„้…็ฝฎ๏ผŒViT-L14ๅœจ224x224ๅˆ†่พจ็Ž‡ไธŠ่ƒฝๅคŸๅˆ†่พจๆ ทไพ‹ไธญ็š„ๅฏ†้›†ๆ–‡ๅญ—ๅ—๏ผŸๆฏ”ๅฆ‚documentๅ’Œwebpage

About Table Parsing in mPLUG-DocOwl1.5 work

Thanks for your great work.

I have a small question: In the Table Parsing section, the text converts all table representations from HTML to Markdown format.

But the table syntax in Markdown does not support merging rows or columns. It says that tags like <ROWSPAN=x> or <COLSPAN=y> are added in the paper.

Why not just use LaTeX code to represent the table? MMD format is compatible with table LaTeX.

At same time, there is another question: in the inference phase, the model outputs the results of the transformed table, which can not be directly rendered. This is because the output format is neither LaTeX format nor Markdown format.

How to get the real bbox and questions about the normalization function

Both Multi-grained Text Grounding and Multi-grained Text Recognition task need bounding box to get the correspondence between specific texts and local positions.
The bbox in DocLocal4K and DocStruct4M datasets seems not real bounding box of the imgs.

My question is

  1. how to get the real image bounding box of the image?
  2. the normalized function, max(min(int(x)/999, 1.0), 0.0) for x in gt_answer.split(',')], will truncate the relatively large coordinates to 1. Isn't there a problem with this?

Breakdown the 16GB file into chunks

Please can you split the model into 4GB chunks rather than 1 x 16GB. I have already converted to safe tensors through HF in the repo also.

Will just make it much more useable.

Thanks.

Online demo parameters

I'd like to get the same results with Omni model as demonstrated in huggingface demo using the inference code in this repo.

Could you share what parameters like anchor/grid, input resolution etc. you use under the hood? Is there any other pre- or postprocessing for the query or the input image that is absent from the inference code?

For example, with an image that says:

MAKE TEXT
STAND OUT FROM 
BACKGROUNDS

I've got the following results:

With inference code:

from docowl_infer import DocOwlInfer

model_path = 'mPLUG/DocOwl1.5-Omni'
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
query = "Parse texts in the image."
answer = docowl.inference(image_path, query)

Output:

<doc>     MAKE TEXT FROM IEX 
    STAOKOROUNDLICKGRIUINI </doc>

While the demo gives outputs:

[doc] TEXT MAKE
STAND OUT FROM
BACKGROUNDS [/doc]

EDIT: Added example

Inference is not working with both sagemaker and inference file provided on github

  1. I have created the inference endpoint on sagemaker when i try to invoke it getting the following error.
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "The checkpoint you are trying to load has model type `mplug_docowl` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date."
}
  1. Downloaded the model check points to my machine from Hugging Face and trying to run the inference file.
    This one is also failing with following error.
2024-03-29 12:11:24.059622: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 12:11:24.059732: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 12:11:24.156838: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-29 12:11:26.578613: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
ic| model_name: '0735ba4067b5ab76192ce6e7bc5694701ab4d779'
Traceback (most recent call last):
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/docowl_infer.py", line 70, in <module>
    docowl = DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/docowl_infer.py", line 19, in __init__
    self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/builder.py", line 52, in load_pretrained_model
    model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3375, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 209, in __init__
    self.model = MPLUGDocOwlLlamaModel(config)
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 201, in __init__
    super(MPLUGDocOwlLlamaModel, self).__init__(config)
  File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 33, in __init__
    super(MPLUGDocOwlMetaModel, self).__init__(config)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp>
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
TypeError: LlamaDecoderLayer.__init__() takes 2 positional arguments but 3 were given

Issue when loading the model with huggingface

Hi, I downloaded the repo and tried initializing the model with:

model_path = "mPLUG/DocOwl1.5-Chat"
docowl = DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
print('load model from ', model_path)

However, I get the following:

----> 5 docowl = DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
      6 print('load model from ', model_path)
      7 # exit(0)

Cell In[2], line 5, in DocOwlInfer.__init__(self, ckpt_path, anchors, add_global_img, load_8bit, load_4bit)
      3 model_name = get_model_name_from_path(ckpt_path)
      4 ic(model_name)
----> 5 self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")

     50     else:
     51         tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
---> 52         model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
     53 else:

--> 209     self.model = MPLUGDocOwlLlamaModel(config)
    211     self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
    213     # Initialize weights and apply final processing

File ~/SageMaker/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py:201, in MPLUGDocOwlLlamaModel.__init__(self, config)
    200 def __init__(self, config: MPLUGDocOwlConfig):
--> 201     super(MPLUGDocOwlLlamaModel, self).__init__(config)

    924 self.layers = nn.ModuleList(
--> 925     [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
    926 )
    927 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
    928 self.gradient_checkpointing = False

TypeError: LlamaDecoderLayer.__init__() takes 2 positional arguments but 3 were given

As you can see, I'm using a sagemaker instance. Could you please provide some guidance? Thanks

DocOwl1.5 ๆŒ‰็คบไพ‹ไปฃ็ ่ฟ่กŒ ๆŽจ็†็ป“ๆžœไป…ไป…้‡ๅค้ƒจๅˆ†ๆ–‡ๅญ—ใ€‚

DocOwl1.5 ๆŒ‰็คบไพ‹ไปฃ็ ่ฟ่กŒ๏ผš่ฏ†ๅˆซๅ›พ็‰‡ไธญ็š„ๆ–‡ๅญ—ใ€‚ ๆŽจ็†็ป“ๆžœไป…ไป…้‡ๅค้ƒจๅˆ†ๆ–‡ๅญ—ใ€‚ๅนถๅœจ่ฟ่กŒไปฃ็ ๅŽ ๅ‡บ็Žฐๆ็คบ'...Setting pad_token_id to eos_token_id:2 for open-end generation....'
ๅฑๅน•ๆˆชๅ›พ 2024-04-02 155430

Amazing work

This looks really good. And nothing like this has been developed before.

Excited for the source code. Also, all other model fail with documents due to imageprocessor downgrading the resolution to 224. I believe this model handles high resolution for the need for Document understanding.

Does it need OCR to extract the text in the document or is it OCR free mdoel?

Spelling errors in DocStruct4M, 'multi_grained_text_localization.jsonl'

All the question prompts are extracted from DocStruct4M, 'multi_grained_text_localization.jsonl' as below,

[
  "Give the bounding box of the text",
  "Predict the bounding box of the text",
  "Detect the text in the bounding box",
  "Identify the text within the bounding box",
  "Recognize the text in the bounding box",
  "Locate the postion of the text"
]

In the last column, 'postion' should be replaced with 'position'.
I wonder whether it matters for training the MLLM, because the error amount is significantly high.

Missing Images for Paperowl

I have tried to execute the steps as enlisted here for extracting the PaperOwl dataset. Can you please confirm if these images are really missing or is there something wrong in extraction?

imgs/2106.08905v2/figures/out_28170.png
imgs/2106.08905v2/figures/28170.png
imgs/2303.16501v1/tables/table_7.png
imgs/2305.16835v1/figures/fig_result_2.png
imgs/2102.12037v3/figures/table-AUROC-boed.png
imgs/1908.09231v1/tables/table_1.png
.... more images are missing

่ฟ™ไธชๅฏไปฅๆ‹ฟๆฅๅšPI CIๅ›พ็‰‡ไฟกๆฏๆŠฝๅ–ๅ—๏ผŸ

ไฝ ๅฅฝ๏ผŒไฝœ่€…๏ผŒๅพˆๆ„Ÿ่ฐขไฝ ็š„ๅทฅไฝœ๏ผŒๆˆ‘ๆ‹ฟmPLUG-DocOwl็š„็ฝ‘ไธŠdemoๆฅๆต‹่ฏ•ไบ†ไธ€ไธ‹็›ธๅ…ณ็š„PI CIๅ›พ็‰‡๏ผŒๆˆ‘็š„็›ฎๆ ‡ๆ˜ฏ่ฎฉๆจกๅž‹ๅพ—ๅˆฐๆœ‰ๅ…ณๅญ—ๆฎต็š„็ป“ๆž„ๅŒ–ๆ•ฐๆฎ๏ผŒไธบไบ†ๆ›ดๅฟซ็š„ๅฎกๆ ธใ€‚
ไฝ†ๆ˜ฏ็Žฐๅœจ็š„demo็š„ๆ•ˆๆžœไธๅฐฝไบบๆ„๏ผŒ้—ฎ็›ธๅ…ณๅญ—ๆฎต็š„ๅ€ผๅพˆๅฎนๆ˜“ๅ‡บ็Žฐ่ฏญ่จ€ๅนป่ง‰ๅ’Œๅ›ž็ญ”็š„ไธๅฏน๏ผŒๅ›ž็ญ”็š„ๆ•ฐๅญ—ไป€ไนˆ็š„้ƒฝๆ˜ฏ้”™่ฏฏ็š„๏ผŒ่ฏท้—ฎๅฏไปฅ้€š่ฟ‡ๅพฎ่ฐƒ็š„ๆ–นๅผ่ฎฉไป–ๆ›ดๅฏนไธ€ไบ›ๅ—๏ผŒๆˆ–่€…ๆ˜ฏๅขžๅŠ ๅฎƒ็š„OCR่ƒฝๅŠ›๏ผŸ ๆœŸๅพ…ไฝ ็š„ๅ›žๅคใ€‚
AMMER

Huggingface integration

Thank you for your work!
When will you make this available on Huggingface with instructions please?

Thanks.

bounding box visualization

How can I properly visualize a bounding box on an image? It seems that conventional operations don't display it correctly. Do I need to perform any special transformations?"

instruction with how to use the stage1 model

Hi team,

Is there any instruction on how to use the stage1 model? interested with the document/webpage parsing capabilities.
If not can you provide an example script?

Thanks!!

Getting TypeError: LlamaDecoderLayer.__init__() takes 2 positional arguments but 3 were given

When I run the inference code:
from docowl_infer import DocOwlInfer
model_path='./mPLUG/DocOwl1.5-chat'
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
print('load model from ', model_path)

I am getting

TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\docowl_infer.py:21, in DocOwlInfer.init(self, ckpt_path, anchors, add_global_img, load_8bit, load_4bit)
19 ic(model_name)
20 print("DocOwl Infer ")
---> 21 self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
22 self.doc_image_processor = DocProcessor(image_size=448, anchors=anchors, add_global_img=add_global_img, add_textual_crop_indicator=True)
23 self.streamer = TextStreamer(self.tokenizer, skip_prompt=True, skip_special_tokens=True)

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\builder.py:54, in load_pretrained_model(model_path, model_base, model_name, load_8bit, load_4bit, device_map, device)
52 tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
53 print("MPLUGDocOwlLlamaForCausalLM")
---> 54 model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
55 else:
56 # Load language model
57 if model_base is not None:
58 # PEFT model

File c:\Users\internanirudh\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_utils.py:3405, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3402 with ContextManagers(init_contexts):
3403 # Let's make sure we don't run the init function of buffer modules
3404 print("ContexManager")
-> 3405 model = cls(config, *model_args, **model_kwargs)
3407 # make sure we use the model's config since the init call might have copied it
3408 config = model.config

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\modeling_mplug_docowl.py:209, in MPLUGDocOwlLlamaForCausalLM.init(self, config)
207 def init(self, config):
208 super(LlamaForCausalLM, self).init(config)
--> 209 self.model = MPLUGDocOwlLlamaModel(config)
211 self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
213 # Initialize weights and apply final processing

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\modeling_mplug_docowl.py:201, in MPLUGDocOwlLlamaModel.init(self, config)
200 def init(self, config: MPLUGDocOwlConfig):
--> 201 super(MPLUGDocOwlLlamaModel, self).init(config)

File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\modeling_mplug_docowl.py:33, in MPLUGDocOwlMetaModel.init(self, config)
32 def init(self, config):
---> 33 super(MPLUGDocOwlMetaModel, self).init(config)
34 self.vision_model = MplugOwlVisionModel(
35 MplugOwlVisionConfig(**config.visual_config["visual_model"])
36 )
38 self.vision2text = MplugDocOwlHReducerModel(
39 MplugDocOwlHReducerConfig(**config.visual_config["visual_hreducer"]), config.hidden_size
40 )

File c:\Users\internanirudh\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py:926, in LlamaModel.init(self, config)
923 self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
924 print("LlamaDecoderLayer Start")
925 self.layers = nn.ModuleList(
--> 926 [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
927 )
928 print("LlamaDecoderLayer Ran")
929 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

File c:\Users\internanirudh\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py:926, in (.0)
923 self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
924 print("LlamaDecoderLayer Start")
925 self.layers = nn.ModuleList(
--> 926 [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
927 )
928 print("LlamaDecoderLayer Ran")
929 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

TypeError: LlamaDecoderLayer.init() takes 2 positional arguments but 3 were given

Can y'all give me a solution to this problem

basic instructions for deploying locally?

I tried

ModuleNotFoundError: No module named 'icecream'
(textgen) [root@pve0 DocOwl1.5]# pip install icecream
Collecting icecream
  Using cached icecream-2.1.3-py2.py3-none-any.whl.metadata (1.4 kB)
Requirement already satisfied: colorama>=0.3.9 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (0.4.6)
Requirement already satisfied: pygments>=2.2.0 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (2.17.2)
Requirement already satisfied: executing>=0.3.1 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (2.0.1)
Requirement already satisfied: asttokens>=2.0.1 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (2.4.1)
Requirement already satisfied: six>=1.12.0 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from asttokens>=2.0.1->icecream) (1.16.0)
Using cached icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: icecream
Successfully installed icecream-2.1.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
(textgen) [root@pve0 DocOwl1.5]# python app.py
2024-04-13 16:08:59 | ERROR | stderr | Traceback (most recent call last):
2024-04-13 16:08:59 | ERROR | stderr |   File "/data/mplug-docowl/DocOwl1.5/app.py", line 23, in <module>
2024-04-13 16:08:59 | ERROR | stderr |     no_change_btn = gr.Button.update()
2024-04-13 16:08:59 | ERROR | stderr | AttributeError: type object 'Button' has no attribute 'update'

Check out our datasets, I think they might be useful for training models like this.

We created some large-scale multimodal datasets that contain OCR annotations, for some we ran paddle OCR over LAION images

  1. https://huggingface.co/datasets/wendlerc/LAION5B-en-PaddleOCR-parquet
  2. https://huggingface.co/datasets/wendlerc/LAION5B-hr-en-PaddleOCR-parquet
    for toand rendered images with blender,
  3. https://huggingface.co/datasets/wendlerc/RenderedText
    and here we captioned synthtext with BLIP2,
  4. https://huggingface.co/datasets/wendlerc/CaptionedSynthText

do you think those might be useful to tune your method?

Best,
Chris

ๆŒ‰็…งhugging face็š„ไธŠ็š„ๅ‚ๆ•ฐload pretrainted็ป“ๆžœinferๅ‡บๆฅ็š„ไธœ่ฅฟๆ˜ฏๆททไนฑ็š„ๅ‘ข๏ผŸ

ๆŒ‰็…งuread็š„ๆ–นๅผๅŠ ่ฝฝไบ†hugging faceไธŠ็š„้ข„่ฎญ็ปƒ็š„ๆจกๅž‹ใ€‚่พ“ๅ…ฅไบ†ไธ€ๅผ ็ฎ€ๅ•็š„ๅ›พ็‰‡๏ผš
424777-PE9BDR-101

็ป“ๆžœ่พ“ๅ‡บ็š„ๆ˜ฏ็œ‹ไธๆ‡‚็š„ไธœ่ฅฟ๏ผš
iwEdAqNwbmcDAQTRA7oF0QJvBrD8kNF7ARjUZAWLq7Xt_WIAB9MAAAAA8ugD3QgACaJpbQoAC9IAAecr png_720x720q90

ไธ็Ÿฅ้“ๅ“ช้‡Œๅ‡บไบ†้—ฎ้ข˜๏ผŸๆ˜ฏไธๆ˜ฏๅ‚ๆ•ฐๆœ‰ๅ˜ๅŒ–ๅ‘ข๏ผŸ

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.