x-plug / mplug-docowl Goto Github PK
View Code? Open in Web Editor NEWmPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
License: Apache License 2.0
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
License: Apache License 2.0
Thanks for your great work!
In the paper: mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding, it mentioned DocStruct4M and DocReason25K dataset, but they are not open-source.
May I ask if there are any open source plans for these two datasets?
:)
another thing,will this model support chinese ocr soon?
Thank you for your work!
When will you make this available on Huggingface with instructions please?
Thanks.
Hello,
I pulled your repo and so far the inference with the stage 1 model works fine. However, the results I get for the localized text recognition often are in the wrong order. For example, I use this code (basically the demo code from the README.md):
from docowl_infer import DocOwlInfer
model_path = "./models/models--mPLUG--DocOwl1.5-stage1/.../"
docowl = DocOwlInfer(ckpt_path=model_path, anchors="grid_9", add_global_img=False)
image = "image.jpg"
query = "Identify the text within the bounding box <bbox>92, 444, 880, 480</bbox>"
answer = docowl.inference(image, query)
print(answer)
on this image (only the relevant part is left visible)
Which gives the result 8 Spl. Fz.z.Pers.bef.b. 5
Here, the two parts "8 Spl." and "Fz.z.Pers.bef.b." are in the wrong order (the "5" in the end is hallucinated, but that only happens in the anonymized image, not in the original one -> no concern here). Something like that happens quite often. I have the feeling that I missed something there. Do I use the model correctly?
There is indeed a warning the code throws during inferene:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
And also one during model loading:
Some weights of MPLUGDocOwlLlamaForCausalLM were not initialized from the model checkpoint at ... and are newly initialized: ['model.layers.4.self_attn.rotary_emb.inv_freq', ..., 'model.layers.2.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
How can I properly visualize a bounding box on an image? It seems that conventional operations don't display it correctly. Do I need to perform any special transformations?"
请问mPLUG-DocOwl 1.5的数据集在huggingface上的链接失效是什么原因
Hi, I downloaded the repo and tried initializing the model with:
model_path = "mPLUG/DocOwl1.5-Chat"
docowl = DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
print('load model from ', model_path)
However, I get the following:
----> 5 docowl = DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
6 print('load model from ', model_path)
7 # exit(0)
Cell In[2], line 5, in DocOwlInfer.__init__(self, ckpt_path, anchors, add_global_img, load_8bit, load_4bit)
3 model_name = get_model_name_from_path(ckpt_path)
4 ic(model_name)
----> 5 self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
50 else:
51 tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
---> 52 model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
53 else:
--> 209 self.model = MPLUGDocOwlLlamaModel(config)
211 self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
213 # Initialize weights and apply final processing
File ~/SageMaker/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py:201, in MPLUGDocOwlLlamaModel.__init__(self, config)
200 def __init__(self, config: MPLUGDocOwlConfig):
--> 201 super(MPLUGDocOwlLlamaModel, self).__init__(config)
924 self.layers = nn.ModuleList(
--> 925 [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
926 )
927 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
928 self.gradient_checkpointing = False
TypeError: LlamaDecoderLayer.__init__() takes 2 positional arguments but 3 were given
As you can see, I'm using a sagemaker instance. Could you please provide some guidance? Thanks
I have tried to execute the steps as enlisted here for extracting the PaperOwl dataset. Can you please confirm if these images are really missing or is there something wrong in extraction?
imgs/2106.08905v2/figures/out_28170.png
imgs/2106.08905v2/figures/28170.png
imgs/2303.16501v1/tables/table_7.png
imgs/2305.16835v1/figures/fig_result_2.png
imgs/2102.12037v3/figures/table-AUROC-boed.png
imgs/1908.09231v1/tables/table_1.png
.... more images are missing
Hello, I would like to ask how to test on M-Paper dataset? For example, for the task Multimodal Diagram Analysis, its input needs to be 𝐶𝑜𝑛𝑡𝑒𝑥𝑡 + 𝐷𝑖𝑎𝑔𝑟𝑎𝑚 𝑠 + 𝑂𝑢𝑡𝑙𝑖𝑛𝑒, and the question instructions, so how are you organizing the input format for the model? Are there any associated script evaluation files about M-Paper?
We created some large-scale multimodal datasets that contain OCR annotations, for some we ran paddle OCR over LAION images
do you think those might be useful to tune your method?
Best,
Chris
I tried
ModuleNotFoundError: No module named 'icecream'
(textgen) [root@pve0 DocOwl1.5]# pip install icecream
Collecting icecream
Using cached icecream-2.1.3-py2.py3-none-any.whl.metadata (1.4 kB)
Requirement already satisfied: colorama>=0.3.9 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (0.4.6)
Requirement already satisfied: pygments>=2.2.0 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (2.17.2)
Requirement already satisfied: executing>=0.3.1 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (2.0.1)
Requirement already satisfied: asttokens>=2.0.1 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from icecream) (2.4.1)
Requirement already satisfied: six>=1.12.0 in /data/miniconda3/envs/textgen/lib/python3.10/site-packages (from asttokens>=2.0.1->icecream) (1.16.0)
Using cached icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: icecream
Successfully installed icecream-2.1.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
(textgen) [root@pve0 DocOwl1.5]# python app.py
2024-04-13 16:08:59 | ERROR | stderr | Traceback (most recent call last):
2024-04-13 16:08:59 | ERROR | stderr | File "/data/mplug-docowl/DocOwl1.5/app.py", line 23, in <module>
2024-04-13 16:08:59 | ERROR | stderr | no_change_btn = gr.Button.update()
2024-04-13 16:08:59 | ERROR | stderr | AttributeError: type object 'Button' has no attribute 'update'
如题
@LukeForeverYoung Hey! Thanks for sharing this amazing work!
Are the model weights and inference code available ?
I would be happy to test them locally.
When I run the inference code:
from docowl_infer import DocOwlInfer
model_path='./mPLUG/DocOwl1.5-chat'
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
print('load model from ', model_path)
TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\docowl_infer.py:21, in DocOwlInfer.init(self, ckpt_path, anchors, add_global_img, load_8bit, load_4bit)
19 ic(model_name)
20 print("DocOwl Infer ")
---> 21 self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
22 self.doc_image_processor = DocProcessor(image_size=448, anchors=anchors, add_global_img=add_global_img, add_textual_crop_indicator=True)
23 self.streamer = TextStreamer(self.tokenizer, skip_prompt=True, skip_special_tokens=True)
File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\builder.py:54, in load_pretrained_model(model_path, model_base, model_name, load_8bit, load_4bit, device_map, device)
52 tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
53 print("MPLUGDocOwlLlamaForCausalLM")
---> 54 model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
55 else:
56 # Load language model
57 if model_base is not None:
58 # PEFT model
File c:\Users\internanirudh\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_utils.py:3405, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3402 with ContextManagers(init_contexts):
3403 # Let's make sure we don't run the init function of buffer modules
3404 print("ContexManager")
-> 3405 model = cls(config, *model_args, **model_kwargs)
3407 # make sure we use the model's config since the init call might have copied it
3408 config = model.config
File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\modeling_mplug_docowl.py:209, in MPLUGDocOwlLlamaForCausalLM.init(self, config)
207 def init(self, config):
208 super(LlamaForCausalLM, self).init(config)
--> 209 self.model = MPLUGDocOwlLlamaModel(config)
211 self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
213 # Initialize weights and apply final processing
File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\modeling_mplug_docowl.py:201, in MPLUGDocOwlLlamaModel.init(self, config)
200 def init(self, config: MPLUGDocOwlConfig):
--> 201 super(MPLUGDocOwlLlamaModel, self).init(config)
File c:\Users\internanirudh\Desktop\DocOwl\mPLUG-DocOwl-main\DocOwl1.5\mplug_docowl\model\modeling_mplug_docowl.py:33, in MPLUGDocOwlMetaModel.init(self, config)
32 def init(self, config):
---> 33 super(MPLUGDocOwlMetaModel, self).init(config)
34 self.vision_model = MplugOwlVisionModel(
35 MplugOwlVisionConfig(**config.visual_config["visual_model"])
36 )
38 self.vision2text = MplugDocOwlHReducerModel(
39 MplugDocOwlHReducerConfig(**config.visual_config["visual_hreducer"]), config.hidden_size
40 )
File c:\Users\internanirudh\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py:926, in LlamaModel.init(self, config)
923 self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
924 print("LlamaDecoderLayer Start")
925 self.layers = nn.ModuleList(
--> 926 [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
927 )
928 print("LlamaDecoderLayer Ran")
929 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
File c:\Users\internanirudh\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py:926, in (.0)
923 self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
924 print("LlamaDecoderLayer Start")
925 self.layers = nn.ModuleList(
--> 926 [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
927 )
928 print("LlamaDecoderLayer Ran")
929 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
TypeError: LlamaDecoderLayer.init() takes 2 positional arguments but 3 were given
Can y'all give me a solution to this problem
I found links to HF model cards in DocOwl 1.5 README, but none of them works.
Also, I found no models here https://huggingface.co/mPLUG
Can you please provide working links?
Hi! Thank you for the excellent work!
The unified instruction tuning dataset is a great contribution to the community and can be very useful. I wonder if there is a timetable for its release? Thanks!
This looks really good. And nothing like this has been developed before.
Excited for the source code. Also, all other model fail with documents due to imageprocessor downgrading the resolution to 224. I believe this model handles high resolution for the need for Document understanding.
Does it need OCR to extract the text in the document or is it OCR free mdoel?
The meaning of the values in < bbox > are confusing. It doesn't look in the format of x1,y1,x2,y2, since it failes to get correct bbox for most images.
DocStruct4M
DocDownstream-1.0
DocReason25K
DocLocal4K
Are the image files included in the four data sets different image files
When will the training code be released?Thx.
The DUE-benchmark provides the ocr results of pdf-type documnets and other models use the ocr results as input to eval their model. What are the input of your model when you use this benchmark? Are you use png/jpg image?
I'd like to get the same results with Omni
model as demonstrated in huggingface demo using the inference code in this repo.
Could you share what parameters like anchor/grid, input resolution etc. you use under the hood? Is there any other pre- or postprocessing for the query or the input image that is absent from the inference code?
For example, with an image that says:
MAKE TEXT
STAND OUT FROM
BACKGROUNDS
I've got the following results:
With inference code:
from docowl_infer import DocOwlInfer
model_path = 'mPLUG/DocOwl1.5-Omni'
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
query = "Parse texts in the image."
answer = docowl.inference(image_path, query)
Output:
<doc> MAKE TEXT FROM IEX
STAOKOROUNDLICKGRIUINI </doc>
While the demo gives outputs:
[doc] TEXT MAKE
STAND OUT FROM
BACKGROUNDS [/doc]
EDIT: Added example
Thanks for your great job! I'm really curious about the pre-process of the dataset. So I'd like to ask you about the following:
Thanks for your reply!
Thanks for your great work! What is the difference between DocDownstrem1.0 and the finetund data used in Ureader?
All the question prompts are extracted from DocStruct4M, 'multi_grained_text_localization.jsonl' as below,
[
"Give the bounding box of the text",
"Predict the bounding box of the text",
"Detect the text in the bounding box",
"Identify the text within the bounding box",
"Recognize the text in the bounding box",
"Locate the postion of the text"
]
In the last column, 'postion' should be replaced with 'position'.
I wonder whether it matters for training the MLLM, because the error amount is significantly high.
Running Windows 10 venv Python 3.10.6:
from docowl_infer import DocOwlInfer
model_path='mPLUG/DocOwl1.5-stage1'
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=False)
ic| model_name: 'DocOwl1.5-stage1'
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 749/749 [00:00<?, ?B/s]
E:\DocOwl\venv\lib\site-packages\huggingface_hub\file_download.py:148: UserWarning:huggingface_hub
cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\User.cache\huggingface\hub\models--mPLUG--DocOwl1.5-stage1. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting theHF_HUB_DISABLE_SYMLINKS_WARNING
environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 11.5MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 438/438 [00:00<?, ?B/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 4.84k/4.84k [00:00<?, ?B/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████| 16.3G/16.3G [06:55<00:00, 39.2MB/s]
Some weights of MPLUGDocOwlLlamaForCausalLM were not initialized from the model checkpoint at mPLUG/DocOwl1.5-stage1 and are newly initialized: ['model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
generation_config.json: 100%|███████████████████████████████████████| 162/162 [00:00<?, ?B/s]
Traceback (most recent call last):
File "", line 1, in
File "E:\DocOwl\mPLUG-DocOwl\DocOwl1.5\docowl_infer.py", line 19, in init
self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
File "E:\DocOwl\mPLUG-DocOwl\DocOwl1.5\mplug_docowl\model\builder.py", line 52, in load_pretrained_model
model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
File "E:\DocOwl\venv\lib\site-packages\transformers\modeling_utils.py", line 2959, in from_pretrained
dispatch_model(model, **kwargs)
File "E:\DocOwl\venv\lib\site-packages\accelerate\big_modeling.py", line 364, in dispatch_model
weights_map = OffloadedWeightsLoader(
File "E:\DocOwl\venv\lib\site-packages\accelerate\utils\offload.py", line 150, in init
raise ValueError("Need either astate_dict
or asave_folder
containing offloaded weights.")
ValueError: Need either astate_dict
or asave_folder
containing offloaded weights.
从hf下载太慢了,求在modelscope上传一份
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "The checkpoint you are trying to load has model type `mplug_docowl` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date."
}
2024-03-29 12:11:24.059622: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 12:11:24.059732: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 12:11:24.156838: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-29 12:11:26.578613: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
ic| model_name: '0735ba4067b5ab76192ce6e7bc5694701ab4d779'
Traceback (most recent call last):
File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/docowl_infer.py", line 70, in <module>
docowl = DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/docowl_infer.py", line 19, in __init__
self.tokenizer, self.model, _, _ = load_pretrained_model(ckpt_path, None, model_name, load_8bit=load_8bit, load_4bit=load_4bit, device="cuda")
File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/builder.py", line 52, in load_pretrained_model
model = MPLUGDocOwlLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3375, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 209, in __init__
self.model = MPLUGDocOwlLlamaModel(config)
File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 201, in __init__
super(MPLUGDocOwlLlamaModel, self).__init__(config)
File "/content/drive/MyDrive/Document_Extraction/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 33, in __init__
super(MPLUGDocOwlMetaModel, self).__init__(config)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__
[LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp>
[LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
TypeError: LlamaDecoderLayer.__init__() takes 2 positional arguments but 3 were given
Hi, I am confused that how to transfer the pdf datasets (Deepform、KLC) to multi images with the true key-value GT pairs for each transfered png image? Because the datasets download in DUE-benchmark have no page ID information.
These opened dataset can not really find which dataset can hav img -> markdown text information.
And where does the Chinese OCR ability comes from? The whole dataset has no Chinese,
MplugDocOwlHReducerModel --> forward --> line 487
sequence_output = self.reducer(hidden_states) # B,C,H,W -> B,C,H/conv_shape[1],W/(conv_shape[1])
After self.reducer operation, the shape of hidden_states should be (B,C,H/conv_shape[0],W/(conv_shape[1])).
请问图像编码器使用多大的图像输入,patch size是多少呢?
如果按照mPLUG-Owl的配置,ViT-L14在224x224分辨率上能够分辨样例中的密集文字吗?比如document和webpage
Does mPLUG/DocStruct4M and mPLUG/DocDownstream-1.0 contain image files in the dataset, which cannot be verified on the hugging face.
Thanks for your great work.
I have a small question: In the Table Parsing section, the text converts all table representations from HTML to Markdown format.
But the table syntax in Markdown does not support merging rows or columns. It says that tags like <ROWSPAN=x> or <COLSPAN=y> are added in the paper.
Why not just use LaTeX code to represent the table? MMD format is compatible with table LaTeX.
At same time, there is another question: in the inference phase, the model outputs the results of the transformed table, which can not be directly rendered. This is because the output format is neither LaTeX format nor Markdown format.
Training data or model design?
请问patch_positions没有被使用吗?
Both Multi-grained Text Grounding and Multi-grained Text Recognition task need bounding box to get the correspondence between specific texts and local positions.
The bbox in DocLocal4K and DocStruct4M datasets seems not real bounding box of the imgs.
My question is
max(min(int(x)/999, 1.0), 0.0) for x in gt_answer.split(',')]
, will truncate the relatively large coordinates to 1. Isn't there a problem with this?https://modelscope.cn/studios/damo/mPLUG-DocOwl/summary
This site is down at this moment, it said 当前空间运行错误,暂未发布
Hi team,
Is there any instruction on how to use the stage1 model? interested with the document/webpage parsing capabilities.
If not can you provide an example script?
Thanks!!
Thanks for your work.
Please can you split the model into 4GB chunks rather than 1 x 16GB. I have already converted to safe tensors through HF in the repo also.
Will just make it much more useable.
Thanks.
When start supporting Chinese document Q&A?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.