microsoft / i-code Goto Github PK

License: MIT License

Python 20.38% Shell 0.02% Jupyter Notebook 79.32% Procfile 0.01% HTML 0.05% JavaScript 0.17% CSS 0.06%

i-code's Introduction

Project i-Code

The ambition of the i-Code project is to build integrative and composable multimodal Artificial Intelligence. The "i" stands for integrative multimodal learning.

Multimodal Foundation Models

i-Code V1: i-Code: An Integrative and Composable Multimodal Learning Framework. AAAI 2023, paper link.
i-Code V2: i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data. Paper link.
i-Code V3 (CoDi): Any-to-Any Generation via Composable Diffusion, paper link.
i-Code Studio: A Configurable and Composable Framework for Integrative AI, paper link.

Multimodal Document Intelligence

i-Code Doc (UDOP): Unifying Vision, Text, and Layout for Universal Document Processing. CVPR 2023 Highlight, paper link.

Knowledge-Based Visual Question Answering

[MM-Reasoner] MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering. EMNLP 2023 Findings.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

i-code's People

Contributors

Stargazers

Watchers

Forkers

ziyi-yang zinengtang wangkk1996 davidcorreas gpalrepo etrigger codender techthiyanes yangyoungyang natyren chengshuang18 eyeshot mediaeater paperwave moerehman jmwdpk maxmax2016 chandra-devs guduhanyan apollohuang1 eltociear wdshin statsgary vrrs muki-skywalker commerceless mrcodechef wilsonodpn stevegyutyan jfsantos soxunlocks bingtian88 renliuran thanhpham1987 contropist pilsuu fellow-soft yuwfan lucyellu neuralens ukaserge leong-08 ccaiccie codwest tuananhnguyenkim nagaramcheedella jinlmsft feixuedudiao dhanrajkatkar msocko420 david20080125 mbyase henry-sun1974 zhaoqj2016 valeriawong alistairwalsh woody0105 andrewyang885 waddiwasibiu aceanan keyman9848 ningshuang-yao gaiann hzzhang-nlp remarkmediagroup dl-diffusion petercao sandy4321 positivewon aha-0810 dorbodwolf iamleon121 aicodehunt ziichuan desi-ivanova thinh-huynh-re stanleyjacob raymond310 shubham-attri ltyanghuang ruofeidu phoenixfury007 davidoster rotemfogel ikj1992 codi-gen tjdharani ionut-mihai-vladasel lakshyapandey know-it-all-marketing-llc tata19884 2132660698 fastflair drehere m4rtymcfly vidina-solutions anythingwithawire chutianshu1981 jacek-max930 quantumbuddha

i-code's Issues

FileNotFoundError: [Errno 2] No such file or directory: '../CoDi_encoders.pth'

Running the demo.ipynb notebook in the V3 folder

DocVQA

Hi,
I see that the fine_rvlcdip.sh is used to train on the classification dataset from scratch. What's the procedure to train the model on DocVQA task from scratch?
Also what does the finetune_duebenchmark.sh exactly do? I couldn't run it because it required the config file for the pretrained model which is not available yet.

Thanks!!!

loss does not have a grad fn

Hi, I have been trying to finetune the UDOP model on DocVQA. However, I am getting the following error after the first training iteration.

Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

After going through each of the intermediate output, it is found that even the inputs_patches variable used here do not have requires_grad as True. Any help/pointers regarding this is much appreciated.

Method to use ?

Can any one explain a sequential way to use this and generate output

where is the rvl-cdip dataset

where is the rvl-cdip dataset?

--im_dir 参数是什么含义

您好，我看--im_dir默认参数是'DocVQA/png'，这个是什么含义呢，我在处理due数据的时候也提示“Cannot locate directory xx”，png目录下存放的是所有docvqa图片吗

Document Understanding

For the DocVQA, does UDOP follow the same procedure as layoutLM (single image input) or can we can do it in a document level. For example given an input of document with 15 pages and performing DocVQA on that.

Finetuning on Due-Benchmark

Hi,

I have been trying to finetune the model on due-benchmark using the provided script. However, the performance is quite low compared to the reported numbers. For example, DocVQA results in an ANLS score of 75 instead of the reported 84. I have two main queries.

The provided checkpoint is missing one parameter: special_vis_token. For now this parameter is initialized randomly. I am not sure if this has a significant impact on the final score.
As per the paper, the input is prepended with a task specific prompt. However, it seems this is not done for the due-benchmark tasks. Could this be the reason for the low performance?

How to use single machine multi card reasoning

i-Code studio access?

I will appreciate it if you explain how can I access i-code-studio and how can I connect the custom diffusion models to i-code-v3.

CoDi Video with num_frames != 8

In web demo https://codi-gen.github.io/ there're videos with at least 16 frames

demo notebook fails to generate videos longer than 8 frames (it doesn't crash, just generates complete garbage)

Is it suppose to be able to generate them, or did you use some tricks like processing video during diffusion process in patches of 8 frames to generate examples?

Language Support

Hi a few questions, does this model work only on English?

If so, what would it take to train it on another language or script type?

Would it need to be pretrained again using self-supervision and how expensive is the pre-training process computationally?

Thank you!

Can the provided models perform multi-label classification?

The provided code examples provide a good example of single label classification (e.g. rvlcdip). But can this be easily extended to multi-label classification? I'm a bit new to models where the predicted label is the actual textual representation of the label versus an integer. In the latter case multi-label can be provided through one-hot encoding an array of possible labels; does UDOP have a similar extension from single label to multi-label? How would one do it?

from pytorch_lightning.trainer.supporters import CombinedLoader

error reporting：ModuleNotFoundError: No module named 'pytorch_lightning.trainer.supporters'

Layout Analysis

Hi,

I found that you mentioned this awesome model performs well on several tasks. But there are few performance results of the layout analysis part. I'd like to run it, but I'm not sure which part should I modify. Could you please tell me how can I make it run on this task?

Best regards,

Poor performance on various tasks using the provided example

Hi, thanks for publishing the model and its weights - it looks very promising. Sadly I can't get good results out of it using the provided notebook. For example if I ask more complex questions (which it should support due to pretraining as far as I understand), it fails to produce correct answers.
Let's say I modify the prompt to task_prefix = "Layout Modeling. <layout_0> Manuscript </layout_0> review" - the model gives me 'form', which is incorrect, and I get in most cases either "form" or some other incorrect answer. For task_prefix = 'information extraction. What is the completion date?' it gives me '3/16/68' which is close but still incorrect. Can you provide more elaborate examples with different task prompts so it will be possible to check if it's something on my side or if there is a problem with the model?

Pretraining weights for the non-vision task?

Hi, based on the paper, it sounded like there would be a release of the pretraining weights for the text and layout pretraining tasks for UDOP. When will that be released? Thanks!

Pre-training code of UDOP

Thanks for your awesome work in UDOP, I was trying out Finetune on VQA. Do you plan to release the code to pre-train such a model? Thank you!

UDOP pretraining with MAE decoder

I am attempting to reproduce the original UDOP pretraining code as in paper. I have question - is the image reconstruction loss optimized together with text generation loss within the same batch or are they optimized alternatingly? This is because in UDOP model forward() pass, either image loss or text loss is returned depending on ids_keep is None (text) or not (image), but not both at the same time, so I wonder which approach was used in the original code.

TIA

Computation resources and traning time

Could you provide some information about computation resources and traning time, if I want to train the model from scrath.

In 'Finetuninng on RVLCDIP', which one is the dataset ?

Finetuninng on RVLCDIP

Download RVLCDIP first and change the path
For OCR, you might need to customize your code

bash scripts/finetune_rvlcdip.sh   # Finetuning on RVLCDIP

Q1. which Dataset?

        ocr_dir = os.path.join(data_args.data_dir, data_args.mpdfs_dir, 'cdip-images-full-clean-ocr021121')
        image_dir = os.path.join(data_args.data_dir, data_args.mpdfs_dir, 'cdip-images')
        label_dir = os.path.join(data_args.data_dir, data_args.rvlcdip_dir, 'labels')

and in run_rvlcdip.py the dir 'cdip-images-full-clean-ocr021121' is not found in the datasets below.

https://paperswithcode.com/dataset/rvl-cdip

Q2. Which OCR?
I have downloaded the raw rvl_cdip dataset, in order to get a cdip-images-full-clean-ocr021121 to get a performance matching the one paper listed, which OCR should I use? Is it https://learn.microsoft.com/en-us/rest/api/computervision/3.1/get-read-result/get-read-result?tabs=HTTP ?

Q3. Is it OK for rvl_cdip being used for both pretrain and finetune?

Thank you!

Question on Production Use-Cases

Hello!

Firstly, thanks for your fantastic efforts and research. The UDOP model is definitely one of the more interesting releases from the past few months.

Looking at the model code and reading the paper, I noticed you have decided to use a Seq2Seq type of format, very similar to T5. While this is a flexible type of model (as your paper proves!), I feel the largest downside is a lack of confidences for answers given.

For example, using LayoutLM1/2/3 for QA, I can easily filter out low-confidence answers, which makes production use-cases very feasible. With conditional generation, this is not as simple. In the past, I've tried to look at the confidences for each word generated, but that hasn't proved very useful.

Is there a particular method you suggest for filtering responses from UDOP or similar conditional generational models (i.e Donut, T5, etc.)? Or is this an area that might need more research in the future?

‘UdopUnimodelForConditionalGeneration' object has no attribute 'special_vis_token'

I ran your finetune_duebenchmark.sh script to finetune DocVQA dataset，but got an error as below:

how to handle it? Thanks for your help.

Number of Epochs for Fine-tuning Tasks

Hi Authors,
Appreciate your interesting paper UDOP. Wanted to know the specifics on fine-tuning. Appendix C.6 does provide some details but the crucial number of epochs is missing. Specifically:

Number of epochs fine-tuned for each downstream task?
Are all downstream tasks fine-tuned for the same number of epochs?

`true` is not defined when running demo.ipynb

After following the installation instructions and trying to run the demo notebook, I get a fatal error indicating true was used instead of True in runpy.py

←[1;31mNameError←[0m: name 'true' is not definedTraceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\CoDi\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\envs\CoDi\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\envs\CoDi\Scripts\jupyter-run.EXE_main.py", line 7, in
File "C:\ProgramData\Anaconda3\envs\CoDi\lib\site-packages\jupyter_core\application.py", line 285, in launch_instance
return super().launch_instance(argv=argv, **kwargs)
File "C:\ProgramData\Anaconda3\envs\CoDi\lib\site-packages\traitlets\config\application.py", line 1043, in launch_instance
app.start()
File "C:\ProgramData\Anaconda3\envs\CoDi\lib\site-packages\jupyter_client\runapp.py", line 112, in start
raise Exception("jupyter-run error running '%s'" % filename)
Exception: jupyter-run error running 'demo.ipynb'
[IPKernelApp] WARNING | Parent appears to have exited, shutting down.

Example code results in input_id's of varying lengths

I followed #17 (comment) in order to load the UdopTokenizer. I then followed the code examples for tokenizing text provided in rvlcdip.py

This amounts to calling tokenizer.tokenize(text) on a word text, appending the resulting sub_tokens to a text_list and then calling tokenizer.convert_tokens_to_ids on that text_list to get input_ids. However this always results in lengths that are longer or shorter than 512. This is despite the fact that tokenizer_config.json has a "model_max_length": 512, param.

Is this provided example code the expected way to encode text?

(it makes sense that the provided code doesn't pad/truncate correctly, but its odd to me that rvlcdip can correctly fine tune without a step in this tokenization piece that ensures the text_list is 512 tokens long)

EDIT I just noticed this pad_tokens function but it doesn't appear to be used anywhere. Is it used automatically once RvlCdipDataset() is created? Also, it doesn't appear to do any truncation

CoDi : CUDA ran out of memory while trying to do inference tasks

I was trying to run the demo notebook on Nvidia A100 80 GB. While trying to load the model from checkpoint, I am facing this issue:
#######################
Running in eps mode
#######################

making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Load pretrained weight from ['CoDi_encoders.pth', 'CoDi_text_diffuser.pth', 'CoDi_audio_diffuser_m.pth', 'CoDi_video_diffuser_8frames.pth']

RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.70 GiB total capacity; 17.10 GiB already allocated; 3.56 MiB free; 17.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Can you let me know how to solve this issue ?

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Generating bounding boxes with UDOP

Hi,

by reading the UDOP paper, my understanding is that during pre-training the model is taught to predict the layout of a target (textual) sequence using special layout tokens.
I was wondering if it is possible to exploit such capability also during finetuning e.g. to finetune the model using target sequences such as: <key> Name <loc_100> <loc_200> <loc_150> <loc_250> </key> <value> Jane Doe <loc_110> <loc_210> <loc_160> <loc_260> </key>

Ideally, could this approach allow to have a correspondence between the generated text (e.g. the name) and its position within the page document?

Inference Training pipeline

@ziyi-yang @zinengtang Thanks for sharing the wonderful work but can u please let us knw when shall be the inference and training pipeline along with pretrained weights be release ??

Thanks in advance

Few shot learning for Document AI

Hello,

I am working on a practical use-case of Document understanding and wondering if I could leverage UDOP. The goal is to extract key informations from the document (in fields or tables). The trick is that I only have a few training samples (<50) and I don't think VQA would apply as these informations are very specific and not always associated with a clear question.

Here are the 2 options I have in mind :

finetuning model. But would 50 sample be enough for UDOP ? How should I deal with tables ? (which don't really look like tables but rather a list without printed rows and columns, as on many receipts)
leverage a foundation model to perform few shot learning (as in GPT3). Are there text + layout foundation models out there that would work for this ? Or should I do prompt engineering with GPT3, Flan-T5, OPT or equivalent models ?

I am interested to get your insights for both english-data... and non english (but latin) data,

Many thanks for your inputs, but also for your work, it is very useful to get open sourced models as yours :)
Simon

Has anyone successfully attempted a torch.jit.trace or torch.onnx.export of a UDOP model?

I am currently loading the checkpoint on huggingface udop-unimodel-large-224 and immediately trying to torch.jit.trace or torch.onnx.export it with input data provided for the input_ids, attention_mask, labels, seg_data, visual_seg_data, decoder_attention_mask and image fields (for the forward function -- note these are the same fields used in the rvlcdip example). I'm running into a variety of issues, most commonly:

RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.cpp":614, please report a bug to PyTorch. We don't have an op for aten::full_like but it isn't a special case.  Argument types: Tensor, bool, int, int, Device, bool, NoneType,

Curious if anyone has successfully traced the UDOP model and has a code example. I cannot find one in the i-Code repository.

Can load Udop-Dual-Large-224, but not Udop-Unimodel-Large-224

I was able to load the dual-large tokenizer/config/model using the same method described in this comment

However, when I attempt to use the exact same code, but for udop-unimodel-large-224 I get:

Traceback (most recent call last):
    tok = UdopTokenizer.from_pretrained(
    return cls._from_pretrained(
    tokenizer = cls(*init_inputs, **init_kwargs)
    self.sp_model.Load(vocab_file)
    return self.LoadFromFile(model_file)
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: unk is not defined.

What is the pipline and installation guideline?

Audio decoding glitch

audio.zip

All generation works fine except the audio one—no matter from which data type.
It looks like there are some problems with audio decoding. Here is the output sample.

Problems with "i-Code-V3"

Hi guys! Great article with CoDi!
I tried to run demo.ipynb. The models are successfully loaded into the memory. But not a single example is working. Here are examples of problems:

Text To Image

File ~/i-Code/i-Code-V3/core/models/model_module_infer.py:143, in model_module.inference(self, xtype, condition, condition_types, n_samples, mix_weight, image_size, ddim_steps, scale, num_frames)
    139         raise
    140     shapes.append(shape)
--> 143 z, _ = sampler.sample(
    144     steps=ddim_steps,
    145     shape=shapes,
    146     condition=conditioning,
    147     unconditional_guidance_scale=scale,
    148     xtype=xtype, 
    149     condition_types=condition_types,
    150     eta=ddim_eta,
    151     verbose=False,
    152     mix_weight=mix_weight)
    154 out_all = []
    155 for i, xtype_i in enumerate(xtype):

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/i-Code/i-Code-V3/core/models/ddim_vd.py:34, in DDIMSampler_VD.sample(self, steps, shape, xt, condition, unconditional_guidance_scale, xtype, condition_types, eta, temperature, mix_weight, noise_dropout, verbose, log_every_t)
     32 self.make_schedule(ddim_num_steps=steps, ddim_eta=eta, verbose=verbose)
     33 print(f'Data shape for DDIM sampling is {shape}, eta {eta}')
---> 34 samples, intermediates = self.ddim_sampling(
     35     shape,
     36     xt=xt,
     37     condition=condition,
     38     unconditional_guidance_scale=unconditional_guidance_scale,
     39     xtype=xtype,
     40     condition_types=condition_types,
     41     ddim_use_original_steps=False,
     42     noise_dropout=noise_dropout,
     43     temperature=temperature,
     44     log_every_t=log_every_t,
     45     mix_weight=mix_weight,)
     46 return samples, intermediates

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/i-Code/i-Code-V3/core/models/ddim_vd.py:93, in DDIMSampler_VD.ddim_sampling(self, shape, xt, condition, unconditional_guidance_scale, xtype, condition_types, ddim_use_original_steps, timesteps, noise_dropout, temperature, mix_weight, log_every_t)
     90 index = total_steps - i - 1
     91 ts = torch.full((bs,), step, device=device, dtype=torch.long)
---> 93 outs = self.p_sample_ddim(
     94     pred_xt, 
     95     condition, 
     96     ts, index, 
     97     unconditional_guidance_scale=unconditional_guidance_scale,
     98     xtype=xtype,
     99     condition_types=condition_types,
    100     use_original_steps=ddim_use_original_steps,
    101     noise_dropout=noise_dropout,
    102     temperature=temperature,
    103     mix_weight=mix_weight,)
    104 pred_xt, pred_x0 = outs
    106 if index % log_every_t == 0 or index == total_steps - 1:

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/i-Code/i-Code-V3/core/models/ddim_vd.py:132, in DDIMSampler_VD.p_sample_ddim(self, x, condition, t, index, unconditional_guidance_scale, xtype, condition_types, repeat_noise, use_original_steps, noise_dropout, temperature, mix_weight)
    129     x_in.append(torch.cat([x_i] * 2))
    130 t_in = torch.cat([t] * 2)
--> 132 out = self.model.model.diffusion_model(
    133     x_in, t_in, condition, xtype=xtype, condition_types=condition_types, mix_weight=mix_weight)
    134 e_t = []
    135 for out_i in out:

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/i-Code/i-Code-V3/core/models/openaimodel.py:1109, in UNetModelVD.forward(self, x, timesteps, condition, xtype, condition_types, mix_weight)
   1107 emb_image = self.unet_image.time_embed(t_emb)
   1108 emb_text = self.unet_text.time_embed(t_emb)
-> 1109 emb_audio = self.unet_audio.time_embed(t_emb)
   1111 for i in range(len(xtype)):
   1112     if xtype[i] == 'text':

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/nn/modules/container.py:139, in Sequential.forward(self, input)
    137 def forward(self, input):
    138     for module in self:
--> 139         input = module(input)
    140     return input

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x320 and 192x768)

Image To Text

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (8x320 and 192x768)

Text To Audio

File ~/anaconda/envs/CoDi/lib/python3.8/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x320 and 192x768)

And so on.

I use clean installation on A100 + 64GB RAM. No memory limitation was reached.

Missing import for DropPath in mae.py

Hello!

Started going through the code and found that mae.py file is missing the import line for DropPath class.
As I understood, it was taken from the timm library, that is something like this is missed:

from timm.models.layers.drop import DropPath

Demos on actual documents

Hello! Thank you for making this available - I was wondering if there are any demos that you could share where text/information is extracted from a PDF? I have seen that the current scripts uploaded are for fine-tuning, but it would be nice to have a script for just inference on new docs! Thank you

About i-code V1 project

A great job. Also, i would like to ask if you will open source the code of i-Code V1 project? It would be exciting if it could

Environment encoder V in CoDi

Hello, thanks for sharing this work!

Need to figure it out something in CoDi. Is the environment encoder V in paper as clap_encode_audio like this ?

Question about tokenizer and config.json

Hi,

I noticed that the files contained in https://huggingface.co/ZinengTang/Udop/tree/main contain the model weights (pytorch_model.bin for UdopDual-Large-224 and UdopUnimodel-Large-224), but not config.json or tokenizer related files.
Are those missing files identical to t5-large?

Thanks!

how much computation resource is needed to reproduce the pre-training of Udop

Request some details about i-Code V1

Dear @ziyi-yang and @yuwfan ,
I'm trying to use a similar idea on peptide representation learning. But several details are missing from your context. So I sent this email, hoping to get a response from you.

What representation did you use in the single modality task? The embedding after the fusion layer or the embedding after the text encoder? This is not clearly explained in the context, but I think it should influence the results a lot.
I think you should have done some finetuning using the pretrained i-Code model. How did you do that? What classification head did you add? A linear layer after the embedding or some other dropout tricks as well?
In your calculation of vl, vs, and ls loss, what feature did you use as inputs? The embedding after the fusion layer? If so, for the merged attention case, did you separate each of the text, vision, and speech features from fused embedding? This seems weird to me and not clear if these features still persist the information it originally contains.

Unresolved reference

Hi,

Firstly, thanks a lot for your awesome work!

There seems to be some errors in run_rvlcdip.py file. In the line "from datasets import ClassLabel, load_dataset, load_metric" it shows "Unresolved reference 'datasets'". And I can't find a module named datasets in the project folder.

Best regards,

Document-parsing example

Good morning,

First off all, thank you very much for open sourcing this model.

I have been looking at this model as an alternative to Donut for Document Parsing, I think we will get better performance as OCR data is included.

However, after checking your repository I just saw scripts for document classification and understanding. An example for Document Parsing or token classification will be helpfull. For document parsing I mean an example similar to this one for CORD dataset.

Thanks in advance! @zinengtang @ziyi-yang

Best regards

CLEARML ISSUE

I am getting a clearml issue regarding its path while executing clearml-init?

Run i-Code-v3 on CPU, Solve GPU VRAM problem!

I will appreciate it to provide the support or apply the following changes to be able run the i-Code-v3 on CPU.

the following changes need to apply:
/core/common/utils.py --> change np.int to np.int32

in the following files:
/core/models/model_module_infer.py
/core/models/ddim/ddim.py
/core/models/latent_diffusion/diffusion_unet.py

should add the following code:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

and change ".cuda( )" to ".to(device)"

by applying the above changes I could run the i-Code-V3 on the CPU, but fp16 was not supported yet (on CPU mode). I appreciate it to support Running FP32 and FP16 on CPU.