screenai's Introduction

Screen AI

Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding". The flow is: img + text -> patch sizes -> vit -> embed + concat -> attn + ffn -> cross attn + ffn + self attn -> to out. PAPER LINK:

Install

pip3 install screenai

Usage

import torch
from screenai.main import ScreenAI

# Create a tensor for the image
image = torch.rand(1, 3, 224, 224)

# Create a tensor for the text
text = torch.randn(1, 1, 512)

# Create an instance of the ScreenAI model with specified parameters
model = ScreenAI(
    patch_size=16,
    image_size=224,
    dim=512,
    depth=6,
    heads=8,
    vit_depth=4,
    multi_modal_encoder_depth=4,
    llm_decoder_depth=4,
    mm_encoder_ff_mult=4,
)

# Perform forward pass of the model with the given text and image tensors
out = model(text, image)

# Print the shape of the output tensor
print(out)

License

MIT

Citation

@misc{baechler2024screenai,
    title={ScreenAI: A Vision-Language Model for UI and Infographics Understanding}, 
    author={Gilles Baechler and Srinivas Sunkara and Maria Wang and Fedir Zubach and Hassan Mansoor and Vincent Etter and Victor Cărbune and Jason Lin and Jindong Chen and Abhanshu Sharma},
    year={2024},
    eprint={2402.04615},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Todo

Implement the nn.ModuleList([]) in the encoder and decoder

screenai's People

Contributors

Stargazers

Watchers

screenai's Issues

runtime error when executing the default example

Describe the bug
after pip install screenai a runtime error is produced in the from screenai.main import ScreenAI line in the default example :
RuntimeError: mat1 and mat2 shapes cannot be multiplied (512x4 and 512x512)

To Reproduce
Steps to reproduce the behavior:

run pip install screenai
run the default example

Expected behavior
run without error

Screenshots
`---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_20976\3292023021.py in <cell line: 2>()
1 import torch
----> 2 from screenai.main import ScreenAI
3
4 # Create a tensor for the image
5 image = torch.rand(1, 3, 224, 224)

~\AppData\Local\Programs\Python\Python39\lib\site-packages\screenai_init_.py in
----> 1 from screenai.main import (
2 CrossAttention,
3 MultiModalEncoder,
4 MultiModalDecoder,
5 ScreenAI,

~\AppData\Local\Programs\Python\Python39\lib\site-packages\screenai\main.py in
5 from torch import Tensor, einsum, nn
6 from torch.autograd import Function
----> 7 from zeta.nn import (
8 SwiGLU,
9 FeedForward,

~\AppData\Local\Programs\Python\Python39\lib\site-packages\zeta_init_.py in
26 logger.addFilter(f)
27
---> 28 from zeta.nn import *
29 from zeta.models import *
30 from zeta.utils import *

~\AppData\Local\Programs\Python\Python39\lib\site-packages\zeta\nn_init_.py in
1 from zeta.nn.attention import *
2 from zeta.nn.embeddings import *
----> 3 from zeta.nn.modules import *
4 from zeta.nn.biases import *

~\AppData\Local\Programs\Python\Python39\lib\site-packages\zeta\nn\modules_init_.py in
45 from zeta.nn.modules.s4 import s4d_kernel
46 from zeta.nn.modules.h3 import H3Layer
---> 47 from zeta.nn.modules.mlp_mixer import MLPMixer
48 from zeta.nn.modules.leaky_relu import LeakyRELU
49 from zeta.nn.modules.adaptive_layernorm import AdaptiveLayerNorm

~\AppData\Local\Programs\Python\Python39\lib\site-packages\zeta\nn\modules\mlp_mixer.py in
143 1, 512, 32, 32
144 ) # Batch size of 1, 512 channels, 32x32 image
--> 145 output = mlp_mixer(example_input)
146 print(
147 output.shape

~\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py in _wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
1519
1520 def _call_impl(self, *args, **kwargs):

~\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *args, **kwargs)
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1528
1529 try:

~\AppData\Local\Programs\Python\Python39\lib\site-packages\zeta\nn\modules\mlp_mixer.py in forward(self, x)
123 x = rearrange(x, "n c h w -> n (h w) c")
124 for mixer_block in self.mixer_blocks:
--> 125 x = mixer_block(x)
126 x = self.pred_head_layernorm(x)
127 x = x.mean(dim=1)

~\AppData\Local\Programs\Python\Python39\lib\site-packages\zeta\nn\modules\mlp_mixer.py in forward(self, x)
61 y = self.norm1(x)
62 y = rearrange(y, "n c t -> n t c")
---> 63 y = self.tokens_mlp(y)
64 y = rearrange(y, "n t c -> n c t")
65 x = x + y

~\AppData\Local\Programs\Python\Python39\lib\site-packages\zeta\nn\modules\mlp_mixer.py in forward(self, x)
28 torch.Tensor: description
29 """
---> 30 y = self.dense1(x)
31 y = F.gelu(y)
32 return self.dense2(y)

~\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\linear.py in forward(self, input)
112
113 def forward(self, input: Tensor) -> Tensor:
--> 114 return F.linear(input, self.weight, self.bias)
115
116 def extra_repr(self) -> str:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (512x4 and 512x512)`

Real image input example

Hello @kyegomez ,

Thank you so much for this awesome repo. I'm very excited to test this project. So, i've tried with example code but it gives me this below error

SyntaxError: Non-UTF-8 code starting with '\xff' in file C:\Users\alamj\Downloads\screenai.py on line 1, but no encoding declared; see https://peps.python.org/pep-0263/ for details

Could you use a real example of giving input image and text and converting them to vector and feed to the model. I really want to check it out

Thank you!

Recommend Projects

kyegomez / screenai Goto Github PK

screenai's Introduction

Screen AI

Install

Usage

License

Citation

Todo

screenai's People

Contributors

Stargazers

Watchers

Forkers

screenai's Issues

runtime error when executing the default example

Real image input example

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent