invictus717 / metatransformer Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 112.0 22.17 MB

Meta-Transformer for Unified Multimodal Learning

Home Page: https://arxiv.org/abs/2307.10802

License: Apache License 2.0

Python 81.72% Shell 6.69% C++ 6.33% Cuda 4.97% Cython 0.13% C 0.13% Dockerfile 0.03%

artificial-intelligence computer-vision foundationmodel machine-learning multimedia multimodal transformers

metatransformer's People

Contributors

Stargazers

Watchers

Forkers

aust-hansen standardgalactic haorand yashasdevasurmutt evdcush miltos-thestargazer fdoperezi ezoalab apollohuang1 evelynmitchell syc-hjy eltociear olayinkaadeleye mathpopo tyu6 z99205388 dl-vit mhhamdan techthiyanes fffox-abc gary109 juinzhu tvjoseph kevingoh kirkryan fowobajek kumar045 sandhiyara kai0226 mbrukman ashoka74 dani-el-lo dingrizhi wushian hhy5277 rsanthanagopalan fvuff1314 keyman9848 sysujayce dayadaya222 phoebussi rexxxx1234 jeff52415 xuyongfu bobrown lakmalrupasinghe stjordanis stepwise-ai-dev sindhura-cs budavarapu bingtian88 jianqiuer haikuoxin zstreet87 russal-yang aipursuing rohan7958 jaedukseo gitkamo maykhel-deleon celenius thomascherickal 2132660698 nickljlee plutew ajinkyapuar contropist nemonameless caczhtus kemolo lum1104 f901107 wl031 tangjunjun966 xiaofeixiang1234 sukeey xfg0913 moqingxinai robinbg xiaowangya guanam gill-wang anas2908 jdlc105 zero506 coreyfury 5l1v3r1 weiyiqianlima amine0110 zhengjunsun frankjinghao coderchen01 accebet pekin66 daiyizheng claudia2020shen hglun hl212 vimukthirandika1997 hippoley

metatransformer's Issues

audio模块代码报错

Audio模块中的run_sc.sh代码15行，报错提示venvast文件不存在，请问怎么解决？

Video

您好，请问可以提供下

中针对k400数据集的data preparation的代码吗

Hello! I've been trying to figure out Meta-Transformer for two weeks now and I can't get the embeddings I need. Please share the code for the following example: how to get text and image embeddings from the words "dog", "car", "bird" and their pictures. Thanks!

关于text模态的使用

非常感谢您做了如此出色地工作，给了我非常大的启发!
在利用meta-transformer处理文本模态时，我有这样几个问题有点困惑：

在paper中，对于text的tokenizer，先使用clip将其分割为一系列子词，再用一个embedding层将其向量化。这个embedding层我该如何以代码的方式体现呢？
data2seq模块中，是直接使用clip对得到的子词进行了embedding，但这与文章中描述的tokenization的过程有点出入，在这个过程中我们应该只是用一个很轻量级的投影层而非clip吧？
最后，您是否能够提供一个简易的文本模态tokenization的小demo。
感谢再次您如此富有创造力的工作！

what is the difference between vit adapter and MetaTransformer on detection and segmentation task?

Hi,

Thanks for your greate job.
I'm curious about the difference between the vit adapter and MetaTransformer on detection and segmentation tasks.

Best,
James

预训练模型内地如何下载？

[赞]，非常棒的开源项目，可是我们国内没法访问墙外的谷歌，是否有计划将预训练模型国内也上传一份呢？

请问红外数据实验中使用了SYSU-MM01数据集吗？

如图，在目前的Arxiv论文中，仅有表5提到了SYSU-MM01数据集

但是AGW等方法的实验数据应该都是RegDB数据集的，SYSU-MM01数据集有被使用吗，该处为什么会提到SYSU-MM01数据集？感谢

Question about the ``pretrain-finetune'' pipeline

Hi, thanks for your great contributions.

I am curious about your ``pretrain-finetune'' pipeline.

According to the paper and your code, it seems that the pipeline is:

you first carry out pre-training on LAION-2B with a CLIP-style objective to obtain a modality-agnostic encoder
then you integrate a data-to-sequence tokenizer (whose implementation depends on the modality of downstream task) with the pre-trained encoder and fine-tune the model.

Am I understand right?

Here are my concerns:

The core idea of MetaTransformer is one shared backbone + different tokenizers + different heads. However, I don't see any joint training process on data across 12 modalities. Instead, it seems that you carry out fine-tuning 12 times, where in some cases the so-called ``shared'' backbone needs to be trained to fit in a specific modality to obtain superior performance.
According to the first concern, the demo you give in README may give inferior representations for modalities except for images, right? This is because the released pre-trained weights are obtained from the above step 1) pre-training, not the joint training on 12 modalities.

Code for Tokenization?

Thank you for sharing this most exciting work!

I would like to know: Is the code for tokenizing different modalities not released yet or am I failing to read where in the code the tokenization happens?

I would like to use Meta Transformer on a custom Data Set, with image and text inputs.

As far as I understood the workflow would be:

token_text, token_image = tokenize(text), tokenize(image)

embedding_text = pretrained_encoder(token_text)  # as described in demo
embedding_image = pretrained_encoder(token_image)  # as described in demo

downstream_task(embedding_text, embedding_image)

Is this correct on a very high level?

Thanks in advance!

Whether to support BBOX data？

Can you tell me if BBOX data (either 2D or 3D) is currently supported? If not, can you give some guidance on using richer input data types?

video

您好！我在运行video的run.sh中遇到了如下的问题：

代码在执行
ckpt = torch.load("Meta-Transformer_base_patch16_encoder.pth")
model.blocks.load_state_dict(ckpt,strict=True)
时遇到了如上报错。
我猜测是由于模型结构不一致造成的，想请教下您是如何解决的，感谢！

For More Setup Instructions

Hello

Thank you for your amzing work which is quite promising. However, I'm encountering difficulties due to the lack of clear environment setup instructions.Could you please provide more detailed environment and guidence in documentation? This would greatly help users like me to set up the project smoothly and contribute effectively.Additionally, I would like to inquire about the current status of the project's code upload.

Thank you for your consideration.

Best regards,

为什么训练图像会出现这种错误

我使用的Image中的htc++模型，数据集按照readme.md中的要求下载了COCO数据集，但是在训练过程中老是提示如下错误FileNotFoundError: [Errno 2] No such file or directory: 'data/coco/stuffthingmaps/train2017/000000248242.png'，但是数据集中所有文件都是jpg，改称png后，有一些数据又要用到png。请问我该如何解决这个问题

Finetuning the model

First of all, great work by your team, this will be the new breakthrough in AI.
How can we Finetune the Meta Transformer model with our task-specific data?

Enquiry of the training code of the Unified Multimodal Model

Hi, thank you for your great job!

Could you share your training code of the Unified Multimodal Model?

Many thanks!

Multiple modals

How to use multiple modals at the same time for a task, such as text+image, text+audio, or text+pointcloud?

There is a typo in README file, LOL

After obtaining the token sequence, we employ a modality-shared encoder to extract representation across different modalities. With task-specific heads, Meta-Transformer can hanle various tasks on the different modalities, such as: classification, detection, and segmentation.

The hanle should be handle.

LOL, Nice work, forks, and thanks for sharing.

Looking forward to the code.

video

你们论文中提到了video recognition，请问在哪一部分代码中体现？

Questions about experiments

Hi, thanks for your great contributions.

I have some questions after reading your paper:

In Table 3 (GLUE benchmark), you present Size for pre-training but do not give the corresponding unit. I know that LAION-2B contains 2B image-text pairs. But what do 0.8B, 3.3B, 4,5000B mean for language models? The number of tokens or the disk space of text files?
In Table 3 (GLUE benchmark), the model with frozen LAION-2B-pretrained backbone (i.e., Meta-Transformer-B16_F) lags behind the SOTA performance by a large margin (similarly in Table 9 - video understanding, and Table 12 - graph data understanding). In terms of this, I mentioned in issue#49 that the released Meta-Transformer weights may give inferior representations, especially for modalities like text, video and graph (because the backbone is pre-trained on LAION-2B, not trained jointly on data across 12 modalities). Do I misunderstand or overlook sth? It seems that one would have to fine-tune the backbone to get more discriminative representations (e.g., Table 3, Meta-Transformer-B16_F vs. Meta-Transformer-B16_T).

Enhance Codebase with Comprehensive Docstrings

adding detailed docstrings throughout the code.

Each function, class, and module in the codebase is accompanied by a comprehensive docstring.
Docstrings follow the established documentation format and conventions.
README is updated to include guidelines for writing effective docstrings.

Data2Seq > Hyper_Spectrum.py update from self.cls_tokens to self.cls_token

https://github.com/invictus717/MetaTransformer/blob/d30327826f4c2f158df137568e9557cb715026ec/Data2Seq/Hyper_Spectrum.py#L21C9-L21C69

From
cls_tokens = repeat(self.cls_tokens, '() n d -> b n d', b=b)

To
cls_tokens = repeat(self.cls_token, '() n d -> b n d', b=b)

Which would be equivalent to

MetaTransformer/Hyper-spectrum/metatransformer.py

Line 156 in d303278

cls_tokens = repeat(self.cls_token, '() n d -> b n d', b = b) #[b,1,dim]

Great work! When can we see the code of the research?

Same as title.

Is the model that each task will have a corresponding downstream HEAD MLP?

Great work, thanks.
I have a question that Is the model that each task will have a corresponding downstream HEAD MLP?

audio问题

请问audio部分按教程运行
! bash run_sc.sh
之后为什么会出现log.txt训练记录为空的情况

how to export to Onnx model for faster inference

this is a very useful project, would be great if it can be used in production with onnx support

Replicating training?

How to replicating the training process? Would you release a detailed guide for replicating? What was the training rig that you used?

About the point cloud model release time?

Hi, thanks for your great work, when will the point cloud model release?

想问一下作者你们使用的mmcv版本是多少

我安装了mmcv == 2.0.0 但是运行代码时报错ModuleNotFoundError: No module named 'mmcv.fileio'
貌似在2.0版本的mmcv中删除了fileio这个库

如何使用较大版本的模型。

您好，非常感谢您做出如此出色的工作！
我想使用meta-transformer来做一些分类工作，我首先使用了base版本的模型，并使用了您的部分代码，它成功的运行了。
但当我想要换成large版本的模型时，出现了一些问题，相比如base版本，他似乎有24个block，同时需要的embedding变成了1024.
我尝试了修改模型定义来适配large版本，但一直无法work，您是否可以提供一下large版本的模型定义以及如何使用large版本的模型呢？
感谢！

FileNotFoundError: [Errno 2] No such file or directory: 'Meta-Transformer_base_patch16_encoder.pth'

请问Audio目录里面执行sh脚本报错，Meta-Transformer_base_patch16_encoder.pth在哪里呢，没有看到torch.save这个，然后torch.load就报错了

How to compute similarity score between different modalities?

Hello! The project looks very promising, but I'm having some issues starting up with it. Given that there is a common embedding space for all these modalities, all I would like to do with it is to encode and compare different data types.

Very rudimentary example would be that I have some pictures of animals and some point clouds of animals and I'd like to calculate the similarity matrix to find out which picture matches which point cloud.

As far as I understand it,

this README.md / Model Zoo snippet loads the foundation model
then I'd need to encode all my different file types with some functions that I can't find - I'm stuck here
multiply, maybe softmax etc. the two feature arrays to get a similarity matrix/heatmap

paper

论文的Introduction的第二段少一个"."

Questions about inference

hello, how to determine which modality is input during reasoning? Is a classification network used before the unimodal expert transfomer?

audio

请问audio部分中run_sc.sh中代码第56行，CUDA_VISIBLE_DEVICES是什么参数

audio

audio部分的训练可以中断后继续训练吗

audio preprocess

Nice work! Could you please provide code about the audio preprocess (Data-to-Sequence Tokenization) in your paper section 3.2 Audio Spectrogram?

demo use

After reading your paper, I still don't know how to use your model. Could you please provide a complete example? Please include a full 'Demo of Use for Pretrained Encoder' here. Thank you very much.

The tokenizer for time-series data

The paper mentioned that metaTransformer uses the tokenizer of autoformer for time-series forecasting task. However, autoformer does not have a "tokenizer", and its encoder directly takes the raw time-series data as input. I wonder if you mistake it for PatchTST or something else?

How to use Meta Transformer to finish object detection without ViT-Adapter model?

How can I realize the object detection function of an image through MetaTransformer without using the ViT-Adapter model? I don't see the tutorial in your project, is it convenient to remind me?

how to use it?

Hello! Your transformer is amazing! But i m beginner in data science. I have to do research for my university task: we want to predict how negotiations will finish. We have various modalities including video, audio, time-series EEG. Maybe you have demo version how to use transformer for such tasks? If you do, please share it.
Thanks!

How to pretrain Unified Multimodal Model?

I would like to express my appreciation for your exceptional work. I attended your live presentation yesterday and gained valuable insights. I am interested in exploring the Unified Multimodal Model that you proposed within my research domain. As my multimodal data is of fine granularity, I am considering fine-tuning or retraining your model to suit my needs.

I kindly request if it would be possible for you to open-source some of the pretraining procedures for the Unified Multimodal Model. This would greatly assist me in adapting the model to my specific requirements.

Thank you very much for your outstanding contributions.

How can we get the inference demo?

How can we get the inference demo? The patch embedding part seems not to be available.

Data2Seq Weights

In the Data2Seq code for getting embeddings, the Image and Video embedders have a Conv2d and Conv3d, respectively. Do you plan to release the pre-trained weights for these layers?

When the "Data2Seq" module dataset and code will be available?

Hi, this is a great work. When the "Data2Seq" module dataset and ode will be available? Do you have a timetable now？
Thanks so much!

Data2Seq Usage/Embedding Dim

Thanks for sharing the code for embedding modalities!

I'd like to use Meta Transformer in my research (I use images and text) and have multiple short questions:

When embedding an Image with data2seq code, I get an embedding of (batch_size, num_patches, 768). Is this the correct embedding shape for images?

2 a) When passing a text, data2seq produces a dict with input_ids (tokens) and attention_masks
2 b) When using the get_text_embeddings() to embed text, I get an embedding of (batch_size x 768). The encoder as loaded in the demo section does not accept this shape (I need to add unsqueeze() to add another dimension).
Whats the correct way to embed text, and what input shape is expected by the encoder?

Input shapes after embedding should be the same across all modalities, correct?
Are weights for embeddings available to download or would I need to learn them seperately?

Thanks in advance for your time!

data2seq

Can you give a demo about how to use data2seq code?

audio

我在改写dataloader时遇到一个问题。
原先工作用的tau中音频是双声道的wav文件，而您的工作中用的数据集中用的是单声道的wav文件。
您的代码中有一块将音频文件转换为滤波器组特征的代码，在这地方报了错。原因是两种文件在load之后shape不同。我对音频方面不太懂，我猜测是不是将原本对单声道处理的函数改为对双声道的两个声道都处理一遍就可以解决问题了？

question about training

I wonder that could the modality A interact with modality B in training?
I guess that each tokenizer process their modality sperately, each modality was transformed by the freezed encoder(concat all the data and set the attention mask if modality A can not interact with modality B, or just forward 12 times?), and each modality was sent to its head to calculate loss.
Am I right?

what if it not use the LAION-2B dataset CLIP backbone

I wonder the power is or not come from the CLIP backbone that is freezed in your experiments.

What's the difference between the patch 14 and patch 16 models?

Hi,
Thanks for your great job!
I'm a beginner in LLM, could you please tell me the difference between the patch 14 and patch 16 models?
Besides, how to use the pre-trained models? For example, if I want to do a text generation task, should I find a model like llama or vacuna and replace their encoder with this pre-trained encoder models?

invictus717 / metatransformer Goto Github PK

metatransformer's People

Contributors

Stargazers

Watchers

Forkers

metatransformer's Issues

Recommend Projects

Recommend Topics

Recommend Org