invictus717 / metatransformer Goto Github PK
View Code? Open in Web Editor NEWMeta-Transformer for Unified Multimodal Learning
Home Page: https://arxiv.org/abs/2307.10802
License: Apache License 2.0
Meta-Transformer for Unified Multimodal Learning
Home Page: https://arxiv.org/abs/2307.10802
License: Apache License 2.0
Audio模块中的run_sc.sh代码15行,报错提示venvast文件不存在,请问怎么解决?
Hello! I've been trying to figure out Meta-Transformer for two weeks now and I can't get the embeddings I need. Please share the code for the following example: how to get text and image embeddings from the words "dog", "car", "bird" and their pictures. Thanks!
非常感谢您做了如此出色地工作,给了我非常大的启发!
在利用meta-transformer处理文本模态时,我有这样几个问题有点困惑:
Hi,
Thanks for your greate job.
I'm curious about the difference between the vit adapter and MetaTransformer on detection and segmentation tasks.
Best,
James
[赞],非常棒的开源项目,可是我们国内没法访问墙外的谷歌,是否有计划将预训练模型国内也上传一份呢?
Hi, thanks for your great contributions.
I am curious about your ``pretrain-finetune'' pipeline.
According to the paper and your code, it seems that the pipeline is:
Am I understand right?
Here are my concerns:
Thank you for sharing this most exciting work!
I would like to know: Is the code for tokenizing different modalities not released yet or am I failing to read where in the code the tokenization happens?
I would like to use Meta Transformer on a custom Data Set, with image and text inputs.
As far as I understood the workflow would be:
token_text, token_image = tokenize(text), tokenize(image)
embedding_text = pretrained_encoder(token_text) # as described in demo
embedding_image = pretrained_encoder(token_image) # as described in demo
downstream_task(embedding_text, embedding_image)
Is this correct on a very high level?
Thanks in advance!
Can you tell me if BBOX data (either 2D or 3D) is currently supported? If not, can you give some guidance on using richer input data types?
Hello
Thank you for your amzing work which is quite promising. However, I'm encountering difficulties due to the lack of clear environment setup instructions.Could you please provide more detailed environment and guidence in documentation? This would greatly help users like me to set up the project smoothly and contribute effectively.Additionally, I would like to inquire about the current status of the project's code upload.
Thank you for your consideration.
Best regards,
我使用的Image中的htc++模型,数据集按照readme.md中的要求下载了COCO数据集,但是在训练过程中老是提示如下错误FileNotFoundError: [Errno 2] No such file or directory: 'data/coco/stuffthingmaps/train2017/000000248242.png',但是数据集中所有文件都是jpg,改称png后,有一些数据又要用到png。请问我该如何解决这个问题
First of all, great work by your team, this will be the new breakthrough in AI.
How can we Finetune the Meta Transformer model with our task-specific data?
Hi, thank you for your great job!
Could you share your training code of the Unified Multimodal Model?
Many thanks!
How to use multiple modals at the same time for a task, such as text+image, text+audio, or text+pointcloud?
After obtaining the token sequence, we employ a modality-shared encoder to extract representation across different modalities. With task-specific heads, Meta-Transformer can hanle various tasks on the different modalities, such as: classification, detection, and segmentation.
The hanle
should be handle
.
LOL, Nice work, forks, and thanks for sharing.
Looking forward to the code.
Hi, thanks for your great contributions.
I have some questions after reading your paper:
Size
for pre-training but do not give the corresponding unit
. I know that LAION-2B contains 2B
image-text pairs. But what do 0.8B
, 3.3B
, 4,5000B
mean for language models? The number of tokens or the disk space of text files?Meta-Transformer-B16_F
) lags behind the SOTA performance by a large margin (similarly in Table 9 - video understanding, and Table 12 - graph data understanding). In terms of this, I mentioned in issue#49 that the released Meta-Transformer weights may give inferior representations
, especially for modalities like text, video and graph (because the backbone is pre-trained on LAION-2B, not trained jointly on data across 12 modalities). Do I misunderstand or overlook sth? It seems that one would have to fine-tune the backbone to get more discriminative representations (e.g., Table 3, Meta-Transformer-B16_F
vs. Meta-Transformer-B16_T
).adding detailed docstrings throughout the code.
From
cls_tokens = repeat(self.cls_tokens, '() n d -> b n d', b=b)
To
cls_tokens = repeat(self.cls_token, '() n d -> b n d', b=b)
Which would be equivalent to
?Same as title.
Great work, thanks.
I have a question that Is the model that each task will have a corresponding downstream HEAD MLP?
this is a very useful project, would be great if it can be used in production with onnx support
How to replicating the training process? Would you release a detailed guide for replicating? What was the training rig that you used?
Hi, thanks for your great work, when will the point cloud model release?
我安装了mmcv == 2.0.0 但是运行代码时报错ModuleNotFoundError: No module named 'mmcv.fileio'
貌似在2.0版本的mmcv中删除了fileio这个库
您好,非常感谢您做出如此出色的工作!
我想使用meta-transformer来做一些分类工作,我首先使用了base版本的模型,并使用了您的部分代码,它成功的运行了。
但当我想要换成large版本的模型时,出现了一些问题,相比如base版本,他似乎有24个block,同时需要的embedding变成了1024.
我尝试了修改模型定义来适配large版本,但一直无法work,您是否可以提供一下large版本的模型定义以及如何使用large版本的模型呢?
感谢!
请问Audio目录里面执行sh脚本报错,Meta-Transformer_base_patch16_encoder.pth在哪里呢,没有看到torch.save这个,然后torch.load就报错了
Hello! The project looks very promising, but I'm having some issues starting up with it. Given that there is a common embedding space for all these modalities, all I would like to do with it is to encode and compare different data types.
Very rudimentary example would be that I have some pictures of animals and some point clouds of animals and I'd like to calculate the similarity matrix to find out which picture matches which point cloud.
As far as I understand it,
论文的Introduction的第二段少一个"."
hello, how to determine which modality is input during reasoning? Is a classification network used before the unimodal expert transfomer?
请问audio部分中run_sc.sh中代码第56行,CUDA_VISIBLE_DEVICES是什么参数
audio部分的训练可以中断后继续训练吗
Nice work! Could you please provide code about the audio preprocess (Data-to-Sequence Tokenization) in your paper section 3.2 Audio Spectrogram?
After reading your paper, I still don't know how to use your model. Could you please provide a complete example? Please include a full 'Demo of Use for Pretrained Encoder' here. Thank you very much.
The paper mentioned that metaTransformer uses the tokenizer of autoformer for time-series forecasting task. However, autoformer does not have a "tokenizer", and its encoder directly takes the raw time-series data as input. I wonder if you mistake it for PatchTST or something else?
How can I realize the object detection function of an image through MetaTransformer without using the ViT-Adapter model? I don't see the tutorial in your project, is it convenient to remind me?
Hello! Your transformer is amazing! But i m beginner in data science. I have to do research for my university task: we want to predict how negotiations will finish. We have various modalities including video, audio, time-series EEG. Maybe you have demo version how to use transformer for such tasks? If you do, please share it.
Thanks!
I would like to express my appreciation for your exceptional work. I attended your live presentation yesterday and gained valuable insights. I am interested in exploring the Unified Multimodal Model that you proposed within my research domain. As my multimodal data is of fine granularity, I am considering fine-tuning or retraining your model to suit my needs.
I kindly request if it would be possible for you to open-source some of the pretraining procedures for the Unified Multimodal Model. This would greatly assist me in adapting the model to my specific requirements.
Thank you very much for your outstanding contributions.
How can we get the inference demo? The patch embedding part seems not to be available.
In the Data2Seq code for getting embeddings, the Image and Video embedders have a Conv2d and Conv3d, respectively. Do you plan to release the pre-trained weights for these layers?
Hi, this is a great work. When the "Data2Seq" module dataset and ode will be available? Do you have a timetable now?
Thanks so much!
Thanks for sharing the code for embedding modalities!
I'd like to use Meta Transformer in my research (I use images and text) and have multiple short questions:
2 a) When passing a text, data2seq produces a dict with input_ids (tokens) and attention_masks
2 b) When using the get_text_embeddings() to embed text, I get an embedding of (batch_size x 768). The encoder as loaded in the demo section does not accept this shape (I need to add unsqueeze() to add another dimension).
Whats the correct way to embed text, and what input shape is expected by the encoder?
Input shapes after embedding should be the same across all modalities, correct?
Are weights for embeddings available to download or would I need to learn them seperately?
Thanks in advance for your time!
Can you give a demo about how to use data2seq code?
我在改写dataloader时遇到一个问题。
原先工作用的tau中音频是双声道的wav文件,而您的工作中用的数据集中用的是单声道的wav文件。
您的代码中有一块将音频文件转换为滤波器组特征的代码,在这地方报了错。原因是两种文件在load之后shape不同。我对音频方面不太懂,我猜测是不是将原本对单声道处理的函数改为对双声道的两个声道都处理一遍就可以解决问题了?
I wonder that could the modality A interact with modality B in training?
I guess that each tokenizer process their modality sperately, each modality was transformed by the freezed encoder(concat all the data and set the attention mask if modality A can not interact with modality B, or just forward 12 times?), and each modality was sent to its head to calculate loss.
Am I right?
I wonder the power is or not come from the CLIP backbone that is freezed in your experiments.
Hi,
Thanks for your great job!
I'm a beginner in LLM, could you please tell me the difference between the patch 14 and patch 16 models?
Besides, how to use the pre-trained models? For example, if I want to do a text generation task, should I find a model like llama or vacuna and replace their encoder with this pre-trained encoder models?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.