vimalabs / vima Goto Github PK

View Code? Open in Web Editor NEW

710.0 17.0 82.0 5.34 MB

Official Algorithm Implementation of ICML'23 Paper "VIMA: General Robot Manipulation with Multimodal Prompts"

License: MIT License

Python 100.00%

vima's Introduction

VIMA: General Robot Manipulation with Multimodal Prompts

ICML 2023

[Website] [arXiv] [PDF] [Pretrained Models] [Baselines Implementation] [VIMA-Bench] [Training Data] [Model Card]

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. However, different robotics tasks are still tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We introduce VIMA (VisuoMotor Attention agent), a novel scalable multi-task robot learner with a uniform sequence IO interface achieved through multimodal prompts. The architecture follows the encoder-decoder transformer design proven to be effective and scalable in NLP. VIMA encodes an input sequence of interleaving textual and visual prompt tokens with a pretrained language model, and decodes robot control actions autoregressively for each environment interaction step. The transformer decoder is conditioned on the prompt via cross-attention layers that alternate with the usual causal self-attention. Instead of operating on raw pixels, VIMA adopts an object-centric approach. We parse all images in the prompt or observation into objects by off-the-shelf detectors, and flatten them into sequences of object tokens. All these design choices combined deliver a conceptually simple architecture with strong model and data scaling properties.

In this repo, we provide VIMA model code, pre-trained checkpoints covering a spectrum of model sizes, and demo and eval scripts. This codebase is under MIT License.

Installation

VIMA requires Python ≥ 3.9. We have tested on Ubuntu 20.04. Installing VIMA codebase is as simple as:

pip install git+https://github.com/vimalabs/VIMA

Pretrained Models

We host pretrained models covering a spectrum of model capacity on Hugging Face. Download links are listed below. The mask R-CNN model can be found here.

200M	92M	43M	20M	9M	4M	2M

Baselines Implementation

Because there is no prior method that works out of the box with our multimodal prompting setup, we make our best effort to select a number of representative transformer-based agent architectures as baselines, and re-interpret them to be compatible with VIMA-Bench. They include VIMA-Gato, VIMA-Flamingo, and VIMA-GPT. Their implementation can be found in the policy folder.

Demo

To run the live demonstration, first follow the instruction to install VIMA-Bench.Then we can run a live demo through

python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}

Here eval_level means one out of four evaluation levels and can be chosen from placement_generalization, combinatorial_generalization, novel_object_generalization, and novel_task_generalization. task means a specific task template. Please refer to task suite and benchmark for more details. For example:

python3 scripts/example.py --ckpt=200M.ckpt --partition=placement_generalization --task=follow_order

After running the above command, we should see a PyBullet GUI pop up, alongside a small window showing the multimodal prompt. Then a robot arm should move to complete the corresponding task. Note that this demo may not work on headless machines since the PyBullet GUI requires a display.

Paper and Citation

Our paper is posted on arXiv. If you find our work useful, please consider citing us!

@inproceedings{jiang2023vima,
  title     = {VIMA: General Robot Manipulation with Multimodal Prompts},
  author    = {Yunfan Jiang and Agrim Gupta and Zichen Zhang and Guanzhi Wang and Yongqiang Dou and Yanjun Chen and Li Fei-Fei and Anima Anandkumar and Yuke Zhu and Linxi Fan},
  booktitle = {Fortieth International Conference on Machine Learning},
  year      = {2023}
}

vima's People

Contributors

Stargazers

Watchers

Forkers

evelynmitchell lee-b simrit1 hyundai-robotics-autonomous-engineering stjordanis jlqzzz deedive shadowkun gregorylemasurier zhanghuzhenyu robot-learning-library tmats shahrutav yeknafar peepleinc robotseye qianqianlo codedcclxxvii kelvin34501 yanglei50 anilcosaran ciccio42 marsaki imnotprepared germanpa 2132660698 changhaonan yokko123 xadhrit lqiang2003cn cocos3ds liuqinglong110 yinfi eltociear liangofthechen npulrk mikejeans 0000duck namsan96 davidperezp124 gameinskysky kaixin-bai shuxjweb wjqgitnz felixpun roni5 wsf-hust aopolin-lv jdj2261 erwincoumans zhangsanfeng86 ramon349 dieface ghassenaskri conquereryang lgfinfo dhilabs hst1227 tailagency redcarp0 zero-coder mukesh2m2 epguo zianpan filthyshoe jdwebprogrammer spawn32 surajsahani tyndouf liuzc188 the-nishant sandyspappy rahulsundar tszxtentacion bingsheng1991 wangran12398 rochelleni hongjiedai maskey2902

vima's Issues

An error is reported when running the demo, and other models also fail to load

When I run the demo command, my command is as follows: python3 scripts/example.py --ckpt=./2M.ckpt --device=cuda --partition=placement_generalization --task=visual_manipulation
Error after running: pybullet build time: May 20 2022 19:45:31
[INFO] 17 tasks loaded
Traceback (most recent call last):
File "/data/code/VIMA-code/scripts/example.py", line 506, in
main(arg)
File "/data/anaconda3/envs/VIMA/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/data/code/VIMA-code/scripts/example.py", line 84, in main
policy = create_policy_from_ckpt(cfg.ckpt, cfg.device)
File "/data/code/VIMA-code/vima/init.py", line 11, in create_policy_from_ckpt
policy_instance.load_state_dict(
File "/data/anaconda3/envs/VIMA/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for VIMAPolicy:
Unexpected key(s) in state_dict: "xattn_gpt.h.0.attn.bias"

I tried to replace other ckpt files, and the same error was reported, all of which failed to load the model.

I refer to other people's methods to solve it successfully: #20 (comment)

Some questions about the input observation

Hi, I have a question, why do VIMA need both frontal and top-down views for the observation space，Can't just only give the top?

Implementation details of the loss scaling algorithm

Thank you for sharing such great works!

I encountered the following problem during the reproduction of your paper and was wondering if you might be willing to offer some guidance or clarification.

I trained the 2M VIMAPolicy (with all weights initialized by their default initial distributions except T5) on a small subset of the VIMA-Bench dataset (32 samples per task and 13 tasks in total) and tried to make it overfit.

It can be found that the imitation loss (calculated by cross_entropy_loss(dist_dict._logits, discrete_target_action)) of different action attributes (such as pose0_rotation, pose1_position) can change very differently during the training process, like the plot showing below. In this experiment, the final loss is calculated by taking the sum of all those action attributes with equal weights and then normalized by time step length

The plot shows how different loss (per step) attributes converges.
for example, `pose0_rotation_0` means the loss associated with the first dimension of `pose0_rotation` at a single time step.

By zooming to the first and last 100 epochs of the experiment, it can be found all dimensions of pose0_rotation and the first two dimensions of pose1_rotation converge very quickly to zero while the other losses converge relatively slow. The scaling between them changes dynamically.

First100 epochs

Last 100 epochs

In the same experiment, I also measured the ratio of the average loss between different tasks and got the following table. For example, 16.745474 means that the average loss of rearrange_then_restore samples is about 16x larger than the one of novel_noun samples

novel_noun 1.000000 sweep_without_exceeding 1.602642 rotate 1.857377 visual_manipulation 1.998764 twist 3.802508 manipulate_old_neighbor 4.956325 scene_understanding 5.033336 follow_order 5.132609 rearrange 5.827855 pick_in_order_then_restore 11.248917 rearrange_then_restore 16.745474

I would like to know how those losses (per action attribute and per task) are balanced during training. Thank you

t5-large instead of t5-base

I want to use t5-large instead of t5-base. What parts of the code should I change?
I have changed the following lines:

tokenizer = Tokenizer.from_pretrained("t5-large") self.t5 = T5EncoderModel.from_pretrained("t5-large") model = AutoModel.from_pretrained("t5-large")
But I am got the following error:

size mismatch for prompt_embedding._embed_layer.weight: copying a param with shape torch.Size([32128, 768]) from checkpoint, the shape in current model is torch.Size([32128, 1024]). size mismatch for t5_prompt_encoder.t5.shared.weight: copying a param with shape torch.Size([32128, 768]) from checkpoint, the shape in current model is torch.Size([32128, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.embed_tokens.weight: copying a param with shape torch.Size([32128, 768]) from checkpoint, the shape in current model is torch.Size([32128, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.0.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.0.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.0.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.0.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight: copying a param with shape torch.Size([32, 12]) from checkpoint, the shape in current model is torch.Size([32, 16]). size mismatch for t5_prompt_encoder.t5.encoder.block.0.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.0.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.0.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.0.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.1.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.1.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.1.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.1.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.1.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.1.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.1.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.1.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.2.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.2.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.2.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.2.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.2.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.2.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.2.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.2.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.3.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.3.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.3.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.3.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.3.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.3.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.3.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.3.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.4.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.4.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.4.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.4.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.4.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.4.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.4.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.4.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.5.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.5.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.5.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.5.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.5.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.5.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.5.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.5.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.6.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.6.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.6.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.6.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.6.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.6.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.6.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.6.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.7.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.7.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.7.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.7.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.7.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.7.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.7.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.7.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.8.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.8.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.8.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.8.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.8.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.8.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.8.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.8.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.9.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.9.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.9.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.9.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.9.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.9.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.9.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.9.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.10.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.10.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.10.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.10.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.10.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.10.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.10.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.10.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.11.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.11.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.11.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.11.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.11.layer.0.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.11.layer.1.DenseReluDense.wi.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for t5_prompt_encoder.t5.encoder.block.11.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for t5_prompt_encoder.t5.encoder.block.11.layer.1.layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for t5_prompt_encoder.t5.encoder.final_layer_norm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).

thanks for your attention.

[Test] The test result is not consistent with that reported in paper.

Hi, I tried to replement the test period via your given command python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}. For more detailes, in my experiment, I used the 200M.ckpt.

Specifically,

I excute the command mentioned before, by using 100 instances per task as the test sample.
Then the success ratio is obtained by obs, _, done, info = env.step(...).
I got the success ratio by averaging the result according to the L1-L4.

However, I found that the result I obtained is far from that in your paper. The following table is my experimental result, and its sucess ratio is too lower than your result.
By the way, the result of L1 and L2 is too similary. Is there any bug in my test period?

	L1		L2		L3		L4
	succ	fail	succ	fail	succ	fail	succ	fail
Simple Object Manipulation:visual_manipulation	99	1	94	6	100	0
Simple Object Manipulation:scene_understanding	100	0	98	2	96	4
Simple Object Manipulation:rotate	100	0	100	0	100	0
Visual Goal Reaching:rearrange	49	51	49	51	49	51
Visual Goal:reaching:rearrange_then_restore	10	90	12	88	11	89
Novel Concept Grounding:novel_adj	99	1	100	0	99	1
Visual Reasoning:noval_noun	97	3	97	3	99	1
Novel Concept Grounding:novel_adj_and_noun							98	2
Novel Concept Grounding:twist	1	99	4	96	0	100
One-shot Video Imitation:follow_motion							0	100
One-shot Video Imitation:follow_order	44	56	45	55	47	53
Visual Constraint Satisfaction:sweep_without_exceeding	67	33	67	33
Visual Constraint Satisfaction:sweep_without_touching							0	100
Visual Reasoning:same_texture							50	50
Visual Reasoning:same_shape	50	50	50	50	50	50
Visual Reasoning:manipulate_old_neighbor	47	53	47	53	37	63
Visual Reasoning:pick_in_order_then_restore	11	89	10	90	13	87
num	774	526	773	527	701	499	148	252
success ratio	59.54		59.46		58.4		0.37

the empty denotes the example.py does not support

At the same time, I don't find the usage of mask R-CNN. The bbox is not recognized by any models, but given by the env (if I dont miss anything). Could you provide more details about this?

Metrics related to evaluating the models.

I want to compress the model and then compare its performance difference before and after, what evaluation metrics can I use?

ModuleNotFoundError: No module named 'vima_bench'

Hi, thanks for the stunning work! I am trying to running the demo however I got the error "ModuleNotFoundError: No module named 'vima_bench'". I'm guessing there may be some files you forgot to upload. Or could you tell me how to fix it? Thanks in advance!

All tensors not on the same device

Hi there,
I am trying to run the provided example with following command python3 scripts/example.py --ckpt=checkpoints/200M.ckpt --device=cuda:0 --partition=placement_generalization --task=visual_manipulation
I get an error that all tensors are not on the same device.

Traceback (most recent call last):
  File "/home/oier/code/VIMA/scripts/example.py", line 506, in <module>
    main(arg)
  File "/home/oier/miniconda3/envs/vima/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/oier/code/VIMA/scripts/example.py", line 118, in main
    prompt_tokens, prompt_masks = policy.forward_prompt_assembly(
  File "/home/oier/code/VIMA/vima/policy/policy.py", line 163, in forward_prompt_assembly
    batch_word_emb = self.prompt_embedding(word_batch)
  File "/home/oier/miniconda3/envs/vima/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/oier/code/VIMA/vima/nn/prompt_encoder/word_embd.py", line 22, in forward
    x = self._embed_layer(x)
  File "/home/oier/miniconda3/envs/vima/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/oier/miniconda3/envs/vima/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/home/oier/miniconda3/envs/vima/lib/python3.9/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

Model "t5-base" on the Hub doesn't have a tokenizer

hi, thanks for sharing this work.
i followed the instructions and built vima and vima bench successfully. but when i ran command like this:

python3 scripts/example.py --ckpt=2M.ckpt --device=cuda --partition=novel_object_generalization --task=pick_in_order_then_restore

i got the following errors:

pybullet build time: May 20 2022 19:45:31
[INFO] 17 tasks loaded
[2023-07-26T12:45:49Z ERROR cached_path::cache] ETAG fetch for https://huggingface.co/t5-base/resolve/main/tokenizer.json failed with fatal error
Traceback (most recent call last):
File "/home/lq/ws_vima/VIMA/scripts/example.py", line 74, in
tokenizer = Tokenizer.from_pretrained("t5-base")
Exception: Model "t5-base" on the Hub doesn't have a tokenizer

any ideas? thanks in advance.

Exception: Model "t5-base" on the Hub doesn't have a tokenizer

I try to run VIMA demo, but it appears the following issue:

python scripts/example.py --ckpt=2M.ckpt --partition=placement_generalization --task=follow_order

pybullet build time: May 20 2022 19:45:31
[INFO] 17 tasks loaded
[2023-10-24T11:20:29Z ERROR cached_path::cache] Max retries exceeded for https://huggingface.co/t5-base/resolve/main/tokenizer.json
Traceback (most recent call last):
  File "/home/VIMA/scripts/example.py", line 74, in <module>
    tokenizer = Tokenizer.from_pretrained("t5-base")
Exception: Model "t5-base" on the Hub doesn't have a tokenizer

could you release the code of real robot experiment?

Hi,
you did a great job, if I want to use the algorithm with UR5 or fetch robot in the real world, how could I do it?

Thanks.

Action dim not match

In the training dataset which you provide, action dims are like:

--------------------------------------------------
Action
pose0_position : (2, 3)
pose0_rotation : (2, 4)
pose1_position : (2, 3)
pose1_rotation : (2, 4)
--------------------------------------------------

But the in the trajectory I generate by oracle, action dims are like:

--------------------------------------------------
Action
pose0_position : (2, 2)
pose0_rotation : (2, 4)
pose1_position : (2, 2)
pose1_rotation : (2, 4)
--------------------------------------------------

so are the dims of the action space of environment.

In your vima-bench, the action dims of pose0_position and pose1_position are 2, but in your training dataset, the corresponding action dims are 3.
So that the action in the traing dataset can't be used in the vima-bench environment and don't match the corresponding dim in the models you provide.

But I notice that when the dims of action['pose0_position'] and action['pose1_position'] are 3, the action can still be used in the environment, so I'd like to ask if it means the third dim in the actions of datasets is useless and I can just ignore it. If not, how to solve this question so that the dataset can be utilized.

Thanks!

Can you release the Mask-RCNN model used for segmentation?

To directly reproduce and compare the results, can you release the mask rcnn model used for segmentation please?

@yunfanjiang

How is the current inference frequency? Did someone try it on real world experiment?

Did someone try it on real world experiment? Thanks!

Any suggestion to reproduce your reseults?

Hi,

Thanks for your sharing!

It seems that only one inference demo script is released and not much information w.r.t the training procedure. It would be great that you could offer us some insights about the training procedure. Like hardware, (a100 * 8) and time cost?

Bests,

[Evaluation] How to get the result reported in the paper

Hello, I am interested in VIMABenchmark. And, it is curious that how to get the result reported in your paper. You said you test on each task with 100 instances in Issue #16 . So,

Does each task has four paritions?
Is the overall success ratio of L1/L2/L3/L4 computed by averaging all 17 tasks result?

Question: how to specific state or restore the state of the environment?

setting ramdom seeds doesn't semms to work
seed = 42 np.random.seed(seed) env_1 = TimeLimitWrapper( ResetFaultToleranceWrapper( make( cfg.task, modalities=["segm", "rgb"], task_kwargs=PARTITION_TO_SPECS["test"][cfg.partition][cfg.task], seed=seed, # render_prompt=True, display_debug_window=True, hide_arm_rgb=True, # record_gui=True, ) ), bonus_steps=2, )
.....
can i use env.state to save current env state, and load it next time?

problem about "post_init"

Hi,

I tried to follow the example.py but met one problem. I really can't solve it,could you tell me how to do.

pybullet build time: May 20 2022 19:45:31
[INFO] 17 tasks loaded
Traceback (most recent call last):
  File "/home/vipuser/VIMA/VIMA-main/scripts/example.py", line 510, in <module>
    main(arg)
  File "/home/vipuser/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/vipuser/VIMA/VIMA-main/scripts/example.py", line 88, in main
    policy = create_policy_from_ckpt(cfg.ckpt, cfg.device).to(cfg.device)
  File "/home/vipuser/anaconda3/lib/python3.9/site-packages/vima/__init__.py", line 10, in create_policy_from_ckpt
    policy = Policy(**ckpt["cfg"])
  File "/home/vipuser/anaconda3/lib/python3.9/site-packages/vima/policy/policy.py", line 23, in __init__
    self.xattn_gpt = vnn.XAttnGPT(
  File "/home/vipuser/anaconda3/lib/python3.9/site-packages/vima/nn/seq_modeling/xattn_gpt/xattn_gpt.py", line 69, in __init__
    self.post_init()
  File "/home/vipuser/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'XAttnGPT' object has no attribute 'post_init'

ImportError: cannot import name 'checkpoint' from 'transformers.models.t5.modeling_t5'

Thanks for your great work. I successfully install vima and vima_bench. However, when I try to run the example by '''python3 scripts/example.py --ckpt=../VimaBench/ckpts/200M.ckpt --partition=placement_generalization --task=follow_order''', I face the problem of failure to import checkpoint when import modeling_t5 at vima/nn/prompt_encoder/prompt_encoder.py. Did I miss some requirements and how can I fix this?

As the input are images of single objects, so how does the model know the relative position and distance between objects

I have read this paper and it is very interesting, I assume that there are images of full scenes are input to the model. But I didn't find relevant pieces about that. All I see is that objects in the full scenes are extracted as images of single objects. How does this model know the relative position and distance between objects. Thank you very much.

Problems about using --device=cuda

Hi,
I am now trying to using gpu to run the demo(it's all good when using only --device=cpu), the following are the error info in my terminal, could you please give me an answer. By the way, my exper settings: GPU3060, ubuntu2004, python3.9, pytorch1.12.1, cuda 11.4. Besides, I have tried to alter the code about cuda and cpu conflict, no results by this time.

error messages:

python scripts/example.py --ckpt=../c --device=cuda
pybullet build time: May 20 2022 19:45:31
[INFO] 17 tasks loaded
/home/murphy/anaconda3/envs/vima/lib/python3.9/site-packages/gym/spaces/box.py:73: UserWarning: WARN: Box bound precision lowered by casting to float32
logger.warn(
startThreads creating 1 threads.
starting thread 0
started thread 0
argc=2
argv[0] = --unused
argv[1] = --start_demo_name=Physics Server
ExampleBrowserThreadFunc started
X11 functions dynamically loaded using dlopen/dlsym OK!
X11 functions dynamically loaded using dlopen/dlsym OK!
Creating context
Created GL 3.3 context
Direct GLX rendering context obtained
Making context current
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=NVIDIA GeForce RTX 3060 Laptop GPU/PCIe/SSE2
GL_VERSION=3.3.0 NVIDIA 470.199.02
GL_SHADING_LANGUAGE_VERSION=3.30 NVIDIA via Cg compiler
pthread_getconcurrency()=0
Version = 3.3.0 NVIDIA 470.199.02
Vendor = NVIDIA Corporation
Renderer = NVIDIA GeForce RTX 3060 Laptop GPU/PCIe/SSE2
b3Printf: Selected demo: Physics Server
startThreads creating 1 threads.
starting thread 0
started thread 0
MotionThreadFunc thread started
text argument:/home/murphy/workspace/VIMABench/vima_bench/tasks/assets
int args: [ven = NVIDIA Corporation
ven = NVIDIA Corporation
Traceback (most recent call last):
File "/home/murphy/workspace/VIMA/scripts/example.py", line 506, in
main(arg)
File "/home/murphy/anaconda3/envs/vima/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/murphy/workspace/VIMA/scripts/example.py", line 118, in main
prompt_tokens, prompt_masks = policy.forward_prompt_assembly(
File "/home/murphy/workspace/VIMA/vima/policy/vima_policy.py", line 163, in forward_prompt_assembly
batch_word_emb = self.prompt_embedding(word_batch)
File "/home/murphy/anaconda3/envs/vima/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/murphy/workspace/VIMA/vima/nn/prompt_encoder/word_embd.py", line 22, in forward
x = self._embed_layer(x)
File "/home/murphy/anaconda3/envs/vima/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/murphy/anaconda3/envs/vima/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/home/murphy/anaconda3/envs/vima/lib/python3.9/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)
numActiveThreads = 0
stopping threads
Thread with taskId 0 exiting
Thread TERMINATED
destroy semaphore
semaphore destroyed
destroy main semaphore
main semaphore destroyed
finished
numActiveThreads = 0
btShutDownExampleBrowser stopping threads
Thread with taskId 0 exiting
Thread TERMINATED
destroy semaphore
semaphore destroyed
destroy main semaphore
main semaphore destroyed

Request for Training Code and Clarification on Parallelization & Batch Size

Hello, I have a couple of questions about the project:

Training Code: Is it possible for the training code to be released? It would greatly help in understanding the implementation details and for reproducing the results.

Parallelization & Batch Size: When training, does parallelizing episodes across multiple GPUs equate to setting the batch size? I would appreciate some clarification on how parallelization and batch size are related in the context of this project.

Thank you for your time and consideration. Looking forward to your response.

Question of the object encoding and training details

Hi, I have some questions regarding the code

Is there a specific reason to fuse the end effector features with the object features (this line)? Is there any intuition of this fusion?
Can I know how much training epochs/time have been used? I checked you used 8 V100 GPUs.

Thank you so much in advance.
(2023.09.21) I resolved some questions and updated ones. (You can ignore the mail that I send to you directly)

Questions about the training and evaluation pipelines

Hi Yunfan,

Thank you so much for the great work! Since I'm trying to reproduce the results, I would like to ask some questions regarding the training and evaluation details.

Can you provide the number of training epochs? (#9)
Let's look at Table 7. Denote the number of gradient steps as $N_{gs}$. Since you are using learning rate warm-up and cosine annealing, I assume the learning rate first increase linearly from 0 to 1e-4 when $N_{gs} \in [0, 7K]$. When $N_{gs}\in [7K + (2i * 17K), 7k + ( (2i + 1) * 17K)]$, the learning rate decrease from 1e-4 to 0. And when $N_{gs}\in [7K + ((2i + 1) * 17K), 7k + ( (2i + 2) * 17K)]$, the learning rate increase from 0 to 1e-4. Am I right?
I notice that you fine-tune the last two layers and freeze all other layers of T5. Does it correspond to the following codes?

        for n, p in self.policy.t5_prompt_encoder.named_parameters():
            p.requires_grad = False
            if "t5.encoder.block.11.layer.1." in n or "final_layer_norm" in n:
                p.requires_grad = True

When calculating the success rate (SR) for each task distribution and level, how many task instances did you sample? I assume the equation you used is $$SR = \frac{\text{number of success}}{\text{number of total task instances}} $$
Can you share your vectorized implementation for the policy evaluation?
When evaluating the performance of your methods and the other baseline, how did you set the parameter hide_arm_rgb when making the env? Should we always set it to True?

Thanks and regards,
Jiachen

How do you tokenize the pose actions?

I can see that you have the discretize_actions method in every "policy" class but you've not actually used it anywhere, nor have you used it within the example.py snippet.

Can you please share a very specific code snippet that shows how you convert a single example from the raw dataset into a instance that is provided to the model during training. If you cannot use functions from the internal codebase, can you use pseudocode with strong typing so that we can see what each function/aspect is doing please?

Question about Dataset creation

Thank you so much for such an amazing job.

Could you explain how you collected the dataset? I read on Hugging Face that "All demonstrations are generated by scripted oracles". But I don't really understand this process.

In addition to that I would like to ask, did every trajectory in the dataset successfully complete the task? I mean, they all end up with a reward of 1 instead of 0, right?

The inputs are single objects but h

Script/command for training?

Hi there,

I've managed to successfully run the demo, and am interested in learning more about the training. Is there any scripts/commands etc. that I can use to run the training?

Many thanks for any help, and for this amazing work! :)

The choice of the actiion decoder

Hi, I noticed that you used torch.distributions.Distribution after MLP to get the final output, could you share some insights about this choice? What's the advantage compared with the direct usage of MLP and softmax?

Also, for the training procedure, should we ignore that header, and direct apply NLL loss with the output of MLP, or should we apply the NLL with the probability of that distribution? If also, could you give some simple code snippets to demonstrate the training usage?

BTW, congrats on the acceptance of ICML, well done!

Bests,

the physical meaning of the parameter 'actions'

In VIMA, the robot's final executed position is fed into env.step(actions) through the actions variable. The robot then calculates positions like prepick, postpick, preplace, and postplace. It appears that pose0_position and pose1_position in the actions dictionary represent the positions to which the robot will move in Cartesian coordinates, while pose0_rotation and pose1_rotation seem to be quaternions.

My question is how these variables are transformed into their final form. I noticed in the code above that you've defined some variables to scale these positional variables, such as _n_discrete_x_bins, _n_discrete_y_bins, etc.

Suppose I try the VIMA algorithm on a physical robot using my camera. How would my actions be transformed from the form below to the final grasping position? Also, the values below don't appear to be in pixel coordinates.

            """
            actions:
              'pose0_position': Tensor:(1,1,2) tensor([[[16, 35]]])
              'pose0_rotation': Tensor:(1,1,4) tensor([[[25, 25, 25, 49]]])
              'pose1_position': Tensor:(1,1,2) tensor([[[13, 85]]])
              'pose1_rotation': Tensor:(1,1,4) tensor([[[25, 25, 49, 19]]])
            """
            actions = {k: v.mode() for k, v in dist_dict.items()}

Request for Guidance on Implementing imitation_loss

Thank you for the snippet you provided #28 ; it has been immensely helpful. I am truly grateful for your remarkable contributions and research.

Using your snippet as a reference, I have crafted my training code. However, it seems that the imitation_loss you previously mentioned hasn't been implemented.

Could you provide guidance on implementing the imitation_loss or suggest another way to compute it? Additionally, if you notice any ambiguities or potential issues in my training code, I would greatly appreciate your insights.

Here's the error I encountered:

Exception has occurred: AttributeError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
'dict' object has no attribute 'imitation_loss'
  File "/home/initial/workspace/VIMA2/VIMA/scripts/train.py", line 610, in train
    imitation_loss = dist_dict.imitation_loss(actions=tar_action)
  File "/home/initial/workspace/VIMA2/VIMA/scripts/train.py", line 633, in main
    train(policy, dataloader, optimizer, epochs=cfg.epochs)
  File "/home/initial/workspace/VIMA2/VIMA/scripts/train.py", line 642, in <module>
    main(args)
  File "/home/initial/.pyenv/versions/3.9.16/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/initial/.pyenv/versions/3.9.16/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
AttributeError: 'dict' object has no attribute 'imitation_loss'

And here is the relevant training code snippet:

def iteration(policy, batch):
    obs = batch['obs']
    action = batch['action']
    prompt_token_type = batch['prompt_token_type']
    word_batch = batch['word_batch']
    image_batch = batch['image_batch']

    prompt_tokens, prompt_masks = policy.forward_prompt_assembly(
        (prompt_token_type, word_batch, image_batch)
    )
    obs_tokens, obs_masks = policy.forward_obs_token(obs)

    action = policy.discretize_action(action)
    # cache target action
    tar_action = {k: v.clone() for k, v in action.items()}

    # slice action sequence up to the last one
    action_tokens = policy.forward_action_token(action)

    action_tokens = action_tokens.transpose(1,0)
    obs_tokens = obs_tokens.transpose(1,0)
    obs_masks = obs_masks.transpose(1,0)

    pred_action_tokens = policy.forward(
        obs_token=obs_tokens,
        action_token=action_tokens,
        prompt_token=prompt_tokens,
        prompt_token_mask=prompt_masks,
        obs_mask=obs_masks,
    )# (L, B, E)
    # pred_action_tokens = pred_action_tokens[-2:].contiguous()# (2, B, E)
    dist_dict = policy.forward_action_decoder(pred_action_tokens)
    tar_action = policy._de_discretize_actions(tar_action)
    return dist_dict, tar_action
    
def train(policy, dataloader, optimizer, validation_dataloader=None,epochs=10):
    wandb.init(project="VIMA", name=f"VIMA_{NOW}")
    wandb.watch(policy, log_freq=100)  # モデルのパラメータと勾配をログします。

    policy.train()  # モデルを学習モードに設定
    best_val_loss = float('inf')
    no_improve_count = 0

    for epoch in range(epochs):
        total_epoch_loss = 0
        for batch in tqdm(dataloader,desc=f"Epoch {epoch + 1}/{epochs}"):
            dist_dict, tar_action = iteration(policy, batch)

            total_loss = 0
            # pred_actions = {k: v.mode().detach().clone().requires_grad_() for k, v in dist_dict.items()}

            total_loss = compute_cross_entropy_loss(dist_dict, tar_action)
            # imitation_loss = dist_dict.imitation_loss(actions=tar_action)

            # imitation_loss.backward()
            total_loss.backward()

            optimizer.step()
            total_epoch_loss += total_loss.item()
        
def get_pred(pred_actions, key, time, index):
    return pred_actions[key]._dists[index].probs[time]

def get_true(tar_action, key, time, index):
    return tar_action[key][:, time, index]

def compute_cross_entropy_loss(pred_actions, tar_action):
    keys = ['pose0_position', 'pose1_position', 'pose0_rotation', 'pose1_rotation']
    indices = {
        'pose0_position': [0, 1],
        'pose1_position': [0, 1],
        'pose0_rotation': [0, 1, 2, 3],
        'pose1_rotation': [0, 1, 2, 3]
    }
    times = [0, 1]  # 0 for t2, 1 for t1

    total_loss = 0
    for key in keys:
        for time in times:
            for index in indices[key]:
                pred = get_pred(pred_actions, key, time, index)
                true = get_true(tar_action, key, time, index).long()
                total_loss += criterion(pred, true)

    return total_loss

Invalid GLX version: major 1, minor 2

I've tried to upgrade the GLX version but have been unsuccessful. How do I go about upgrading to the appropriate version? Is it possible to upgrade within a docker container? I'm using Xming to forward the GLX program from the server to my local computer, is this related to the configuration of my local computer? My Xming is configured successfully. Thanks for the help!
Here is my server configuration info:
server glx version string: 1.2 client glx version string: 1.4 GLX version: 1.2 Max core profile version: 4.5 Max compat profile version: 3.1 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.2 OpenGL version string: 3.1 Mesa 21.2.6 OpenGL shading language version string: 1.40

How to construct dataset for training?

Hi, thanks for this amazing work.

Could you provide the interface for loading the training dataset? Is is possible to reveal part of the code for reference? For my understanding, for each task we need to create env as in scripts/example.py, set partition to 'placement_generalization', and task for different 13 tasks and load corresponding data.

Settings for Mask RCNN version

Thank you for sharing your work! I really enjoy playing with VIMA.

Your code in example.py assumes ground truth segmentation mask of object is given.

Can you share codes for Mask RCNN version where segmentation mask is not given by simulator?
Or can you tell me which kind of modification is needed for Mask RCNN version?

Questions about prompt and Mask-RCNN

Thank you for your previous response. I have a few more questions:

Regarding the prompt: For instance, given the example 'This is a dax {dragged_obj_1}. This is a zup {base_obj}. Put a dax into a zup.', I'm unclear on how the object names are associated with their respective segm or center positions. Could you explain this further?

Regarding Mask-RCNN: In another issue #13, you provided the model's checkpoints and a link to Detectron2. Could you elaborate on how to load the RCNN model using those checkpoints? Once the pretrained model is loaded, can it directly generate segmentation from an input RGB image? Are there any specific parameters that I should be aware of or set when doing so?

Looking forward to your reply.

Originally posted by @oxFFFF-Q in #26 (comment)

Asking for better explanation

Thank you for the excellent work of VIMA in open source. I think there is still a lot of room for improvement, otherwise it will be really difficult to expand:

① There is almost no explicit explainable documentation for Vima-bench , which makes it inconvenient for users to call.

② The entire workflow is not closed-loop, as there is no code for the Train stage of the model, making it inconvenient to expand the work and unable to customize Prompts to perform tasks, even in specific tasks.

Clarification Required on ckpt_path Parameter for Running Demo

the issue has been resolved :)

Best regards

How to eval the baseline？

thank for your great work，but i wonder if you can release the baseline or tell me how to reproduce the baseline(e.g. vimagato,vimagpt,vimaflamingo), thank you

How can we calculate the loss for vima?

Hi there,

I am trying to train the VIMA model on a custom dataset to generate custom actions. I am a bit confused though on how to determine the loss while training.
My first attempt was to generate action tokens for the actions from my dataset like this action_tokens = policy.forward_action_token(actions) and predicted action tokens like this predicted_action_tokens = policy.forward(...) and then just forward these into the cross entropy loss function. But I realized you don't have to pass the tokens but rather some logits. But I can't figure out how to access these.
My guess would be to somehow extract them from dist_dict = policy.forward_action_decoder(predicted_action_tokens).

Also the actions I want to generate are partly categorical and partly multicategorical and I took from this issue that for multicategorical actions you didn't even used cross entropy loss but rather either some type of regression or GMM (which is gaussian mixture modelling?). If that's correct I hope you could give my some insight on what exactly you applied.

Like the person in this issue already said, a pseudo-code of the training algorithm would also be really helpful to reproduce and build upon your results. :)

Best regards,
Mano

About Success Rate calculating details

Hi there,
I have already run the demo and through the paper carefully. I'm wondering is there a automatic way you guys calculate the success rate in your appendix results about the L1~L4 based on different methods(eg: Flamingo).Say, I gotta a one-time inference on the specific task(visual_manipulation) with the prompt(put obj1 into obj2), how do we know it succeeds in a quantitative way. And how to calculate this on a large test data with a automatic way?
Thank you for your attention.

how to solve this import error?

Problem with rendering

Thanks for amazing project.
When I run the demo, the output is a bit strange and the execution speed is not normal. There is a video of this show in the link below:
link

Am I doing something wrong?

Training hyperparameters besides Table 6

Hi,

Thanks for your sharing!

Could you share more training hyperparameters besides Table 6 in your appendix?

Batch size and epochs (along with bs and steps)
do you use accumulated memory to increase the effective bs?
Training issue for IL. Let's take a 2-step sequence as an example. The a2 should be predicted with history a1. My question is, is a1 predicted by the model during training, or do we use some tricks to fill it with the GT one? If it was the predicted one, we force the model to predict the right outputs with the drifted input (last pose).

Questions about VIMA Data Loading, RCNN Models, and Real-World Applications

Hi,
Firstly, I would like to express my gratitude for open-sourcing the VIMA work. I am very intrigued by it. However, I encountered several issues during my implementation:

Training with PyTorch:
While attempting to implement the training part of VIMA using PyTorch, I've faced issues with the trajectory.pkl file. Specifically, some data types within ['obj_id_to_info'] are either None or functools.partial. These types lead to errors when fed into Data_Loading of Pytorch. May I know how you managed such data?
I observed in other issues that you utilize PyTorch Lightning. If I were to use plain PyTorch for the training phase, applying the same methods for inference to compute predicted actions, followed by loss computation and backpropagation, would this approach be feasible?
RCNN:
Can the used RCNN model only recognize objects that appear in the simulation or the dataset? For instance, is it possible for VIMA to pick up and place an apple? If it cannot, would I need to retrain the model or switch the object recognition model?
VIMA in the Real World:
Have you tried deploying VIMA on a physical robotic arm? I'm venturing into this, and any advice or insights you could offer would be greatly appreciated.
Thank you for your time and looking forward to your response.

Best regards,
Qiao

RuntimeError: Failed to reset environment after 10 retries

how to solve it ?

How to render environment like in paper?

When I run the current evaluation the environment rendering looks a lot more basic than in the paper. How does one render the simulation environment with the proper lighting and textures?

Training Algorithm

Hello.
First of all, thanks for sharing, and congratulations for your work.
I want to use your model in my experiments related to object-centric Imitation Learning policy, however, I have some differences in the robot platform and simulation environment, so I need to train your model from scratch. Therefore, I have some high-level questions about your training loop.
In particular, starting from your example.py scripts, how you use history information is clear, but how did you build the batch during the training? Particularly, how did you fill the slot related to past action-token?
I appreciate your consideration.

Francesco Rosa.

Franka in Isaac-sim

Hi ,
Thank you for release model and codes, It's a great job!
Does the model control Franka robot in Isaac-sim?