Git Product home page Git Product logo

openvla's Introduction

OpenVLA: An Open-Source Vision-Language-Action Model

arXiv HF Models PyTorch Python License

Getting Started | Pretrained VLAs | Installation | Fine-Tuning OpenVLA via LoRA | Fully Fine-Tuning OpenVLA | Training VLAs from Scratch | Project Website


Latest Updates


A simple and scalable codebase for training and fine-tuning vision-language-action models (VLAs) for generalist robotic manipulation:

  • Different Dataset Mixtures: We natively support arbitrary datasets in RLDS format, including arbitrary mixtures of data from the Open X-Embodiment Dataset.
  • Easy Scaling: Powered by PyTorch FSDP and Flash-Attention, we can quickly and efficiently train models from 1B - 34B parameters, with easily adaptable model architectures.
  • Native Fine-Tuning Support: Built-in support (with examples) for various forms of fine-tuning (full, partial, LoRA).

Built on top of Prismatic VLMs.

Getting Started

To get started with loading and running OpenVLA models for inference, we provide a lightweight interface that leverages HuggingFace transformers AutoClasses, with minimal dependencies.

For example, to load openvla-7b for zero-shot instruction following in the BridgeData V2 environments with a WidowX robot:

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b", 
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = get_from_camera(...)
prompt = "In: What action should the robot take to {<INSTRUCTION>}?\nOut:"

# Predict Action (7-DoF; un-normalize for BridgeData V2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action, ...)

We also provide an example script for fine-tuning OpenVLA models for new tasks and embodiments; this script supports different fine-tuning modes -- including (quantized) low-rank adaptation (LoRA) supported by HuggingFace's PEFT library.

For deployment, we provide a lightweight script for serving OpenVLA models over a REST API, providing an easy way to integrate OpenVLA models into existing robot control stacks, removing any requirement for powerful on-device compute.

Pretrained VLAs

We release two OpenVLA models trained as part of our work, with checkpoints, configs, and model cards available on our HuggingFace page:

  • openvla-7b: The flagship model from our paper, trained from the Prismatic prism-dinosiglip-224px VLM (based on a fused DINOv2 and SigLIP vision backbone, and Llama-2 LLM). Trained on a large mixture of datasets from Open X-Embodiment spanning 970K trajectories (mixture details - see "Open-X Magic Soup++").
  • openvla-v01-7b: An early model used during development, trained from the Prismatic siglip-224px VLM (singular SigLIP vision backbone, and a Vicuña v1.5 LLM). Trained on the same mixture of datasets as Octo, but for significantly fewer GPU hours than our final model (mixture details - see "Open-X Magic Soup").

Explicit Notes on Model Licensing & Commercial Use: While all code in this repository is released under an MIT License, our pretrained models may inherit restrictions from the underlying base models we use. Specifically, both the above models are derived from Llama-2, and as such are subject to the Llama Community License.


Installation

Note: These installation instructions are for full-scale pretraining (and distributed fine-tuning); if looking to just run inference with OpenVLA models (or perform lightweight fine-tuning), see instructions above!

This repository was built using Python 3.10, but should be backwards compatible with any Python >= 3.8. We require PyTorch 2.2.* -- installation instructions can be found here. The latest version of this repository was developed and thoroughly tested with:

  • PyTorch 2.2.0, torchvision 0.17.0, transformers 4.40.1, tokenizers 0.19.1, timm 0.9.10, and flash-attn 2.5.5

[5/21/24] Note: Following reported regressions and breaking changes in later versions of transformers, timm, and tokenizers we explicitly pin the above versions of the dependencies. We are working on implementing thorough tests, and plan on relaxing these constraints as soon as we can.

Once PyTorch has been properly installed, you can install this package locally via an editable installation (or via pip install git+https://github.com/openvla/openvla):

cd openvla
pip install -e .

# Training additionally requires Flash-Attention 2 (https://github.com/Dao-AILab/flash-attention)
pip install packaging ninja

# Verify Ninja --> should return exit code "0"
ninja --version; echo $?

# Install Flash Attention 2
#   =>> If you run into difficulty, try `pip cache remove flash_attn` first
pip install "flash-attn==2.5.5" --no-build-isolation

If you run into any problems during the installation process, please file a GitHub Issue.

Note: See vla-scripts/ for full training and verification scripts for OpenVLA models. Note that scripts/ is mostly a holdover from the original (base) prismatic-vlms repository, with support for training and evaluating visually-conditioned language models; while you can use this repo to train VLMs AND VLAs, note that trying to generate language (via scripts/generate.py) with existing OpenVLA models will not work (as we only train current OpenVLA models to generate actions, and actions alone).

Fine-Tuning OpenVLA via LoRA

In this section, we discuss fine-tuning OpenVLA using Low-Rank Adaptation (LoRA) via the Hugging Face transformers library, which is recommended if you do not have sufficient compute to fully fine-tune a 7B-parameter model. The main script for LoRA fine-tuning is vla-scripts/finetune.py. (If you instead wish to do full fine-tuning, please see the Fully Fine-Tuning OpenVLA section.)

Below we show an example of how you can fine-tune the main OpenVLA checkpoint (openvla-7b) via LoRA. Here we fine-tune on BridgeData V2 using a single A100 GPU with 80 GB VRAM. (You can also fine-tune with a smaller GPU, as long as it has at least ~27 GB of memory, by modifying the batch size.)

First, download the BridgeData V2 dataset:

# Change directory to your base datasets folder
cd <PATH TO BASE DATASETS DIR>

# Download the full dataset (124 GB)
wget -r -nH --cut-dirs=4 --reject="index.html*" https://rail.eecs.berkeley.edu/datasets/bridge_release/data/tfds/bridge_dataset/

# Rename the dataset to `bridge_orig` (NOTE: Omitting this step may lead to runtime errors later)
mv bridge_dataset bridge_orig

Now, launch the LoRA fine-tuning script, as shown below. Note that --batch_size==16 with --grad_accumulation_steps==1 requires ~72 GB GPU memory. If you have a smaller GPU, you should reduce --batch_size and increase --grad_accumulation_steps to maintain an effective batch size that is large enough for stable training. If you have multiple GPUs and wish to train via PyTorch Distributed Data Parallel (DDP), simply set --nproc-per-node in the torchrun command below to the number of available GPUs.

torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
  --vla_path "openvla/openvla-7b" \
  --data_root_dir <PATH TO BASE DATASETS DIR> \
  --dataset_name bridge_orig \
  --run_root_dir <PATH TO LOG/CHECKPOINT DIR> \
  --adapter_tmp_dir <PATH TO TEMPORARY DIR TO SAVE ADAPTER WEIGHTS> \
  --lora_rank 32 \
  --batch_size 16 \
  --grad_accumulation_steps 1 \
  --learning_rate 5e-4 \
  --image_aug <True or False> \
  --wandb_project <PROJECT> \
  --wandb_entity <ENTITY> \
  --save_steps <NUMBER OF GRADIENT STEPS PER CHECKPOINT SAVE>

Note: If you set --image_aug==False in the command above, you will observe nearly 100% action_accuracy in the training logs, since the openvla-7b model is already pretrained (without augmentations) on a superset of datasets that includes BridgeData V2.

To LoRA fine-tune on a different dataset, you can download the dataset from the Open X-Embodiment (OXE) mixture (see this custom script for an example of how to download datasets from OXE). Alternatively, if you have a custom dataset that is not part of OXE, you can either (a) convert the dataset to the RLDS format which is compatible with our fine-tuning script (see this repo for instructions on this), or (b) use your own custom PyTorch Dataset wrapper (see comments in vla-scripts/finetune.py for instructions). We recommend option (a) for most users; the RLDS dataset and dataloader are tested more extensively since we used these for all of our pretraining and fine-tuning experiments.

For option (a), after you converted your dataset to RLDS, you need to register it with our data loader, by registering a dataset config here and a dataset transform function here.

Once you have integrated your new dataset, you can launch LoRA fine-tuning with the same vla-scripts/finetune.py script above. If you run into any issues, please visit the VLA Troubleshooting section or search for a similar issue in the OpenVLA GitHub Issues page (including "Closed" issues). If you cannot find a similar issue there, feel free to create a new issue.

Fully Fine-Tuning OpenVLA

In this section, we discuss fully fine-tuning OpenVLA (all 7.5 billion parameters) via native PyTorch Fully Sharded Data Parallel (FSDP) using the Prismatic VLMs training script. Full fine-tuning is more advanced/involved and is only recommended if you have sufficient compute (e.g., a full node of 8 A100 GPUs) and if LoRA fine-tuning is insufficient for your use case (e.g., if the fine-tuning distribution varies drastically from the pretraining distribution). Otherwise, we recommend that you try parameter-efficient fine-tuning via LoRA, which is described in the Fine-Tuning OpenVLA via LoRA section.

For full fine-tuning, you will need to download a different version of the OpenVLA model checkpoint that is compatible with the Prismatic VLMs codebase, which we built on top of to develop the OpenVLA model. You can download this Prismatic-compatible OpenVLA checkpoint using the git commands below (alternatively, you can download via the Hugging Face CLI):

# Change directory to your base model checkpoints folder
cd <PATH TO BASE MODEL CHECKPOINTS DIR>

# Download checkpoint (30 GB) -- may take a few minutes
git clone [email protected]:openvla/openvla-7b-prismatic

# If the command above did not download the full checkpoint,
# manually fetch it via git Large File Storage (LFS)
# Note: You may have to configure an SSH key for this to work
cd openvla-7b-prismatic
git lfs fetch --all

We show how you can fully fine-tune OpenVLA on BridgeData V2 using a single node with 8 GPUs. If you wish to use a different number of GPUs (or nodes), you can modify the VLA training configuration in prismatic/conf/vla.py.

Download the BridgeData V2 dataset:

# Change directory to your base datasets folder
cd <PATH TO BASE DATASETS DIR>

# Download the full dataset (124 GB)
wget -r -nH --cut-dirs=4 --reject="index.html*" https://rail.eecs.berkeley.edu/datasets/bridge_release/data/tfds/bridge_dataset/

# Rename the dataset to `bridge_orig` (NOTE: Omitting this step may lead to runtime errors later)
mv bridge_dataset bridge_orig

Next, create a Hugging Face user access token and copy the token value (a string that starts with hf_...) into a file named .hf_token at the root directory of this repo (openvla/.hf_token).

# Go to openvla root directory
cd openvla

# Copy HF token value into token file. Replace "hf_..." with your own token value!
# See: https://huggingface.co/docs/hub/en/security-tokens
echo hf_... >>> .hf_token

Now, launch the training script. If you wish to use a different number of nodes or GPUs, modify the VLA training configuration in prismatic/conf/vla.py and then change the --nnodes and --nproc-per-node arguments below accordingly.

torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/train.py \
  --pretrained_checkpoint <PATH TO openvla/openvla-7b-prismatic CHECKPOINT FILE: step-295000-epoch-40-loss=0.2200.pt> \
  --vla.type prism-dinosiglip-224px+mx-bridge \
  --data_root_dir <PATH TO BASE DATASETS DIR> \
  --run_root_dir <PATH TO LOG/CHECKPOINT DIR> \
  --run_id <OPTIONAL RUN ID FOR WANDB LOGGING> \
  --image_aug <True or False> \
  --wandb_project <PROJECT> \
  --wandb_entity <ENTITY> \
  --save_interval <NUMBER OF GRADIENT STEPS PER CHECKPOINT SAVE> \
  --is_resume False

Note that the --is_resume argument is set to False above since we are fine-tuning a pretrained checkpoint rather than resuming a paused training run.

If your training run gets paused and you wish to resume from the latest checkpoint, change --pretrained_checkpoint to the latest checkpoint path, and then set --is_resume==True and specify --resume_step and --resume_epoch as the step and epoch number, respectively. For example, if you wish to resume training from a checkpoint named step-010000-epoch-20-loss=0.0160.pt, you would set is_resume==True, resume_step==10000, and resume_epoch==20.

Note: If you run the BridgeData V2 fine-tuning command above, you should observe nearly 100% Action Token Accuracy in the training logs, since the openvla-7b model is already pretrained on a superset of datasets that includes BridgeData V2.

To fully fine-tune OpenVLA on a different dataset, you can download the dataset from the Open X-Embodiment (OXE) mixture (see this custom script for an example of how to download datasets from OXE). Alternatively, if you have a custom dataset that is not part of OXE, you can convert the dataset to the RLDS format, which is compatible with our fine-tuning script (see this repo for instructions on this). After downloading/converting the dataset, you will need to modify the following files:

  • prismatic/conf/vla.py: Add a new training configuration by creating an experiment class, and then register it in the VLARegistry at the bottom of the file.
    • Make sure to create a new unique vla_id for your fine-tuning run, and adjust some configuration variables as needed – e.g., expected_world_size (number of GPUs), per_device_batch_size (batch size per GPU), global_batch_size (total batch size), shuffle_buffer_size (number of samples in shuffle buffer per GPU), etc. See comments under the VLAConfig class at the top of the file to understand the purpose of each variable.
  • prismatic/vla/datasets/rlds/oxe/mixtures.py: Define a new mixture for your fine-tuning mixture in the OXE_NAMED_MIXTURES dictionary.
  • prismatic/vla/datasets/rlds/oxe/transforms.py: Define a new dataset transform function for your fine-tuning dataset, and add it to the OXE_STANDARDIZATION_TRANSFORMS registry at the bottom of the file.
  • prismatic/vla/datasets/rlds/oxe/configs.py: Add a new configuration specifying your fine-tuning dataset's observation and action spaces to the OXE_DATASET_CONFIGS dictionary.

After completing the steps above, you can start full fine-tuning using the vla-scripts/train.py script. Make sure to set the --vla.type argument to the new vla_id that you added in prismatic/conf/vla.py.

When you are finished with fine-tuning, you will need to convert the final model checkpoint to a version that is compatible with the Hugging Face transformers library. See the Converting Prismatic Models to Hugging Face section for instructions.

If you run into any issues, please visit the VLA Troubleshooting section or search for a similar issue in the OpenVLA GitHub Issues page (including "Closed" issues). If you cannot find a similar issue there, feel free to create a new issue.

Converting Prismatic Models to Hugging Face

If you have used the Prismatic VLMs codebase to train your model (e.g., if you did full fine-tuning of OpenVLA on a new dataset), you will need to convert the final checkpoint to a version that is compatible with Hugging Face transformers AutoClasses. We discuss how to do so in this section.

Let's say your training run directory is PRISMATIC_RUN_DIR (e.g., prism-dinosiglip-224px+mx-oxe-magic-soup-plus+n8+b32+x7). Inside this directory, there should be a directory called checkpoints which contains saved model checkpoints (e.g., step-295000-epoch-40-loss=0.2200.pt). The Prismatic-to-Hugging-Face conversion script (convert_openvla_weights_to_hf.py) expects a checkpoint file named latest-checkpoint.pt. Therefore, you should first create a symbolic link called latest-checkpoint.pt that points to the checkpoint file that you wish to convert:

# Go to your Prismatic training run's `checkpoints` directory
cd PRISMATIC_RUN_DIR/checkpoints

# Create symbolic link pointing to your checkpoint file
ln -s <YOUR CHECKPOINT FILENAME> latest-checkpoint.pt

Then, launch the conversion script to convert the checkpoint from the Prismatic VLMs format to the Hugging Face format:

python vla-scripts/extern/convert_openvla_weights_to_hf.py \
    --openvla_model_path_or_id <PRISMATIC_RUN_DIR> \
    --output_hf_model_local_path <OUTPUT DIR FOR CONVERTED CHECKPOINT>

The command above will save the HF-compatible checkpoint in output_hf_model_local_path. Now you can load the checkpoint with HF AutoClasses as normal, as shown below. Note that there is an additional necessary step to register the OpenVLA model to HF AutoClasses before loading it because you are loading a locally saved checkpoint rather than one that is pushed to the HF Hub (see here for details).

import torch
from transformers import AutoConfig, AutoImageProcessor, AutoModelForVision2Seq, AutoProcessor

from prismatic.extern.hf.configuration_prismatic import OpenVLAConfig
from prismatic.extern.hf.modeling_prismatic import OpenVLAForActionPrediction
from prismatic.extern.hf.processing_prismatic import PrismaticImageProcessor, PrismaticProcessor

# Register OpenVLA model to HF AutoClasses (not needed if you pushed model to HF Hub)
AutoConfig.register("openvla", OpenVLAConfig)
AutoImageProcessor.register(OpenVLAConfig, PrismaticImageProcessor)
AutoProcessor.register(OpenVLAConfig, PrismaticProcessor)
AutoModelForVision2Seq.register(OpenVLAConfig, OpenVLAForActionPrediction)

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("<PATH TO CONVERTED CHECKPOINT DIR>", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "<PATH TO CONVERTED CHECKPOINT DIR>",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).to("cuda:0")

...

Training VLAs from Scratch

We provide full instructions and configurations for training VLA models on (arbitrary subsets of) the Open X-Embodiment (OXE) Dataset. If you run in to any issues with the following, see VLA Troubleshooting below (or file a GitHub Issue).

VLA Pretraining Datasets

We download and preprocess individual datasets from Open X-Embodiment in RLDS format following this custom script. See mixtures.py for the full list of component datasets (and mixture weights) we use to train openvla-7b.

  • Important: For the BridgeData V2 component, the version in OXE is out of date (as of 12/20/2023). Instead, you should download the dataset from the official website and place it under the subdirectory bridge_orig/. Replace any reference to bridge in the OXE code with bridge_orig.

VLA Configuration & Training Script

The entry point for VLA training is vla-scripts/train.py. We use draccus to provide a modular, dataclass-based interface for specifying VLA training configurations; existing VLA configurations are in prismatic/conf/vla.py. You can add your own training configuration and refer to it using the --vla.type command line argument.

We use PyTorch Fully Sharded Data Parallel (FSDP) to distribute training across GPUs. Launch training via torchrun:

# Train VLA on BridgeData V2 with the Prismatic DINO-SigLIP 224px Backbone on a Single Node (w/ 8 GPUs)
torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/train.py \
  --vla.type "prism-dinosiglip-224px+mx-bridge" \
  --data_root_dir <PATH TO OXE DATA ROOT> \
  --run_root_dir <PATH TO LOG/CHECKPOINT ROOT> \
  --wandb_project "<PROJECT>" \
  --wandb_entity "<ENTITY>"

VLA Troubleshooting

The following are a list of known problems and corresponding fixes:

FileNotFoundError: Failed to construct dataset "fractal20220817_data", builder_kwargs "{'data_dir': '/path/to/processed/datasets/'}": Could not load dataset info from fractal20220817_data/0.1.0/dataset_info.json
  • Fix: Downgrade tensorflow-datasets via pip install tensorflow-datasets==4.9.3.
AttributeError: 'DLataset' object has no attribute 'traj_map'. Did you mean: 'flat_map'?
  • Fix: Upgrade dlimp to the newest version. You may have to --force-reinstall like so: pip install --no-deps --force-reinstall git+https://github.com/moojink/dlimp_openvla

Repository Structure

High-level overview of repository/project file-tree:

  • prismatic - Package source; provides core utilities for model loading, training, data preprocessing, etc.
  • vla-scripts/ - Core scripts for training, fine-tuning, and deploying VLAs.
  • LICENSE - All code is made available under the MIT License; happy hacking!
  • Makefile - Top-level Makefile (by default, supports linting - checking & auto-fix); extend as needed.
  • pyproject.toml - Full project configuration details (including dependencies), as well as tool configurations.
  • README.md - You are here!

Citation

If you find our code or models useful in your work, please cite our paper:

@article{kim24openvla,
    title={OpenVLA: An Open-Source Vision-Language-Action Model},
    author={{Moo Jin} Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
    journal = {arXiv preprint arXiv:2406.09246},
    year={2024}
} 

openvla's People

Contributors

kpertsch avatar moojink avatar siddk avatar siddk-tri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openvla's Issues

Dataset setup for finetuning

Hi, I'm trying to finetune the openvla-7b model on the bridge v2 dataset.
I know the model is already trained on bridge v2.
I just want to have hands-on experience on finetuning the model.

I downloaded the bridge v2 dataset under openvla/datasets/open-x-embodiment/bridge_orig

and I run the following code

torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
                                 --data_root_dir "./datasets/open-x-embodiment/" \
                                 --dataset_name bridge \

Then I get the following error

Did you mean: bridge_orig -> bridge ?
The builder directory datasets/open-x-embodiment/bridge_orig doesn't contain any versions.
No builder could be found in the directory: datasets/open-x-embodiment for the builder: bridge_orig.
No registered data_dirs were found in:
     - datasets/open-x-embodiment

I would greatly appreciate it if you could help me on this issue.

Results on RLbench

Thanks for your wonderful work, and open-sourcing the code, model and data of OpenVLA!

I try to evaluate the OpenVLA on RLbench, but the results seem not very good (0 acc on most of the tasks).
OpenVLA seems very hard to generalize across different environments and robots.

Do you have any results on the simulation platforms, such as RLbench, which is better for fair comparision

memory leak

Hi, guys. Thanks for your great open-source work. When I use your rlds dataset built on tf.data.Dataset, I found there is memory leak when train=True. Specifically, when it's true and shuffle is called without cache, the memory gradually increases as the iteration goes.

dataset = dataset.shuffle(shuffle_buffer_size)

However, I don't understand why it happens. In my understanding, shuffle preloads a fixed size of data before iteration and then replace the used data with new data on the fly during iteration. So the memory should keep constant after preloading.
Could you provide some solution to the memory leak? If the leak is hard to solve, could you provide an approximate memory it needs to run the whole iteration during training?

Simulation for validation

Thank you for open-sourcing the VLA. I am new to robotic models and I do not have a physical robot. Is there any simulation platform to validate a VLA model? May I know how to evaluate a VLA model in the digital domain? Thank you again.

Dataset

How can I download the dataset used for training?I cant figure out followed the explanation in your github..

About Hardware

How to purchase the corresponding hardware in the demonstration?

image
image
image

Or do you have any recommendations for other hardware to purchase?

About memory growth

There are several issues where people ask why the memory keeps growing when using the Dataloader:
#4
octo-models/octo#16

i kind of want to reopen the question with one assumption that maybe someone can verify:
The restructuring of trajectories is done with tf's symbolic tensors. Since we randomly access samples of a trajectory (random sharding of tfds before accessing it) the data of a trajectory is not loaded sequentially. When loading data with history or future_window_size, the previous/next samples are also loaded due to the trajectory transform executing while accessing the sample.

Does TF cache these previous / next samples and reuses them once the corresponding sample is loaded?
Could that be why the memory is growing?

I've noticed that the memory keeps growing till a specific point, then goes slightly up and down, which is kind of annoying when loading many different cameras.
If that is the case, is there a way to disable this caching?

Maybe this is also the wrong repository for this, i will prbl cross post it to Octo but i think they are currently on a conference

How does it calculate the trajectory between its initial state to its goal action?

Hi, thanks for open-sourcing the code and providing detailed README to run the code.

I am trying the Fine-Tuning OpenVLA via LoRA in README and got into some questions.

It seems that the openvla model is given the image and the language instruction as input
and trained to directly output the final action dimension, which means the final position of the robot (7 action dims).

  1. How does the model directly predict the 7 action dims only with the language instruction and the image of the initial environment (state)?
    Doesn't it require any additional images during the robot's execution to go to that final position from its initial position?
    Is there no additional input images during its execution of the action?
    What if something happens and intervenes during the robot's execution?
    How do you feed back the change of environment ?

  2. Given "only" the final action dims by the model such as [-0.04435408 0.06181145 -0.2151171 0.1507535 0.19060254 0.03201425 0.],
    how do you execute the robot to go to that position?
    How do you produce the trajectory of the robot to get to that final position from its initial state given only the final goal action?

Thanks

Validation

Hello, Author,

I have a question that I would like to consult with you. We tried fine-tuning on OpenVLA using our own dataset. We added validation following the training method, but we found that the performance on the training set is excellent, with an action accuracy of 0.85. However, the performance on the validation set is very poor, only around 0.1. I would like to ask if you have encountered a similar situation during your training. Do you have any plans to add validation in the future?

Thank you again for your work.

Variable._execution_engine.run_backward eror during finetuning

Issue

I got this error when running the finetuning script, in particular with quantization set to true.

 torchrun --standalone --nnodes 1 --nproc-per-node 2 vla-scripts/finetune.py --batch_size 4 --shuffle_buffer_size 1000 --lora_rank --use_quantization true ... # custom dataset etc....
  • Setup: running single node with 2 DGX V100 gpus

However this throws me an error message

Traceback (most recent call last):                                                                                                                                                                                                   
  File "/home/youliang/openvla/vla-scripts/finetune.py", line 326, in <module>
    finetune()
  File "/home/youliang/anaconda3/envs/vla/lib/python3.10/site-packages/draccus/argparsing.py", line 203, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/home/youliang/openvla/vla-scripts/finetune.py", line 247, in finetune
    normalized_loss.backward()
  File "/home/youliang/anaconda3/envs/vla/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/youliang/anaconda3/envs/vla/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/youliang/anaconda3/envs/vla/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/home/youliang/anaconda3/envs/vla/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/youliang/anaconda3/envs/vla/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 875 with name base_model.model.language_model.model.layers.31.mlp.down_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

Potential solution

To resolve this issue, I added vla._set_static_graph().

    # Wrap VLA in PyTorch DDP Wrapper for Multi-GPU Training
    vla = DDP(vla, device_ids=[device_id], find_unused_parameters=True, gradient_as_bucket_view=True)
    vla._set_static_graph()   # <---- ADD THIS LINE

Not sure if this the right way to resolve this.

Publish Image Encoders as independent Models?

Hi,
thank you for sharing this cool project and pretrained models!
I'm particularly interested in the finetuned Siglip and Dino image encoders as standalone models.
I believe there could be a general interest in these finetuned image encoder models in the community. To my knowledge, there are no publicly available image encoders that have been finetuned on such a large dataset of 970,000 robot trajectories in PyTorch, especially with language supervision. Making these encoders easily accessible would be invaluable for researchers working on various robotic projects that may not require the full VLA model.
Would it be possible to release these finetuned image encoders as separate, standalone models?
Thanks in advance!

About deployment

Thank you for open sourcing your code!
May I ask what platform your code is deployed on? What frame rate was achieved?

ValueError: unrecognized configuration class

Traceback (most recent call last):
File "/home/h666/下载/get_data/examples/depoly.py", line 19, in
processor = AutoProcessor.from_pretrained("/home/h666/下载/openvla-7b", trust_remote_code=True)
File "/home/h666/anaconda3/envs/openvla/lib/python3.9/site-packages/transformers/models/auto/processing_auto.py", line 310, in from_pretrained
return processor_class.from_pretrained(
File "/home/h666/anaconda3/envs/openvla/lib/python3.9/site-packages/transformers/processing_utils.py", line 465, in from_pretrained
args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/home/h666/anaconda3/envs/openvla/lib/python3.9/site-packages/transformers/processing_utils.py", line 511, in _get_arguments_from_pretrained
args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
File "/home/h666/anaconda3/envs/openvla/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 890, in from_pretrained
raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.openvla-7b.configuration_prismatic.OpenVLAConfig'> to build an AutoTokenizer.
Model type should be one of AlbertConfig, AlignConfig, BarkConfig, Bart

More details about finetuning on new robots & tasks

Thank you for the exciting work!

Can you provide more details about finetuning on new robots & tasks?
I tried to finetune the pretrained openvla model on only one new task of a new robot but I did not obtain a promising result. Maybe more details can help such as:

  • learing_rate,
  • max_epochs,
  • batch size,
  • the size of training dataset,
  • image augmentation,
  • train more epochs to overfit or not?

Un-normalizing statistics for Google robot tasks

Hi, I'm playing with the OpenVLA model and I want to evaluate this model on SimplerEnv's Google robot tasks. Since there is a unnorm_key argument in the predict_action method, I assume that that's related to some dataset-specific bias terms and has to be changed if I use some other datasets. However I saw that for the openvla-7b model, there's no google-related keys in the norm_stats dictionary. Does that mean I have to retrain/fine-tune the model on Google tasks? Also, I saw you did some Google robot evaluations in the paper, have you released related code? Thanks in advance for your help!

An issue about full-fine tuning.

Thanks for your great work.
If want to use the openvla/openvla-7b as the initialized weights for full-fine tuning, what should I do.
How should I set the ”pretrained_checkpoint“, I try to set ”pretrained_checkpoint = openvla/openvla-7b", but it raise the error as following:
raise ValueError(f"Couldn't find valid HF Hub Path {hf_path = }")
ValueError: Couldn't find valid HF Hub Path hf_path = 'openvla/openvla-dev/pretrained/openvla/openvla-7b'

yaw pitch roll or roll pitch yaw

Hello,

I have a problem regarding the orientation order of the output action.

I see some datasets in Open-X-Embodiment (OXE) have an action vector defined as [yaw, pitch, roll] (for example TACO-Play https://www.tensorflow.org/datasets/catalog/taco_play), while other datasets have [roll, pitch, yaw].

I see the configuration in OXE is [roll, pitch yaw] so I guess the OpenVLA model trained on OXE is also [roll, pitch, yaw], am I correct?

Any plan to release a finetuned checkpoint?

Dear authors,

Thank you for your excellent work. May I ask if you can release a fine-tuned checkpoint, such as the Droid-Wipe model or some of the Franka-Tabletop model? Besides, may I ask where I can download the fine-tuning datasets?

Best,
Runpei

About how to identify the instruction is done.

I have acknowledged that the predicted action is in the form of delta. And it is necessary to predict a certain number of times to finish the instruction. So each time I get a new predicted action. But how do I know if the task is complete? When it is time for the openvla to stop, what would the model return, and how to identify the task has been done? Thanks!

quantization

Are the int8 and int4 quantization mentioned in the paper open source and supported in this repo

RuntimeError: CUDA error: device kernel image is invalid

Hi, thanks for your great work! I feel so sorry to disturb you.

When I launch the code finetune.py, an error happends at this line:

RuntimeError: CUDA error: device kernel image is invalid
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This error may be caused by a mismatch between the CUDA version and Torch version. However here is my environment:

torch                     2.2.0               
torchaudio            2.2.0+cu118            
torchtriton            2.2.0                    
torchvision           0.17.0                 

and I install a cuda-nvcc in my virtual environment, so the nvcc -V shows:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

and finally I run the code torch.cuda.is_available() and get True.
image

I have no idea what to do, looking forward to your reply, sorry again for disturbing, thank you!

Here is the information of my device:
Ubuntu 20.04.4 LTS
GPU: A100
Version:
image

How does it work on a novel image ?

Hello, first of all, I want to thank the authors for this great work they have done and providing the whole work as an open-source.
It truly contributes to the robot learning research and boosts the development of general robot intelligence.

I have some questions about how the trained model, "openvla-7b" outputs the action (7 action dimensions) for a novel input image.

With the code you provided in the Getting Started in README, I could feed a novel image I personally have to the OpenVLA along with an instruction prompt.

And it successfully outputs the 7 action dimensions. (I did not validate the accuracy of the action output yet though.)

The given image shows a picture of a table with multiple objects on it and a robotic arm.
It is an image I personally took so the OpenVLA could not have seen it before during its training.
Therefore, it does not know the camera position, the robot position and anything else in the image.
Only data given to OpenVLA is the image and the instruction prompt.

Then, without these information, how does OpenVLA accurately calculate the 7 action dimensions ?
Does it estimate the positions of the view-point of camera, robot and objects only with the given image?

I read your answers on other [issues]( #3, #6 ) saying that

"we do not do any explicit calibration of actions or coordinate frames across environments (other than aligning the open/close gripper action), and we do not condition action generation on anything other than a single input image and language instruction.
The OpenVLA model is likely automatically learning the necessary transformations to map action representations to the correct coordinate frame, based on the image it sees during generation. "

and

"If you want to use the model for a certain target configuration, you can just finetune it with a bit of data from your target domain and it will adopt the action space you need and learn to output such actions. If you want to use the model out of the box (ie without finetuning), you'd need to match one of the setups from the training datasets pretty closely (eg set up a Bridge-like WidowX with the BridgeDataV2 control stack) and then the model would output actions for the action space definition used in that dataset (implicitly, by recognizing the robot in the images)."

However, I still do not quite understand how it outputs actions for a novel image input.
Does it use the coordinate systems it learned from its training data? such as BridgeData V2?
Then, would the action output from my input image can be quite inaccurate ?
Also, does the parameter "unnorm_key" in predict_action function refer to the dataset it uses to un-normalize the positions and actions spaces ?

Also, I am wondering whether the 7 dimension action output is the final position of the robot or just a one piece of the trajectory to the final position.
Does it output a trajectory of the action for each time step to get to the final goal position?
If it is latter, how frequently does it output the 7 action dimensions for a single instruction output?

Thanks,

Storage requirement

Hi,

Can you give an estimate on the required storage ( in terabytes) used to do full training from scratch!

Problem on fintuning in newly collected dataset

Thank you for your wonderful open-sourced VLA models. For fintuning stage, I still have some questions to be solved.
I used RLBench data collected by myself to prepare RLDS data following your repository rlds_dataset_builder. I find there is something missing in guidelines to train on a new dataset.

For training stage:
I need also register my dataset in prismatic/vla/datasets/rlds/oxe/transforms.py and prismatic/vla/datasets/rlds/oxe/configs.py.

After training:
unnorm key for my dataset seems not automatically added in config.json, and I need to copy it from dataset_statics.json

For evaluation:
After I did things above, the action output seems fit in required action space. But even though I got action accuracy higher than 95% in training stage, the visualized actions can not even complete tasks from the training set. Sometimes it moves to the target, but it will then move back.

I want to find the reason why I got bad result in evaluation stage:

  1. Is the fix I did for training stage and after training stage correct? Have you already provided interface for convinient use of new dataset.
  2. I use 128 * 128 RGB image as observation input, which is recommanded in rlds_dataset_builder, but I see your vision encoder uses 224 * 224 to resize. Do I also need higher resolution image as input?
  3. I set image_aug to be False, how will it influece accuracy?
  4. I used wrist image to train and test. What viewpoint do you recommand to train and eval?
  5. I used simulation environment, is there big real-to-sim gap? How will it success on simulation environment?

Thank you!

Fine-tuning your own dataset

Hello,

I am a student from Japan working on development using Denso industrial robots and OpenVLA. I would like to express my sincere gratitude for your remarkable research and development. I am eager to spread this wonderful technology in Japan. I believe that the integration of advanced AI technologies like OpenVLA with industrial robots will lead to significant growth and innovation in the robotics industry, opening up new possibilities for automation and intelligent manufacturing.

Currently, I am working on fine-tuning OpenVLA using a custom dataset. The specifications of the equipment I am using are as follows:

  1. Industrial Robot:

    • Vertical multi-joint industrial robot: VP-6242
    • Pendant: TP-RC7/8
  2. End Effector:

    • IAI electric gripper
    • Model: RCP2-GRST-I-20P-1-80-P3-S-A1
    • 20mm square pulse motor, reduction ratio 1/1 high-speed type, 80mm stroke
  3. Camera:

    • Realsense D415
  4. Inference Computer:

    • GPU: TUF-RTX4090-24G-GAMING
    • Memory: CMN64GX4M2Z3200C16 128GB
    • OS: Ubuntu Desktop 22.04 LTS
  5. Development Environment:

    • GPU: NVIDIA RTX 3090
    • RAM: 64GB
    • CPU: Intel Core i9-11900K
    • OS: Ubuntu 20.04
    • Others: CUDA 11.2, PyTorch 1.10

**I would appreciate it if you could provide guidance on the following points:

  1. The procedure for creating a custom dataset and the specifications for its format.
  2. Detailed steps for fine-tuning the model using the dataset.**

I apologize for the inconvenience, but your assistance would be greatly appreciated. Thank you very much for your time and support.

Best regards
Image (3)
Image (2)
Image (1)

Finetune for bimanual robot

Thanks for open sourcing your great work!

I want to fine tune openvla using bimanual dataset, and generated dataset for it.
But I guess in the /prismatic/vla/datasets/rlds/oxe/materialize.py said only EEE_POS or EEE_R6 action is supported.

Is it possible to finetune with dataset that action dimension is 14? Or which part I have to modify?

Best
Hokyun

during LoRA finetuning, the training loss is NaN

dear authors,

Thanks so much for fully releasing the OpenVLA. I am trying to fine-tune the openVLA via LoRA on berkeley_autolab_ur5/bridge_orig/ucsd_pick_and_place_dataset_converted_externally_to_rlds datasets. BUT I find the action accuracy is always 0, for all three datasets. Then I find the train_loss during my fine-tuning is always NaN. This could be the reason.

I have checked the data input, and I think they have no problem.
For the model prediction, I find

action_preds = action_logits.argmax(dim=2)

always predict the 0-idx token;

I obtain this info by inserting the following prints in finetune.py:

            if distributed_state.is_main_process:
                print("action_logits: ", action_logits.size())
                print("action_preds: ", action_preds.size(), action_preds[0])
                print("action_gt: ", action_gt.size(), action_gt[0])
                print("mask: ", mask.size(), mask[0])
                print("correct_preds: ", correct_preds)
                print("action_accuracy: ", action_accuracy)
                print("train_loss: ", smoothened_loss)
                print("l1_loss: ", smoothened_l1_loss)

and the output is:

action_logits:  torch.Size([1, 40, 32064])
action_preds:  torch.Size([1, 40]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
action_gt:  torch.Size([1, 40]) tensor([ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100, 31872, 31872, 31872, 31884, 31872, 31869, 31744,     2],
       device='cuda:0')
mask:  torch.Size([1, 40]) tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False,  True,  True,  True,  True,  True,  True,  True, False],
       device='cuda:0')
correct_preds:  tensor([[False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False]],
       device='cuda:0')
action_accuracy:  tensor(0., device='cuda:0')
train_loss:  nan

The below is the log during fine-tuning:

root@ai-precog-machine9:/home/jiamingz/projects/openvla# torchrun --standalone --nnodes 1 --nproc_per_node 2 vla-scripts/finetune.py
[2024-07-26 09:00:48,903] torch.distributed.run: [WARNING] 
[2024-07-26 09:00:48,903] torch.distributed.run: [WARNING] *****************************************
[2024-07-26 09:00:48,903] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-07-26 09:00:48,903] torch.distributed.run: [WARNING] *****************************************
2024-07-26 09:00:50.724369: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-26 09:00:50.724369: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-26 09:00:50.755711: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-26 09:00:50.755711: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-26 09:00:50.755738: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-26 09:00:50.755742: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-26 09:00:50.756664: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-26 09:00:50.756665: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-26 09:00:50.761815: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-26 09:00:50.761817: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-26 09:00:51.434507: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-26 09:00:51.435073: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Fine-tuning OpenVLA Model `openvla/openvla-7b` on `berkeley_autolab_ur5`
Fine-tuning OpenVLA Model `openvla/openvla-7b` on `berkeley_autolab_ur5`
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.55it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.43it/s]
trainable params: 110,828,288 || all params: 7,652,065,472 || trainable%: 1.4483
trainable params: 110,828,288 || all params: 7,652,065,472 || trainable%: 1.4483
2024-07-26 09:01:11.941738: I tensorflow/core/grappler/optimizers/data/replicate_on_split.cc:32] Running replicate on split optimization
2024-07-26 09:01:12.073699: I tensorflow/core/grappler/optimizers/data/replicate_on_split.cc:32] Running replicate on split optimization
07/26 [09:01:12] INFO     | >> [*] Computing dataset statistics. This may take a bit, but should only need to happen once.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [02:22<00:00,  7.00it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [02:22<00:00,  7.00it/s]
2024-07-26 09:03:35.973908: I tensorflow/core/grappler/optimizers/data/replicate_on_split.cc:32] Running replicate on split optimization
2024-07-26 09:03:35.980789: I tensorflow/core/grappler/optimizers/data/replicate_on_split.cc:32] Running replicate on split optimization

######################################################################################
# Loading the following 1 datasets (incl. sampling weight):                         #
# berkeley_autolab_ur5: ====================================================1.000000 #
######################################################################################

07/26 [09:03:36] INFO     | >> [*] Threads per Dataset: [1]                                                                                           dataset.py:531
                 INFO     | >> [*] Reads per Dataset: [1]                                                                                             dataset.py:532
                 INFO     | >> [*] Constructing datasets...                                                                                           dataset.py:535

######################################################################################
# Loading the following 1 datasets (incl. sampling weight):                         #
# berkeley_autolab_ur5: ====================================================1.000000 #
######################################################################################

2024-07-26 09:03:36.700706: I tensorflow/core/grappler/optimizers/data/replicate_on_split.cc:32] Running replicate on split optimization
2024-07-26 09:03:36.707621: I tensorflow/core/grappler/optimizers/data/replicate_on_split.cc:32] Running replicate on split optimization
07/26 [09:03:37] INFO     | >> [*] Applying frame transforms on dataset...                                                                            dataset.py:575
07/26 [09:03:38] INFO     | >> [*] Saved dataset statistics file at path                                                                           data_utils.py:291
                          runs/openvla-7b+berkeley_autolab_ur5+b4+lr-0.0005+lora-r32+dropout-0.0_test/dataset_statistics.json                                       
  0%|                                                                                                                                    | 0/200000 [00:00<?, ?it/s]WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1721984618.588573  598457 op_level_cost_estimator.cc:699] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 224 } dim { size: 224 } dim { size: -7 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "106" frequency: 2400 num_cores: 32 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 49152 l2_cache_size: 1310720 l3_cache_size: 25165824 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: -8 } dim { size: -9 } dim { size: -7 } } }
W0000 00:00:1721984618.588991  598457 op_level_cost_estimator.cc:699] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 224 } dim { size: 224 } dim { size: -6 } } } inputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -3 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "106" frequency: 2400 num_cores: 32 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 49152 l2_cache_size: 1310720 l3_cache_size: 25165824 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: -10 } dim { size: -11 } dim { size: -6 } } }
wandb: Currently logged in as: jiamingzhou2472. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.17.5
wandb: Run data is saved locally in /home/jiamingz/projects/openvla/wandb/run-20240726_090339-90ybz7bj
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ft+openvla-7b+berkeley_autolab_ur5+b4+lr-0.0005+lora-r32+dropout-0.0_test
wandb: ⭐️ View project at https://wandb.ai/jiamingzhou2472/openvla
wandb: 🚀 View run at https://wandb.ai/jiamingzhou2472/openvla/runs/90ybz7bj
  0%|                                                                                                                                    | 0/200000 [00:00<?, ?it/s]WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1721984628.016364  598456 op_level_cost_estimator.cc:699] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 224 } dim { size: 224 } dim { size: -7 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "106" frequency: 2400 num_cores: 32 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 49152 l2_cache_size: 1310720 l3_cache_size: 25165824 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: -8 } dim { size: -9 } dim { size: -7 } } }
W0000 00:00:1721984628.016811  598456 op_level_cost_estimator.cc:699] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 224 } dim { size: 224 } dim { size: -6 } } } inputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -3 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "106" frequency: 2400 num_cores: 32 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 49152 l2_cache_size: 1310720 l3_cache_size: 25165824 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: -10 } dim { size: -11 } dim { size: -6 } } }
action_logits:  torch.Size([1, 40, 32064])
action_preds:  torch.Size([1, 40]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
action_gt:  torch.Size([1, 40]) tensor([ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100, 31872, 31872, 31872, 31884, 31872, 31869, 31744,     2],
       device='cuda:0')
mask:  torch.Size([1, 40]) tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False,  True,  True,  True,  True,  True,  True,  True, False],
       device='cuda:0')
correct_preds:  tensor([[False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False]],
       device='cuda:0')
action_accuracy:  tensor(0., device='cuda:0')
train_loss:  nan

l1_loss:  0.8638655462184873
action_logits:  torch.Size([1, 40, 32064])
action_preds:  torch.Size([1, 40]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
action_gt:  torch.Size([1, 40]) tensor([ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100, 31872, 31872, 31872, 31884, 31872, 31869, 31744,     2],
       device='cuda:0')
mask:  torch.Size([1, 40]) tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False,  True,  True,  True,  True,  True,  True,  True, False],
       device='cuda:0')
correct_preds:  tensor([[False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False]],
       device='cuda:0')
action_accuracy:  tensor(0., device='cuda:0')
train_loss:  nan
...
...

can you find the cause of this problem? many thanks!

About Delta Output

Hello, Author,

I have a question regarding the output of OpenVLA. Is the output always in relative terms? I noticed that the Open X Embodiment dataset contains both absolute and relative positions. I am seeking advice from the authors on how to handle the absolute and relative values within the dataset effectively.

Could you please provide some guidance or suggestions on this matter?
Thank you for your time and assistance.

client cant connect to server?

It seems that my server is deployed:
iwEdAqNwbmcDAQTRDwAF0QjgBrCnExC32IqxogZ4zNTbGpcAB9MAAAAA9-KPVQgACaJpbQoAC9IADrf2 png_720x720q90
but it occured something wrong when I running client:
image
could you please check it for me?

Memory Leak?

Hello, thanks for the great work!

I was wondering if there are any memory leaks? Maybe somewhere in the dataset/dataloader code? When I run the training code I get this plot of the process memory available.
Screenshot 2024-07-22 at 12 20 54 AM

Your help would be much appreciated!

Missing data needed to reproduce on other robots

In order to test this on a real robot one would need details on:

  • The units of the resultant action (I'd assume meters? radians in angle-axis?)
  • The coordinates used (Z = camera forward?)
  • Grippers supported: I'd assume part of the image will always include gripper fingers, is the model agnostic to these somehow? How are grippers calibrated so that the action output can be applied to other grippers?
  • Transformations from action coordinates to robot coordinates

Cheers!

finetuning with flexible observations and flexible action spaces

Thank you for open-sourcing the VLA model. This initiative will significantly foster the advancement of foundational models for robotics.

I'm curious whether you have plans to support fine-tuning with customizable/flexible observations (e.g. add proprio info, left/right wrist images, depth images) and action spaces (e.g. diffusion action head, 14 dims actions, etc), similar to the capabilities of the Octo model.

Any guidance on this matter would be greatly appreciated. Thanks!

get_openvla_prompt ablations and deploy details.

Hello,

Thank you for the great works.

  1. I had a question out of curiosity if there has been any ablation studies you did in regards to the system prompt in openvla/vla-scripts/deploy.py lines 58 - 62. From my understanding, it seems only this prompt of f"{SYSTEM_PROMPT} USER: What action should the robot take to {instruction.lower()}? ASSISTANT:" or f"In: What action should the robot take to {instruction.lower()}?\nOut:", where

SYSTEM_PROMPT = ("A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user's questions.")

were tested.

If you could expand on the sensitivity of the model's performance wrt the system prompt it would be great!

  1. Additionally, in the Getting Started code snippet in the README page
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b", 
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = get_from_camera(...)
prompt = "In: What action should the robot take to {<INSTRUCTION>}?\nOut:"

# Predict Action (7-DoF; un-normalize for BridgeData V2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action, ...)

I was wondering why the prompt did not have the SYSTEM_PROMPT?

  1. Lastly, could you please clarify whether the given logic is correct for deployment?
instruction = 'pick up cup'
prompt = f"{SYSTEM_PROMPT} USER: What action should the robot take to {instruction.lower()}?
while rollout True
     image: Image.Image = get_from_camera(...)   
     inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
     action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
     if success:
           break

Thank you.

About finetuning dataset setup

Hi:

Thanks a lot for you to open-source vla. And I'm trying to use openvla for fine-tuning on my customized dataset.

Before this, I'm doing a verification to check if it can work on parts of droid dataset. Here are what I did:

  1. I download rlds_dataset_mod to local
  2. I modified prepare_open_x.sh in rlds_dataset_mod to let it download the first top 100 episodes + 4963 trajectories
    image
  3. Then I use the following scripts to start fine-tuning
export HF_TOKEN="/home/xiandao_airs/.cache/huggingface/token"
torchrun --standalone --nnodes 1 --nproc-per-node 2 vla-scripts/finetune.py \
  --data_root_dir ./data/datasets \
  --dataset_name droid \
  --batch_size 16 \
  --run_root_dir ./run/20240705_12_49 

Then I found the loss, accuracy fluctuated quite a lot as shown in the following picture

image

So I further debug a little bit

image

Then I found the action_decoding result is weird

self.action_tokenizer(action) = '么近巴計客ව食'

Given code implementation in https://github.com/openvla/openvla/blob/main/prismatic/vla/datasets/datasets.py

        prompt_builder = self.prompt_builder_fn("openvla")
        conversation = [
            {"from": "human", "value": f"What action should the robot take to {lang}?"},
            {"from": "gpt", "value": self.action_tokenizer(action)},
        ]
        for turn in conversation:
            prompt_builder.add_turn(turn["from"], turn["value"])

        # Tokenize (w/ `base_tokenizer`)
        input_ids = self.base_tokenizer(prompt_builder.get_prompt(), add_special_tokens=True).input_ids
        labels = list(input_ids)

I further checked the loss. And we rely on labels generated by self.action_tokenizer(action) to optimize loss, that's why the loss fluctuating quite a lot.

My question is how to make self.action_tokenizer to convert vectors to natural language to avoid such problem. Correcting me if I'm wrong.

Best Regards
Orlando

Best Regards
Orlando

The issue about bridge dataset

Thanks for your great work!
I want to know choose which version of bridge dataset to train Openvla?
If it is bridge v2,the website of you provide is version 1.

Suggestion to add extern script imports to separate requirements file

Hi all,

Thank you for this fantastic project and repository! I'm currently working with the REPL example to create an OpenVLA demo/walkthrough and wanted to suggest a couple of improvements:

Separate Requirements File: It would be helpful to add some of the imports, such as draccus, dlimp, and rich, to a separate requirements-extern.txt file within the extern/REPL example folder. This could streamline the setup process for users.

Higher Resolution Images: Could you please provide higher resolution versions of the images (without annotations) from Figure 3 and Figure 4 of the paper? Perhaps these could be included in the extern folder if possible. (nm, can get these from the rollout videos)

Thanks again for all your hard work on this project!

Bus error (core dumped)'

i tried to inference with the finetuned checkpoint( using lora finetune), it comes out Bus Error. How can i fix this? thanks

Can OpenVLA chat?

First of all, thank you for your great work!
I wanted to ask questions (such as describe what you see) about the model because I saw nothing similar within the paper. Or is the only output we can receive from the model only the 7D action values?

Evaluation on public benchmark, e.g., SimplerEnv

Thank you for the open-source project 🤗. This will be a significant advancement for general robotic systems! Additionally, I would like to ask if the author or anyone else has evaluated openvla on public benchmarks, e.g., SimplerEnv SimplerEnv? Could you share the evaluation results? Thanks! 😊

Pre-processing the BridgeData V2 in RLDS format

Hi, me again!

I'm trying to train the OpenVLA model on BridgeData V2 with the script you provided in the README.

# Train VLA on BridgeData V2 with the Prismatic DINO-SigLIP 224px Backbone on a Single Node (w/ 8 GPUs)
torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/train.py \
  --vla.type "prism-dinosiglip-224px+mx-bridge" \
  --data_root_dir <PATH TO OXE DATA ROOT> \
  --run_root_dir <PATH TO LOG/CHECKPOINT ROOT> \
  --wandb_project "<PROJECT>" \
  --wandb_entity "<ENTITY>"

I downloaded the BridgeData V2 from its official website and placed it under openvla/datasets/open-x-embodiment/.

I tried to pre-process the dowloaded dataset with the custom script mentioned in README.
This is the command I ran.

DOWNLOAD_DIR="../datasets/open-x-embodiment"
CONVERSION_DIR="../datasets/temp"
N_WORKERS=20                  # number of workers used for parallel conversion --> adjust based on available RAM
MAX_EPISODES_IN_MEMORY=200    # number of episodes converted in parallel --> adjust based on available RAM

# increase limit on number of files opened in parallel to 20k --> conversion opens up to 1k temporary files
# in /tmp to store dataset during conversion
ulimit -n 20000

DATASET="bridge"
TRANSFORM="resize_and_jpeg_encode"

mkdir ${DOWNLOAD_DIR}/${DATASET}
python3 modify_rlds_dataset.py  --dataset=$DATASET --data_dir=$DOWNLOAD_DIR --target_dir=$CONVERSION_DIR --mods=$TRANSFORM --n_workers=$N_WORKERS --max_episodes_in_memory=$MAX_EPISODES_IN_MEMORY
rm -rf ${DOWNLOAD_DIR}/${DATASET} 
mv ${CONVERSION_DIR}/${DATASET} ${DOWNLOAD_DIR}

modify_rlds_dataset.py is located at openvla/rlds_dataset_mod/ .

When I run the above command, I get the following error message.

File "~/openvla/rlds_dataset_mod/rlds_dataset_mod/multithreaded_adhoc_tfds_builder.py", line 149, in __init__
   super().__init__(*args, **kwargs)
TypeError: SplitBuilder.__init__() got an unexpected keyword argument 'file_format'.

Do you know any way to resolve this error?
Also, I'm wondering if it is necessary to pre-process the downloaded BridgeData V2,
because when I run modify_rlds_dataset.py, It seems that it tries to download the dataset from the server again.

Thanks!

Train_loss to become NaN

Thank you very much for your outstanding work. I am conducting fine-tuning experiments using the cmu_stretch dataset. My data processing should be fine. I would like to ask if using float16 for computation could cause the train_loss to become NaN. Your code uses bfloat16, but my device, a Titan GPU, does not support this precision.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.