Git Product home page Git Product logo

calvin-scripts's Introduction


  • script from calvin developers. Try pip install opencv-python to make import cv2 work.

  •,jl: extracts the numeric fields of calvin data files in current directory and prints them to stdout in tab separated format. The fields are given below. Only file-id(1), actions(7), rel_actions(7), robot_obs(15), scene_obs(24) are printed (54 columns). has more details about the splits and the fields. We also provide description and statistics in the next section.

  • debug-training.tsv, debug-validation.tsv, etc.: The release page has the output of calvin_extract_numbers scripts.

  • Read and print info in the scene_info.npy, ep_lens.npy, ep_start_end_ids.npy files in each calvin data directory.

  • Print all the unique task_ids and language annotations in the dataset. There are 34 unique task ids and 389 unique annotations.

  • zcat D-validation.tsv.gz | python prints out the differences with the previous line for all but the first line.

  • zcat D-validation.tsv.gz | python Tries to guess the episode boundaries if xyz of successive frames differ by more than 8.5 std.

  • zcat D-validation.tsv.gz | python Prints out intervals based on idnum discontinuities.

  • zcat D-validation.tsv.gz | python Print statistics for each column.

  • python /datasets/calvin/D/validation/episode_0000000.npz: Visualize a single frame from a given npz file.

  • python Visualize a series of frames in the current directory, a more detailed version of the original

  •,,, Scripts I tried to discover, output and check the controller (button etc) coordinates with.

  • Extract the mean pixels of depth_tactile (2) and rgb_tactile images. Saved as e.g. D-validation-tactile.tsv.gz.

  • The initial version of extract_tactile did not normalize the values, depth pixels were too small, rgb pixels were too large. I fixed these using this script, and named the normalized files e.g. D-validation-tactile2.tsv.gz. The extract script is also fixed now and normalizes.

  • python auto_lang_ann.npy: extract start, end, task, instruction triples in tab separated format

  • python D/validation: extract and normalize scene coordinate differences, saved in sdiff files.

  • zcat data/ABC(D)-training.tsv.gz | perl calvin_scene_fix_abc(d).pl | gzip > data/ABC(D)fixed-training.tsv.gz: fix the red-pink and red-blue swaps in scenes A and C respectively.

  • python ABCD training: create an npz file merging all features to one array and attached meta information.

  • CalvinDataset class, reads from npz file and subclasses Dataset. Deprecating


  • episode_XXXXXXX.npz: Each frame is represented in a file named episode_idnum.npz, consecutive idnums indicate consecutive frames (with the exception of episode transitions I guess). Other files indicating the contents:
  • scene_info.npy indicates the first and last frame numbers for a particular directory as well as which scene (A,B,C,D) they come from (although there seems to be some confusion about this). It only exists in */training but describes the union of training and validation.
  • ep_start_end_ids.npy indicates the start and end idnums of segments in that particular directory.
  • ep_lens.npy indicates the lengths of segments given by ep_start_end_ids.npy.
  • statistics.yaml gives basic stats for numeric variables.
directory frames scene_info.npy
debug/training 2771 {'calvin_scene_D': [358482, 361252]}
debug/validation 1675 {'calvin_scene_D': [553567, 555241]}
D/training 512077 {'calvin_scene_D': [0, 611098]}
D/validation 99022 .
ABC/training 1795045 {'calvin_scene_B': [0, 598909], 'calvin_scene_C': [598910, 1191338], 'calvin_scene_A': [1191339, 1795044]}
ABC/validation 99022 .
ABCD/training 2307126 {'calvin_scene_D': [0, 611098], 'calvin_scene_B': [611099, 1210008], 'calvin_scene_C': [1210009, 1802437], 'calvin_scene_A': [1802438, 2406143]}
ABCD/validation 99022 .

The validation directories for ABCD, ABC, and D have the same content: calvin_scene_D: 0:53818, 219635:244284, 399946:420498. (99022 frames). However note that the language annotations in the D split is different from the ones in the ABC/ABCD splits.

The training directory for D has the rest of the scene D data: calvin_scene_D: 53819:219634, 244285:399945, 420499:611098. (512077 frames). (scene_info.npy says scene_A but should be scene_D).

The training directory for ABCD starts with the same content as D/training (except for several off-by-one errors) followed by B (598910), C (592429), and A (603706) sections which are identical to ABC/training but renumbered.

The scenes differ by desk color and drawer positioning. The objects look the same.


A summary of all data (including images) in episode_XXXXXXX.npz files (each represents a single 1/30 sec frame):

# julia> for f in data[:files]; println(f, "\t", summary(get(data, f))); end
# actions	7-element Vector{Float64}
# rel_actions	7-element Vector{Float64}
# robot_obs	15-element Vector{Float64}
# scene_obs	24-element Vector{Float64}
# rgb_static	200×200×3 Array{UInt8, 3}
# rgb_gripper	84×84×3 Array{UInt8, 3}
# rgb_tactile	160×120×6 Array{UInt8, 3}
# depth_static	200×200 Matrix{Float32}
# depth_gripper	84×84 Matrix{Float32}
# depth_tactile	160×120×2 Array{Float32, 3}


The fields in the output of (files like D-training.tsv.gz) is as follows: (The npz files keep idnum separate, so column 0 is actx and in general subtract 1 from the following to get column index). (The ranges given below are from D-validation).

  1. idnum
  2. actions/x [-0.42,0.37] (tcp (tool center point) position: x,y,z in absolute world coordinates in meters: we want in the next frame, i.e. act[t] = tcp[t+1])
  3. actions/y [-0.46,0.10]
  4. actions/z [0.30,0.69]
  5. actions/a [-pi,pi] (tcp orientation (3): euler angles a,b,c in absolute world coordinates)
  6. actions/b [-0.46,0.32]
  7. actions/c [-pi,pi]
  8. actions/g [-1 or 1] (gripper_action (1): binary close=-1, open=1)
  9. rel_actions/x [-1,1] (tcp position (3): x,y,z in relative world coordinates normalized and clipped to (-1, 1) with scaling factor 50: rel[t]=normalize(act[t]-tcp[t]))
  10. rel_actions/y [-1,1] (normalize_dist=clip(act-obs,-0.02,0.02)/0.02, normalized_angle=clip(((act-obs) + pi) % (2*pi) - pi, -0.05, 0.05)/0.05
  11. rel_actions/z [-1,1]
  12. rel_actions/a [-1,1] (tcp orientation (3): euler angles a,b,c in relative world coordinates normalized and clipped to (-1, 1) with scaling factor 20)
  13. rel_actions/b [-1,1]
  14. rel_actions/c [-1,1]
  15. rel_actions/g [-1 or 1] (gripper_action (1): binary close=-1, open=1)
  16. robot_obs/x [-0.42,0.37] (tcp position (3): x,y,z in world coordinates: current position of tcp, i.e. tcp[t]=act[t-1])
  17. robot_obs/y [-0.46,0.10]
  18. robot_obs/z [0.30,0.69]
  19. robot_obs/a [-pi,pi] (tcp orientation (3): euler angles a,b,c in world coordinates)
  20. robot_obs/b [-0.46,0.32]
  21. robot_obs/c [-pi,pi]
  22. robot_obs/w [-0.02,0.08] (gripper opening width (1): in meters)
  23. robot_obs/j1 [-1.67,0.41] (arm_joint_states (7): in rad)
  24. robot_obs/j2 [-0.11,1.44]
  25. robot_obs/j3 [1.21,2.55]
  26. robot_obs/j4 [-2.80,-0.50]
  27. robot_obs/j5 [-1.65,0.17]
  28. robot_obs/j6 [1.27,2.56]
  29. robot_obs/j7 [-1.20,2.46]
  30. robot_obs/g [-1 or 1] (gripper_action (1): binary close = -1, open = 1)
  31. scene_obs/sliding_door [-0.002,0.28] (relative coordinates in meters)
  32. scene_obs/drawer [0.0,0.22] (relative coordinates in meters)
  33. scene_obs/button [0.0,0.034] (relative coordinates, scaled)
  34. scene_obs/switch [0.04,0.09] (relative coordinates, scaled)
  35. scene_obs/lightbulb [0 or 1] (1): on=1, off=0
  36. scene_obs/green light [0 or 1] (1): on=1, off=0
  37. scene_obs/redx (red block (6): (x, y, z, euler_x, euler_y, euler_z)
  38. scene_obs/redy
  39. scene_obs/redz
  40. scene_obs/reda
  41. scene_obs/redb
  42. scene_obs/redc
  43. scene_obs/bluex (blue block (6): (x, y, z, euler_x, euler_y, euler_z)
  44. scene_obs/bluey
  45. scene_obs/bluez
  46. scene_obs/bluea
  47. scene_obs/blueb
  48. scene_obs/bluec
  49. scene_obs/pinkx (pink block (6): (x, y, z, euler_x, euler_y, euler_z)
  50. scene_obs/pinky
  51. scene_obs/pinkz
  52. scene_obs/pinka
  53. scene_obs/pinkb
  54. scene_obs/pinkc


The fields in the output of (files like D-training-controllers.tsv.gz) are as follows: The coordinates give the location tcp should be at to move the controller at each point in time. For details of the coordinate calculation see and

00.00 idnum 01.54 slider.x 02.55 slider.y 03.56 slider.z 04.57 drawer.x 05.58 drawer.y 06.59 drawer.z 07.60 button.x 08.61 button.y 09.62 button.z 10.63 switch.x 11.64 switch.y 12.65 switch.z


The fields in the output of the script (files like D-validation-tactile2.tsv.gz). These give the average pixel of the tactile images. The depth pixels were normalized with x100, the rgb pixels were normalized with /255.0.

00.00 idnum 01.66 depth_tactile1 02.67 depth_tactile2 03.68 rgb_tactile1_r 04.69 rgb_tactile1_g 05.70 rgb_tactile1_b 06.71 rgb_tactile2_r 07.72 rgb_tactile2_g 08.73 rgb_tactile2_b


These are the normalized differences in scene coordinates: normalize(x[next]-x[curr]).

Normalization similar to rel_actions based on

00.00 idnum 01.74 scene_obs/sliding_door 02.75 scene_obs/drawer 03.76 scene_obs/button 04.77 scene_obs/switch 05.78 scene_obs/lightbulb 06.79 scene_obs/green_light 07.80 scene_obs/redx 08.81 scene_obs/redy 09.82 scene_obs/redz 10.83 scene_obs/reda 11.84 scene_obs/redb 12.85 scene_obs/redc 13.86 scene_obs/bluex 14.87 scene_obs/bluey 15.88 scene_obs/bluez 16.89 scene_obs/bluea 17.90 scene_obs/blueb 18.91 scene_obs/bluec 19.92 scene_obs/pinkx 20.93 scene_obs/pinky 21.94 scene_obs/pinkz 22.95 scene_obs/pinka 23.96 scene_obs/pinkb 24.97 scene_obs/pinkc


Each language annotation file is class (task) balanced, roughly equal number of each class for each dataset. ABC-validation and ABCD-validation are buggy and should not be used: they padded the annotation numbers by repeating the same annotations multiple times. Use D-validation instead.

dataset % frames annotated frames annots contig tasks frames/annot % frames multiply annotated
ABCD-training .3734 2307126 22966 14176 100.46 .3650
ABC-training .3760 1795045 17870 11102 100.45 .3608
D-training .3813 512077 5124 3251 99.94 .3528
debug-training .1364 2771 9 7 307.89 .3306
ABCD-validation .0634 99022 1087 90 91.10 .9154
ABC-validation .0634 99022 1087 90 91.10 .9154
D-validation .3660 99022 1011 605 97.94 .3755
debug-validation .2036 1675 8 6 209.38 .1994

calvin-scripts's People


denizyuret avatar


 avatar  avatar  avatar



calvin-scripts's Issues

pl forward vs step functions

regarding to your log here Hocam; you don't have to use both of them for your case, step function is enough to iterate over each batch.

In pl or pytorch, forward() should define your predictions/inference; it encapsulates the way your model would be used. You can choose to have it in your step() function or not.

However; if you are using self(x) rather than self.mlp(x) (see here), you must define a forward() method. self(x) automatically calls the forward() function for you.

Reference: You can check also the link Hocam, there is a FORWARD vs TRAINING_STEP section provided by pl.

code confusion in calvindataset2

Hi, I am a bit confused about the aim of that line. As I remember, that line is generated by chatGPT; I was wondering if there is a more simple and readable way; e.g. reshape(-1), flatten, etc. Could you briefly go over that line Hocam?

gripper width - gripper open/close confusion

The last index of robot_obs is gripper open or close [-1/1] and 7th index is gripper width in original Calvin, see here. But it vice-verse for your case in line:

('armg', int), # 29. (gripper_action (1): binary close = -1, open = 1)
. Did you change it specifically Hocam or it is a small issue?

I am going to drop gripper-open/close variable in robot_obs while training action-decoder; to make it 96-dimensional and make it divisible by number of heads..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.