Git Product home page Git Product logo

google-research-datasets / objectron Goto Github PK

View Code? Open in Web Editor NEW
2.2K 64.0 264.0 39.85 MB

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes

License: Other

Jupyter Notebook 99.54% Python 0.46%
deep-learning computer-vision machine-learning python tensorflow pytorch 3d-vision 3d-reconstruction ai 3d

objectron's Introduction

Objectron Dataset

Objectron is a dataset of short object centric video clips with pose annotations.


WebsiteDataset FormatTutorialsLicense

The Objectron dataset is a collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment. In each video, the camera moves around the object, capturing it from different angles. The data also contain manually annotated 3D bounding boxes for each object, which describe the object’s position, orientation, and dimensions. The dataset consists of 15K annotated video clips supplemented with over 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes. In addition, to ensure geo-diversity, our dataset is collected from 10 countries across five continents. Along with the dataset, we are also sharing a 3D object detection solution for four categories of objects — shoes, chairs, mugs, and cameras. These models are trained using this dataset, and are released in MediaPipe, Google's open source framework for cross-platform customizable ML solutions for live and streaming media.

Key Features

  • 15000 annotated videos and 4M annotated images
  • All samples include high-res images, object pose, camera pose, point-cloud, and surface planes.
  • Ready to use examples in various tf.record formats, which can be used in Tensorflow/PyTorch.
  • Object-centric multi-views, observing the same object from different angles.
  • Accurate evaluation metrics, like 3D IoU for oriented 3D bounding boxes.

Dataset Format

The data is stored in the objectron bucket on Google Cloud storage. Check out the Download Data notebook for a quick review of how to download/access the dataset. The following assets are available:

  • The video sequences (located in /videos/class/batch-i/j/video.MOV files)
  • The annotation labels containing the 3D bounding boxes for objects. The annotation protobufs are located in /videos/class/batch-i/j/geometry.pbdata files. They are formatted using the object.proto. See [example] on how to parse the annotation files.
  • AR metadata (such as camera poses, point clouds, and planar surfaces). They are based on a_r_capture_metadata.proto. See example on how to parse these files.
  • Processed dataset: sharded and shuffled tf.records of the annotated frames, in tf.example format and videos in tf.SequenceExample format. These are used for creating the input data pipeline to your models. These files are located in /v1/records_shuffled/class/ and /v1/sequences/class/.
  • Supporting scripts to run evaluation based on the 3D IoU metric.
  • Supporting scripts to load the data into Tensorflow, Jax and Pytorch and visualize the dataset, including “Hello World” examples.
  • Supporting Apache Beam jobs to process the datasets on Google Cloud infrastructure.
  • The index of all available samples, as well as train/test splits for easy access and download.

Raw dataset size is 1.9TB (including videos and their annotations). Total dataset size is 4.4TB (including videos, records, sequences, etc.). This repository provides the required schemas and tools to parse the dataset.

class bike book bottle camera cereal_box chair cup laptop shoe
#videos 476 2024 1928 815 1609 1943 2204 1473 2116
#frames 150k 576k 476k 233k 396k 488k 546k 485k 557k

Tutorials

License

Objectron is released under Computational Use of Data Agreement 1.0 (C-UDA-1.0). A copy of the license is available in this repository.

BibTeX

If you found this dataset useful, please cite our paper.

@article{objectron2021,
  title={Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations},
  author={Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, Matthias Grundmann},
  journal={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2021}
}

This is not an officially supported Google product. If you have any question, you can email us at [email protected] or join our mailing list at [email protected]

objectron's People

Contributors

ahmadyan avatar chadcyb avatar daeyun avatar jianingwei avatar jinlinyi avatar soskek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

objectron's Issues

import torch_xla error

ImportError Traceback (most recent call last)

in ()
11
12 # imports the torch_xla package
---> 13 import torch_xla
14 import torch_xla.core.xla_model as xm
15

/usr/local/lib/python3.6/dist-packages/torch_xla/init.py in ()
103 import torch
104 from ._patched_functions import _apply_patches
--> 105 import _XLAC
106
107

ImportError: /usr/local/lib/python3.6/dist-packages/_XLAC.cpython-36m-x86_64-linux-gnu.so: undefined symbol: ZN2at19slow_conv_dilated3dERKNS_6TensorES2_N3c108ArrayRefIlEES2_S5_S5_S5

Questions about the preprocessed dataset and the data used in the paper

Hi there,

I was wondering about the information included in your preprocessed dataset. You have already provided API to extract the info, which is really convenient. But I am not sure what frames are exactly included in your dataset. I guess you extracted every frame in those videos and then shuffled them?

Then my question is how to have a fair comparison with the result presented in your paper.

MobilePose (https://arxiv.org/pdf/2003.03522.pdf) mentioned that
We only accepted one clip for one or a pair of shoes, and hence, the objects are completely different from clip to clip. Among the clips, 1500 were randomly selected for training, and the rest 300 were reserved for evaluation. Finally, considering adjacent frames from the same clip are very similar, we randomly selected 100K images for training, and 1K images for evaluation.

I am not sure how to pick up the frames randomly just as you did. Is that fair that I take out the first 1K ones from your preprocessed dataset?

"Objection: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations" presented more results on all the categories but it did not state what training/testing data were involved.

Hope you could give me more ideas. Thanks a lot.

Projecting detected planes into image coordinates

Hi, thanks for the cool dataset!

I have been tinkering with objectron-geometry-tutorial.ipynb, exploring the available meta-data. I haven't been able to successfully transform the extracted planes into image space for visualization. I tried using the same procedure by which the bounding box coordinates are projected into image pixels, but that doesn't seem to have worked since I have many unreasonable values, e.g. values that are negative or much larger than image bounds.

Here's the code that I used:

plane_points = np.array([[v.x,v.y,v.z,1] for v in plane.geometry.vertices])
plane_points_3d_world = transform @ plane_points.T
plane_points_3d_cam = frame_view_matrix @ plane_points_3d_world
plane_points_2d_proj = frame_projection_matrix @ plane_points_3d_cam

plane_points2d_ndc = plane_points_2d_proj[:-1, :] / plane_points_2d_proj[-1, :]
plane_points2d_ndc = plane_points2d_ndc.T

x = plane_points2d_ndc[:, 1]
y = plane_points2d_ndc[:, 0]
plane_points2d = np.copy(plane_points2d_ndc)
plane_points2d[:, 0] = ((1 + x) * 0.5) * width
plane_points2d[:, 1] = ((1 + y) * 0.5) * height

plane_points2d = np.round(plane_points2d).astype(np.int32)
for point_id in range(plane_points2d.shape[0]):
    cv2.circle(image, (plane_points2d[point_id, 0], plane_points2d[point_id, 1]), 25, (0, 255, 255), -1)

Also, there's a small bug in the notebook in the definition of grab_frame. The line

current_frame = np.frombuffer(
        pipe.stdout.read(frame_size), dtype='uint8').reshape(width, height, 3)

has width and height transposed.

Thanks for any help you can provide!

2D bounding boxes

Is there a way to extract accurate 2D bounding box data from the 3D bounding boxes?

Inconsistency between records and records_shuffled

It seems that some tf.records are present in the records_shuffled directory but not in records. I believe this is an unintended discrepancy. Essentially, out of 10532 tf.record files in records_shuffled only 10448 remain in records. You can investigate the 84 missing records with the following excerpt:

import tensorflow as tf

def fetchFileNames(dir_names):
    filepaths = []
    for name in dir_names:
        filepaths += tf.io.gfile.glob(f"{name}/*")
    return filepaths


record_dirs = tf.io.gfile.glob("gs://objectron/v1/records/*")
record_filepaths = fetchFileNames(record_dirs)
shuffled_dirs = tf.io.gfile.glob("gs://objectron/v1/records_shuffled/*")
shuffled_filepaths = fetchFileNames(shuffled_dirs)

assert len(record_filepaths) < len(shuffled_filepaths)

shuffled_filepaths = [fp.replace("_shuffled", "") for fp in shuffled_filepaths]
record_filepaths = set(record_filepaths)
shuffled_filepaths = set(shuffled_filepaths)
missing = shuffled_filepaths - record_filepaths

These are the missing filepaths:

{'gs://objectron/v1/records/camera/camera_test-00137-of-00163', 'gs://objectron/v1/records/laptop/summary.txt', 'gs://objectron/v1/records/cereal_box/cereal_box_train-
00169-of-00819', 'gs://objectron/v1/records/chair/chair_train-00953-of-01106', 'gs://objectron/v1/records/chair/chair_train-00526-of-01106', 'gs://objectron/v1/records
/cereal_box/cereal_box_train-00192-of-00819', 'gs://objectron/v1/records/camera/camera_train-00434-of-00552', 'gs://objectron/v1/records/cereal_box/cereal_box_train-00
080-of-00819', 'gs://objectron/v1/records/cereal_box/cereal_box_train-00174-of-00819', 'gs://objectron/v1/records/camera/camera_test-00101-of-00163', 'gs://objectron/v
1/records/cereal_box/cereal_box_train-00152-of-00819', 'gs://objectron/v1/records/chair/chair_train-00833-of-01106', 'gs://objectron/v1/records/cereal_box/cereal_box_t
rain-00284-of-00819', 'gs://objectron/v1/records/cereal_box/cereal_box_test-00228-of-00322', 'gs://objectron/v1/records/cereal_box/cereal_box_test-00063-of-00322', 'gs
://objectron/v1/records/bottle/bottle_train-00215-of-00920', 'gs://objectron/v1/records/shoe/summary.txt', 'gs://objectron/v1/records/cereal_box/cereal_box_test-00213-
of-00322', 'gs://objectron/v1/records/camera/camera_train-00539-of-00552', 'gs://objectron/v1/records/bottle/bottle_train-00273-of-00920', 'gs://objectron/v1/records/c
amera/camera_train-00140-of-00552', 'gs://objectron/v1/records/camera/camera_train-00463-of-00552', 'gs://objectron/v1/records/cereal_box/cereal_box_train-00036-of-008
19', 'gs://objectron/v1/records/cereal_box/cereal_box_train-00025-of-00819', 'gs://objectron/v1/records/cup/summary.txt', 'gs://objectron/v1/records/cereal_box/cereal_
box_test-00247-of-00322', 'gs://objectron/v1/records/camera/camera_train-00013-of-00552', 'gs://objectron/v1/records/camera/camera_train-00252-of-00552', 'gs://objectr
on/v1/records/camera/camera_train-00408-of-00552', 'gs://objectron/v1/records/camera/camera_train-00440-of-00552', 'gs://objectron/v1/records/camera/camera_train-00148
-of-00552', 'gs://objectron/v1/records/cereal_box/cereal_box_train-00010-of-00819', 'gs://objectron/v1/records/camera/camera_test-00152-of-00163', 'gs://objectron/v1/r
ecords/chair/chair_train-00947-of-01106', 'gs://objectron/v1/records/cereal_box/cereal_box_train-00249-of-00819', 'gs://objectron/v1/records/chair/chair_train-00207-of
-01106', 'gs://objectron/v1/records/chair/chair_train-00647-of-01106', 'gs://objectron/v1/records/camera/summary.txt', 'gs://objectron/v1/records/camera/camera_train-0
0040-of-00552', 'gs://objectron/v1/records/chair/chair_train-01068-of-01106', 'gs://objectron/v1/records/chair/chair_train-01087-of-01106', 'gs://objectron/v1/records/
chair/chair_train-01048-of-01106', 'gs://objectron/v1/records/camera/camera_test-00018-of-00163', 'gs://objectron/v1/records/chair/summary.txt', 'gs://objectron/v1/rec
ords/cereal_box/cereal_box_train-00095-of-00819', 'gs://objectron/v1/records/chair/chair_train-00361-of-01106', 'gs://objectron/v1/records/camera/camera_train-00474-of
-00552', 'gs://objectron/v1/records/camera/camera_train-00452-of-00552', 'gs://objectron/v1/records/camera/camera_train-00282-of-00552', 'gs://objectron/v1/records/cer
eal_box/cereal_box_train-00237-of-00819', 'gs://objectron/v1/records/chair/chair_train-01097-of-01106', 'gs://objectron/v1/records/bottle/bottle_train-00746-of-00920',
 'gs://objectron/v1/records/camera/camera_train-00256-of-00552', 'gs://objectron/v1/records/cereal_box/cereal_box_test-00103-of-00322', 'gs://objectron/v1/records/chai
r/chair_train-00444-of-01106', 'gs://objectron/v1/records/chair/chair_train-00904-of-01106', 'gs://objectron/v1/records/cereal_box/cereal_box_test-00160-of-00322', 'gs
://objectron/v1/records/chair/chair_train-01090-of-01106', 'gs://objectron/v1/records/camera/camera_train-00073-of-00552', 'gs://objectron/v1/records/cereal_box/cereal
_box_train-00050-of-00819', 'gs://objectron/v1/records/camera/camera_train-00111-of-00552', 'gs://objectron/v1/records/cereal_box/cereal_box_test-00049-of-00322', 'gs:
//objectron/v1/records/chair/chair_train-00480-of-01106', 'gs://objectron/v1/records/chair/chair_train-01023-of-01106', 'gs://objectron/v1/records/cereal_box/summary.t
xt', 'gs://objectron/v1/records/camera/camera_train-00509-of-00552', 'gs://objectron/v1/records/book/summary.txt', 'gs://objectron/v1/records/cereal_box/cereal_box_tra
in-00115-of-00819', 'gs://objectron/v1/records/bottle/summary.txt', 'gs://objectron/v1/records/cereal_box/cereal_box_train-00068-of-00819', 'gs://objectron/v1/records/
cereal_box/cereal_box_test-00134-of-00322', 'gs://objectron/v1/records/chair/chair_train-00873-of-01106', 'gs://objectron/v1/records/chair/chair_train-00197-of-01106',
 'gs://objectron/v1/records/chair/chair_train-00350-of-01106', 'gs://objectron/v1/records/camera/camera_test-00038-of-00163', 'gs://objectron/v1/records/cereal_box/cer
eal_box_test-00073-of-00322', 'gs://objectron/v1/records/camera/camera_train-00304-of-00552', 'gs://objectron/v1/records/bike/summary.txt', 'gs://objectron/v1/records/
camera/camera_test-00046-of-00163', 'gs://objectron/v1/records/camera/camera_train-00396-of-00552', 'gs://objectron/v1/records/camera/camera_train-00062-of-00552', 'gs
://objectron/v1/records/cereal_box/cereal_box_train-00255-of-00819', 'gs://objectron/v1/records/camera/camera_test-00048-of-00163'}

Ideally, the assertion statement in the gist above would fail and the number of records in these two directories in the bucket would be equal.

Possible bug in from_transformation in box.py

Hi, I noticed that in the class method from_transformation in box.py, the transformation itself is thrown away, only the points are kept, upon calling cls(vertices=vertices). That cannot be right? For example, in IOU computation, the boxes are transformed by these transforms, so when they are not set it leads to different results.

Best,
Matous

Best practices for transitioning to PyTorch

Hi,

First of all, thanks for the dataset!

I have gone through your Jupiter notebooks and am now wondering what is the most efficient pipeline (best practice) to use PyTorch for training models. In your PyTorch notebook you 'just' sample and visualize individual examples but do not create a PyTorch dataset object or something that would allow sampling data batches. Is there an efficient way to transition to the usual PyTorch workflows for training models?

Thanks! :)

Some sequence shards have missing fields

When going over the sequence shards, some of them have missing fields. The loading being done as it's described in the example script, the code crashes with the following error message

InvalidArgumentError: Name: <unknown>, Feature list 'image/encoded' is required but could not be found.  Did you mean to include it in feature_list_dense_missing_assumed_empty or feature_list_dense_defaults?
	 [[{{node ParseSingleSequenceExample/ParseSequenceExample/ParseSequenceExampleV2}}]]

And the faulty shards I found by iterating over the entire dataset are (the missing fields differ between the affected shards):

objectron/sequences/book/book_train-00020-of-01324
objectron/sequences/book/book_train-00476-of-01324
objectron/sequences/book/book_train-01231-of-01324
objectron/sequences/book/book_train-01256-of-01324
objectron/sequences/bottle/bottle_train-00559-of-01320
objectron/sequences/chair/chair_train-00365-of-01274

Error parsing annotation

Met google.protobuf.message.DecodeError: Error parsing message or RuntimeWarning: Unexpected end-group tag: Not all data was converted (and end up have ab empty list of annotations) for all the annotations I tested in bike data.

Code for reproduction

if __name__ == "__main__":
    batch_id = 0
    seq_id = 10
    video_filename = f'/data/objectron/train/videos/bike/batch-{batch_id}/{seq_id}/video.MOV'
    annotation_file = f'/data/objectron/train/annotations/bike/batch-{batch_id}/{seq_id}.pbdata'
    # Along with the video.MOV file, there is a geometry.pbdata file that contains
    # the geometry information of the scene (such as camera poses, point-clouds, and surfaces).
    # There is a copy of this container within each annotation protobuf too.
    geometry_filename = f'/data/objectron/train/videos/bike/batch-{batch_id}/{seq_id}/geometry.pbdata'  # a.k.a. AR metadata
    frame_id = 100
    with open(annotation_file, 'rb') as pb:
        data = pb.read()
        sequence = annotation_protocol.Sequence()
        sequence.ParseFromString(data)
        frame = grab_frame(video_filename, [frame_id])
        annotation, cat, num_keypoints, types = get_frame_annotation(sequence, frame_id)

What could be wrong? Are there any sepcific version requirements?

Annotation download error

Hi,

The following command throws a HTTP 404 error when downloading annotations for video laptop/batch-39/38:
wget https://storage.googleapis.com/objectron/annotations/laptop/batch-39/38.pbdata

I am able to download other annotations in the same batch with no issues. For example, the following command is fine:
wget https://storage.googleapis.com/objectron/annotations/laptop/batch-39/9.pbdata

Do you know what's going on here?

Thanks!

Bounding box point id consistency

Hi,

Thank you for releasing such a comprehensive set of tools to work with the data!

Are the point ids in the 3D bounding box annotations are consistent within a video sequence?

Thanks!

I some of the data supposed to be size 0?

I noticed some of the data is empty, size 0 in the objectron bucket. Is this done on purpose or was it a bad upload?

gsutil ls -lh gs://objectron/videos/shoe/batch-93/9
0 B  2020-10-29T05:00:04Z  gs://objectron/videos/shoe/batch-93/9/geometry.pbdata
0 B  2020-10-29T05:00:04Z  gs://objectron/videos/shoe/batch-93/9/id.txt
0 B  2020-10-29T05:00:04Z  gs://objectron/videos/shoe/batch-93/9/video.MOV
TOTAL: 3 objects, 0 bytes (0 B)

2D bounding boxes request

As mentioned in #46, directly fitting a 2D bbox by the projected vertices of 3D bbox can be very inaccurate. For example, the actual 2d bbox in green and the fitted one in red.

Is there any better way to get the 2d bbox?

image

Question about the metric in the evaluation code

Hi there,

I think there is a problem in your evaluation code that you do not count the case where there is no prediction in the given input. So the final number may not fully reflect the truth.

for boxes, label, plane in zip(results, labels, planes):
instances = label['2d_instance']
instances_3d = label['3d_instance']
visibilities = label['visibility']
num_instances = 0
for instance, instance_3d, visibility in zip(
instances, instances_3d, visibilities):
if (visibility > self._vis_thresh and
self._is_visible(instance[0]) and instance_3d[0, 2] < 0):
num_instances += 1
# We don't have negative examples in evaluation.
if num_instances == 0:
continue
iou_hit_miss = metrics.HitMiss(self._iou_thresholds)
azimuth_hit_miss = metrics.HitMiss(self._azimuth_thresholds)
polar_hit_miss = metrics.HitMiss(self._polar_thresholds)
pixel_hit_miss = metrics.HitMiss(self._pixel_thresholds)
num_matched = 0
for box in boxes:
box_point_2d, box_point_3d = box
index = self.match_box(box_point_2d, instances, visibilities)
if index >= 0:
num_matched += 1
pixel_error = self.evaluate_2d(box_point_2d, instances[index])
# If you only compute the 3D bounding boxes from RGB images,
# your 3D keypoints may be upto scale. However the ground truth
# is at metric scale. There is a hack to re-scale your box using
# the ground planes (assuming your box is sitting on the ground).
# However many models learn to predict depths and scale correctly.
#scale = self.compute_scale(box_point_3d, plane)
#box_point_3d = box_point_3d * scale
azimuth_error, polar_error, iou = self.evaluate_3d(box_point_3d, instances_3d[index])
iou_hit_miss.record_hit_miss(iou)
pixel_hit_miss.record_hit_miss(pixel_error, greater=False)
azimuth_hit_miss.record_hit_miss(azimuth_error, greater=False)
polar_hit_miss.record_hit_miss(polar_error, greater=False)
if num_matched > 0:
self._iou_ap.append(iou_hit_miss, num_instances)
self._pixel_ap.append(pixel_hit_miss, num_instances)
self._azimuth_ap.append(azimuth_hit_miss, num_instances)
self._polar_ap.append(polar_hit_miss, num_instances)
self._matched += num_matched

In your code snippet, the instance represents ground truth while the box represents prediction. You try to match each prediction with one ground truth. But if there is no prediction (which means no match), you just skip this case. I think you should instead record that case as missing targets (you should still add the num_instances but do not update tp & fp).

Coordinate System Conventions

Can someone please clarify the conventions for the world-to-camera and camera projection transforms? In particular:

  • What is the world coordinate system for this dataset? In a previous issue it was mentioned that +X is down, +Y is to the right and +Z is outwards from the screen. Is this correct? Or is it +Y up and a left handed system (i.e. +X right, +Z into the screen) as mentioned in the paper?
  • Are the view_matrix and projection_matrix given assuming this world convention?
  • Is the projection_matrix given in terms of NDC or screen? i.e. do we need to convert fx, fy using the image width/height to get NDC values?

Loss function details

According to Table 3 in the Objectron paper, the loss function used in the two-stage pipeline is "Per vertex MSE normalized on diagonal edge length". I am trying to understand the different parts in this sentence. Could you share the equation or pseudo-code corresponding to this? It will help make the computation explicit. Thanks a lot!

Question about OBJECT_ORIENTATION

Can someone explain why the feature "OBJECT_ORIENTATION" has 9 values.

Separately
the class method box.fit returns orientation (3 values), translation and scale.
I guess this method essentially returns parts of a transformation matrix which transform a scaled axis aligned cube to the bounding box. If this is the case, what does orientation represent?
My guess -- angles(or cosines of the angle) at which box has to be rotated at an angle to the 3 axes...ie first value is the cosine of the angle between x-axis and line joining centre of box to origin and so forth?

Can you pls confirm?

Question about the scale retrieval process

Hi there,

Thanks for your great work. It is really inspiring. I am curious about the scale retrieval process and I found something in your code.

  def compute_scale(self, box, plane):
    """Computes scale of the given box sitting on the plane."""
    center, normal = plane
    vertex_dots = [np.dot(vertex, normal) for vertex in box[1:]]
    vertex_dots = np.sort(vertex_dots)
    center_dot = np.dot(center, normal)
    scales = center_dot / vertex_dots[:4]
    return np.mean(scales)

def compute_scale(self, box, plane):

I am a little bit confused about the meaning of those steps. Could you explain it?

Thank you so much.

Question about evaluation results

Hi!
I get predicted 3d bounding boxs following https://google.github.io/mediapipe/solutions/objectron.html, and I can also read gt 3d bouding box( by features.FEATURE_NAMES["POINT_3D"]) from tf record file following https://github.com/google-research-datasets/Objectron/blob/master/notebooks/Hello%20World.ipynb. However, I get a small 3d iou value when evaluating 3d iou metric.

I am not sure if there is anything wrong in my evaluation procedure, and can you provide evaluation results of each sequence?

I am looking forward to your reply. Thank you so much.

Develop 3D Object Detection for (delivery) boxes

Hello,

I need to develop an solution to detect and track some kind of boxes like delivery boxes etc.

How to start here? I read that there is already data in the dataset, but How to build a solution up on that to 3D detect Boxes?

Example:
image

I really appreciate your help.

Thank you.

Queries about Objectron Features

Greetings! I am new to Objectron and am just experimenting with it. I went through your notebooks on how to load the data from different sources; specifically, using TFRecords (either through Tensorflow, or PyTorch).
I have a few queries, which may sound silly; please bear with me if they are!

  • One, I noticed that there is a feature in the data when the TensorFlow pipeline is implemented , "image/alpha". The schema mentions that this is the segmentation mask of the image. However, this feature is not present in the PyTorch pipeline. I'm reading in the data using the PyTorch XLA library as mentioned in the PyTorch notebook, and there are no errors; however, I can't seem to find this feature in the PyTorch pipeline. I understand that the same TFRecords are being used in both pipelines. Hence, I can't figure out what the matter is here.
  • Two, I was wondering what the feature point_num is used for. The schema mentions that it's "a list of point numbers for each instance". Is this any different from the variable NUM_KEYPOINTS defined in both notebooks?
  • Three, I ran across one record ("gs://objectron/v1/records/bike/bike_train-00073-of-00378"), where the above feature, point_num, was actually a tensor containing 2 elements, and not 1 as was the case in the TFRecords preceding it. What is the meaning of this?

Found masks appearing in the data video?

I download videos/bike/batch-10/3/video.MOV to see what the data would be like.
However, there are seemingly random masks keep appearing in the video. Cannot quite get what is this for, is it for augmentation or something?

image

The questions about bike data

Hi, Thank you for this dataset.
The number of bike datasets is different README and index/bike_annotations file.
In README file, The bike dataset is 476 videos.
But index/bike_annotations file is 472 line.
May I get the missing data?

How to project 3D points annotations to 2D with CAMERA INTRINSIC MATRIX (instead of PROJECTION MATRIX)?

According to the code provided in the repo,
Each annotation frame contains the keypoints in 3D in the camera coordinates, as well as their 2D-projection + depth in image coordinate. You can also get the box's 9-DoF parameters (rotation, translation, and scale) too.

However, the projection implementation does not use the intrinsic matrix at all, instead it uses a view_matrix as well as image size, which seems quite counter-intuitive to me.

Simply put, what should I do if I want to use the camera intrinsic matrix to project provided 3D point annotation(3d box corners + center) to 2D image? It's important because you cannot solve for object pose with a 4 by 4 view_matrix using PnP

Intrinsic camera matrices are not compatible with the 640x480 crops

I use the processed data in tf.SequenceExample format, and all the intrinsic camera matrices I load look a bit like this:
image

I assume the matrix is compatible with the Hartley and Zisserman definition:
image

Here, the values seem to be pixel-scale (which makes sense), but the principal point offsets are way beyond the image bounds (makes no sense). I assume this is because that intrinsic matrix corresponds to the pre-cropped image (i.e. before the dataset is normalized to 640x480).

I tried to decompose the projection matrix to get the intrinsic matrix back, but did not get a sensical result. Is this a bug, or am I misinterpreting the content of that matrix?

I'm currently trying to find a way to get the new intrinsics from the data, but the only way I can now think of is to recalibrate using the provided 2D/3D correspondences. Would there be a simpler way?

How to get the trained model for 'book' and 'cereal_box' category through mediapipe python API?

Hi, I'm a researcher working on a paper related to Object 6D Pose estimation. The proposed method in Objectron is an important baseline method for us so we do hope to compare our method with the proposed method in Objectron on our own dataset.

However, the models for 'book' and 'cereal_box' are not available on mediapipe python API. Is there any method for us to obtain the model for these two categories?

Question about the category annotation

Firstly, sincerely thanks for your great effort on this large scale dataset.

Now I follow the official tutorial to parse the annotations.pddata. But it seems that some videos' category annotations are mismatched.

Take shoe/batch-9/10, shoe/batch-30/9 and shoe/batch-27/5 as examples, objects in these video sequence are shoes and their category annotations should be shoe. But when I use sequence.objects.category to parse the category annotation, I get chair. Do I make some mistakes or just the category annotations mismatch the correspondent objects.

This problem happens to me when I randomly verify some video sequences.

Thanks for your reply.

The questions about the shoe data

Hi, the shoes in the data does not distinguish left and right, Is that correct? Is there any chance, somewhere can find the label to distinguish left and right? Thanks!

Questions about evaluation (reproducing the results)

Hi,

Thank you for all the great work!

I tried to reproduce the evaluation results in the paper-
I used the released evaluation code
(https://github.com/google-research-datasets/Objectron/blob/master/objectron/dataset/eval.py)

Using Python solution APIs
(mediapipe.solutions.objectron)
set static_image_mode=True

I have set the correct image size, focal length, and principal point.
and I have done re-scaling the 3d bounding boxes. I'm pretty sure I've done them correctly (I used the numbers from the dataset).

And I evaluated using the preprocessed image dataset here:
(gs://objectron/v1/records_shuffled/chair/chair_test*)

Accoding to the paper, the average precision at 0.5 3D IoU metric for chair category should be 0.8505,
however I only got 0.5095.

The possible reasons I can think of:

  1. I haven't evaluated the video_mode
  2. The delta 0.8505 <-> 0.5095 is due to hdf5 vs TFlite

Any reasons I haven't thought of? Would you suggest me some directions to close the gap?
Thank you very much!

how to evaluate symmetric objects?

hi!i have trained a model on bottle! However, i find that it seems that is no evaluation codes about computing iou for such kinds of symmetric objects as described in the paper. Could please share how to evaluate symmetric objects in detail?

Error in annotation for cereal box

Hi, I use the codes provided in the repo to generate images with point cloud and annotation overlay.

The results look quite convincing on bike category but seems to fail in a cereal_box sequence: batch-11-17 (actually it fails for at least first two sequences of 'cereal_box')

image

image

coefficients in post processing

Hallo,

Thanks for the good work, I have a problem in the post processing part of the paper MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision, where the coefficients alpha(i,j) that held under rigid transformation. so, these coefficients are known? besides, could we find the supplementary materials?

from the paper, I think we can get 3d vertices from 2d vertices and camera pose by EPnP, but in the code in box.py, the calculation need 3d vertices?

def fit(cls, vertices):

also in eval.py file,

def predict(self, images, batch_size):
, do we also need to predict the 3d vertices ?

Thanks so much

euler angles and rotation matrix

which rotation matrix the euler_angles field of ARCamera is corresponding to?

I calculate the rotation matrix from the euler angles, but it doesn't equals "transform" matrix or "view_matrix".
could you provide the calculation steps?thank you.

how to get the bounding box in the world-coordinate system

I hope to transform the bounding box through following formula you give to the box in the world coordinate system.​

rotation * scale * keypoints + translation

​But I can't get the same result as the POINT-2D you have provided. I hope you can tell me how to get the correct result.

Near and far plane?

Hi,

thanks a lot for the great dataset! I was wondering if it is possible to get the approximate near and far plane for the frames? I couldn't find this in the geometry tutorial.

Thanks!

3D object pose estimation model capabilities

Hi,

I had a question about the 3D object detection model you proposed in your paper. In Section 5.2 of your paper you mention:

For each category, we trained the network separately without any pre-training or hyperparameter optimization.

Does this mean that one model can only predict the poses of objects from a certain category? If so, does this mean that when I want to estimate the pose of let's say a camera, I have to manually choose the model that works solely for the camera class?

In other words, a single model is not able to estimate poses of objects from different categories?

Using 2D keypoints for drawing 3D bounding box in flutter

I am currently running a tflite Mediapipe model which outputs 8 vertices and the center of 3D bounding boxes
that describe the detected objects 3D bounding box. So I would like to know how to draw a 3D bounding box in my Flutter App using the output below.

This is the current OutputTensor am receiving from my tflite model and I want use to draw the 3D bounding box

flutter: [120.87956237792969, 115.73860168457031, 138.97604370117188, 187.78915405273438, 137.4835205078125, 84.2886734008789, 151.32508850097656, 165.74168395996094, 146.28330993652344, 51.38185501098633, 95.4119873046875, 181.07603454589844, 97.72759246826172, 77.48689270019531, 100.90084838867188, 155.7063751220703, 101.90467071533203, 40.651939392089844]

Testing and training frames

  1. Are the results presented in Table 2 in the Objectron paper for randomly selected 1K images in the test split or for the whole test split? I could not find this info in the Objectron paper but it is mentioned in the MobilePose paper, so wondering if it is still the same.

  2. Was the training done with randomly selected 100K images in the training split or the whole training split?

  3. If using randomly selected frames, do you have any suggestions for starting from raw data with pytorch, as my system does not support reading (shuffled) TFRecords. Or any random selection is fine?

  4. Did you use any data augmentation during training?

Thanks much!

Question about PLANE_CENTER and PLANE_NORMAL in objectron.schema.features

Thanks a lot for this great dataset.

I am now implementing my own Dataset class wrapper for objectron. And I follow the tutorial of Parse Annotations.ipynd and objectron-geometry-tutorial.ipynb to parse the raw annotations.pbdata and geometry.pbdata. When I use the evaluation code in objectron.dataset.eval, I notice that there is a feature named plane which is used to re-scale the predict box. However, I couldn't parse the PLANE_NORMAL infomation from annotations.pbdata or geometry.pbdata.

I wonder how to get PLANE_CENTER and PLANE_NORMAL infomation. Or PLANE_CENTER means the PlaneVector center defined in ARPlaneAnchor in a_r_capture_metadata.proto while PLANE_NORMAL needs to be calculated using other infomation in AR metadata?

Sincerely look forward to your reply.

Obtaining pose information (intrinsics and extrinsics related to frame of video)

Hello,

My team and I are currently pursuing a project on 3D object reconstruction and this dataset comes like a boon just at the right time. For a successful reconstruction, the required components are

  1. Camera intrinsics
    intrinsic_matrix

  2. Rotational and Translation matrix (Extrinsics) related to each frame of video available.
    extrinsics_matrix

I understand that that this information is encoded in the AR metadata protobuf file. I've tried to use a_r_capture_metadata_pb2 file to decode this information as follows -

geometry_filename = '.../videos/class/batch-i/j/geometry.pbdata'# a.k.a. AR metadata

from objectron.schema import a_r_capture_metadata_pb2 as ar_metadata_protocol

# This does not work! ParseFromString can't read binary string as it does for annotation file in the example given.
with open(geometry_filename, 'rb') as pb:
    obj = ar_metadata_protocol.ARCamera()
    obj.ParseFromString(pb.read())
    print(obj.intrinsics)

I am stuck at making sense on how I can extract the above required data using these protobuf geometry files. Could someone please direct me in the right direction? Much appreciated!

Regards,
Niraj Pandkar

Data download from gsutil

Hello,

First of all I would like to thank the entire team involved in this project. This dataset seems absolutely fantastic.
However, I have a little problem related to its download, I use gsutil for this and I end up with a set of splitted files: bike_test-00000-of-00094 ... bike_test-00094-of-00094
What should I use to merge them?

I am sorry if this question is naive or if the answer already appear on the git but I could not find a clear solution anywhere.

Thank you very much.

Camera Tracking

Is the code for doing the camera tracking available anywhere

Potential issue in the evaluation code

Thanks a lot for this great dataset.

My colleague @swtyree and I have taken a close look at your evaluation code and found some potential issues in it.

  1. The first issue is about the visibility label.

Although the dataset provided it, you did not use the obtained index to extract the instance for it. As a result, the lengths may not match because label[VISIBILITY] includes the entries for objects that are below the visibility threshold while label[LABEL_INSTANCE] does not.

Before:

label[VISIBILITY] = visibilities
index = visibilities > self._vis_thresh

I think it should be:

index = visibilities > self._vis_thresh
label[VISIBILITY] = visibilities[index]

Here is the corresponding part from the evaluation code:
https://github.com/google-research-datasets/Objectron/blob/master/objectron/dataset/parser.py#L50-L53

  1. The second issue is with the calculation of average precision.

I found that the testing order of the images would affect the final result.

The original process in the classification/segmentation works has an important step which sorts the results by the predicted confidence. See https://github.com/ShawnNew/Detectron2-CenterNet/blob/master/detectron2/evaluation/pascal_voc_evaluation.py#L243-L246. However, I did not find it in your code. I am not sure if you assumed the tested methods would do that somewhere else, or you just fixed the order of testing images.

Here is the corresponding part from the evaluation code: https://github.com/google-research-datasets/Objectron/blob/master/objectron/dataset/metrics.py#L86-L98

It is similar to the process used in the pascal_voc_evaluation:https://github.com/ShawnNew/Detectron2-CenterNet/blob/master/detectron2/evaluation/pascal_voc_evaluation.py#L290-L299

I am looking forward to your reply. Thank you so much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.