Hello, What is the required format for IMU input embeddings? Or rath

Based on that sample from the Ego-4D dataset (<a href="https://ego4d-data.org/docs/dat

It seems that we are supposed to use repeated padding? <a href="http

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, I had a question similar to that of <a class="user-mention notranslate" data-hover

IMU Input Dimensions are Unclear - Missing Information on Data Prep about imagebind HOT 10 OPEN

facebookresearch commented on July 22, 2024 4

IMU Input Dimensions are Unclear - Missing Information on Data Prep

from imagebind.

Comments (10)

aelnouby commented on July 22, 2024 6

Hi everyone,

Thanks for your question and sorry for the late response. The IMU signal corresponds to 10 second clips, this is a typo in the appendix that will be fixed in the coming revision of the paper. For the aligned video, we sample 2 frames at the center of the window.

from imagebind.

artemisp commented on July 22, 2024 3

Oh yes of course, I did not mean for this to be a final answer - just trying to help out/start a discussion since it has been a while without a response 🥲.

Yes they do provide source code, but once again, the embedding dimension is 1000 corresponding to 5second clips.

For my use case I tried the following to account for the 2x factor: pad with zeros, grab 10 second clips, and the "repeat" method, and it seemed that the repeat method works best. I hope this helps to get your application moving.

from imagebind.

aelimame commented on July 22, 2024 1

Based on that sample from the Ego-4D dataset (https://ego4d-data.org/docs/data/imu/) the sample rate is 200Hz (5ms each time step). If only T=2000 works, this means they expect the clips to correspond to a 10 seconds video segment?

However they mention this in the paper:

For each video, We select all time-stamps that contains a synchronized IMU signal as well as aligned narrations. We sample 5 second clips around each time-stamp.

So, there seems to be some 2x ratio lost somewhere?

from imagebind.

artemisp commented on July 22, 2024 1

I agree - I am just making the conjecture that since we want image-IMU alignments for training, if this is the procedure for image padding, it could work for IMU padding to maintain the alignment - even though it is nowhere to be found in the code/paper. It is worth a try. Another option would be to sample 10s - but it seems to directly contradict the paper.

Grabbing a 10s video clip and aligning it with the 5s IMU could make sense - given that there may be a small 1-2s misalignment between IMUs and Videos due to various factors (e.g. latency).

Now....this is all a guess! I tried this method for action recognition (see IMU2CLIP paper) and it seemed to work decently. However, I cannot say for sure if it is the right way to go.

from imagebind.

artemisp commented on July 22, 2024

It seems that we are supposed to use repeated padding?

PadIm2Video(pad_type="repeat", ntimes=2)

from imagebind.

aelimame commented on July 22, 2024

It seems that we are supposed to use repeated padding?

PadIm2Video(pad_type="repeat", ntimes=2)

But that's for the image to video transformation (forward() method). It seems to convert a single image to n time steps video. Basically either copying the same image to create a video of the given image (pad_type="repeat") or just using zeros/black images (pad_type="zero") to create the video sequence.

So not related to the IMU processing really.

from imagebind.

aelimame commented on July 22, 2024

I agree - I am just making the conjecture that since we want image-IMU alignments for training, if this is the procedure for image padding, it could work for IMU padding to maintain the alignment - even though it is nowhere to be found in the code/paper. It is worth a try. Another option would be to sample 10s - but it seems to directly contradict the paper.

Grabbing a 10s video clip and aligning it with the 5s IMU could make sense - given that there may be a small 1-2s misalignment between IMUs and Videos due to various factors (e.g. latency).

Now....this is all a guess! I tried this method for action recognition (see IMU2CLIP paper) and it seemed to work decently. However, I cannot say for sure if it is the right way to go.

Yeah sure, this is all hypothesis waiting for the FAIR guys to validate...

Thanks for sharing that paper, it looks interesting. Do they also provide source code?

from imagebind.

beitong95 commented on July 22, 2024

Hi, I was wondering what is the normalization method used on IMU data in ImageBind. It seems the data from ego4d is raw imu data. However, in Figure 7, I found IMU data is clipped to -1 to 1.

from imagebind.

zainhas commented on July 22, 2024

@beitong95 Good point, another issue with the preprocessing is that it doesn't work for any inputs greater than or less than 2000 points - in my current implementation I've just padded upto 2k or cut down and only taken the first 2k datapoints to generate embeddings. Would be good to know the details about how the model was trained so that embeddings are more reliable!

from imagebind.

RitvikKapila commented on July 22, 2024

Hi, I had a question similar to that of @beitong95, how is the IMU input preprocessed and/or normalized before being fed as an input to the model? Is there a load_and_transform function provided for IMU? Thanks.

from imagebind.

IMU Input Dimensions are Unclear - Missing Information on Data Prep about imagebind HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent