Comments (10)
Hi everyone,
Thanks for your question and sorry for the late response. The IMU signal corresponds to 10 second clips, this is a typo in the appendix that will be fixed in the coming revision of the paper. For the aligned video, we sample 2 frames at the center of the window.
from imagebind.
Oh yes of course, I did not mean for this to be a final answer - just trying to help out/start a discussion since it has been a while without a response 🥲.
Yes they do provide source code, but once again, the embedding dimension is 1000 corresponding to 5second clips.
For my use case I tried the following to account for the 2x factor: pad with zeros, grab 10 second clips, and the "repeat" method, and it seemed that the repeat method works best. I hope this helps to get your application moving.
from imagebind.
Based on that sample from the Ego-4D dataset (https://ego4d-data.org/docs/data/imu/) the sample rate is 200Hz (5ms each time step). If only T=2000 works, this means they expect the clips to correspond to a 10 seconds video segment?
However they mention this in the paper:
For each video, We select all time-stamps that contains a synchronized IMU signal as well as aligned narrations. We sample 5 second clips around each time-stamp.
So, there seems to be some 2x ratio lost somewhere?
from imagebind.
I agree - I am just making the conjecture that since we want image-IMU alignments for training, if this is the procedure for image padding, it could work for IMU padding to maintain the alignment - even though it is nowhere to be found in the code/paper. It is worth a try. Another option would be to sample 10s - but it seems to directly contradict the paper.
Grabbing a 10s video clip and aligning it with the 5s IMU could make sense - given that there may be a small 1-2s misalignment between IMUs and Videos due to various factors (e.g. latency).
Now....this is all a guess! I tried this method for action recognition (see IMU2CLIP paper) and it seemed to work decently. However, I cannot say for sure if it is the right way to go.
from imagebind.
It seems that we are supposed to use repeated padding?
PadIm2Video(pad_type="repeat", ntimes=2)
from imagebind.
It seems that we are supposed to use repeated padding?
But that's for the image to video transformation (forward() method). It seems to convert a single image to n time steps video. Basically either copying the same image to create a video of the given image (pad_type="repeat") or just using zeros/black images (pad_type="zero") to create the video sequence.
So not related to the IMU processing really.
from imagebind.
I agree - I am just making the conjecture that since we want image-IMU alignments for training, if this is the procedure for image padding, it could work for IMU padding to maintain the alignment - even though it is nowhere to be found in the code/paper. It is worth a try. Another option would be to sample 10s - but it seems to directly contradict the paper.
Grabbing a 10s video clip and aligning it with the 5s IMU could make sense - given that there may be a small 1-2s misalignment between IMUs and Videos due to various factors (e.g. latency).
Now....this is all a guess! I tried this method for action recognition (see IMU2CLIP paper) and it seemed to work decently. However, I cannot say for sure if it is the right way to go.
Yeah sure, this is all hypothesis waiting for the FAIR guys to validate...
Thanks for sharing that paper, it looks interesting. Do they also provide source code?
from imagebind.
Hi, I was wondering what is the normalization method used on IMU data in ImageBind. It seems the data from ego4d is raw imu data. However, in Figure 7, I found IMU data is clipped to -1 to 1.
from imagebind.
@beitong95 Good point, another issue with the preprocessing is that it doesn't work for any inputs greater than or less than 2000 points - in my current implementation I've just padded upto 2k or cut down and only taken the first 2k datapoints to generate embeddings. Would be good to know the details about how the model was trained so that embeddings are more reliable!
from imagebind.
Hi, I had a question similar to that of @beitong95, how is the IMU input preprocessed and/or normalized before being fed as an input to the model? Is there a load_and_transform function provided for IMU? Thanks.
from imagebind.
Related Issues (20)
- 多模态数据对
- `load_and_transform_text` method exec failed HOT 1
- Something wrong with EncodedVideo in load_and_transform_video_data HOT 2
- 预训练模型的输出问题
- Custom sensor as one of the multimodality? HOT 1
- Question regarding SelectElement(index=0) in the modality heads HOT 1
- Using Depth Embeddings in NyuV2 Zero-Shot Classification HOT 4
- Directly using images from S3 bucket using URL.
- Can Inference Time Be Improved by Using ONNX Model?
- IMU inference
- Inconsistent Statement Regarding Experiments on NYU-Depth-v2 HOT 2
- Checkpoints for small/medium model
- Imagebind for commercial purposes
- Simply replacing Detic's CLIP-based ‘class’ enbedding with imagebind audio embedding
- How to use ImageBind to locate sound sources in video?
- issue building wheel for cartopy (Windows 11) HOT 3
- 3 and more modalities in one model HOT 1
- What is your perspective on LanguageBind surpassing ImageBind? HOT 1
- Questions for demo sites audio and image data usage.
- Initialization of Thermal backbone
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from imagebind.