Hello, I am running inference with the model: <code class="notransla

Feature extractors after cropping about sparsesync HOT 6 CLOSED

v-iashin commented on July 28, 2024

Feature extractors after cropping

from sparsesync.

Comments (6)

v-iashin commented on July 28, 2024

Hi there,

"The range of the temporal crop is limited such that the resulting clip could fit the audio that was offset as well as the visual clip."

This note is about how the temporal crop is done. You seem to ask a question about the spatial crop.

Anyway, the "limiting" is related to time, not to space, and is done as follows.

First, we pick the offset from the grid. Second, we pick a start time for the visual crop such that the offset audio's starting time would be inside of the video clip.

For instance, if my offset is -1.0 seconds and the visual stream starts at 0.5 seconds, then the audio should start at -0.5 seconds and it is out of bounds. We want to avoid this. Therefore, we pick the start time of the visual crop from 1.0 seconds onwards until e.g. 5 seconds because if the visual crop starts at 6 seconds, it will end at 11 seconds which is again out of bounds.

How would a window of size 512x512 be different than 224x224 in terms of offsetting audio?

As you can see, it does not matter. Sorry if you got confused here. First, we do the spatial crop (e.g. 224 out of 256). Then, temporal cropping and temporal offsetting are done in one of the next transformations.

I currently receive memory errors

Is it RAM or GPU mem errors? Perhaps, you are trying to load long videos to your RAM. These can be quite brutal on RAM. What is the resolution, fps and length of a video you are trying to feed in? Do you scale your input videos? For instance, notice that all of the videos that are provided as examples are scaled to max(Height, Width) = 256. You are asking about 512, but I am not sure where did you get it.

Sorry for long description but maybe I missed something?!

No problem. I think you got a few things confused but I hope this was helpful.

from sparsesync.

v-iashin commented on July 28, 2024

Not sure I understand how a visual stream can start at 0.5 seconds

I think there is confusion somewhere. Imagine you have a 10-second video clip. Out of it, you crop a 5-second sub-clip starting at any random point within the 10-second clip. The 5-second visual stream may start at 0.5 seconds with respect to the 10-second clip and spans the temporal region from 0.5 to 5.5 seconds. If the audio track starts at 1.5 seconds and ends at 6.5 seconds, it means that the out-of-sync offset is +1 second or the audio is late by 1 second.

Would this mean that the model is only expecting a certain frame size to enfer with? (e.g not more than 240*240?)

The model 22-07-13T22-25-49 is trained with 224x224 inputs. Please re-scale your videos to match them. Meaning that you may want to resize your 240x240 video clips into 224x224.

I am having troubles understanding how you'd download the videos from youtube. Was it done manually?

There are scripts online. Let me leave it for you to figure out the rest. I hope you understand. Sorry.

How did you ensure RGB (25fps, H.264) and audio (16kHz, AAC) constraint?

Once you will download them using one of the scripts you find online, it will be clear.

Does it mean that there no video ids originally contain 'S'?

The LRS3 dataset renamed the video clips with something like: x.replace('_', 'S').replace('-', 'S') (perhaps to avoid non-alpha-numeric characters in the file paths). Keep this in mind, when downloading/processing the videos.

it is always within 0.2 seconds intervals (either 0.00, 0.2, 0.4, etc), correct?

The prediction might not be correct because there isn't a perfect model. What we mean by "tolerance ±0.2" applies only for evaluation. However, I am not entirely sure what you mean.

from sparsesync.

v-iashin commented on July 28, 2024

So it only makes sense if the prediction follow a similar pattern of tolerance ±0.2 but never less (e.g. ±0.1 or ±0.05). Do you think training with a smaller offset class may yeild more detailed predictions; e.g ±0.05 sec delay?

Actually, we use jitter around the grid. See these lines:

SparseSync/dataset/transforms.py

Lines 319 to 322 in 3a26976

 if self.max_a_jitter_sec is not None and self.max_a_jitter_sec > 0: 

 max_a_start_i = a_len_frames - a_crop_len_frames 

 a_start_i, a_jitter_i = self.apply_a_jitter(a_start_i, max_a_start_i, a_fps) 

 item['meta']['a_jitter_i'] = a_jitter_i

the amount of random factors in there I wasn't reaching anywhere ... Do you have a clue what might be causing either issue?

Sorry, that piece is not particularly nice, I agree.

Also, mind the duration of the vids you are using. We filtered out the videos that are shorter than 10 seconds. h264_uncropped_25fps_256side_16000hz_aac/pretrain/0akiEFwtkyA/00015.mp4 is 2 seconds long. Thus, the error.

from sparsesync.

v-iashin commented on July 28, 2024

Did you notice good curves early on? First 10 epochs or so? Or you'd recommend waiting longer?

yep, I think you need to wait for longer:

from sparsesync.

v-iashin commented on July 28, 2024

Anything you could add?

from sparsesync.

Nadedic commented on July 28, 2024

Sadly, I could not replicate your results with your archeticture.
I also was looking at another usecase with speech scenario or sparse in space and dense in time setup, I used your pretrained model for my baseline. Along with alot of your implementation, great thanks and lots of appreciatation for the quality of code.
I cited your work. :)

from sparsesync.

Feature extractors after cropping about sparsesync HOT 6 CLOSED

Comments (6)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if self.max_a_jitter_sec is not None and self.max_a_jitter_sec > 0:
	max_a_start_i = a_len_frames - a_crop_len_frames
	a_start_i, a_jitter_i = self.apply_a_jitter(a_start_i, max_a_start_i, a_fps)
	item['meta']['a_jitter_i'] = a_jitter_i