Git Product home page Git Product logo

Comments (6)

v-iashin avatar v-iashin commented on July 28, 2024

Hi there,

"The range of the temporal crop is limited such that the resulting clip could fit the audio that was offset as well as the visual clip."

This note is about how the temporal crop is done. You seem to ask a question about the spatial crop.

Anyway, the "limiting" is related to time, not to space, and is done as follows.

First, we pick the offset from the grid. Second, we pick a start time for the visual crop such that the offset audio's starting time would be inside of the video clip.

For instance, if my offset is -1.0 seconds and the visual stream starts at 0.5 seconds, then the audio should start at -0.5 seconds and it is out of bounds. We want to avoid this. Therefore, we pick the start time of the visual crop from 1.0 seconds onwards until e.g. 5 seconds because if the visual crop starts at 6 seconds, it will end at 11 seconds which is again out of bounds.

How would a window of size 512x512 be different than 224x224 in terms of offsetting audio?

As you can see, it does not matter. Sorry if you got confused here. First, we do the spatial crop (e.g. 224 out of 256). Then, temporal cropping and temporal offsetting are done in one of the next transformations.

I currently receive memory errors

Is it RAM or GPU mem errors? Perhaps, you are trying to load long videos to your RAM. These can be quite brutal on RAM. What is the resolution, fps and length of a video you are trying to feed in? Do you scale your input videos? For instance, notice that all of the videos that are provided as examples are scaled to max(Height, Width) = 256. You are asking about 512, but I am not sure where did you get it.

Sorry for long description but maybe I missed something?!

No problem. I think you got a few things confused but I hope this was helpful.

from sparsesync.

v-iashin avatar v-iashin commented on July 28, 2024

Not sure I understand how a visual stream can start at 0.5 seconds

I think there is confusion somewhere. Imagine you have a 10-second video clip. Out of it, you crop a 5-second sub-clip starting at any random point within the 10-second clip. The 5-second visual stream may start at 0.5 seconds with respect to the 10-second clip and spans the temporal region from 0.5 to 5.5 seconds. If the audio track starts at 1.5 seconds and ends at 6.5 seconds, it means that the out-of-sync offset is +1 second or the audio is late by 1 second.

Would this mean that the model is only expecting a certain frame size to enfer with? (e.g not more than 240*240?)

The model 22-07-13T22-25-49 is trained with 224x224 inputs. Please re-scale your videos to match them. Meaning that you may want to resize your 240x240 video clips into 224x224.

I am having troubles understanding how you'd download the videos from youtube. Was it done manually?

There are scripts online. Let me leave it for you to figure out the rest. I hope you understand. Sorry.

How did you ensure RGB (25fps, H.264) and audio (16kHz, AAC) constraint?

Once you will download them using one of the scripts you find online, it will be clear.

Does it mean that there no video ids originally contain 'S'?

The LRS3 dataset renamed the video clips with something like: x.replace('_', 'S').replace('-', 'S') (perhaps to avoid non-alpha-numeric characters in the file paths). Keep this in mind, when downloading/processing the videos.

it is always within 0.2 seconds intervals (either 0.00, 0.2, 0.4, etc), correct?

The prediction might not be correct because there isn't a perfect model. What we mean by "tolerance ±0.2" applies only for evaluation. However, I am not entirely sure what you mean.

from sparsesync.

v-iashin avatar v-iashin commented on July 28, 2024

So it only makes sense if the prediction follow a similar pattern of tolerance ±0.2 but never less (e.g. ±0.1 or ±0.05). Do you think training with a smaller offset class may yeild more detailed predictions; e.g ±0.05 sec delay?

Actually, we use jitter around the grid. See these lines:

if self.max_a_jitter_sec is not None and self.max_a_jitter_sec > 0:
max_a_start_i = a_len_frames - a_crop_len_frames
a_start_i, a_jitter_i = self.apply_a_jitter(a_start_i, max_a_start_i, a_fps)
item['meta']['a_jitter_i'] = a_jitter_i

the amount of random factors in there I wasn't reaching anywhere ... Do you have a clue what might be causing either issue?

Sorry, that piece is not particularly nice, I agree.

Also, mind the duration of the vids you are using. We filtered out the videos that are shorter than 10 seconds. h264_uncropped_25fps_256side_16000hz_aac/pretrain/0akiEFwtkyA/00015.mp4 is 2 seconds long. Thus, the error.

from sparsesync.

v-iashin avatar v-iashin commented on July 28, 2024

Did you notice good curves early on? First 10 epochs or so? Or you'd recommend waiting longer?

yep, I think you need to wait for longer:
Screenshot 2023-01-31 at 20 37 24

from sparsesync.

v-iashin avatar v-iashin commented on July 28, 2024

Anything you could add?

from sparsesync.

Nadedic avatar Nadedic commented on July 28, 2024

Sadly, I could not replicate your results with your archeticture.
I also was looking at another usecase with speech scenario or sparse in space and dense in time setup, I used your pretrained model for my baseline. Along with alot of your implementation, great thanks and lots of appreciatation for the quality of code.
I cited your work. :)

from sparsesync.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.