iejmac / video2dataset Goto Github PK

it's used to synchronize videos from frame blocks that come back out of order (FrameReader job 10 finished before job 0 because job 0 got 10min video and job 10 got 10 second video)

YouTube downloading: fix error messages

right now it's a unique error for each video which spams WandB, make it a single one

basic audio subsampler

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.w8r1nm2s7fy1

DoD: audio subsampler implemented and tested

Deep subsampler: voice/music separator

https://github.com/facebookresearch/demucs

Could be interesting to get voice or music data from mixed audio sourced

make data reader as dedicated classes instead of functions

Rename implementations to fit naming of design

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.iql4z2bvephh

input sharder
data reader
subsampler
data writer

downloader should fit into data reader

Implement CLIP video filter

Implement the feature to filter videos by the similarity scores between text and video frames CLIP embeddings. Video Filter class Colab demo.

extract subtitles from YouTube videos

needs to be compatible with clipping

Integrate features from other projects

https://github.com/rom1504/clip-retrieval : quick computation of clip embeddings
https://github.com/rom1504/img2dataset : downloading and packaging : this is mostly done already
https://github.com/marianna13/audio2dataset : same for audio
https://github.com/iejMac/video2numpy : video to numpy
https://github.com/iejMac/clip-video-encode : encode video as clip embeddings
https://github.com/LAION-AI/audio-dataset : preprocessing of audio datasets

Let's try and integrate things as best as possible.

The ideal end state is several common packages we extract from all this, put them on pypi and use them here and in these projects. So that all projects benefit from each other.

Bottleneck Analysis

Potential bottlenecks worth keeping an eye on:

Video downloading (most likely in the YouTube setting). Source of bottleneck will likely be availability of formats, how do we detect this and get around it?
Video decoding

DoD: measure speed of each component and report in gdoc / md

Evaluation dataset example: how2-dataset

https://github.com/srvk/how2-dataset

v0.1 steps

take video-caption table and output files (mp4, txt pairs) at destination [DONE]
output as webdataset [DONE]

make simple end2end test + include testing for other components

Implement the flow

video url table -> input sharder -> data reader -> subsampler -> data writer

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.5yrpnwfijqfb

DoD: should run and have simple tests

Make it fast

Let's make it as fast as possible to get and package videos.

Documentation and packaging

DoD: basic API doc in readme, one example available

Design Document

gdoc
MD

Should clarify the plan and help collaborate on details

DoD: main contributors are happy with design

optimize video handling

currently everything is done in memory, full video
if doing many in parallel (which may be needed due to slow connections of sources) then it takes a lot of memory

solutions

store on disk instead
process in streaming / in small clips

1 is easier but still limited by disk size
2 works in general but more difficult to implement

Use produced dataset in one training

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.uwyt4zqe1dgy

DoD: one dataset has been produced, one training base has used it

Make video reading fast by picking the format correctly

check how it's done at #36

(@iejMac )

v0

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.v1npsrmmerim

Goals:
1000 hours of video/audio processed
3 sources supported
3 formats supported
Used by 2 training codebase

Features:
Input: files and streams
Output: webdataset
Transform: simple subsampling
Basic documentation
Proper pip packaging

Speed:
Basic subsampling, 1h processed in 1h of cpu core

https://github.com/iejMac/video2dataset/issues?q=is%3Aissue+is%3Aopen+label%3Av0

Consider extracting reader/writer/logger/distributor into some common packages with img2dataset

Let's maybe wait a little bit to make sure it indeed make sense, but would be smart to reduce code duplication and expose functionalities of these components to more people

implement input sharder

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.g59sfbda6v9c

DoD: input sharder implemented and tested

transition from youtube-dl to yt-dlp

this function needs updating

video2dataset/video2dataset/downloader.py

Line 10 in 47747b2

def handle_youtube(youtube_url):

Let's talk about how to best benchmark the code in the most general way possible. For video2numpy the way I did it is I had a constant set of videos but I only did local mp4 since the main bottleneck was video decoding. Here we might have other bottlenecks (#12) so our benchmarks should be robust to all setting someone might use video2dataset in.

Maybe we should have multiple benchmarks for all settings?:

Local mp4s (would test the video decoding bottleneck
mp4 links (downloading bottleneck)
YouTube links (potential format bottleneck)
maybe some GPU benchmarks for frame embedding generation (when we add it)

DoD: one end2end benchmark + one benchmark for each component is written and has run, performance is reported, issues are created to improve big slowlness

v1

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.v1npsrmmerim

Goals:
1M hours processed
All important sources and formats supported
10 training codebases

Features:
Transform: pytorch model support, complex subsampling
Distribution supported
A few examples documented

Speed:
Optimal and complex processing, 20min per 1h of cpu core and/or optimal gpu usage

https://github.com/iejMac/video2dataset/issues?q=is%3Aissue+is%3Aopen+label%3Av1

dataset example: minedojo video set

https://minedojo.org/knowledge_base.html#youtube

Consider component renaming with new features imagined

data reader -> extractor
subsampler -> processor

any better ideas ?

Subsampler: Audio encoding

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.w8r1nm2s7fy1

What's the best way to do this

@marianna13

Soundfile
Ffmpeg

Input: audio from audio or video file
Output : mp3 or similar
Options: quality, segment lengths, encoding kind, ...

DoD: basic audio subsampler implemented and merged

Dataset example documentation

In readme say what dataset example is available and link to them
Make a .md file for each dataset explaining what it is and how to get it. Potentially also have a .sh file (but not sure if really needed)

fps subsampler

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.w8r1nm2s7fy1

DoD: fps subsampler was implemented and benchmarked

fps_subsampler = FpsSubsampler()

subsampled = FpsSubsampler(video)

distributed processing

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.t13cy1ngll3j

DoD: basic distributor is done (sequencial)

Install ffmpeg through pip

Dataset/source example: peertube

https://joinpeertube.org/ is a p2p YouTube
Downloading from it is encouraged by the platform
We could download everything from it (100k videos)

Use case: paella video

https://github.com/dome272/Paella/tree/main/ongoing_research/text-to-video

make size selection not scrict

Implement simple data writer : output formats: jpg, mp3, numpy, mp4

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.px7ldb9e9fjj

DoD: the data writer can emit jpg, numpy, mp3 or mp4 as files or webdataset

Should work like this

writer = Writer("shard_001.tar")
writer({"mp3": b"..."})

take a look at https://github.com/rom1504/img2dataset/blob/main/img2dataset/writer.py for reference

Dataset output format: files or webdataset

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.px7ldb9e9fjj

video clipping subsampler

Investigate fusing

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.4dhg93pb66bc

The data reader, the data writer and the subsampler are meant to be independent components. They should be implemented independently to make it easy to test and benchmarks them.
However for performance reason we may want to let underlying libraries (eg ffmpeg or yt-dl) handle multiple responsibility and hence we can introduced fused components. For example we may fuse multiple subsamplers into a single one, or we may even want to fuse a reader and a subsampler.
These fused components will behave the same as using a composition of 2 independent components but may be faster.

DoD: one fused component has been implemented, and performance comparison was done