iejmac / video2dataset Goto Github PK
View Code? Open in Web Editor NEWEasily create large video dataset from video urls
License: MIT License
Easily create large video dataset from video urls
License: MIT License
explore datasets of https://arxiv.org/abs/2212.03191
DoD: a .md is created in the repo to say some example / same info in gdoc
let's try and put this into https://github.com/iejMac/video2dataset/blob/main/video2dataset/data_reader.py
do you actually need it?
it's used to synchronize videos from frame blocks that come back out of order (FrameReader job 10 finished before job 0 because job 0 got 10min video and job 10 got 10 second video)
right now it's a unique error for each video which spams WandB, make it a single one
DoD: audio subsampler implemented and tested
https://github.com/facebookresearch/demucs
Could be interesting to get voice or music data from mixed audio sourced
downloader should fit into data reader
Implement the feature to filter videos by the similarity scores between text and video frames CLIP embeddings. Video Filter class Colab demo.
needs to be compatible with clipping
Let's try and integrate things as best as possible.
The ideal end state is several common packages we extract from all this, put them on pypi and use them here and in these projects. So that all projects benefit from each other.
Potential bottlenecks worth keeping an eye on:
DoD: measure speed of each component and report in gdoc / md
video url table -> input sharder -> data reader -> subsampler -> data writer
DoD: should run and have simple tests
Let's make it as fast as possible to get and package videos.
DoD: basic API doc in readme, one example available
Should clarify the plan and help collaborate on details
DoD: main contributors are happy with design
currently everything is done in memory, full video
if doing many in parallel (which may be needed due to slow connections of sources) then it takes a lot of memory
solutions
1 is easier but still limited by disk size
2 works in general but more difficult to implement
DoD: one dataset has been produced, one training base has used it
Goals:
1000 hours of video/audio processed
3 sources supported
3 formats supported
Used by 2 training codebase
Features:
Input: files and streams
Output: webdataset
Transform: simple subsampling
Basic documentation
Proper pip packaging
Speed:
Basic subsampling, 1h processed in 1h of cpu core
https://github.com/iejMac/video2dataset/issues?q=is%3Aissue+is%3Aopen+label%3Av0
Let's maybe wait a little bit to make sure it indeed make sense, but would be smart to reduce code duplication and expose functionalities of these components to more people
DoD: input sharder implemented and tested
this function needs updating
video2dataset/video2dataset/downloader.py
Line 10 in 47747b2
Let's talk about how to best benchmark the code in the most general way possible. For video2numpy the way I did it is I had a constant set of videos but I only did local mp4 since the main bottleneck was video decoding. Here we might have other bottlenecks (#12) so our benchmarks should be robust to all setting someone might use video2dataset in.
Maybe we should have multiple benchmarks for all settings?:
DoD: one end2end benchmark + one benchmark for each component is written and has run, performance is reported, issues are created to improve big slowlness
Goals:
1M hours processed
All important sources and formats supported
10 training codebases
Features:
Transform: pytorch model support, complex subsampling
Distribution supported
A few examples documented
Speed:
Optimal and complex processing, 20min per 1h of cpu core and/or optimal gpu usage
https://github.com/iejMac/video2dataset/issues?q=is%3Aissue+is%3Aopen+label%3Av1
data reader -> extractor
subsampler -> processor
any better ideas ?
What's the best way to do this
Input: audio from audio or video file
Output : mp3 or similar
Options: quality, segment lengths, encoding kind, ...
DoD: basic audio subsampler implemented and merged
In readme say what dataset example is available and link to them
Make a .md file for each dataset explaining what it is and how to get it. Potentially also have a .sh file (but not sure if really needed)
DoD: fps subsampler was implemented and benchmarked
fps_subsampler = FpsSubsampler()
subsampled = FpsSubsampler(video)
DoD: basic distributor is done (sequencial)
https://joinpeertube.org/ is a p2p YouTube
Downloading from it is encouraged by the platform
We could download everything from it (100k videos)
DoD: the data writer can emit jpg, numpy, mp3 or mp4 as files or webdataset
Should work like this
writer = Writer("shard_001.tar")
writer({"mp3": b"..."})
take a look at https://github.com/rom1504/img2dataset/blob/main/img2dataset/writer.py for reference
The data reader, the data writer and the subsampler are meant to be independent components. They should be implemented independently to make it easy to test and benchmarks them.
However for performance reason we may want to let underlying libraries (eg ffmpeg or yt-dl) handle multiple responsibility and hence we can introduced fused components. For example we may fuse multiple subsamplers into a single one, or we may even want to fuse a reader and a subsampler.
These fused components will behave the same as using a composition of 2 independent components but may be faster.
DoD: one fused component has been implemented, and performance comparison was done
DoD: a no-op subsampler is there
This is important to let next subsamplers be done right
Goals:
100M hours processed
50 training code bases
Features:
Efficient distribution, robustness, incremental
Provides examples for many use cases
Very extensive documentation
Speed:
10% distribution overhead
https://github.com/iejMac/video2dataset/issues?q=is%3Aissue+is%3Aopen+label%3Av2
If you try example
f.e. this one http://www.robots.ox.ac.uk/~maxbain/webvid/results_2M_train.csv
DoD: resolution subsampler implemented and tested
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.