Git Product home page Git Product logo

video2dataset's Issues

Integrate features from other projects

Let's try and integrate things as best as possible.

The ideal end state is several common packages we extract from all this, put them on pypi and use them here and in these projects. So that all projects benefit from each other.

Bottleneck Analysis

Potential bottlenecks worth keeping an eye on:

  • Video downloading (most likely in the YouTube setting). Source of bottleneck will likely be availability of formats, how do we detect this and get around it?
  • Video decoding

DoD: measure speed of each component and report in gdoc / md

v0.1 steps

  1. take video-caption table and output files (mp4, txt pairs) at destination [DONE]
  2. output as webdataset [DONE]

Make it fast

Let's make it as fast as possible to get and package videos.

Design Document

  • gdoc
  • MD

Should clarify the plan and help collaborate on details

DoD: main contributors are happy with design

optimize video handling

currently everything is done in memory, full video
if doing many in parallel (which may be needed due to slow connections of sources) then it takes a lot of memory

solutions

  1. store on disk instead
  2. process in streaming / in small clips

1 is easier but still limited by disk size
2 works in general but more difficult to implement

Benchmark Performance

Let's talk about how to best benchmark the code in the most general way possible. For video2numpy the way I did it is I had a constant set of videos but I only did local mp4 since the main bottleneck was video decoding. Here we might have other bottlenecks (#12) so our benchmarks should be robust to all setting someone might use video2dataset in.

Maybe we should have multiple benchmarks for all settings?:

  • Local mp4s (would test the video decoding bottleneck
  • mp4 links (downloading bottleneck)
  • YouTube links (potential format bottleneck)
  • maybe some GPU benchmarks for frame embedding generation (when we add it)

DoD: one end2end benchmark + one benchmark for each component is written and has run, performance is reported, issues are created to improve big slowlness

Dataset example documentation

In readme say what dataset example is available and link to them
Make a .md file for each dataset explaining what it is and how to get it. Potentially also have a .sh file (but not sure if really needed)

Investigate fusing

https://docs.google.com/document/d/1_TD2KQLkEegszq4Eip568fc6cWnh9h0Jqj4Lc88t9Y0/edit#bookmark=id.4dhg93pb66bc

The data reader, the data writer and the subsampler are meant to be independent components. They should be implemented independently to make it easy to test and benchmarks them.
However for performance reason we may want to let underlying libraries (eg ffmpeg or yt-dl) handle multiple responsibility and hence we can introduced fused components. For example we may fuse multiple subsamplers into a single one, or we may even want to fuse a reader and a subsampler.
These fused components will behave the same as using a composition of 2 independent components but may be faster.

DoD: one fused component has been implemented, and performance comparison was done

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.