Comments (5)
Priority queue:
from video2dataset.
@rom1504
Making a summary/list of what needs to be done pre-release. Going over the code as it's executed and noting my thoughts:
main.py
- dict as input type for encode_formats. Works when calling video2dataset from python but not sure how this works with fire/CLI. Any thoughts?
worker.py
- make this cleaner. ifs are ugly and doesn't get the point across, point is that the video already has the audio so you want something like
bytes_downloaded += max(streams.get("video", 0), streams.get("audio", 0))
- clipping subsampler should take encode_formats as init param
- we can make this nicer by doing something like
broadcast_subsampler = clipping_sub if "clips" in whatever else noop_sub
and then just call that with streams and meta - idea to get rid of this and listing out all subsamples, as we add more we shouldn't have to add another if statement in the worker loop. The idea is to initialize worker with a list of non-None f"{modality}_subsamplers" list and then just iterate over all of those. The reason the attribute would be called f"{modality}_subsamplers" is because instead of checking if we have "video" in streams we can just iterate over all the modalities in streams and retrieve the ```eval(f"self.{modality}_subsamplers")
- format_type isn't argument to writer
data_reader.py
- rename and rethink Mp4Downloader. Maybe something like
WebFileDownloader
or whatever. Should be very similar process for any .mp4, .mp3, .webm, .wav, or whatever extension links. Maybe we can add some is_web_link helper function and you pass the url into and then if that returns true you go to WebFileDownloader that determines what to do with bytes based on encode_formats. i.e. if you have audio then bytes are audio, if you have video,audio then the url is video but you need to extract the audio etc. - let's commit to output dicts kind of like with streams and encode_formats, we have this everywhere and makes the code more readable imo
- sample_rate shouldn't be a key in encode_formats, it should be a parameter to main which gets used in some audio subsampler or whatever
- video2audio is kind of already doing fusing without an explicit audio subsampler, I'll put this in an audio subsampler and we can come back to this when fusing. Although this isn't great fusing tbh. Also add a testing function for this in test_subsamplers.py. this should help with the FFmpeg errors we're not catching in data_reader.py in video2audio
- also video2audio needs better param names
- remove this, and rename things so it's "audio_file", "video_file". Let's keep naming conventions involving modalities consistent to stay open to tricks like
eval(f"{modality}_something")
and such. Also we can reduce some lines in the VideoDataReader class with those tricks - audio tempfile isn't deleted
- even though for now it is yt_meta_dict let's just call it meta_dict
- YtDlpDownloader needs editing, better format string, why m4a? why not the extension passed in via encode_formats? also it does subsampling in the downloader which we don't want yet. Also you can probably do something similar as we do for resolution i.e. get closest possible higher sample rate through passing the correct format string. Also I'm not sure we can just replace like this and finally I need to make sure video is downloaded separately from audio and audio separately from video, then we can download separately and also edit that max() line above to just sum the bytes then it will be exactly right. Some useful links for me later:https://write.corbpie.com/downloading-youtube-videos-as-audio-with-yt-dlp/, https://superuser.com/questions/1266162/youtube-dl-set-sample-rate-on-mp3
- in general data reader needs to generalize to other video formats than mp4 (right now it's overfit to that)
data_writer.py
- think about this, do we need to be iterating over encode_formats? maybe it's enough to just iterate over streams and write nothing for cases where a modality isn't present in streams. When does this happen? Do we want to write an empty meta then?
- writers need testing
subsamplers/clipping_subsampler.py
- same as with noop_subsampler, encode_formats should be init param not call param
- need to check that audio clipping works as intended though i.e. lines up with video and correct number of clips correctly ID'd etc.
tests/test_downloaders.py
tests/test_audio.py
- should be renamed to test_reader.py and we should actually test the reader
- actually pretty good just needs to be adapted a bit and add more parameters to test if the video is reading properly etc. we should also test that the correct error_messages are being returned (or exceptions being thrown if we decide to merge that one PR)
README.md
- examples getting too long and unintuitive, we should just make more things in examples directory
- specifically let's show how to use encode_formats and other params like that
- also let's add a tutorial for how to run this with distributed=spark, I think that's not obvious but very useful.
- maybe add citation?
besides the above cleanup there's 2 more PR's to get merged:
- #91 - improves subtitle support and fixes a few things, 100% needs to get merged
- #92 - I think this is worth trying and considering
- #80 - check if it's done
v1 ideas
While going through the code I had some ideas for v1:
- if encode_formats has both video and audio perhaps we should try to do most of the pipeline with video and audio in 1 mp4 byte stream instead of separate video and audio streams so subsampling such as clipping can be done together and then we can split it up instead of the other way
from video2dataset.
Delaying, we need to get some successful use case of data from this repo. Either SVD or VideoCLIP or whatever.
from video2dataset.
I would call this done
codebase are using this
from video2dataset.
yeah sure, maybe it would be good to update pip package if this is the case
from video2dataset.
Related Issues (20)
- too many requests for download video from YouTube HOT 2
- Failed to download: 0.000 messages when downloading HOT 1
- provide a docker image
- investigate celery + redis distribution HOT 1
- Unexpected behaviour change after download worker refactor HOT 9
- list index out of range HOT 1
- Clean up tmp part files in case of d/l failure
- Question regarding slurm(+pyspark) distributed download HOT 1
- Recent regressions HOT 3
- Add efficiency test to make it unnecessary to run it manually for each major PR
- FrameSubsampler broken in version 1.3.0 HOT 10
- Add tests for all subsamplers HOT 1
- Default process group has not been initialized, please make sure to call init_process_group. HOT 1
- Example on how to handle a padded video frame in downstream network? HOT 1
- YouTube metadata is not saved HOT 3
- Example dataloader code for distributed training? HOT 3
- Clarification needed on yt-dlp video selection query
- how could I re-download those "failed_to_download"?
- HTTPSConnectionPool(host='ak.picdn.net', port=443): Read timed out.
- How can I use video2dataset to down load the specific clip of the youtube video?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from video2dataset.