dmarx / videolinkbot Goto Github PK

Reddit bot that posts a comment with all of the video links in a submission. Currently only supports YouTube.

License: MIT License

Python 100.00%

videolinkbot's Introduction

Hi there! I'm David (He/Him)

I'm passionate about making quality tools that empower AI practitioners and generative artists.

I'm currently working as an ML engineer at CoreWeave, and am also affiliated with eleuther.ai. Previously, I helped launch Stability AI in the role of Distinguished Engineer, and worked in more conventional Data Scientist roles at Microsoft, Amazon, Fannie Mae, USDOL-OIG, Elder Research, and SoundExchange. A common theme in my work for the last few years has been building tools which making bleeding edge AI research accessible to non-technical creatives.

Creator/maintainer/curator of https://github.com/pytti-tools ->

Some projects I'm currently or recently involved with:

Developing tools to facilitate parameterizing complex, multi-scene animation sequences
Invented a technique for generating a multi-scene music video from audio
inventing and building cutting edge, state of the art, AI animation tools and techniques
working with electronic musicians and VJs to advance audio-reactive animation research
maintaining and extending pytti-tools
building a library to facilitate working with pre-trained CLIP-like models
building a library to facilitate working with and management of messy research code
working with researchers to build data collection tools that will be used to turn hackathon activities into training data for code generative language modeling
Implementing notebooks to facilitate AI artist use of pre-trained research models, including FiLM and blended defusion
model design for a forthcoming AI art model
Shepherding the launch of the Stable Diffusion API and accompanying infra and tooling (public SDK, CI/CD, etc.) as engineering lead through the launch of the DreamStudio product

Broad Research Interests

Text guided image synthesis
AI-assisted animation
Representation learning
- contrastive
- semi-supervised
- adversarial
- composable
- multi-modal
Application of topolgical and geometric methods to machine learning
Generative models
Inductive priors
Learning theory

Current Areas of Research Focus

Modeling with implicit representations and operators
Composable representations
Latent-space manipulation
Scale-agnostic learning
Artistic applications of multi-modal generative models

Old Blog

http://dmarx.github.io/posts/

# Greatest hits

https://github.com/dmarx/bench-warmers/
https://github.com/dmarx/sd-lazy-wildcards
https://github.com/dmarx/notebooks
https://github.com/dmarx/workbench
https://github.com/dmarx/anthology-of-modern-ml
https://github.com/dmarx/video-killed-the-radio-star
https://github.com/dmarx/keyframed
https://github.com/dmarx/fast-and-simple-dense-retrieval
https://github.com/dmarx/fasdr-action
https://github.com/dmarx/zero-shot-intent-classifier
https://github.com/dmarx/keyframed_chatgpt
https://github.com/dmarx/not-a-package-manager
https://github.com/dmarx/Multi-Modal-Comparators
https://github.com/dmarx/anthology-of-ml-for-ai-art
https://github.com/dmarx/pytti-core
https://github.com/dmarx/cka_pytorch
https://github.com/dmarx/checkin
https://github.com/dmarx/Topological-Anomaly-Detection
https://github.com/dmarx/Reddit_response_to_Trump

https://github.com/dmarx/make_for_datascience
https://github.com/dmarx/dispatchr
https://github.com/dmarx/twitterMonitor
https://github.com/dmarx/TextSummarization
https://github.com/dmarx/reddit-map
https://github.com/dmarx/dmarx.github.io
https://github.com/dmarx/Target-Shuffling
https://github.com/dmarx/statisticalArgumentForSettling
https://github.com/dmarx/SubredditMentionsGraph
https://github.com/dmarx/BoostTheVote

https://github.com/dmarx/Topological-Anomaly-Detection-r
https://github.com/dmarx/TravelingSalesmanMCMC
https://github.com/dmarx/GameOfLife
https://github.com/dmarx/Dreidel
https://github.com/dmarx/Random-Contingency-Table-Generator
https://github.com/dmarx/d3-mines
https://github.com/dmarx/mines

https://github.com/dmarx/VideoLinkBot

https://github.com/dmarx/awesome-llm-utilities
https://github.com/dmarx/auto-tagger
https://github.com/dmarx/the-rest-of-the-fucking-owl
https://github.com/dmarx/owl-keyframed
https://github.com/dmarx/psaw
https://github.com/dmarx/debugger
https://github.com/dmarx/nfpa
https://github.com/dmarx/enl-supply
https://github.com/dmarx/supply-chain
https://github.com/dmarx/data_generation_demo
https://github.com/dmarx/tpot
https://github.com/dmarx/DataKind-SmokeAlarms
https://github.com/dmarx/congressional-rollcalls

videolinkbot's People

Contributors

Stargazers

Watchers

Forkers

listentous pbhj rs19hack ruddfawcett adolfoeliazat

videolinkbot's Issues

"by request" bot

Create a new bot username (partially to skirt around subreddit bans, also to limit messaging to the main bot) and a subreddit of the same name.
Build a script that monitors this subreddit for links to reddit submissions or comments (or reddit links in a self post body) and directs the bot to scrape the associated submissions.
The bot then responds to the bot-specific-subreddit submission and the source submission (if it's able) with the list of videos.
As with the main bot, this bot should update (every hour) on submissions made within 24 hours.

Creation of this bot could probably be simplified by generalizing some of the simplebot code. This might not even be necessary, would be great if I could just use everything as is. I think at least post_aggregate_links could/should be generalized, or maybe moved into a separate script. simplemonitor could also potentially be generalized.

Add support for domain: thedailyshow

Add link to radd.it playlist

Basically have this already, just need to test.

Add exception handling to recognize rate-limiting from the YouTube API

YouTube doesn't have a clear-cut rate limit, but they have blocked the bot before. This manifests as HTTP errors in get_title(), which then just posts the video link with the title "...?..." (by design). get_title should be modified to slow the bot down (more than it already does) when it recognizes youtube may be getting annoyed with the bot.

Alternatively, could potentially identify if certain specific actions resulted in the bot previously being blocked and determine if there's any action that can be taken to mitigate future API blocking.

Refactor build_comment() so we don't waste time getting titles for videos that won't be listed due to the character limit

Instead of getting all video titles then formatting the links and appending to the comment body, format links and append them immediately after getting the video title. This way we can cut out as soon as the comment hits the character limit. This should also eliminate the need for the trim_comment() function, since we'll stop building comments before they hit the limit.

Update get_title() to extract video title from problematic youtube domains

Examples:

m.youtube.com
youtube.googleapis.com

Script enters an infinite loop when trying to post to a deleted submission

This wasn't really an issue before, but now it's a big problem because of adding playlist's. Simple workaround would be to add support for a subreddit blacklist, but really we need better error handling. Should be cognizant of similar issue when attempting to comment on a deleted post: use specialized exceptions from praw.

Add support for domain: vimeo

Add more rigorous sorting

No reason to just sort by score. Should sort by score > author > title. Should be trivial to implement since we're already working with pandas dataframes.

Update existing posts

People have been complaining that video scores are stagnant. To remedy this: every hour or so, visit bot's comment history, sort by "hot", and re-scrape those posts.

Add (optional) data persistence

Since the bot is scraping /r/all anyway, would be nice to build a dataset to play with later. The bot should store select information from all the comments it scrapes in a SQLite database. Also, the bot should store information about itself: in particular, what video links it's collecting from each subreddit. Would be interesting to see which videos are popular in which subreddits. Not sure whether or not the deduplication is something I care about or not here.

Data persistence should be turned on via command line argument: default bot operation should be as lightweight as possible.

Spotify playlist

http://www.reddit.com/r/Music/comments/184f91/93_til_infinity_souls_of_mischief/c8bxnzu

"yes it actually worked great as a video playlist - thanks again! I don't know anything about spotify playlists except that they exist. here is a website that will convert a text file into a playlist that should be able to at least show you what the format is like. this is spotify's API documentation FWIW."

comandline support: username/password/credentials filename

Keeping the credentials in a file is convenient, but really it would be better to pass something like that in on the commandline. Other interesting commandline options to be explored:

subreddit to monior (default: all)
blacklist filename
database name (for persistence)

Deal with escaped HTML characters in video titles

Just registering that this was an issue. Resolved today.

youtube channels appearing as [None](None)

http://www.reddit.com/r/letsplay/comments/1d4zkt/feedback_friday_3_04262013/c9myimd

Domain support priorities

via https://gist.github.com/dmarx/4732673 (in descending order):

Implemented:
YouTube (from start)
LiveLeak (done: 4/26/13)
Vimeo (done: 5/11/13)
youtubedoubler (done, 5/11/13)
nicovideo (done, 5/12/13)

High Priority:
Vine
TED
DailyMotion
TheDailyShow
colbertnation
FunnyOrDie
CollegeHumor
TheOnion

Low Priority:
ComedyCentral
WorldStarHipHop
DeadSpin
TheStar
nymag
nytimes.com
guardian.co.uk
twitvid
flickr.com

Clear old data from memos

Right now, memos will grow and grow and grow. After several day sof operation, they only take up a few MB, so maybe this shouldn't even be a concern. But I feel like data that's over a day or two old should get flushed from the memos.

Deduplicate youtube videos on yt video id instead of raw URL

Sort video links by score

Two possible options here:

Comment score of earliest comment where this videolink was posted.
Comment score of highest scoring comment containing video link

Option (1) is more in accordance with the current state of the bot, but probably not what people would really want to see. Option (2) is definitely more how people would want the videos sorted, but then the "source comment" permalink is a little deceptive.

I should probably go with option (2) and modify the bot to identify each video not with its earliest mention, but with the comment that has achieved the highest score. Alternatively, I could add a second column with a link to the highest scoring comment, but this will be the source comment for most videos. Maybe only populate this column if the source video is not the same comment as the one where the video achieved its highest score? Yuck. Would save space though.

I guess I should probably just go with option (2) and change the "source comment" to link to the highest scoring comment, but this will probably result in a feedback loop where high scoring comments will receive more upvotes. I'd prefer if these upvotes were directed to the first user to post the video, but whatever. C'est la vie.

NB: Timestamp comment with last updated time. UTC?

Add support for domain: dailymotion

Add logging support

Right now it just pushes messages to stdout. We should have some real logging going on of some kind. In particular, I'd like to log the amount of time it takes to update hot comments to determine if perhaps the bot shouldn't have a different rubric for when to resume normal scraping.

Recognize videos from arbitrary domain

Not sure if this is possible, and even if it is it's probably not a good idea since it would require a GET request on every single link the bot encounters. I can dream (and open up an issue) though.

unk issue: triple posted playlist

http://www.reddit.com/r/anime/comments/18clbn/whats_your_favorite_scene_in_a_ed_op/c8dnyae

Bot scraping the same posts repeatedly

Bot would identify a link comment, scrape the post, then repeat the same "identified link-comment by on submission ". This is causing the bot to waste a lot of time repeating work it has already completed.

[CRITICAL] Bot should not completley ignore previously scraped comments

The bot was originally designed to attribute video links to the earliest comment that had posted it. In this scenario it was OK to ignore these comments on a second pass (although we would miss new videos if that user had posted any).

The problem now is that we're collecting comment score. As we're ignoring comments we've already seen, we're necessarily not updating these scores properly. We can still skip over parsing the comments for links, but we need to at least check the score on these comments.

Maybe we need to completely reevaluate how we're using the memo objects.

Add support for domain: funnyordie

Refactor to work with praw 2.x

Bot requires a version of praw.Submission.all_comments_flat, which was removed from praw. To stay up to date, replace all_comments flat like this:

    subm.replace_more_comments()
    all_comments_flat = flatten_tree(subm.comments)

Pertinent discussion: http://www.reddit.com/r/redditdev/comments/17tn7r/replacement_for_all_comments_flat/

skip to 2nd item in radd.it playlist

radd.it now allows playlists to start at an arbitrary element. 1st element is the bot comment, so playlists should always start at 2nd element, like so: http://radd.it/comments/199cyw/_/c8m1io3?start=2

Add a opensource license to this project

This would let other people use your code, or even submit pull requests to improve it.

Subreddits blacklist

simplemonitor should ignore comments from select subreddits (in particular, those the bot has been banned from). Add a file called "blacklisted_subreddits.txt" that simplemonitor references. If a comment is from a blacklisted subreddit, memoize the id and move along.

What would be really nice would be if the bot can recognize in it its messages that it has been banned from a subreddit. These messages come in a standard format, so it should be fairly easy. When banned, the bot would recognize the message and update the blacklist file.

Youtube thumbnails are not videos

videolin false-positive at ytimg donain:

Image is thumbnail for youtube video: http://www.youtube.com/all_comments?v=uVOBEge7YTg

Potential future feature: add video thumbnails to VLB posts via format:

    http://i2.ytimg.com/vi/***video_id***/mqdefault.jpg

I don't think this really adds anything. Just a thought.

Don't ignore submissions over max_num_comments threshhold that the bot has already visited

Instead, sort comments by new and update the bot comment as appropriate

acting extra wonky when title = None

http://www.reddit.com/r/worldnews/comments/18v3n8/north_korean_propaganda_video_shows_president/c8ifjvc?context=3

Don't post links to removed videos

Title comes up as "None". No point posting links to removed content.

Sample:

http://www.youtube.com/watch?feature=player_detailpage&amp;v=rsBOiNAYWtM#t=239s

http://www.reddit.com/r/videos/comments/1c0rdu/the_most_insane_live_crowd_ever_wwe_raw_4813/c9c1kmu

Don't memoize bad video titles

Right now, the bot is memoizing the result from get_title(), including "...?..."

Instead, get_title() should return None if no video title was found. Then, build comment can replace None with "...?..." as needed. If the bot returns to the same post, it will see that it still needs to get a title for any links it missed. Hopefully, this will help get around the YouTube API rate-limiting (or at least, help repair the problems it causes the bot).

Check submission age before scraping

Make sure we aren't scraping month-old posts.