wikipedialibrary / hashtags Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 26.0 419 KB

Hashtags tool - tracking hashtags in Wikimedia project edit summaries

Home Page: https://hashtags.wmcloud.org/

License: MIT License

Python 63.14% HTML 25.93% Dockerfile 0.30% Shell 0.44% CSS 2.96% JavaScript 7.23%

hashtags wikipedia

hashtags's People

Contributors

Stargazers

Watchers

hashtags's Issues

Delayed results

Since around last summer, when we use the hashtag tool, it does not display results on « recent » time slots.
This seems to be getting worse.
For example, I currently run a #shesaid campaign on Wikiquote. I got results till end of September, but not after the 30th. And we are end of November
Last summer, we were running the #WPWP campaign, closing Aug 30, and most results of the last results only showed up in the tool mid to end of September
Is there a practical reason for that ?

Identify edits with hashtags that introduce media into the page

For #WPWP in July / August, it would be great to be able to identify whether an edit with a given hashtag introduces new media into a page.

The first part of this feature is to identify, for incoming realtime change events, whether they add media.

Given the old and new revids for an edit, it seems the way to identify media additions, by type of media, is to use a combination of action=parse&prop=images and action=query&prop=imageinfo API calls. See this Phabricator ticket for details.

I've tried this in a proof-of-concept, hacked version of collect_hashtags.py and it looks like it does not introduce significant latency to the processing of incoming edits.

The second part is the UI: we can probably add a couple of checkboxes ("Adds image" / "Adds video") to the tool's UI to allow that kind of filtering.

Tool not tracking

Hi! It seems like tool has stopped tracking new edits since August 4.

I already reported it at Phabricator but I'm opening this issue to increase the chances of reaching someone who can fix it.

The tool has been invaluable to me (and surely to many others) since I discovered it not so long ago. It helps me track edits done with various tools I develop, cross-wiki and easily, which in turn helps me identify bugs and opportunities. Its loss is a huge problem to me (and surely to others too). Can anything be done??

Thanks! And thanks also for developing the tool in the first place!!

Add new favicon to repo

We got a new favicon via Google Code-In, and I uploaded it to the live tool, but I forgot to ever add it to the repo.

ERROR: for app Container "f3c97443ff32" is unhealthy.

I am unable to start the container. I am using windows container in docker settings and experiments: True.

Track edits that add audio.

In #60, we added a feature where the Hashtags tool can identify edits that introduce images or video to a page, and use those properties in a search.

We should do the same for audio. The implementation would follow very similar steps:

Add a new column has_audio to the database, like in b315c17.
Update scripts/collect_hashtags.py so we also identify changes of media type 'AUDIO'.
Change the UI so the new database field can be used in queries, as in 5674520.

Add default to EXCLUDED

#default is a magic word we should exclude from collection.

https://github.com/Samwalton9/hashtags/blob/master/scripts/common.py#L4

Consider moving back to SQL queries on replicas rather than consuming Event Stream

We've been having reliability issues that are somewhat difficult to troubleshoot, where the tool stops processing updates. We thought 77d68bd solved this, but it looks like it's back after a few months.

We're not sure Event Stream is really the cause -- it could just as well be some issue where the container that consumes the stream is not running, or something else --, but I wonder if moving back from the stream to SQL queries, which is where the tool started, wouldn't result in a simpler and more resilient design.

A SQL-based design could also have lower latency, as a database query should be much faster than doing multiple HTTP queries to fetch the same data from Event Stream. For a quick comparison, we can fetch roughly the data we need for the past month with:

SELECT rc.rc_title, rc.rc_this_oldid,
       rc.rc_last_oldid, rc.rc_timestamp,
       c.comment_text
FROM recentchanges AS rc JOIN comment AS c
ON rc.rc_comment_id = comment.comment_id
WHERE c.comment_text regexp '[[:space:]]+#[^#]{3,}' AND rc.rc_timestamp > '20210704000000'
AND NOT rc.rc_bot AND rc.rc_source IN ('mw.edit', 'mw.new');

This takes 1 min on Toolforge, while my local tool takes many hours to catch up on just a couple of days of backlog. This is not a fair comparison (the SQL query is not doing API calls, and I have a higher latency to the API from home than the tool does), but it's interesting evidence that we should explore further.

A sketch of the design:

Query the meta db to find all projects to track (~400, as I write this):

SELECT dbname
FROM wiki
WHERE family IN ('wikisource', 'wikipedia', 'wikitionary', 'wikinews')
AND is_closed = 0;

Spawn a pool of N worker processes that poll 1 / N of the databases. Maybe use the size field of meta_p.wiki in partitioning to ensure the large projects don't end up with the same worker.
Within each worker, run the big query above for each database, process the results as we currently do, then sleep for X min + jitter.

Searching for a hashtag is slow

As far as I can tell, searching for any hashtag seems to take multiple seconds and even times out sometimes. We should look into optimizing this.

As a simple first step, we could log the output of explain on the queryset used in searches to see if we're doing something funny like a full-table scan. For my own future reference, here's the documentation for interpreting that output.

It would also be good to experiment with a profiler to get a breakdown of the wall-clock time of a query. I think it's likely that the bottleneck is the database query (because the rest of the code is pretty standard use of Django), but we shouldn't take that for granted and this would allow us to identify other bad spots.

@Samwalton9 would you be able to try the above in the production environment please?

Don't attempt to save hashtags longer than the column size

The hashtags tool is stuck on another edit. This time, it's because the data is too long for the hashtag column. hashtag has a max length of 128, but we don't do anything in collect_hashtags.py to check that we're not going to try saving something with a longer length than that. We could either extend the field size or do a check in collect_hashtags.py such that we don't try to save any hashtags with more than 128 characters.

I don't recall the rationale for choosing 128, but I can't imagine any use case for needing a hashtag that long - I suspect these are edge cases with URLs or somesuch other nonsense edit summary that we don't actually care about.

collect_hashtags.py is stuck on an azbwiki edit

The edit that created this page also introduces a file (File:Flag_of_Turkey.svg) into the page.

This is an interesting file: when we query its imageinfo, the API always sends back a continue object, indicating that collect_hashtags.py should query it again passing a different iitstart parameter each time to get more data. This seems to be because the URL of the file keeps changing. See these example queries:

https://azb.wikipedia.org/w/api.php?action=query&titles=File:Flag_of_Turkey.svg&format=json&prop=imageinfo&iiprop=mediatype|url&iistart=2012-05-06T18:27:10Z
https://azb.wikipedia.org/w/api.php?action=query&titles=File:Flag_of_Turkey.svg&format=json&prop=imageinfo&iiprop=mediatype|url&iistart=2012-05-04T09:24:38Z
https://azb.wikipedia.org/w/api.php?action=query&titles=File:Flag_of_Turkey.svg&format=json&prop=imageinfo&iiprop=mediatype|url&iistart=2011-02-28T15:51:42Z

Something even more interesting happens on that last query: the returned iistart is the same one we passed in! When we use it to build a new request URL, we'll end up making the same request again, and getting the same response back, over and over.

I don't really know why this would happen in the API or whether that result is even valid, and the API documentation doesn't really go into a lot of detail on how iistart is supposed to work, so I'm not sure whether we're using it correctly.

One quick, probably robust fix that we can do here is to add a limit to how many times we follow the continue response.

wikipedialibrary / hashtags Goto Github PK

hashtags's People

Contributors

Stargazers

Watchers

Forkers

hashtags's Issues

Recommend Projects

Recommend Topics

Recommend Org