wikipedialibrary / hashtags Goto Github PK
View Code? Open in Web Editor NEWHashtags tool - tracking hashtags in Wikimedia project edit summaries
Home Page: https://hashtags.wmcloud.org/
License: MIT License
Hashtags tool - tracking hashtags in Wikimedia project edit summaries
Home Page: https://hashtags.wmcloud.org/
License: MIT License
Since around last summer, when we use the hashtag tool, it does not display results on « recent » time slots.
This seems to be getting worse.
For example, I currently run a #shesaid campaign on Wikiquote. I got results till end of September, but not after the 30th. And we are end of November
Last summer, we were running the #WPWP campaign, closing Aug 30, and most results of the last results only showed up in the tool mid to end of September
Is there a practical reason for that ?
For #WPWP in July / August, it would be great to be able to identify whether an edit with a given hashtag introduces new media into a page.
The first part of this feature is to identify, for incoming realtime change events, whether they add media.
Given the old and new revids for an edit, it seems the way to identify media additions, by type of media, is to use a combination of action=parse&prop=images
and action=query&prop=imageinfo
API calls. See this Phabricator ticket for details.
I've tried this in a proof-of-concept, hacked version of collect_hashtags.py and it looks like it does not introduce significant latency to the processing of incoming edits.
The second part is the UI: we can probably add a couple of checkboxes ("Adds image" / "Adds video") to the tool's UI to allow that kind of filtering.
Hi! It seems like tool has stopped tracking new edits since August 4.
I already reported it at Phabricator but I'm opening this issue to increase the chances of reaching someone who can fix it.
The tool has been invaluable to me (and surely to many others) since I discovered it not so long ago. It helps me track edits done with various tools I develop, cross-wiki and easily, which in turn helps me identify bugs and opportunities. Its loss is a huge problem to me (and surely to others too). Can anything be done??
Thanks! And thanks also for developing the tool in the first place!!
We got a new favicon via Google Code-In, and I uploaded it to the live tool, but I forgot to ever add it to the repo.
In #60, we added a feature where the Hashtags tool can identify edits that introduce images or video to a page, and use those properties in a search.
We should do the same for audio. The implementation would follow very similar steps:
has_audio
to the database, like in b315c17.'AUDIO'
.#default is a magic word we should exclude from collection.
https://github.com/Samwalton9/hashtags/blob/master/scripts/common.py#L4
We've been having reliability issues that are somewhat difficult to troubleshoot, where the tool stops processing updates. We thought 77d68bd solved this, but it looks like it's back after a few months.
We're not sure Event Stream is really the cause -- it could just as well be some issue where the container that consumes the stream is not running, or something else --, but I wonder if moving back from the stream to SQL queries, which is where the tool started, wouldn't result in a simpler and more resilient design.
A SQL-based design could also have lower latency, as a database query should be much faster than doing multiple HTTP queries to fetch the same data from Event Stream. For a quick comparison, we can fetch roughly the data we need for the past month with:
SELECT rc.rc_title, rc.rc_this_oldid,
rc.rc_last_oldid, rc.rc_timestamp,
c.comment_text
FROM recentchanges AS rc JOIN comment AS c
ON rc.rc_comment_id = comment.comment_id
WHERE c.comment_text regexp '[[:space:]]+#[^#]{3,}' AND rc.rc_timestamp > '20210704000000'
AND NOT rc.rc_bot AND rc.rc_source IN ('mw.edit', 'mw.new');
This takes 1 min on Toolforge, while my local tool takes many hours to catch up on just a couple of days of backlog. This is not a fair comparison (the SQL query is not doing API calls, and I have a higher latency to the API from home than the tool does), but it's interesting evidence that we should explore further.
A sketch of the design:
meta
db to find all projects to track (~400, as I write this):SELECT dbname
FROM wiki
WHERE family IN ('wikisource', 'wikipedia', 'wikitionary', 'wikinews')
AND is_closed = 0;
size
field of meta_p.wiki
in partitioning to ensure the large projects don't end up with the same worker.As far as I can tell, searching for any hashtag seems to take multiple seconds and even times out sometimes. We should look into optimizing this.
As a simple first step, we could log the output of explain on the queryset used in searches to see if we're doing something funny like a full-table scan. For my own future reference, here's the documentation for interpreting that output.
It would also be good to experiment with a profiler to get a breakdown of the wall-clock time of a query. I think it's likely that the bottleneck is the database query (because the rest of the code is pretty standard use of Django), but we shouldn't take that for granted and this would allow us to identify other bad spots.
@Samwalton9 would you be able to try the above in the production environment please?
The hashtags tool is stuck on another edit. This time, it's because the data is too long for the hashtag
column. hashtag
has a max length of 128, but we don't do anything in collect_hashtags.py to check that we're not going to try saving something with a longer length than that. We could either extend the field size or do a check in collect_hashtags.py such that we don't try to save any hashtags with more than 128 characters.
I don't recall the rationale for choosing 128, but I can't imagine any use case for needing a hashtag that long - I suspect these are edge cases with URLs or somesuch other nonsense edit summary that we don't actually care about.
The edit that created this page also introduces a file (File:Flag_of_Turkey.svg) into the page.
This is an interesting file: when we query its imageinfo, the API always sends back a continue
object, indicating that collect_hashtags.py should query it again passing a different iitstart
parameter each time to get more data. This seems to be because the URL of the file keeps changing. See these example queries:
https://azb.wikipedia.org/w/api.php?action=query&titles=File:Flag_of_Turkey.svg&format=json&prop=imageinfo&iiprop=mediatype|url&iistart=2012-05-06T18:27:10Z
https://azb.wikipedia.org/w/api.php?action=query&titles=File:Flag_of_Turkey.svg&format=json&prop=imageinfo&iiprop=mediatype|url&iistart=2012-05-04T09:24:38Z
https://azb.wikipedia.org/w/api.php?action=query&titles=File:Flag_of_Turkey.svg&format=json&prop=imageinfo&iiprop=mediatype|url&iistart=2011-02-28T15:51:42Z
Something even more interesting happens on that last query: the returned iistart
is the same one we passed in! When we use it to build a new request URL, we'll end up making the same request again, and getting the same response back, over and over.
I don't really know why this would happen in the API or whether that result is even valid, and the API documentation doesn't really go into a lot of detail on how iistart
is supposed to work, so I'm not sure whether we're using it correctly.
One quick, probably robust fix that we can do here is to add a limit to how many times we follow the continue
response.
There is no clarity on copyright applicable for screenshots of the tool, as in for example Pageviews tool
@Samwalton9
I ran the command 'docker-compose up --build' inside the project directory. It gives me this error -
Please help.
@Samwalton9 @Jain-Aditya I am getting this error whenever I run the server.
Can you please help me with this. Am I missing something?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.