Git Product home page Git Product logo

hashtags's People

Contributors

3to1null avatar eggpi avatar jain-aditya avatar jsnshrmn avatar jumakiwaka avatar ksubbu199 avatar n3rsti avatar parths007 avatar samwalton9 avatar sanyam-git avatar soumyaa1804 avatar suecarmol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

hashtags's Issues

Delayed results

Since around last summer, when we use the hashtag tool, it does not display results on « recent » time slots.
This seems to be getting worse.
For example, I currently run a #shesaid campaign on Wikiquote. I got results till end of September, but not after the 30th. And we are end of November
Last summer, we were running the #WPWP campaign, closing Aug 30, and most results of the last results only showed up in the tool mid to end of September
Is there a practical reason for that ?

Identify edits with hashtags that introduce media into the page

For #WPWP in July / August, it would be great to be able to identify whether an edit with a given hashtag introduces new media into a page.

The first part of this feature is to identify, for incoming realtime change events, whether they add media.

Given the old and new revids for an edit, it seems the way to identify media additions, by type of media, is to use a combination of action=parse&prop=images and action=query&prop=imageinfo API calls. See this Phabricator ticket for details.

I've tried this in a proof-of-concept, hacked version of collect_hashtags.py and it looks like it does not introduce significant latency to the processing of incoming edits.

The second part is the UI: we can probably add a couple of checkboxes ("Adds image" / "Adds video") to the tool's UI to allow that kind of filtering.

Tool not tracking

Hi! It seems like tool has stopped tracking new edits since August 4.

I already reported it at Phabricator but I'm opening this issue to increase the chances of reaching someone who can fix it.

The tool has been invaluable to me (and surely to many others) since I discovered it not so long ago. It helps me track edits done with various tools I develop, cross-wiki and easily, which in turn helps me identify bugs and opportunities. Its loss is a huge problem to me (and surely to others too). Can anything be done??

Thanks! And thanks also for developing the tool in the first place!!

Add new favicon to repo

We got a new favicon via Google Code-In, and I uploaded it to the live tool, but I forgot to ever add it to the repo.

Track edits that add audio.

In #60, we added a feature where the Hashtags tool can identify edits that introduce images or video to a page, and use those properties in a search.

We should do the same for audio. The implementation would follow very similar steps:

  1. Add a new column has_audio to the database, like in b315c17.
  2. Update scripts/collect_hashtags.py so we also identify changes of media type 'AUDIO'.
  3. Change the UI so the new database field can be used in queries, as in 5674520.

Consider moving back to SQL queries on replicas rather than consuming Event Stream

We've been having reliability issues that are somewhat difficult to troubleshoot, where the tool stops processing updates. We thought 77d68bd solved this, but it looks like it's back after a few months.

We're not sure Event Stream is really the cause -- it could just as well be some issue where the container that consumes the stream is not running, or something else --, but I wonder if moving back from the stream to SQL queries, which is where the tool started, wouldn't result in a simpler and more resilient design.

A SQL-based design could also have lower latency, as a database query should be much faster than doing multiple HTTP queries to fetch the same data from Event Stream. For a quick comparison, we can fetch roughly the data we need for the past month with:

SELECT rc.rc_title, rc.rc_this_oldid,
       rc.rc_last_oldid, rc.rc_timestamp,
       c.comment_text
FROM recentchanges AS rc JOIN comment AS c
ON rc.rc_comment_id = comment.comment_id
WHERE c.comment_text regexp '[[:space:]]+#[^#]{3,}' AND rc.rc_timestamp > '20210704000000'
AND NOT rc.rc_bot AND rc.rc_source IN ('mw.edit', 'mw.new');

This takes 1 min on Toolforge, while my local tool takes many hours to catch up on just a couple of days of backlog. This is not a fair comparison (the SQL query is not doing API calls, and I have a higher latency to the API from home than the tool does), but it's interesting evidence that we should explore further.

A sketch of the design:

  1. Query the meta db to find all projects to track (~400, as I write this):
SELECT dbname
FROM wiki
WHERE family IN ('wikisource', 'wikipedia', 'wikitionary', 'wikinews')
AND is_closed = 0;
  1. Spawn a pool of N worker processes that poll 1 / N of the databases. Maybe use the size field of meta_p.wiki in partitioning to ensure the large projects don't end up with the same worker.
  2. Within each worker, run the big query above for each database, process the results as we currently do, then sleep for X min + jitter.

Searching for a hashtag is slow

As far as I can tell, searching for any hashtag seems to take multiple seconds and even times out sometimes. We should look into optimizing this.

As a simple first step, we could log the output of explain on the queryset used in searches to see if we're doing something funny like a full-table scan. For my own future reference, here's the documentation for interpreting that output.

It would also be good to experiment with a profiler to get a breakdown of the wall-clock time of a query. I think it's likely that the bottleneck is the database query (because the rest of the code is pretty standard use of Django), but we shouldn't take that for granted and this would allow us to identify other bad spots.

@Samwalton9 would you be able to try the above in the production environment please?

Don't attempt to save hashtags longer than the column size

The hashtags tool is stuck on another edit. This time, it's because the data is too long for the hashtag column. hashtag has a max length of 128, but we don't do anything in collect_hashtags.py to check that we're not going to try saving something with a longer length than that. We could either extend the field size or do a check in collect_hashtags.py such that we don't try to save any hashtags with more than 128 characters.

I don't recall the rationale for choosing 128, but I can't imagine any use case for needing a hashtag that long - I suspect these are edge cases with URLs or somesuch other nonsense edit summary that we don't actually care about.

collect_hashtags.py is stuck on an azbwiki edit

The edit that created this page also introduces a file (File:Flag_of_Turkey.svg) into the page.

This is an interesting file: when we query its imageinfo, the API always sends back a continue object, indicating that collect_hashtags.py should query it again passing a different iitstart parameter each time to get more data. This seems to be because the URL of the file keeps changing. See these example queries:

https://azb.wikipedia.org/w/api.php?action=query&titles=File:Flag_of_Turkey.svg&format=json&prop=imageinfo&iiprop=mediatype|url&iistart=2012-05-06T18:27:10Z
https://azb.wikipedia.org/w/api.php?action=query&titles=File:Flag_of_Turkey.svg&format=json&prop=imageinfo&iiprop=mediatype|url&iistart=2012-05-04T09:24:38Z
https://azb.wikipedia.org/w/api.php?action=query&titles=File:Flag_of_Turkey.svg&format=json&prop=imageinfo&iiprop=mediatype|url&iistart=2011-02-28T15:51:42Z

Something even more interesting happens on that last query: the returned iistart is the same one we passed in! When we use it to build a new request URL, we'll end up making the same request again, and getting the same response back, over and over.

I don't really know why this would happen in the API or whether that result is even valid, and the API documentation doesn't really go into a lot of detail on how iistart is supposed to work, so I'm not sure whether we're using it correctly.

One quick, probably robust fix that we can do here is to add a limit to how many times we follow the continue response.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.