mediawiki-utilities / python-mwviews Goto Github PK

View Code? Open in Web Editor NEW

63.0 63.0 14.0 34 KB

Tools for parsing and querying Wikimedia Foundation pageview data from both static dumps and the online API.

License: MIT License

Python 100.00%

python-mwviews's People

Contributors

Stargazers

Watchers

Forkers

agfeldman tcp-ip80 jordipala ghassanmas ihsara deilusia oztalha nettrom puremath86 guyrosin elgranmontoya danmichaelo anilktechie

python-mwviews's Issues

Pagination not implemented for project_views

Hi, I was trying to wrap my head around the Wikimedia REST API, when I found this repo. It looks like there is a pagination problem, at least with the method project_views. Step to reproduce:

p = PageviewsClient(user_agent="<[email protected]> Foo Bar")
a = p.project_views(['it.wikipedia'], granularity='hourly', start='2016010100', end='2022060700')

len(a) # 56,377 items
# However most of those are empty, This library inserts all the dates in the range, 
# but then there are no data coming from the API for all the time span

# E.g.:
a[datetime.datetime(2021, 2, 11, 15, 0)] #  {'it.wikipedia': None}

Data about such dates are definitely present in the API. The problem is that such an endpoint returns at most 5k items per-call along with an undocumented pagination token (see here for more). Probably, there should be a way to pass it to the API to get the next page (still trying to figure out how).

Enforce high-volume access acceptable usage patterns

The API documentation requests/requires that high-volume users:

Don't perform more than 500 requests/s to this API
Set a unique User-Agent header that allows us to contact you quickly. Email addresses or URLs of contact pages work well.

I would like to not worry about following this policy when I use this tool.

python-mwviews does not handle unicode in titles

I'm not sure where the most appropriate place to open this issue is, but I suppose trying both places doesn't hurt.

The report is on Phabricator: https://phabricator.wikimedia.org/T123200

Add support for new pageview endpoints

In particular, I think based on https://wikimedia.org/api/rest_v1/#/Pageviews%20data, we'd need to add top-by-country (monthly top countries for readers of a project) and top-per-country (daily most viewed articles for a given country). I don't know when I could get to this but I'm happy to take it on if someone doesn't beat me to it.

The library should not use printf

Use the standard logging instead.

Is there a way to search by QID?

I have a csv with a list of Wiki QIDs, but frustrating no more direct information about their wikipedia pages. Does mwviews have a setting to search by QID?

Thank you!

Keys in data structure returned by article_views should match the title strings passed in

Example:

>>> p.article_views('en.wikipedia', ['Bug report'], granularity='monthly', start='20220101', end='20220228')
defaultdict(<class 'dict'>, {datetime.datetime(2022, 1, 1, 0, 0): {'Bug_report': 262}, datetime.datetime(2022, 2, 1, 0, 0): {'Bug_report': 266}})

It's somewhat confusing that I can't use the titles I passed to the function ('Bug report') to index the data I get back. Instead I need to reverse-engineer the munging the applied by the mwviews client.

key meanings

For single article_view I get dict with items like this:

datetime.datetime(2016, 10, 23, 0, 0): {
    'D': 1066,
    'N': 942,
    '_': None,
    'a': 6,
    'e': 2,
    'i': 7,
    'l': 8,
    'm': 24,
    'n': 8,
    'o': 1,
    'p': 9,
    't': 8,
    'y': 4
}

What is the meaning of these keys?

Instantiating PageviewsClient(parallelism=1) gives same result.

pip install broken?

(Apologies if this is just a quirk of my local setup, but I noticed there was a recent change to the way the project is packaged, and thought that might be related to the issue I'm having.)

I ran pip install mwviews and it appeared to install successfully. But if I try to run something like python -m mwviews, I get No module named mwviews.

pip show mwviews shows the package as being installed. Its location is given as: Location: /home/colin/.local/lib/python3.6/site-packages (as expected)

However, my site-packages does not have a mwviews subdirectory. And if I try to use pip to uninstall mwviews, the list of files to be removed looks like...

  /home/colin/.local/bin/mwviews
  /home/colin/.local/lib/python3.6/site-packages/api/__init__.py
  /home/colin/.local/lib/python3.6/site-packages/api/__pycache__/__init__.cpython-36.pyc
  /home/colin/.local/lib/python3.6/site-packages/api/__pycache__/pageviews.cpython-36.pyc
  /home/colin/.local/lib/python3.6/site-packages/api/pageviews.py
  /home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/INSTALLER
  /home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/LICENSE.txt
  /home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/METADATA
  /home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/RECORD
  /home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/WHEEL
  /home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/entry_points.txt
  /home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/top_level.txt
  /home/colin/.local/lib/python3.6/site-packages/utilities/__init__.py
  /home/colin/.local/lib/python3.6/site-packages/utilities/__pycache__/__init__.cpython-36.pyc
  /home/colin/.local/lib/python3.6/site-packages/utilities/__pycache__/aggregate.cpython-36.pyc
  /home/colin/.local/lib/python3.6/site-packages/utilities/__pycache__/fetch_global_namespaces.cpython-36.pyc
  /home/colin/.local/lib/python3.6/site-packages/utilities/__pycache__/util.cpython-36.pyc
  /home/colin/.local/lib/python3.6/site-packages/utilities/aggregate.py
  /home/colin/.local/lib/python3.6/site-packages/utilities/fetch_global_namespaces.py
  /home/colin/.local/lib/python3.6/site-packages/utilities/util.py

In other words, it looks to me like the contents of the package is getting exploded into the top-level site-packages directing, rather than its own proper subdirectory. (In fact, it turns out that as a workaround I can do something like from api import PageviewsClient rather than from mwviews.api import PageviewsClient)

Querying a page title with a questionmark

When I query daily statistics on a page containing a questionmark, like p.article_views(u'nl.wikipedia',['Wie is de mol?'],start=startday,end=endday) it will result in zeroes as number of page views.

other special characters seem to fail

Speciaal:Zoeken does work, but 'Speciaal:MyPage/zeusmodepreferences.js' is not working.
This page is showing up in the todaystoplist = p.top_articles('nl.wikipedia' , limit=100) but fails on the p.article_views('nl.wikipedia' ,['Speciaal:MyPage/zeusmodepreferences.js'],start=startday,end=endday)
on line 99 in article_views.It says KeyError: u'\xe9'

Remove client-side monthly aggregation for pageview counts

The API back-end now supports monthly granularity as described in API documentation. Thus, the library should use the endpoint instead of doing client side aggregation.

Implement "mwviews aggregate"

Aggregates page view counts from hourly page view files into a single pageview file

Usage:
    aggregate (-h|--help)
    aggregate <hour-file>... 
              [--namespaces=<path>]
              [--projects=<prefixes>]

Options:
    -h --help               Prints this documentation
    <hour-file>             Path to an pageviews hourly file to process
    --namespace=<path>      Path of a file produced by
                            `fetch_global_namespaces` for processing 
                            namespace prefixes (e.g. "Talk:...")
    --projects=<prefixes>   A "|" separated list of project prefixes that 
                            should be included in the output 
                            (e.g. "en|en.mw")

Implement `fetch_global_namespaces`

Fetches a JSON file containing information about all namespace 
names and aliases across all wikis using action=sitematrix and 
action=query&meta=siteinfo.  This file is used later by 
`aggregate` to parse page names into (namespace, title) 
pairs. 

Usage:
    fetch_global_namespaces (-h|--help)
    fetch_global_namespaces <api-host>

Options:
    -h --help   Prints this documentation
    <api-host>  URL for the MediaWiki host to query for 
                action=sitematrix