mediawiki-utilities / python-mwviews Goto Github PK
View Code? Open in Web Editor NEWTools for parsing and querying Wikimedia Foundation pageview data from both static dumps and the online API.
License: MIT License
Tools for parsing and querying Wikimedia Foundation pageview data from both static dumps and the online API.
License: MIT License
Hi, I was trying to wrap my head around the Wikimedia REST API, when I found this repo. It looks like there is a pagination problem, at least with the method project_views
. Step to reproduce:
p = PageviewsClient(user_agent="<[email protected]> Foo Bar")
a = p.project_views(['it.wikipedia'], granularity='hourly', start='2016010100', end='2022060700')
len(a) # 56,377 items
# However most of those are empty, This library inserts all the dates in the range,
# but then there are no data coming from the API for all the time span
# E.g.:
a[datetime.datetime(2021, 2, 11, 15, 0)] # {'it.wikipedia': None}
Data about such dates are definitely present in the API. The problem is that such an endpoint returns at most 5k items per-call along with an undocumented pagination token (see here for more). Probably, there should be a way to pass it to the API to get the next page (still trying to figure out how).
The API documentation requests/requires that high-volume users:
User-Agent
header that allows us to contact you quickly. Email addresses or URLs of contact pages work well.I would like to not worry about following this policy when I use this tool.
I'm not sure where the most appropriate place to open this issue is, but I suppose trying both places doesn't hurt.
The report is on Phabricator: https://phabricator.wikimedia.org/T123200
In particular, I think based on https://wikimedia.org/api/rest_v1/#/Pageviews%20data, we'd need to add top-by-country
(monthly top countries for readers of a project) and top-per-country
(daily most viewed articles for a given country). I don't know when I could get to this but I'm happy to take it on if someone doesn't beat me to it.
Use the standard logging instead.
I have a csv with a list of Wiki QIDs, but frustrating no more direct information about their wikipedia pages. Does mwviews have a setting to search by QID?
Thank you!
Example:
>>> p.article_views('en.wikipedia', ['Bug report'], granularity='monthly', start='20220101', end='20220228')
defaultdict(<class 'dict'>, {datetime.datetime(2022, 1, 1, 0, 0): {'Bug_report': 262}, datetime.datetime(2022, 2, 1, 0, 0): {'Bug_report': 266}})
It's somewhat confusing that I can't use the titles I passed to the function ('Bug report'
) to index the data I get back. Instead I need to reverse-engineer the munging the applied by the mwviews client.
For single article_view I get dict with items like this:
datetime.datetime(2016, 10, 23, 0, 0): {
'D': 1066,
'N': 942,
'_': None,
'a': 6,
'e': 2,
'i': 7,
'l': 8,
'm': 24,
'n': 8,
'o': 1,
'p': 9,
't': 8,
'y': 4
}
What is the meaning of these keys?
Instantiating PageviewsClient(parallelism=1)
gives same result.
(Apologies if this is just a quirk of my local setup, but I noticed there was a recent change to the way the project is packaged, and thought that might be related to the issue I'm having.)
I ran pip install mwviews
and it appeared to install successfully. But if I try to run something like python -m mwviews
, I get No module named mwviews
.
pip show mwviews
shows the package as being installed. Its location is given as: Location: /home/colin/.local/lib/python3.6/site-packages
(as expected)
However, my site-packages does not have a mwviews
subdirectory. And if I try to use pip to uninstall mwviews, the list of files to be removed looks like...
/home/colin/.local/bin/mwviews
/home/colin/.local/lib/python3.6/site-packages/api/__init__.py
/home/colin/.local/lib/python3.6/site-packages/api/__pycache__/__init__.cpython-36.pyc
/home/colin/.local/lib/python3.6/site-packages/api/__pycache__/pageviews.cpython-36.pyc
/home/colin/.local/lib/python3.6/site-packages/api/pageviews.py
/home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/INSTALLER
/home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/LICENSE.txt
/home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/METADATA
/home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/RECORD
/home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/WHEEL
/home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/entry_points.txt
/home/colin/.local/lib/python3.6/site-packages/mwviews-0.2.0.dist-info/top_level.txt
/home/colin/.local/lib/python3.6/site-packages/utilities/__init__.py
/home/colin/.local/lib/python3.6/site-packages/utilities/__pycache__/__init__.cpython-36.pyc
/home/colin/.local/lib/python3.6/site-packages/utilities/__pycache__/aggregate.cpython-36.pyc
/home/colin/.local/lib/python3.6/site-packages/utilities/__pycache__/fetch_global_namespaces.cpython-36.pyc
/home/colin/.local/lib/python3.6/site-packages/utilities/__pycache__/util.cpython-36.pyc
/home/colin/.local/lib/python3.6/site-packages/utilities/aggregate.py
/home/colin/.local/lib/python3.6/site-packages/utilities/fetch_global_namespaces.py
/home/colin/.local/lib/python3.6/site-packages/utilities/util.py
In other words, it looks to me like the contents of the package is getting exploded into the top-level site-packages
directing, rather than its own proper subdirectory. (In fact, it turns out that as a workaround I can do something like from api import PageviewsClient
rather than from mwviews.api import PageviewsClient
)
When I query daily statistics on a page containing a questionmark, like p.article_views(u'nl.wikipedia',['Wie is de mol?'],start=startday,end=endday) it will result in zeroes as number of page views.
Speciaal:Zoeken does work, but 'Speciaal:MyPage/zeusmodepreferences.js' is not working.
This page is showing up in the todaystoplist = p.top_articles('nl.wikipedia' , limit=100) but fails on the p.article_views('nl.wikipedia' ,['Speciaal:MyPage/zeusmodepreferences.js'],start=startday,end=endday)
on line 99 in article_views.It says KeyError: u'\xe9'
The API back-end now supports monthly granularity as described in API documentation. Thus, the library should use the endpoint instead of doing client side aggregation.
Aggregates page view counts from hourly page view files into a single pageview file
Usage:
aggregate (-h|--help)
aggregate <hour-file>...
[--namespaces=<path>]
[--projects=<prefixes>]
Options:
-h --help Prints this documentation
<hour-file> Path to an pageviews hourly file to process
--namespace=<path> Path of a file produced by
`fetch_global_namespaces` for processing
namespace prefixes (e.g. "Talk:...")
--projects=<prefixes> A "|" separated list of project prefixes that
should be included in the output
(e.g. "en|en.mw")
Fetches a JSON file containing information about all namespace
names and aliases across all wikis using action=sitematrix and
action=query&meta=siteinfo. This file is used later by
`aggregate` to parse page names into (namespace, title)
pairs.
Usage:
fetch_global_namespaces (-h|--help)
fetch_global_namespaces <api-host>
Options:
-h --help Prints this documentation
<api-host> URL for the MediaWiki host to query for
action=sitematrix
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.