By: Rohit Barha
-
This project analysis consists of using big data tools to answer questions about datasets from Wikipedia.
-
There are a series of analysis questions, answered using Hive and MapReduce. The tools used are determined based on the context for each question.
-
The output of the analysis includes MapReduce jar-files and .hql files so that the analysis is a repeatable process that works on a larger dataset.
1.Which English wikipedia article got the most traffic on January 20, 2021?
2.What English wikipedia article has the highest views of its readers follow an internal link to another wikipedia article?
3.What series of wikipedia articles, starting with Hotel California, keeps the highest views of its readers clicking on internal links.
4.Find an example of an English wikipedia article that is relatively more popular in the Americas than elsewhere.There is no location data associated with the wikipedia pageviews data, but there are timestamps.
5.Find which device(PC or Mobile) generates the most traffic on the English Wikipedia article?
- Hadoop
- HDFS
- Python
- Hive
- MapReduce
- Yarn
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream