Git Product home page Git Product logo

bigdata_python_project-1's Introduction

BigData_Python_Project-1

By: Rohit Barha

WIKIPEDIA BIG DATA ANALYSIS

  • This project analysis consists of using big data tools to answer questions about datasets from Wikipedia.

  • There are a series of analysis questions, answered using Hive and MapReduce. The tools used are determined based on the context for each question.

  • The output of the analysis includes MapReduce jar-files and .hql files so that the analysis is a repeatable process that works on a larger dataset.

PROBLEM STATEMENT

1.Which English wikipedia article got the most traffic on January 20, 2021?

2.What English wikipedia article has the highest views of its readers follow an internal link to another wikipedia article?

3.What series of wikipedia articles, starting with Hotel California, keeps the highest views of its readers clicking on internal links.

4.Find an example of an English wikipedia article that is relatively more popular in the Americas than elsewhere.There is no location data associated with the wikipedia pageviews data, but there are timestamps.

5.Find which device(PC or Mobile) generates the most traffic on the English Wikipedia article?

TECHNOLOGIES USED

  • Hadoop
  • HDFS
  • Python
  • Hive
  • MapReduce
  • Yarn

DATASET USED

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews

https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

bigdata_python_project-1's People

Contributors

rohitbarha avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.