Git Product home page Git Product logo

spark's Introduction

Spark - Wikipedia page popularity

The goal of this piece of code is to parallel process 2GB of Wikipedia links, and by understanding the relationships between nodes in the Wikipedia network, to identify the most popular webpages. To accomplish this task, I evaluated the data with the pageRank algorithm and processed it on GCP cluster. At the end, the most prominent page links in the network should have the highest probability allocated to them.

The dataset is stored on a HDFS in a series of txt files with the following format:

2 {'3': 3}

3 {'2': 1}

4 {'1': 1, '2': 1}

5 {'4': 1, '2': 1, '6': 1}

where each row is a webpage link and the dictionary shows the outbound traffic to other webpages and their corresponding ids.

The diagram below is an overview of the math required to iterate over the network graph recursively. Each time the algorithm runs probability is allocated to each page's children. The algorithm also takes into account dangling nodes (or dead-end web pages) where the user will teleport to someone else in the network so each node gets an even piece of the mass on the next iteration. Due to the large size of the dataset, I work with Spark RDDs and data dictionaries with the following critical format: (node_id , (score, edges)) to facilitate the heavy workload.

PR-illustrated.png

I was able to successfully deploy the program on Google Cloud Platform so the training job could run on several slave nodes. Accurately translating a complex math equation across several machines is very challenging so to ensure accuracy I have an accumulator to track the total sum of the probabilities, and as you can see below, it returns 1.00 total probability each iteration as desired. This is a slight deviation from an ideally stateless architecture, but it’s a vital testing step.

output of code

The CSV shows the final output of which nodes have the highest probability within the link network.

If you are interested in learning more, here is the link to Larry & Serge's orginal paper from 1998:

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

spark's People

Contributors

jphilippou27 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.