Git Product home page Git Product logo

wiki_pagerank's Introduction

#wiki_pagerank

This is a set of python scripts to run PageRank on the wikipedia link graph.

##Background by Luke Stanley

Realising that often bandwidth is bad, I got excited about offline Wikipedia collections. I decided to give the gift of knowledge for Christmas (as well as some bootable Ubuntu for some relatives with virus ridden XP boxes).

But I was disappointed that a bunch of articles I thought were really important were missing from existing offline Wikipedia article collections. They were out of date, and the manual selections of articles people chose was too limited in my opinion. It wasn't clear if this was just distasteful choices or willful censorship.

I thought the PageRank algorithm could be a good way to filter less significant articles, and make my own article collection instead of a presumably messy human bureaucracy for choosing what articles make the cut. I found Sean Harnett had calculated PageRank before on Wikipedia dataset here: http://www.columbia.edu/~srh2144/neatthings/ I asked him to upload his Python scripts for doing so, and he kindly obliged putting them here on Github: https://github.com/sharnett/wiki_pagerank

They didn't require much work to get going with a smaller Wikipedia dataset, I wrote a few setup scripts for use on a single computer.

##Some results

##Example Usage

sudo apt-get install python-scipy
./download_and_extract.sh
python main.py

This should take a minute or so and get you something like this:

making title <--> ID dictionaries...
building the graph...
compactifying...
reading graph file and matrixifying...
loading data...
doing pagerank
i = 1 err = 0.0721205984186
i = 2 err = 0.0309899557399
i = 3 err = 0.0180857631125
i = 4 err = 0.00638996682224
i = 5 err = 0.00302626819411
i = 6 err = 0.00118320972815
i = 7 err = 0.000542479457445
sorting
 1 Wikipedia
 2 Malayang_software
 3 Wikang_Ingles
 4 Malayang_sopwer
 5 Hapon
 6 Estados_Unidos
 7 Pilipinas
 8 Ensiklopedya
 9 GNU/Linux
10 World_Wide_Web

##Tips

You may want to use MapReduce to speed the creation of graph.txt for larger Wikipedia language versions (the original motivation was an excuse to play with MapReduce). Included are the mapper and reducer scripts if you want to go that route. In any case, you should definitely test with a small Wikipedia first to ensure that things work properly.

Luke ended up needing to use a big EC2 instance to run it on the full English wikipedia as his laptop didn't have enough memory.

The Tagalog Wikipedia is a good small test Wikipedia. As of December 2013, tlwiki-latest-pagelinks.sql decompresses to 95 MB, and tlwiki-latest-page.sql -- the list of articles -- expands to 16 MB. Included is a file to download and extract these.

wiki_pagerank's People

Contributors

lukestanley avatar sharnett avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.