Git Product home page Git Product logo

sdow's Introduction

Six Degrees of Wikipedia (SDOW)

Data Source

Wikipedia dumps raw database tables in a gzipped SQL format for the English language Wikipedia (enwiki) approximately once a month (e.g. dump from February 1, 2018). The entire database layout is not required, and the database creation script only downloads, trims, and parses three tables:

  1. page - Contains the ID and name (among other things) for all pages.
  2. pagelinks - Contains the source and target pages all links.
  3. redirect - Contains the source and target pages for all redirects.

For performance reasons, the files are downloaded from the dumps.wikimedia.your.org mirror. By default, the script grabs the latest dump (available at https://dumps.wikimedia.your.org/enwiki/latest/), but you can also call the database creation script with a download date in the format YYYYMMDD as the first argument.

SDOW only concerns itself with actual Wikipedia articles, which belong to namespace 0 in the Wikipedia data.

Database Creation Process

The result of running the database creation script is a single sdow.sqlite file which contains four tables:

  1. pages - Page information for all pages, including redirects.
    1. id - Page ID.
    2. title - Sanitized page title.
    3. is_redirect - Whether or not the page is a redirect (1 means it is a redirect; 0 means it is not)
  2. links - Outgoing and incoming links for each non-redirect page.
    1. id - The page ID of the source page, the page that contains the link.
    2. outgoing_links_count - The number of pages to which this page links to.
    3. incoming_links_count - The number of pages which link to this page.
    4. incoming_links - A |-separated list of page IDs to which this page links.
    5. outgoing_links - A |-separated list of page IDs which link to this page.
  3. redirects - Source and target page IDs for all redirects.
    1. source_id - The page ID of the source page, the page that redirects to another page.
    2. target_id - The page ID of the target page, to which the redirect page redirects.
  4. searches - Historical results of all past searches.
    1. source_id - The page ID of the source page at which to start the search.
    2. target_id - The page ID of the target page at which to end the search.
    3. duration - How long the search took, in seconds.
    4. degrees_count - The number of degrees between the source and target pages.
    5. paths_count - The number of paths found between the source and target pages.
    6. paths - Stringified JSON representation of the paths of page IDs between the source and target pages.
    7. t - Timestamp when the search finished.

Generating the SDOW database from a dump of Wikipedia takes approximately one hour given the following instructions:

  1. Create a new Google Compute Engine instance from the sdow-db-builder instance template, which is configured with the following specs:
    1. Name: sdow-db-builder-1
    2. Zone: us-central1-c
    3. Machine Type: n1-highmem-8 (8 vCPUs, 52 GB RAM)
    4. Boot disk: 256 GB SSD, Debian GNU/Linux 8 (jessie)
    5. Notes: Allow full access to all Cloud APIs. Do not use Debian GNU/Linux 9 (stretch) due to degraded performance.
  2. SSH into the machine:
    $ gcloud compute ssh sdow-db-builder-1
  3. Install required operating system dependencies:
    $ sudo apt-get -q update
    $ sudo apt-get -yq install git pigz sqlite3
  4. Clone this directory via HTTPS:
    $ git clone https://github.com/jwngr/sdow.git
  5. Move to the proper directory and create a new screen in case the VM connection is lost:
    $ cd sdow/database/
    $ screen  # And then press <ENTER> on the screen that pops up
  6. Run the database creation script, providing an optional date for the backup:
    $ (time ./buildDatabase.sh [<YYYYMMDD>]) &> output.txt
  7. Detach from the current screen session by pressing <CTRL> + <a> and then <d>. To reattach to the screen, run screen -r. Make sure to always detach from the screen cleanly so it can be resumed!
  8. Copy the script output and the resulting SQLite file to the sdow-prod GCS bucket:
    $ gsutil cp output.txt gs://sdow-prod/dumps/<YYYYMMDD>/
    $ gsutil cp dump/sdow.sqlite gs://sdow-prod/dumps/<YYYYMMDD>/
    
  9. Delete the VM to prevent incurring large fees.

Web Server

Initial Setup

  1. Create a new Google Compute Engine instance from the sdow-web-server instance template, which is configured with the following specs::
    1. Name: sdow-web-server-1
    2. Zone: us-central1-c
    3. Machine Type: f1-micro (1 vCPU, 0.6 GB RAM)
    4. Boot disk: 16 GB SSD, Debian GNU/Linux 8 (jessie)
    5. Notes: Allow default access to Cloud APIs. Do not use Debian GNU/Linux 9 (stretch) due to degraded performance.
  2. SSH into the machine:
    $ gcloud compute ssh sdow-web-server-1
  3. Install required operating system dependencies to run the Flask app:
    $ sudo apt-get -q update
    $ sudo apt-get -yq install git pigz sqlite3 python-pip
    $ sudo pip install --upgrade pip setuptools virtualenv
    # OR for Python 3
    #$ sudo apt-get -q update
    #$ sudo apt-get -yq install git pigz sqlite3 python3-pip
    #$ sudo pip3 install --upgrade pip setuptools virtualenv
  4. Clone this directory via HTTPS and navigate into the repo:
    $ git clone https://github.com/jwngr/sdow.git
    $ cd sdow/
  5. Create and activate a new virtualenv environment:
    $ virtualenv -p python2 env  # OR virtualenv -p python3 env
    $ source env/bin/activate
  6. Install the required Python libraries:
    $ pip install -r requirements.txt
  7. Copy the latest SQLite file from the sdow-prod GCS bucket:
    $ gsutil cp gs://sdow-prod/dumps/<YYYYMMDD>/sdow.sqlite ./sdow/sdow.sqlite
  8. Install required operating system dependencies to generate an SSL certificate (this and the following instructions are based on these blog posts):
    $ echo 'deb http://ftp.debian.org/debian jessie-backports main' | sudo tee /etc/apt/sources.list.d/backports.list
    $ sudo apt-get -q update
    $ sudo apt-get -yq install nginx
    $ sudo apt-get -yq install certbot -t jessie-backports
  9. Add this location block inside the server block in /etc/nginx/sites-available/default:
    location ~ /.well-known {
        allow all;
    }
    
  10. Start NGINX:
    $ sudo systemctl restart nginx
  11. Ensure the server has the proper static IP address (sdow-web-server-static-ip) by editing it on the GCP console if necessary.
  12. Create an SSL certificate using Let's Encrypt's certbot:
    $ sudo certbot certonly -a webroot --webroot-path=/var/www/html -d api.sixdegreesofwikipedia.com --email [email protected]
  13. Ensure auto-renewal of the SSL certificate is configured properly:
    $ certbot renew --dry-run
  14. Run crontab -e and add the following cron job to that file to auto-renew the SSL certificate:
    0 0,12 * * * python -c 'import random; import time; time.sleep(random.random() * 3600)' && /usr/bin/certbot renew
    
  15. Generate a strong Diffie-Hellman group to further increase security (note that this can take a couple minutes):
    $ sudo openssl dhparam -out /etc/ssl/certs/dhparam.pem 2048
  16. Copy over the NGINX configuration, making sure to back up the original configuration:
    $ sudo cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup
    $ sudo cp ./config/nginx.conf /etc/nginx/nginx.conf
  17. Restart nginx:
    $ sudo systemctl restart nginx

Recurring Setup

  1. Activate the virtualenv environment:
    $ cd sdow/
    $ source env/bin/activate
  2. Set the SDOW_ENV environment variable to prod:
    $ export SDOW_ENV=prod
  3. Start the Flask app via Supervisor which runs Gunicorn:
    $ cd sdow/
    $ supervisord -c ../config/supervisord.conf
  4. Ensure the app was started successfully by running supervisorctl -c ../config/supervisord.conf.

Resources

Edge Case Pages

ID Title Sanitized Title
50899560 🦎 🦎
725006 " \"
438953 4′33″ 4′33″
32055176 Λ-ring Λ-ring
11760 F-110 Spectre F-110_Spectre
8695 Dr. Strangelove Dr._Strangelove
337903 Farmers' market Farmers\'_market
24781873 Lindström (company) Lindström_(company)
54201777 Disinformation (book) Disinformation_(book)
1514 Albert, Duke of Prussia Albert,_Duke_of_Prussia
35703467 "A," My Name is Alex - Parts I & II \"A,\"\_My_Name_is_Alex_-_Parts_I_&_II
54680944 N,N,N′,N′-tetramethylethylenediamine N,N,N′,N′-tetramethylethylenediamine
24781871 Jack in the Green: Live in Germany 1970–1993 Jack_in_the_Green:_Live_in_Germany_1970–1993

Interesting searches

Source Page Title Target Page Title Notes
Hargrave Military Academy Illiosentidae Cool graph
Arthropod Haberdashers' Aske's Boys' School Cool graph
AC power plugs and sockets Gymnobela abyssorum 1,311 paths of 6 degrees

Contributing

See the contribution page for details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.