Six Degrees of Wikipedia (SDOW)

Data Source

Wikipedia dumps raw database tables in a gzipped SQL format for the English language Wikipedia (enwiki) approximately once a month (e.g. dump from February 1, 2018). The entire database layout is not required, and the database creation script only downloads, trims, and parses three tables:

page - Contains the ID and name (among other things) for all pages.
pagelinks - Contains the source and target pages all links.
redirect - Contains the source and target pages for all redirects.

For performance reasons, the files are downloaded from the dumps.wikimedia.your.org mirror. By default, the script grabs the latest dump (available at https://dumps.wikimedia.your.org/enwiki/latest/), but you can also call the database creation script with a download date in the format YYYYMMDD as the first argument.

SDOW only concerns itself with actual Wikipedia articles, which belong to namespace 0 in the Wikipedia data.

Database Creation Process

The result of running the database creation script is a single sdow.sqlite file which contains four tables:

pages - Page information for all pages, including redirects.
1. id - Page ID.
2. title - Sanitized page title.
3. is_redirect - Whether or not the page is a redirect (1 means it is a redirect; 0 means it is not)
links - Outgoing and incoming links for each non-redirect page.
1. id - The page ID of the source page, the page that contains the link.
2. outgoing_links_count - The number of pages to which this page links to.
3. incoming_links_count - The number of pages which link to this page.
4. incoming_links - A |-separated list of page IDs to which this page links.
5. outgoing_links - A |-separated list of page IDs which link to this page.
redirects - Source and target page IDs for all redirects.
1. source_id - The page ID of the source page, the page that redirects to another page.
2. target_id - The page ID of the target page, to which the redirect page redirects.
searches - Historical results of all past searches.
1. source_id - The page ID of the source page at which to start the search.
2. target_id - The page ID of the target page at which to end the search.
3. duration - How long the search took, in seconds.
4. degrees_count - The number of degrees between the source and target pages.
5. paths_count - The number of paths found between the source and target pages.
6. paths - Stringified JSON representation of the paths of page IDs between the source and target pages.
7. t - Timestamp when the search finished.

Generating the SDOW database from a dump of Wikipedia takes approximately one hour given the following instructions:

Create a new Google Compute Engine instance from the sdow-db-builder instance template, which is configured with the following specs:
1. Name: sdow-db-builder-1
2. Zone: us-central1-c
3. Machine Type: n1-highmem-8 (8 vCPUs, 52 GB RAM)
4. Boot disk: 256 GB SSD, Debian GNU/Linux 8 (jessie)
5. Notes: Allow full access to all Cloud APIs. Do not use Debian GNU/Linux 9 (stretch) due to degraded performance.
SSH into the machine:
```
$ gcloud compute ssh sdow-db-builder-1
```

Install required operating system dependencies:

$ sudo apt-get -q update
$ sudo apt-get -yq install git pigz sqlite3

Clone this directory via HTTPS:

$ git clone https://github.com/jwngr/sdow.git

Move to the proper directory and create a new screen in case the VM connection is lost:
```
$ cd sdow/database/
$ screen  # And then press <ENTER> on the screen that pops up
```
Run the database creation script, providing an optional date for the backup:
```
$ (time ./buildDatabase.sh [<YYYYMMDD>]) &> output.txt
```
Detach from the current screen session by pressing <CTRL> + <a> and then <d>. To reattach to the screen, run screen -r. Make sure to always detach from the screen cleanly so it can be resumed!

Copy the script output and the resulting SQLite file to the sdow-prod GCS bucket:

$ gsutil cp output.txt gs://sdow-prod/dumps/<YYYYMMDD>/
$ gsutil cp dump/sdow.sqlite gs://sdow-prod/dumps/<YYYYMMDD>/

Delete the VM to prevent incurring large fees.

Web Server

Initial Setup

Create a new Google Compute Engine instance from the sdow-web-server instance template, which is configured with the following specs::
1. Name: sdow-web-server-1
2. Zone: us-central1-c
3. Machine Type: f1-micro (1 vCPU, 0.6 GB RAM)
4. Boot disk: 16 GB SSD, Debian GNU/Linux 8 (jessie)
5. Notes: Allow default access to Cloud APIs. Do not use Debian GNU/Linux 9 (stretch) due to degraded performance.
SSH into the machine:
```
$ gcloud compute ssh sdow-web-server-1
```

Install required operating system dependencies to run the Flask app:

$ sudo apt-get -q update
$ sudo apt-get -yq install git pigz sqlite3 python-pip
$ sudo pip install --upgrade pip setuptools virtualenv
# OR for Python 3
#$ sudo apt-get -q update
#$ sudo apt-get -yq install git pigz sqlite3 python3-pip
#$ sudo pip3 install --upgrade pip setuptools virtualenv

Clone this directory via HTTPS and navigate into the repo:

$ git clone https://github.com/jwngr/sdow.git
$ cd sdow/

Create and activate a new virtualenv environment:

$ virtualenv -p python2 env  # OR virtualenv -p python3 env
$ source env/bin/activate

Install the required Python libraries:
```
$ pip install -r requirements.txt
```

Copy the latest SQLite file from the sdow-prod GCS bucket:

$ gsutil cp gs://sdow-prod/dumps/<YYYYMMDD>/sdow.sqlite ./sdow/sdow.sqlite

Install required operating system dependencies to generate an SSL certificate (this and the following instructions are based on these blog posts):

$ echo 'deb http://ftp.debian.org/debian jessie-backports main' | sudo tee /etc/apt/sources.list.d/backports.list
$ sudo apt-get -q update
$ sudo apt-get -yq install nginx
$ sudo apt-get -yq install certbot -t jessie-backports

Add this location block inside the server block in /etc/nginx/sites-available/default:
```
location ~ /.well-known {
    allow all;
}
```
Start NGINX:
```
$ sudo systemctl restart nginx
```
Ensure the server has the proper static IP address (sdow-web-server-static-ip) by editing it on the GCP console if necessary.

Create an SSL certificate using Let's Encrypt's certbot:

$ sudo certbot certonly -a webroot --webroot-path=/var/www/html -d api.sixdegreesofwikipedia.com --email [email protected]

Ensure auto-renewal of the SSL certificate is configured properly:
```
$ certbot renew --dry-run
```

Run crontab -e and add the following cron job to that file to auto-renew the SSL certificate:

0 0,12 * * * python -c 'import random; import time; time.sleep(random.random() * 3600)' && /usr/bin/certbot renew

Generate a strong Diffie-Hellman group to further increase security (note that this can take a couple minutes):
```
$ sudo openssl dhparam -out /etc/ssl/certs/dhparam.pem 2048
```

Copy over the NGINX configuration, making sure to back up the original configuration:

$ sudo cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup
$ sudo cp ./config/nginx.conf /etc/nginx/nginx.conf

Restart nginx:
```
$ sudo systemctl restart nginx
```

Recurring Setup

Activate the virtualenv environment:
```
$ cd sdow/
$ source env/bin/activate
```
Set the SDOW_ENV environment variable to prod:
```
$ export SDOW_ENV=prod
```

Start the Flask app via Supervisor which runs Gunicorn:

$ cd sdow/
$ supervisord -c ../config/supervisord.conf

Ensure the app was started successfully by running supervisorctl -c ../config/supervisord.conf.

Resources

Edge Case Pages

ID	Title	Sanitized Title
50899560	🦎	`🦎`
725006	"	`\"`
438953	4′33″	`4′33″`
32055176	Λ-ring	`Λ-ring`
11760	F-110 Spectre	`F-110_Spectre`
8695	Dr. Strangelove	`Dr._Strangelove`
337903	Farmers' market	`Farmers\'_market`
24781873	Lindström (company)	`Lindström_(company)`
54201777	Disinformation (book)	`Disinformation_(book)`
1514	Albert, Duke of Prussia	`Albert,_Duke_of_Prussia`
35703467	"A," My Name is Alex - Parts I & II	`\"A,\"\_My_Name_is_Alex_-_Parts_I_&_II`
54680944	N,N,N′,N′-tetramethylethylenediamine	`N,N,N′,N′-tetramethylethylenediamine`
24781871	Jack in the Green: Live in Germany 1970–1993	`Jack_in_the_Green:_Live_in_Germany_1970–1993`

Interesting searches

Source Page Title	Target Page Title	Notes
Hargrave Military Academy	Illiosentidae	Cool graph
Arthropod	Haberdashers' Aske's Boys' School	Cool graph
AC power plugs and sockets	Gymnobela abyssorum	1,311 paths of 6 degrees

Contributing

See the contribution page for details.

trenddev / sdow Goto Github PK

sdow's Introduction

Six Degrees of Wikipedia (SDOW)

Data Source

Database Creation Process

Web Server

Initial Setup

Recurring Setup

Resources

Edge Case Pages

Interesting searches

Contributing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent