Git Product home page Git Product logo

graphipedia's Introduction

Graphipedia

A set of tools for creating a graph database of Wikipedia pages and the links between them.

Importing Data

The graphipedia-dataimport module allows to create a Neo4j database from a Wikipedia database dump.

See Wikipedia:Database_download for instructions on getting a Wikipedia database dump.

Assuming you downloaded pages-articles.xml.bz2, follow these steps:

  1. Run ExtractLinks to create a smaller intermediate XML file containing page titles and links only. The best way to do this is decompress the bzip2 file and pipe the output directly to ExtractLinks:

    bzip2 -dc pages-articles.xml.bz2 | java -classpath graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks - enwiki-links.xml

  2. Run ImportGraph to create a Neo4j database with nodes and relationships into a graphdb directory

    java -Xmx3G -classpath graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph enwiki-links.xml graphdb

Just to give an idea, enwiki-20130204-pages-articles.xml.bz2 is 9.1G and contains almost 10M pages, resulting in over 92M links to be extracted.

The import took 31m 42s to decompress/ExtractLinks (pretty much the same time as decompressing only) and 10m 42s to ImportGraph on a T420 laptop running Linux with an SSD drive.

(Note that disk I/O is the critical factor here: the same import will easily take several hours with an old 5400RPM drive.)

graphipedia's People

Contributors

mirkonasato avatar

Watchers

Justin Edwards avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.