Git Product home page Git Product logo

sparkwiki's Introduction

Sparkwiki - processing tools for wikipedia data

Build Status

Basics

Sparkwiki is a set of tools written in Scala (2.11) that aims at processing data from wikipedia, namely

Usage

Pre-requisites

You need:

Important: for performance reason, you should convert .sql.gz table dumps to .sql.bz2 (s.t. Spark can process them in parallel and take advantage of all processors in the system), e.g. unsing a command such as
zcat dump.sql.gz | bzip2 > dump.sql.gz

Every tool can be run via spark-submit, e.g.

./spark-submit --class ch.epfl.lts2.wikipedia.[ToolNameHere]  --master 'local[*]' --executor-memory 4g --driver-memory 4g --packages org.rogach:scallop_2.11:3.1.5,com.datastax.spark:spark-cassandra-connector_2.11:2.4.0 ./sparkwiki/target/scala-2.11/sparkwiki_2.11-0.6.8.jar [ToolArgsHere]

Build

From the project directory, run sbt package to build the jar file. If you want to edit the code, run sbt eclipse

Dump processor

This tool can be run using the class ch.epfl.lts2.wikipedia.DumpProcessor as entry point. It will read the SQL table dumps starting with value supplient by the namePrefix argument, in the directory specified by the dumpPath argument, and write (compressed) csv files in the directory specified by outputPath that can later be imported in the neo4j graph database.

This tool requires the following table dumps to be present:

  • page
  • pagelinks
  • redirect
  • categorylinks

Arguments:

  • --dumpPath directory containing sql.bz2 files (gz not supported)
  • --outputPath output directory
  • --namePrefix leading part of each SQL dump, e.g. enwiki-20180801

Dump parser

This tool is available using the class ch.epfl.lts2.wikipedia.DumpParser as entry point. It reads a single SQL dump and converts it to either a csv file, or a parquet file.

Arguments:

  • --dumpFilePath path to the .sql.gz or sql.bz2 SQL dump to read
  • --dumpType should be page, redirect, pagelinks, category or categorylinks
  • --outputPath output directory
  • --outputFormat (default=csv), should be csv or parquet

Pagecount processor

This tool can be run using the class ch.epfl.lts2.wikipedia.PagecountProcessor as entry point. It will read a collection of pagecounts covering the period between arguments startDate and endDate (inclusive bounds), filter the counts belonging to en.z (english wikipedia project) and having more daily visits than a threshold given by the minDailyVisit argument and save the result to a cassandra DB, after resolving page ids (either a SQL page dump or a processed SQL page dump (as parquet) must be supplied via the pageDump argument).

Arguments:

  • --basePath directory containing pagecounts files
  • --startDate first day to process, formatted as yyyy-MM-dd, e.g. 2018-08-03
  • --endDate last day to process, formatted as yyyy-MM-dd
  • --pageDump path to a page SQL dump or a version processed by DumpParser and saved as parquet
  • --minDailyVisits minimum number of daily visit for a page to be considered (default=100)
  • --minDailyVisitsHourSplit minimum number of daily visits to parse hourly visits record (default=10000)
  • --keepRedirects if supplied, will process visits of pages marked as /redirect/
  • --dbHost name or IP address of the Cassandra server
  • --dbPort port number on which Cassandra server runs (default = 9042)
  • --keySpace keyspace for saving the output
  • --table destination table for output

sparkwiki's People

Contributors

naspert avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.