Git Product home page Git Product logo

commoncrawl's Introduction

#CommonCrawl Support Library

##Overview

The commoncrawl source code repository is used as a distribution vehicle for our custom Hadoop InputFormat (ARCInputFormat located in org.commoncrawl.hadoop.io). Please refer to the CommonCrawl website at http://www.commoncrawl.org/ for more details on how to access our crawl corpus.

The sample class BasicArcFileReaderSample.java (located in org.commoncrawl.samples) for an example of how to configure the InputFormat. A more detailed example of how to use it in the context of a Hadoop Job will be forthcoming.

##Build Notes:

  1. You need to define JAVA_HOME, and make sure you have Ant & Maven installed.
  2. Set hadoop.path (in build.properties) to point to your Hadoop folder.
  3. Make sure you have the thrift compiler (version 0.7.0) installed on your system.
  4. If you want to use the Google URL Canoncilization library in Hadoop job, copy the shared libraries under lib/native/{Platform} to /usr/local/lib or equivalent.

#Sample Usage:

Once commoncrawl.jar has been built, you can execute a job/sample via the bin/launcher.sh script. The sample class BasicArcFileReaderSample.java (located in org.commoncrawl.samples) demonstrates
how you can go about configuring our InputFormat. To run the BasicArcFileReaderSample against an ARC file in the corpus (2010/01/07/18/1262876244253_18.arc.gz for example), you would run the following command line:

bin/launcher.sh org.commoncrawl.samples.BasicArcFileReaderSample {AWS ACCESS KEY} {AWS SECRET KEY} commoncrawl-crawl-002 2010/01/07/18/1262876244253_18.arc.gz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.