commoncrawl / commoncrawl Goto Github PK

Common Crawl support library to access 2008-2012 crawl archives (ARC files)

Shell 0.11% Java 35.55% C++ 37.44% C 26.79% HTML 0.02% Makefile 0.05% M4 0.05%

commoncrawl's Introduction

Common Crawl Support Library

Overview

This library provides support code for the consumption of the Common Crawl Corpus RAW crawl data (ARC Files) stored on S3. More information about how to access the corpus can be found at https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set .

You can take two primary routes to consuming the ARC File content:

(1) You can run a Hadoop cluster on EC2 or use EMR to run a Hadoop job. In this case, you can use the ARCFileInputFormat to drive data to your mappers/reducers. There are two versions of the InputFormat: One written to conform to the deprecated mapred package, located at org.commoncrawl.hadoop.io.mapred and one written for the mapreduce package, correspondingly located at org.commoncrawl.hadoop.io.mapreduce.

(2) You can decode data directly by feeding an InputStream to the ARCFileReader class located in the org.commoncrawl.util.shared package.

Both routes (InputFormat or ARCFileReader direct route) produce a tuple consisting of a UTF-8 encoded URL (Text), and the raw content (BytesWritable), including HTTP headers, that were downloaded by the crawler. The HTTP headers are UTF-8 encoded, and the headers and content are delimited by a consecutive set of CRLF tokens. The content itself, when it is of a text mime type, is encoded using the source text encoding.

Build Notes:

You need to define JAVA_HOME, and make sure you have Ant & Maven installed.
Set hadoop.path (in build.properties) to point to your Hadoop distribution.

Sample Usage:

Once the commoncrawl.jar has been built, you can validate that the ARCFileReader works for you by executing the sample command line from root for the commoncrawl source directory:

./bin/launcher.sh org.commoncrawl.util.shared.ARCFileReader --awsAccessKey <ACCESS KEY> --awsSecret <SECRET> --file s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690164240/1341819847375_4319.arc.gz

commoncrawl's People

Contributors

Stargazers

Watchers

Forkers

ssalevan mitchellzen methodfix blackymetal daemon13 gianni77 pfig joskid noiano eob mibesr liaowang11 namin diablovin jseppanen netconstructor weifengyao soleun repos-java stanxii net205 ajmarcus palashbhowmick yuhanonescreen rjurney soulbuzz fone4u woofgl alepharchives nod3x balshor idy1000 nilnilnil jghoman samdonly1 thirugnanamsubbiah forschnix chrfr777 hkjallbring sachinlondhe4 big-data openscripts-xx scraping-xx bigdata-tools dynamicguy jeffnappi marcoippolito web5design imclab cogitocs ashty randomstuffs22 ericxsun sameerpany chyman22 xjzhou ghosthamlet vsingh58 chengenbao findhy hophacker bikash chagge lspecian arunta007 gabhi devendradesale oudb angelabier1 zzmjohn vishwakarmarahul stephaniemak aaroc357 czxxjtu montazze zsmj513 hszhsz calvinalvin yukihiro-kimura teotikalki kevinheaton86 wavecheng bryant1410 saadmahboob cznyx streaky75 tk-ant phymanwow penghao1023 luislindner04 wangdq1989

commoncrawl's Issues

Add jar to maven central repository?

What do you think about adding a built version of this to the central repository?

WARN[0060] error instantiating commoncrawl: commoncrawl.apiResult: decode slice: expect [ or n, but found , error found in #0 byte of ...||..., bigger context ...||...

Index for WET files?

Hi, hope I am posting this questions in the right place...

I found .WARC format domain index at http://index.commoncrawl.org/CC-MAIN-2016-18//
I wonder if there is any indexing for .WET format files?

If not, is there anyway I could convert the WARC object address to WET object address?
For example, if I have:
s3://commoncrawl/crawl-data/CC-MAIN-2016-18/segments/1461860125175.9/warc/CC-MAIN-20160428161525-00221-ip-10-239-7-51.ec2.internal.warc.gz
What would be the corresponding .WET file?

Thx...

Build errors and class ArcFileItem

Hi All!

When trying to build JAR using Ant receive following errors:

compile-core-classes:
[javac] Compiling 167 source files to /root/commoncrawl/commoncrawl-commoncrawl-24052ae/build/classes
[javac] warning: [options] bootstrap class path not set in conjunction with -source 1.5
[javac] /root/commoncrawl/commoncrawl-commoncrawl-24052ae/src/org/commoncrawl/hadoop/io/ARCSplitReader.java:34: error: package org.commoncrawl.protocol.shared does not exist
[javac] import org.commoncrawl.protocol.shared.ArcFileItem;
[javac] ^
[javac] /root/commoncrawl/commoncrawl-commoncrawl-24052ae/src/org/commoncrawl/hadoop/io/ARCSplitReader.java:42: error: cannot find symbol
[javac] public class ARCSplitReader implements RecordReader<Text, ArcFileItem> {
.....
.....
lots of errors related to ArcFileItem.
...

May be some files with ArcFileItem class definition are missing? Using grep I didn't find any class ArcFileItem definition in src/*

Eugene

Broken link on main site

I can't find a repository for the main site, but I thought I'd report it

The link "A distributed system for mining Common Crawl using SQS, AWS-EC2 and S3 by Akshay Bhat" on https://commoncrawl.org/the-data/examples/ is broken. From googling it looks like it should point to https://github.com/jrs026/CommonCrawlMiner maybe

comment on public ARCFileReader constructor is confusing

On line 102 of commoncrawl/src/main/java/org/commoncrawl/util/shared/ARCFileReader.java, there's a comment that says the constructor is private (it's actually public), and refers to the "factory method above" even though it's the first method in the file.

  /** 
   * constructor is now private. use the factory method above to construct a reader 
   * @param source
   * @throws IOException
   */
  public ARCFileReader(final InputStream source)throws IOException {
    super(new CustomPushbackInputStream(new CountingInputStream(source),
        _blockSize), new Inflater(true), _blockSize);
    readARCHeader();
  }

Update binaries path in build.xml

Attempting to build with ant failes to download the ant tasks as the URL is incorrect.

It should be:

http://mirror.uoregon.edu/apache/maven/ant-tasks/2.1.3/binaries

VerifyError

I'm trying Common Crawl w/ Hadoop 0.20.205 and I'm getting the following:

Exception in thread "main" java.lang.VerifyError: (class: org/commoncrawl/hadoop/io/JetS3tARCSource, method: configureImpl signature: (Lorg/apache/hadoop/mapred/JobConf;)V) Incompatible argument to function
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:819)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:864)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:890)
at org.commoncrawl.hadoop.io.ARCInputFormat.configure(ARCInputFormat.java:159)
at com.digitalpebble.behemoth.io.commoncrawl.CommonCrawlConverterJob.run(CommonCrawlConverterJob.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.digitalpebble.behemoth.io.commoncrawl.CommonCrawlConverterJob.main(CommonCrawlConverterJob.java:75)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I've tried clean builds, etc. Anyone have any ideas? AIUI, this error usually comes from version mismatches. I deleted the jetst 0.6 version in my Hadoop lib directory, but I suspect there is an older version somewhere else in the classpath. Anyone else seen this?