WARC Input and Output Formats for Hadoop

warc-hadoop is a Java library for working with WARC (Web Archive) files in Hadoop. It provides InputFormats for reading and OutputFormats for writing WARC files in MapReduce jobs (supporting both the 'old' org.apache.hadoop.mapred and the 'new' org.apache.hadoop.mapreduce API).

WARC files are used to record the activity of a web crawler. They include both the HTTP requests that were sent to servers, and the HTTP response received (including headers). WARC is an ISO standard, and is used (amongst others) by the Internet Archive and CommonCrawl.

This warc-hadoop library was written in order to explore the CommonCrawl data, a publicly available dump of billions of web pages. The data is made available for free as a public dataset on AWS. If you want to process it, you just need to pay for the computing capacity of processing it on AWS, or for the network bandwidth to download it.

Using warc-hadoop

Add the following Maven dependency to your project:

<dependency>
    <groupId>com.martinkl.warc</groupId>
    <artifactId>warc-hadoop</artifactId>
    <version>0.1.0</version>
</dependency>

Now you can import either com.martinkl.warc.mapred.WARCInputFormat or com.martinkl.warc.mapreduce.WARCInputFormat into your Hadoop job, depending on which version of the API you are using. Example usage:

JobConf job = new JobConf(conf, CommonCrawlTest.class);

FileInputFormat.addInputPath(job, new Path("/path/to/my/input"));
FileOutputFormat.setOutputPath(job, new Path("/path/for/my/output"));
FileOutputFormat.setCompressOutput(job, true);

job.setInputFormat(WARCInputFormat.class);
job.setOutputFormat(WARCOutputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(WARCWritable.class);

Example of a mapper that emits server responses, using the URL as the key:

public static class MyMapper extends MapReduceBase
        implements Mapper<LongWritable, WARCWritable, Text, WARCWritable> {

    public void map(LongWritable key, WARCWritable value, OutputCollector<Text, WARCWritable> collector,
                    Reporter reporter) throws IOException {
        String recordType = value.getRecord().getHeader().getRecordType();
        String targetURL  = value.getRecord().getHeader().getTargetURI();

        if (recordType.equals("response") && targetURL != null) {
            collector.collect(new Text(targetURL), value);
        }
    }
}

File format parsing

A WARC file consists of a flat sequence of records. Each record may be a HTTP request (recordType = "request"), a response (recordType = "response") or one of various other types, including metadata. When reading from a WARC file, the records are given to the mapper one at a time. That means that the request and the response will appear in two separate calls of the map method.

This library currently doesn't perform any parsing of the data inside records, such as the HTTP headers or the HTML body. You can simply read the server's response as an array of bytes. Additional parsing functionality may be added in future versions.

WARC files are typically gzip-compressed. Gzip files are not splittable by Hadoop (i.e. an entire file must be processed sequentially, it's not possible to start reading in the middle of a file) so projects like CommonCrawl typically aim for a maximum file size of 1GB (compressed). If you're only doing basic parsing, a file of that size takes less than a minute to process.

When writing WARC files, this library automatically splits output files into gzipped segments of approximately 1GB. You can customize the segment size using the configuration key warc.output.segment.size (the value is the target segment size in bytes).

Documentation

Javadocs

Meta

Please submit pull requests to the Github project.

Expected final separator CR LF CR LF, but got: 13 10 87 65

Got this error while content from a revisit entry was read. Probably the problem is that there is a missing empty line that the library expected. 13 10 87 65 equals <CR><LF>WA. Usually there are 3 empty lines or 4 \r\n symbols. Maybe the problem is on my side because I was using a Heritrix version compiled from source/master.

Here is the entry from warc file:

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://parstipru.lv/forums/rss/forum/testa-ieraksts/topics
WARC-Date: 2014-03-10T02:17:25Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 159.148.127.206
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/server-not-modified
WARC-Etag: "d41d8cd98f00b204e9800998ecf8427e"
WARC-Truncated: length
WARC-Record-ID: <urn:uuid:a3d332c0-30ef-4f9d-a622-e6d3af511ec9>
Content-Length: 0


WARC/1.0
WARC-Type: request
...

Here is the stack trace:

java.lang.Exception: java.lang.IllegalStateException: Expected final separator CR LF CR LF, but got: 13 10 87 65
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:400)
Caused by: java.lang.IllegalStateException: Expected final separator CR LF CR LF, but got: 13 10 87 65
    at com.martinkl.warc.WARCRecord.readSeparator(WARCRecord.java:101)
    at com.martinkl.warc.WARCRecord.<init>(WARCRecord.java:50)
    at com.martinkl.warc.WARCFileReader.read(WARCFileReader.java:54)
    at com.martinkl.warc.mapred.WARCInputFormat$WARCReader.next(WARCInputFormat.java:72)
    at com.martinkl.warc.mapred.WARCInputFormat$WARCReader.next(WARCInputFormat.java:52)

Thanks for the great warc-hadoop library!

ept / warc-hadoop Goto Github PK

warc-hadoop's Introduction

WARC Input and Output Formats for Hadoop

Using warc-hadoop

File format parsing

Documentation

Meta

warc-hadoop's People

Contributors

Stargazers

Watchers

Forkers

warc-hadoop's Issues

Can this be installed using sbt?

WARCRecord Exceeded maximum line length

WARC Info record

Expected final separator CR LF CR LF, but got: 13 10 87 65

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent