Git Product home page Git Product logo

warc-hadoop's Introduction

WARC Input and Output Formats for Hadoop

warc-hadoop is a Java library for working with WARC (Web Archive) files in Hadoop. It provides InputFormats for reading and OutputFormats for writing WARC files in MapReduce jobs (supporting both the 'old' org.apache.hadoop.mapred and the 'new' org.apache.hadoop.mapreduce API).

WARC files are used to record the activity of a web crawler. They include both the HTTP requests that were sent to servers, and the HTTP response received (including headers). WARC is an ISO standard, and is used (amongst others) by the Internet Archive and CommonCrawl.

This warc-hadoop library was written in order to explore the CommonCrawl data, a publicly available dump of billions of web pages. The data is made available for free as a public dataset on AWS. If you want to process it, you just need to pay for the computing capacity of processing it on AWS, or for the network bandwidth to download it.

Using warc-hadoop

Add the following Maven dependency to your project:

<dependency>
    <groupId>com.martinkl.warc</groupId>
    <artifactId>warc-hadoop</artifactId>
    <version>0.1.0</version>
</dependency>

Now you can import either com.martinkl.warc.mapred.WARCInputFormat or com.martinkl.warc.mapreduce.WARCInputFormat into your Hadoop job, depending on which version of the API you are using. Example usage:

JobConf job = new JobConf(conf, CommonCrawlTest.class);

FileInputFormat.addInputPath(job, new Path("/path/to/my/input"));
FileOutputFormat.setOutputPath(job, new Path("/path/for/my/output"));
FileOutputFormat.setCompressOutput(job, true);

job.setInputFormat(WARCInputFormat.class);
job.setOutputFormat(WARCOutputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(WARCWritable.class);

Example of a mapper that emits server responses, using the URL as the key:

public static class MyMapper extends MapReduceBase
        implements Mapper<LongWritable, WARCWritable, Text, WARCWritable> {

    public void map(LongWritable key, WARCWritable value, OutputCollector<Text, WARCWritable> collector,
                    Reporter reporter) throws IOException {
        String recordType = value.getRecord().getHeader().getRecordType();
        String targetURL  = value.getRecord().getHeader().getTargetURI();

        if (recordType.equals("response") && targetURL != null) {
            collector.collect(new Text(targetURL), value);
        }
    }
}

File format parsing

A WARC file consists of a flat sequence of records. Each record may be a HTTP request (recordType = "request"), a response (recordType = "response") or one of various other types, including metadata. When reading from a WARC file, the records are given to the mapper one at a time. That means that the request and the response will appear in two separate calls of the map method.

This library currently doesn't perform any parsing of the data inside records, such as the HTTP headers or the HTML body. You can simply read the server's response as an array of bytes. Additional parsing functionality may be added in future versions.

WARC files are typically gzip-compressed. Gzip files are not splittable by Hadoop (i.e. an entire file must be processed sequentially, it's not possible to start reading in the middle of a file) so projects like CommonCrawl typically aim for a maximum file size of 1GB (compressed). If you're only doing basic parsing, a file of that size takes less than a minute to process.

When writing WARC files, this library automatically splits output files into gzipped segments of approximately 1GB. You can customize the segment size using the configuration key warc.output.segment.size (the value is the target segment size in bytes).

Documentation

Meta

(c) 2014 Martin Kleppmann. MIT License.

Please submit pull requests to the Github project.

warc-hadoop's People

Contributors

ept avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

warc-hadoop's Issues

Can this be installed using sbt?

Hi, fairly new to Java, Scala, Hadoop, and Spark. I've been following some Scala Spark tutorials, and I'm wondering if this can be installed using sbt.

Thanks in advance.

WARCRecord Exceeded maximum line length

Am getting the following exception

Error: java.lang.IllegalStateException: Exceeded maximum line length
	at com.martinkl.warc.WARCRecord.readLine(WARCRecord.java:82)
	at com.martinkl.warc.WARCRecord.readHeader(WARCRecord.java:63)
	at com.martinkl.warc.WARCRecord.<init>(WARCRecord.java:47)
	at com.martinkl.warc.WARCFileReader.read(WARCFileReader.java:54)

when processing the WAT files from CC

[s3://commoncrawl/crawl-data/CC-MAIN-2017-04/segments/1484560279410.32/wat/]

I can see that it is set in

https://github.com/ept/warc-hadoop/blob/master/src/main/java/com/martinkl/warc/WARCRecord.java#L32

is that from the WARC specifications?

WARC Info record

It would be great to be able to generate a WARC info record at the beginning of the files. The content could simply be passed as a byte array with the user code being responsible for that byte array to be a correct warc info.

Any thoughts on this?

Expected final separator CR LF CR LF, but got: 13 10 87 65

Got this error while content from a revisit entry was read. Probably the problem is that there is a missing empty line that the library expected. 13 10 87 65 equals <CR><LF>WA. Usually there are 3 empty lines or 4 \r\n symbols. Maybe the problem is on my side because I was using a Heritrix version compiled from source/master.

Here is the entry from warc file:

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://parstipru.lv/forums/rss/forum/testa-ieraksts/topics
WARC-Date: 2014-03-10T02:17:25Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 159.148.127.206
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/server-not-modified
WARC-Etag: "d41d8cd98f00b204e9800998ecf8427e"
WARC-Truncated: length
WARC-Record-ID: <urn:uuid:a3d332c0-30ef-4f9d-a622-e6d3af511ec9>
Content-Length: 0


WARC/1.0
WARC-Type: request
...

Here is the stack trace:

java.lang.Exception: java.lang.IllegalStateException: Expected final separator CR LF CR LF, but got: 13 10 87 65
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:400)
Caused by: java.lang.IllegalStateException: Expected final separator CR LF CR LF, but got: 13 10 87 65
    at com.martinkl.warc.WARCRecord.readSeparator(WARCRecord.java:101)
    at com.martinkl.warc.WARCRecord.<init>(WARCRecord.java:50)
    at com.martinkl.warc.WARCFileReader.read(WARCFileReader.java:54)
    at com.martinkl.warc.mapred.WARCInputFormat$WARCReader.next(WARCInputFormat.java:72)
    at com.martinkl.warc.mapred.WARCInputFormat$WARCReader.next(WARCInputFormat.java:52)

Thanks for the great warc-hadoop library!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.