Git Product home page Git Product logo

jwarc's Introduction

jwarc

A Java library for reading and writing WARC files. This library includes a high level API modeling the standard record types as individual classes with typed accessors. The API is exensible and you can register extension record types and accessors for extension header fields.

try (WarcReader reader = new WarcReader(FileChannel.open(Paths.get("example.warc")))) {
    for (WarcRecord record : reader) {
        if (record instanceof WarcResponse && record.contentType().base().equals(MediaType.HTTP)) {
            WarcResponse response = (WarcResponse) record;
            System.out.println(response.http().status() + " " + response.target());
        }
    }
}

It uses a finite state machine parser generated from a strict grammar using Ragel. There is an optional lenient mode which can handle some forms of non-compliant WARC records. ARC and HTTP parsing is lenient by default.

Gzipped records are automatically decompressed. The parser interprets ARC/1.1 record as if they are a WARC dialect and populates the appropriate WARC headers.

All I/O is performed using NIO and an an effort is made to minimize data copies and share buffers whenever feasible. Direct buffers and even memory-mapped files can be used, but only with uncompressed WARCS until they're supported by Inflater (coming in JDK 11).

Getting it

To use as a library add jwarc as a dependency from Maven Central.

To use as a command-line tool install Java 8 or later, download the latest release jar and run it using:

java -jar jwarc-{version}.jar

If you would prefer to build it from source install JDK 8+ and Maven and then run:

mvn package

Examples

Saving a remote resource

try (WarcWriter writer = new WarcWriter(System.out)) {
    writer.fetch(URI.create("http://example.org/"));
}

Writing records

// write a warcinfo record
// date and record id will be populated automatically if unset
writer.write(new Warcinfo.Builder()
    .fields("software", "my-cool-crawler/1.0",
            "robots", "obey")
    .build());

// we can also supply a specific date
Instant captureDate = Instant.now();

// write a request but keep a copy of it to reference later
WarcRequest request = new WarcRequest.Builder()
    .date(captureDate)
    .target(uri)
    .contentType("application/http")
    .body(bodyStream, bodyLength)
    .build();
writer.write(request);

// write a response referencing the request
WarcResponse response = new WarcResponse.Builder()
    .date(captureDate)
    .target(uri)
    .contentType("application/http")
    .body("HTTP/1.0 200 OK\r\n...".getBytes())
    .concurrentTo(request.id())
    .build();
writer.write(response);

Filter expressions

The WarcFilter class provides a simple filter expression language for matching WARC records. For example here's a moderately complex filter which matches all records that are not image resources or image responses:

 !((warc-type == "resource" && content-type =~ "image/.*") || 
   (warc-type == "response" && http:content-type =~ "image/.*")) 

WarcFilter implements Predicate<WarcRecord> and be used to conveniently with streams of records:

long errorCount = warcReader.records().filter(WarcFilter.compile(":status >= 400")).count();

Their real power though is as a building block for user-supplied options.

Command-line tools

jwarc also includes a set of command-lines tools which serve as examples. Note that many of the tools are lightweight demonstrations and may lack important options and features.

Capture a URL (without subresources):

java -jar jwarc.jar fetch http://example.org/ > example.warc

Create a CDX file:

java -jar jwarc.jar cdx example.warc > records.cdx

Run a replay proxy and web server:

export PORT=8080
java -jar jwarc.jar serve example.warc

Replay each page within in a WARC and use headless Chrome to render a screenshot and save it as a resource record:

export BROWSER=/opt/google/chrome/chrome
java -jar jwarc.jar screenshot example.warc > screenshots.warc

Running a proxy server which records requests and responses. This will generate self-signed SSL certificates so you will will need turn off TLS verification in the client. For Chrome/Chromium use the --ignore-certificate-errors command-line option.

export PORT=8080
java -jar jwarc.jar recorder > example.warc

chromium --proxy-server=http://localhost:8080 --ignore-certificate-errors

Record a command that obeys the http(s)_proxy and CURL_CA_BUNDLE environment variables:

java -jar jwarc.jar recorder -o example.warc curl http://example.org/

Capture a page by recording headless Chrome:

export BROWSER=/opt/google/chrome/chrome
java -jar jwarc.jar record > example.warc

Create a new file containing only html responses with status 200:

java -jar jwarc.jar filter ':status == 200 && http:content-type =~ "text/html(;.*)?"' example.warc > pages.warc 

API Quick Reference

See the javadoc for more details.

              new WarcReader(stream|path|channel);                // opens a WARC file for reading
                  reader.close();                                 // closes the underlying channel
(WarcCompression) reader.compression();                           // type of compression: NONE or GZIP
       (Iterator) reader.iterator();                              // an iterator over the records
     (WarcRecord) reader.next();                                  // reads the next record
                  reader.registerType("myrecord", MyRecord::new); // registers a new record type
                  reader.setLenient(true);                        // enables lenient parsing mode
                new WarcWriter(channel, NONE|GZIP);    // opens a WARC file for writing
                    writer.fetch(uri);                 // downloads a resource recording the request and response
             (long) writer.position();                 // byte position the next record will be written to
                    writer.write(record);              // adds a record to the WARC file

Record types

Message
  HttpMessage
    HttpRequest
    HttpResponse
  WarcRecord
    Warcinfo            (warcinfo)
    WarcTargetRecord
      WarcContinuation  (continuation)
      WarcConversion    (conversion)
      WarcCaptureRecord
        WarcMetadata    (metadata)
        WarcRequest     (request)
        WarcResource    (resource)
        WarcResponse    (response)
        WarcRevisit     (revisit)

The basic building block of both HTTP protocol and WARC file format is a message consisting of set of named header fields and a body. Header field names are case-insensitvie and may have multiple values.

             (BodyChannel) message.body();                     // the message body as a ReadableByteChannel
                    (long) message.body().position();          // the next byte position to read from
                     (int) message.body().read(byteBuffer);    // reads a sequence of bytes from the body
                    (long) message.body().size();              // the length in bytes of the body
             (InputStream) message.body().stream();            // views the body as an InputStream
                  (String) message.contentType();              // the media type of the body
                 (Headers) message.headers();                  // the header fields
            (List<String>) message.headers().all("Cookie");    // all values of a header
                 (boolean) message.headers().contains("TE", "deflate"); // tests if a value is present
        (Optional<String>) message.headers().first("Cookie");  // the first value of a header
(Map<String,List<String>>) message.headers().map();            // views the header fields as a map
        (Optional<String>) message.headers().sole("Location"); // throws if header has multiple values
         (ProtocolVersion) message.version();                  // the protocol version (e.g. HTTP/1.0 or WARC/1.1)

Methods available on all WARC records:

  (Optional<Digest>) record.blockDigest();   // value of hash function applied to bytes of body
           (Instant) record.date();          // instant that data capture began
               (URI) record.id();            // globally unique record identifier
    (Optional<Long>) record.segmentNumber(); // position of this record in segmentated series
   (TuncationReason) record.truncated();     // reason record was truncated; or else NOT_TRUNCATED
            (String) record.type();          // "warcinfo", "request", "response" etc
            (Headers) warcinfo.fields();   // parses the body as application/warc-fields
   (Optional<String>) warcinfo.filename(); // filename of the containing WARC

WarcTargetRecord (abstract)

Methods available on all WARC records except Warcinfo:

     (Optional<String>) record.identifiedPayloadType(); // media type of payload identified by an independent check
               (String) record.target();                // captured URI as an unparsed string
                  (URI) record.targetURI();             // captured URI
(Optional<WarcPayload>) record.payload();               // payload
     (Optional<Digest>) record.payloadDigest();         // value of hash function applied to bytes of the payload
        (Optional<URI>) record.warcinfoID();            // ID of warcinfo record when stored separately
             (String) continuation.segmentOriginId();    // record ID of first segment
   (Optional<String>) continuation.segmentTotalLength(); // (last only) total length of all segments
      (Optional<URI>) conversion.refersTo();    // ID of record this one was converted from

Methods available on metadata, request, resource and response records:

          (List<URI>) capture.concurrentTo();   // other record IDs from the same capture event
 (Optional<InetAddr>) capture.ipAddress();      // IP address of the server
            (Headers) metadata.fields();        // parses the body as application/warc-fields
        (HttpRequest) request.http();           // parses the body as a HTTP request
        (BodyChannel) request.http().body();    // HTTP request body
            (Headers) request.http().headers(); // HTTP request headers

No methods are specific to resource records. See WarcRecord, WarcTargetRecord, WarcCaptureRecord above.

       (HttpResponse) response.http();           // parses the body as a HTTP response
        (BodyChannel) response.http().body();    // HTTP response body
            (Headers) response.http().headers(); // HTTP response headers
       (HttpResponse) revisit.http();              // parses the body as a HTTP response
            (Headers) revisit.http().headers();    // HTTP response headers (note: revisits never have a payload!)
                (URI) revisit.profile()            // revisit profile (not modified or identical payload)
                (URI) revisit.refersTo();          // id of record this is a duplicate of
                (URI) revisit.refersToTargetURI(); // targetURI of the referred to record 
            (Instant) revisit.refersToDate();      // date of the referred to record  

Note: revisit records never have a payload so

Comparison

Criteria jwarc JWAT webarchive-commons
License Apache 2 Apache 2 Apache 2
Parser based on Ragel FSM Hand-rolled FSM Apache HTTP
Push parsing Low level โœ˜ โœ˜
Folded headers โ€  โœ” โœ” โœ”
Encoded words โ€  โœ˜ โœ˜ (disabled) โœ˜
Validation The basics โœ” โœ˜
Strict parsing โ€ก โœ” โœ˜ โœ˜
Lenient parsing HTTP only โœ” โœ”
Multi-value headers โœ” โœ” โœ˜
I/O Framework NIO IO IO
Record type classes โœ” โœ˜ โœ˜
Typed accessors โœ” โœ” Some
GZIP detection โœ” โœ” Filename only
WARC writer Barebones โœ” โœ”
ARC reader Auto Separate API Factory
ARC writer โœ˜ โœ” โœ”
Speed * (.warc) 1x ~5x slower ~13x slower
Speed * (.warc.gz) 1x ~1.4x slower ~2.8x slower

(โ€ ) WARC features copied from HTTP that have since been deprecated in HTTP. I'm not aware of any software that writes WARCs using these features and usage of them should probably be avoided. JWAT behaves differently from jwarc and webarchive-commons as it does not trim whitespace on folded lines.

(โ€ก) JWAT and webarchive-commons both accept arbitrary UTF-8 characters in field names. jwarc strictly enforces the grammar rules from the WARC specification, although it does not currently enforce the rules for the values of specific individual fields.

(*) Relative time to scan records after JIT steady state. Only indicative. Need to redo this with a better benchmark. JWAT was configured with a 8192 byte buffer as with default options it is 27x slower. For comparison merely decompressing the .warc.gz file with GZIPInputStream is about 0.95x.

See also: Unaffiliated benchmark against other languages

More recent benchmarks against Java libraries

Other WARC libraries

jwarc's People

Contributors

ato avatar dependabot[bot] avatar robertvanloenhout avatar sebastian-nagel avatar thomasegense avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

jwarc's Issues

ClueWeb09 WARC files faile to parse

The ClueWeb09 dataset WARC files (see sample files) use a single line feed \n as separator between WARC headers. The WarcParser expects \r\n (which would conform to the standard) and fails:

Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 9: WARC/0.18<-- HERE -->\nWARC-Type: warcinfo\nWARC-Date: 2009-03-...

See also #25 for a similar issue regarding HttpParser.

Multithreading issue on GzipChannel write header

When having multiple instances of WarcWriter the operations on
private static final ByteBuffer GZIP_HEADER = ByteBuffer.wrap(GZIP_HEADER_); are causing issues.
Some threads are writing the gzip header, some might not.

I think the issue could be fixed by removing the static part for the GZIP_HEADER.

Chunked body parser may read over end of chunk if destination buffer has higher capacity

The optimization to bypass the internal buffer reads) if the destination buffer has a higher capacity than the internal buffer may cause a read over the end of the current chunk.

Reproducible with http_chunked_4.warc.gz and a buffer of 16 kB, e.g.,

ByteBuffer buffer = ByteBuffer.allocate(16384);
while (payload.get().body().read(buffer) > -1);

The chunk has size 16122 - the first read will the second bypassed read will consume all input until EOF (end of WARC record). It must be ensured that nothing more than the content of a single chunk is forwarded to the destination buffer.

Note: if the internal buffer has been bypassed, the error message in line 52 while refilling the internal buffer is wrong/misleading because it uses the outdated internal buffer to show the context. Should be also fixed.

replay proxy doesn't start because of sw.js file not found

Thanks for the great project.
When I try to serve a warc I get the following error:

java -jar jwarc.jar serve mywarc.warc
Exception in thread "main" java.nio.file.NoSuchFileException: sw.js
	at org.netpreserve.jwarc.net.WarcServer.resource(WarcServer.java:202)
	at org.netpreserve.jwarc.net.WarcServer.<init>(WarcServer.java:52)
	at org.netpreserve.jwarc.net.WarcServer.<init>(WarcServer.java:45)
	at org.netpreserve.jwarc.tools.ServeTool.main(ServeTool.java:21)
	at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:47)

The files sw.js and inject.js are in resources/org/netpreserve/net/
Because the resources are relative resolved in WarcServer they are expected in resouces/org/netpreserve/jwarc/net/ or use absolute path /org/netpreserve/net/ in WarcServer to resolve the resource.

jwarc version 0.13.1
java 11 and 15

CDX indexer: CDXJ output support

It would be nice to have an option to output in CDXJ format.

Pywb's cdx-indexer uses the command-line option "-j, --cdxj" for that so it'd be nice if we support the same option names.

CDX indexer fails to parse (webrecorder) WARC file and terminates.

The WARC file has been made with webrecorder . cdxj(python) + warc-inder (BL) can parse the WARC-file.
The error will terminate the cdx-tool indexing and write a partial last line to the cdx-file also producing invalid CDX file.

Exception from CDX inder:
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
at java.base/java.net.URLDecoder.decode(URLDecoder.java:225)

I think this is the record that breaks it:
`
WARC/1.1
WARC-Record-ID: urn:uuid:eb2250e7-173b-5052-aa6e-1d0a2e78f996
WARC-Page-ID: 6l9q6b7tynm0ww8tezecpkc
WARC-Concurrent-To: urn:uuid:48aa011b-3ea1-505d-85af-49abbc668e66
WARC-Target-URI: https://graph.instagram.com/logging_client_events
WARC-Date: 2023-04-26T09:51:58.263Z
WARC-Type: request
Content-Type: application/http; msgtype=request
WARC-Payload-Digest: sha256:7368b96670b0fdd9525fcc645fa716b687795224d678a242af020113a1f7c97d
WARC-Block-Digest: sha256:db3b291b024653c5404401c321169ef68ff25c01aa6b25e20672a405796c6650
Content-Length: 2908

POST /logging_client_events HTTP/1.1
accept: /
accept-encoding: gzip, deflate, br
accept-language: da-DK,da;q=0.9,en-US;q=0.8,en;q=0.7
cache-control: no-cache
content-length: 2228
content-type: application/x-www-form-urlencoded
origin: https://www.instagram.com
pragma: no-cache
referer: https://www.instagram.com/
sec-ch-ua: "Chromium";v="112", "Google Chrome";v="112", "Not:A-Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
x-asbd-id: 198387

access_token=936619743392459%7C3cdb3f896252a1db29679cb4554db266&message=%7B%22app_uid%22%3A%2256974635280%22%2C%22app_id%22%3A%22936619743392459%22%2C%22app_ver%22%3A%221.0.0%22%2C%22data%22%3A%5B%7B%22time%22%3A1682502712.891%2C%22name%22%3A%22instagram_web_media_impressions%22%2C%22extra%22%3A%7B%22ig_userid%22%3A56974635280%2C%22pk%22%3A56974635280%2C%22rollout_hash%22%3A%221007379527%22%2C%22frontend_env%22%3A%22C3%22%2C%22app_id%22%3A%22936619743392459%22%2C%22original_referrer%22%3Anull%2C%22original_referrer_domain%22%3A%22%22%2C%22referrer%22%3Anull%2C%22referrer_domain%22%3A%22www.instagram.com%22%2C%22url%22%3A%22%2Fregeringdk%2F%22%2C%22nav_chain%22%3A%22PolarisProfileRoot%3AprofilePage%3A1%3Avia_cold_start%2CPolarisPostModal%3ApostPage%3A6%3AmodalLink%22%2C%22media_id%22%3A%222415285276315581060%22%2C%22media_type%22%3A%22sidecar%22%2C%22owner_id%22%3A%229272148702%22%2C%22surface%22%3A%22profile%22%7D%2C%22obj_type%22%3A%22url%22%2C%22obj_id%22%3A%22%2Fp%2FCGE0Sl-BXqE%2F%22%7D%2C%7B%22time%22%3A1682502712.894%2C%22name%22%3A%22comment_impression%22%2C%22extra%22%3A%7B%22ig_userid%22%3A56974635280%2C%22pk%22%3A56974635280%2C%22rollout_hash%22%3A%221007379527%22%2C%22frontend_env%22%3A%22C3%22%2C%22app_id%22%3A%22936619743392459%22%2C%22ca_pk%22%3A%221549276785%22%2C%22c_pk%22%3A%2218121740952145782%22%2C%22container_module%22%3A%22profilePageModal%22%2C%22deviceid%22%3A%229DCBE01D-1E88-4D43-BBE1-706410D26031%22%2C%22device_model%22%3A%22Chrome+112.0.0.0%22%2C%22device_os%22%3A%22Web%22%2C%22a_pk%22%3A%229272148702%22%2C%22m_pk%22%3A%222415285276315581060%22%2C%22primary_locale%22%3A%22da_DK%22%2C%22isCovered%22%3Afalse%2C%22original_referrer%22%3Anull%2C%22original_referrer_domain%22%3A%22%22%2C%22referrer%22%3Anull%2C%22referrer_domain%22%3A%22www.instagram.com%22%2C%22url%22%3A%22%2Fregeringdk%2F%22%2C%22nav_chain%22%3A%22PolarisProfileRoot%3AprofilePage%3A1%3Avia_cold_start%2CPolarisPostModal%3ApostPage%3A6%3AmodalLink%22%7D%7D%5D%2C%22log_type%22%3A%22client_event%22%2C%22seq%22%3A7%2C%22session_id%22%3A%22187bcf98552-d8e06e%22%2C%22device_id%22%3A%229DCBE01D-1E88-4D43-BBE1-706410D26031%22%2C%22claims%22%3A%5B%22hmac.AR1ZI1sWeUop6a1CkmjJA6UcWBgb97djMyjg0Sh8zmyVpaLE%22%5D%7D
`

The is the CDX-line produced by cdxj-indexer (pywb)
com,instagram,graph)/logging_client_events?__wb_method=post&access_token=936619743392459|3cdb3f896252a1db29679cb4554db266&message={"app_uid":"56974635280","app_id":"936619743392459","app_ver":"1.0.0","data":[{"time":1682502712.891,"name":"instagram_web_media_impressions","extra":{"ig_userid":56974635280,"pk":56974635280,"rollout_hash":"1007379527","frontend_env":"c3","app_id":"936619743392459","original_referrer":null,"original_referrer_domain":"","referrer":null,"referrer_domain":"www.instagram.com","url":"/regeringdk/","nav_chain":"polarisprofileroot:profilepage:1:via_cold_start,polarispostmodal:postpage:6:modallink","media_id":"2415285276315581060","media_type":"sidecar","owner_id":"9272148702","surface":"profile"},"obj_type":"url","obj_id":"/p/cge0sl-bxqe/"},{"time":1682502712.894,"name":"comment_impression","extra":{"ig_userid":56974635280,"pk":56974635280,"rollout_hash":"1007379527","frontend_env":"c3","app_id":"936619743392459","ca_pk":"1549276785","c_pk":"18121740952145782","container_module":"profilepagemodal","deviceid":"9dcbe01d-1e88-4d43-bbe1-706410d26031","device_model":"chrome%20112.0.0.0","device_os":"web","a_pk":"9272148702","m_pk":"2415285276315581060","primary_locale":"da_dk","iscovered":false,"original_referrer":null,"original_referrer_domain":"","referrer":null,"referrer_domain":"www.instagram.com","url":"/regeringdk/","nav_chain":"polarisprofileroot:profilepage:1:via_cold_start,polarispostmodal:postpage:6:modallink"}}],"log_type":"client_event","seq":7,"session_id":"187bcf98552-d8e06e","device_id":"9dcbe01d-1e88-4d43-bbe1-706410d26031","claims":["hmac.ar1zi1sweuop6a1ckmjja6ucwbgb97djmyjg0sh8zmyvpale"]} 20230426095158 https://graph.instagram.com/logging_client_events application/json 200 9b7c9bb91016a0d17171d9a9307591530d2211c64f33104a1b87299a6b386f95 - - 831 6992293 webrec_regering_20230426.warc.gz

And the partial line by cdx-tool:
com,instagram,graph)/logging_client_events?__wb_method=post&access_token=936619743392459|3cdb3f896252a1db29679cb4554db266&message={"app_uid":"56974635280","app_id":"936619743392459","app_ver":"1.0.0","data":[{"time":1682502720.792,"name":"instagram_web_media_impressions","extra":{"ig_userid":56974635280,"pk":56974635280,"rollout_hash":"1007379527","frontend_env":"c3","app_id":"936619743392459","original_referrer":null,"original_referrer_domain":"","referrer":null,"referrer_domain":"www.instagram.com","url":"/regeringdk/","nav_chain":"polarisprofileroot:profilepage:1:via_cold_start,polarispostmodal:postpage:7:modallink","media_id":"2408315599488034008","media_type":"video","owner_id":"9272148702","surface":"profile"},"obj_type":"url","obj_id":"/p/cfsdkcmhcty/"},{"time":1682502720.815,"name":"video_should_start","extra":{

Build WarcRevisit with refersTo String targetURI

I'd like to be able to create a WarcRevisit with the builder. However the target URI can only be set using an object of class URI.
It would be very convenient if I could pass a String.
Converting between URI and String gives some unnecessary headaches. For example I sometimes get a double URL encoded value from WarcTargetRecord.targetURI.

I'll add a pull request for this.
#66

Add build instructions

I have been using the .jar releases for testing but would let to manipulate the code and test JARs that I locally build.

It would be useful to me (and likely others) for instructions to be provided in the README (or within the repo) for building the jwarc library/JAR from source. The current README appears to be limited to usage instructions.

Avoid unchecked exceptions caused by malformed HTTP captures

The WARC parser often throws unchecked exceptions (IllegalArgumentException) when the input cannot be parsed or if it violates certain constraints (examples below). These exceptions make it nearly impossible to use jwarc to parse real-world HTTP captures because unchecked exceptions are not declared and in general considered to be unrecoverable. At least, the lenient parser (#25) should ignore malformed input and try to continue. Alternatively, checked exceptions could be used to force the user to handle the errors.

So far, I've run into these two issues:

  1. duplicate parameters in the "Content-Type" HTTP header: text/html;Charset=utf-8;charset=UTF-8. This is a frequent error, see examples in content_type_dupl_param-CC-MAIN-20200525032636-20200525062636-00118.warc.gz. In CdxTool the IllegalArgumentException is caught, but if this is the intended usage, it'd be better to throw a checked exception.
  2. duplicated HTTP header field Transfer-Encoding. So far, I've only seen a duplicated Transfer-Encoding: chunked which could be safely read as one single header, see examples in
    transfer_encoding_duplicated.warc.gz. In theory, the transfer encoding can be multi-valued (Transfer-Encoding: chunked, gzip) and RFC 7230, 3.2.2 states that two single-value header fields (chunked and gzip) are equivalent. But I have not yet seen an example for this.

GunzipChannel fails on payload with uncompressed size exceeding int_max

A gzip-compressed payload with an uncompressed size exceed 2^31-1 (max. value of a 32-bit integer) causes the GunzipChannel to fail with the following exception:

$> java -cp target/jwarc-0.13.1-SNAPSHOT.jar org.netpreserve.jwarc.tools.WarcTool extract --payload test-size-int-max-overflow-content-encoding-gzip.warc.gz 975
Exception in thread "main" java.util.zip.ZipException: gzip uncompressed size mismatch
        at org.netpreserve.jwarc.GunzipChannel.readTrailer(GunzipChannel.java:92)
        at org.netpreserve.jwarc.GunzipChannel.read(GunzipChannel.java:70)
        at org.netpreserve.jwarc.tools.ExtractTool.writeBody(ExtractTool.java:81)
        at org.netpreserve.jwarc.tools.ExtractTool.writePayload(ExtractTool.java:70)
        at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:156)
        at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:21)

The WARC file test-size-int-max-overflow-content-encoding-gzip.warc.gz (21 kB) contains one record with a payload size of 2^31.

Chunked transfer-encoding causes exceptions at end of WARC record

Reading the payload with Transfer-Encoding chunks may result in an exception thrown after the entire chunked body has been consumed.

Exception in thread "main" java.io.EOFException: EOF reached before end of chunked encoding: ...ews:news>\n</url>\n</urlset>\r\n00000000\r\n\r\n<-- HERE -->
Exception in thread "main" org.netpreserve.jwarc.ParsingException: chunked encoding at position 1392: ...\n Total Excuted Time : 0.543\n-->\n\r\n0\r\n\r\n<-- HERE -->\r

WARC files have been recorded using Wget. See #23 for the logging of the current context (position in buffer/stream).

How to parse not standard http header? avoid not throw exception?

Hi, I use this tools to parse CommonCrawl data, but fail.

hit exception:

org.netpreserve.jwarc.ParsingException: invalid HTTP message at byte position 374: ...T; path=/\r\nX-UA-Compatible: IE=7\r\nPower <-- HERE -->by: Auto Capri\r\nDate: Sun, 24 May 2020 2...

the data:

HTTP/1.1 200 OK
Cache-Control: private
Pragma: private
Content-Type: text/html; charset=UTF-8
X-Crawler-Content-Encoding: gzip
Server: Microsoft-IIS/8.5
X-Powered-By: PHP/5.3.28
Set-Cookie: bblastvisit=1590360012; expires=Mon, 24-May-2021 22:40:12 GMT; path=/
Set-Cookie: bblastactivity=0; expires=Mon, 24-May-2021 22:40:12 GMT; path=/
X-UA-Compatible: IE=7
Power by: Auto Capri
Date: Sun, 24 May 2020 22:40:12 GMT
X-Crawler-Content-Length: 4855
Content-Length: 13868

related file:
crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/warc/CC-MAIN-20200524210325-20200525000325-00000.warc.gz

CDX indexer: support revisit records

It looks like the Pywb indexer indicates these by setting the mime type field to "warc/revisit". Presumably we should follow that. Currently the indexer just ignores revisit records entirely.

Native OSX / Linux binaries do not work

When using the native osx binary from the 0.16.1 release, it seems the binary does load, but the cmdline parsing is broken.
Getting an invalid command for every command, eg:

jwarc: 'validate' is not a jwarc command. See 'jwarc help'.
jwarc: 'help' is not a jwarc command. See 'jwarc help'.

I am using it on an arm64 mac, but that should be transparent (should automatically use the emulation).

Unpack as files

Lots of tricky details:

  • How do we map URLs to file paths?
  • What if a WARC contains several versions of the same URL?
  • How do we handle the file/directory name clashes?
  • Do we make files for metadata, headers and request payloads too?

ARC parser infinite loop reading body

On certain ARC files the parser may run into an infinite loop. So far, I've found the following ARC files which reproducibly cause the hang-up when running the "validate" tool:

  • IAH-20080430204825-00000-blackbook-truncated.arc - part of ukwa/webarchive-test-suite and also used by jwat as test resource. Note: when parsing the gzipped variant (also part of the test suite) the parser complains about an "invalid ARC trailer". The stack during the hang-up:

    "main" #1 prio=5 os_prio=0 tid=0x00007f0e8400b800 nid=0x38450 runnable [0x00007f0e8908a000]
     java.lang.Thread.State: RUNNABLE
          at sun.nio.ch.NativeThread.current(Native Method)
          at sun.nio.ch.NativeThreadSet.add(NativeThreadSet.java:46)
          at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:155)
          - locked <0x000000071ab05d30> (a java.lang.Object)
          at org.netpreserve.jwarc.LengthedBody.read(LengthedBody.java:76)
          - locked <0x000000071ac21f30> (a org.netpreserve.jwarc.LengthedBody$Seekable)
          at org.netpreserve.jwarc.LengthedBody$Seekable$1.read(LengthedBody.java:236)
          at org.netpreserve.jwarc.LengthedBody.read(LengthedBody.java:76)
          - locked <0x000000071ac28738> (a org.netpreserve.jwarc.LengthedBody$Seekable)
          at org.netpreserve.jwarc.tools.ValidateTool.readBody(ValidateTool.java:83)
          at org.netpreserve.jwarc.tools.ValidateTool.validateCapture(ValidateTool.java:159)
          at org.netpreserve.jwarc.tools.ValidateTool.validate(ValidateTool.java:182)
          at org.netpreserve.jwarc.tools.ValidateTool.main(ValidateTool.java:283)
          at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:49)
    
  • the gzipped ARC 1266352769711_14.arc.gz (Common Crawl 2010):

    "main" #1 prio=5 os_prio=0 tid=0x00007f414400b800 nid=0x38c12 runnable [0x00007f414a6b8000]
     java.lang.Thread.State: RUNNABLE
          at java.util.zip.Inflater.inflateBytes(Native Method)
          at java.util.zip.Inflater.inflate(Inflater.java:259)
          - locked <0x000000071ab5fe10> (a java.util.zip.ZStreamRef)
          at org.netpreserve.jwarc.GunzipChannel.read(GunzipChannel.java:59)
          at org.netpreserve.jwarc.LengthedBody.read(LengthedBody.java:76)
          - locked <0x000000071abb4330> (a org.netpreserve.jwarc.LengthedBody)
          at org.netpreserve.jwarc.LengthedBody$1.read(LengthedBody.java:138)
          at org.netpreserve.jwarc.LengthedBody.read(LengthedBody.java:76)
          - locked <0x000000071abf55d8> (a org.netpreserve.jwarc.LengthedBody)
          at org.netpreserve.jwarc.tools.ValidateTool.readBody(ValidateTool.java:83)
          at org.netpreserve.jwarc.tools.ValidateTool.validateCapture(ValidateTool.java:159)
          at org.netpreserve.jwarc.tools.ValidateTool.validate(ValidateTool.java:182)
          at org.netpreserve.jwarc.tools.ValidateTool.main(ValidateTool.java:283)
          at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:49)
    

Gzip compression

Does gzip compressed WARC support append ?
I tried to write to new WARC(with gzip compression), then append it, then read. It says:
Caused by: java.util.zip.ZipException: not in gzip format (magic=4157)
at org.netpreserve.jwarc.GunzipChannel.readHeader(GunzipChannel.java:109)
at org.netpreserve.jwarc.GunzipChannel.read(GunzipChannel.java:45)
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:306)
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:151)
at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:241)

Shouldn't each record be gzipped separatelly ?

Request/Response Builder with String targetURI

Please add a constructor in the WarcRequest.Builder and the WarcResponse.Builder that accepts a String instead of a URI class.
The internet is full of URI's that are difficult or impossible to parse to convert them to a URI class. Jwarc only calls targetURI.toString(), so this would be a small change.

Raw header access

For use cases like ExtractTool (#41), copying records and display/debugging it would be useful if the message parser kept the raw header bytes.

WarcRevisit Builder with String targetURI

Similar to #63 I have added an additional constructor to accept a String instead of an URI when building a WarcRevisit.
I'll add a pull request for this.
#68

I would be grateful if this can become part of the jwarc library.

disable serviceworker in replay proxy mode

Hi,

when running jwarc as a replay proxy is there a way to disable the serviceworker script injection?
Looking at the source code in the WarcServer class I would like to know if it was possible to add a parameter in get request for the "replay" which allows to change the value of the "proxy" argument. Currently the replay method is call always with "proxy" at false (line 112).

Thanks

ByteBuffer inflate and deflate support

It'd be nice to make use of the Java 11 versions of inflate() and deflate() so that buffers that aren't array-backed can be used.

One option would be to produce a multi-release jar with a different class for 8 and 11. That might make it so jwarc is difficult to compile correctly on 8 though.

LambdaMetaFactory might be a reasonable runtime method selection mechanism with minimal performance overhead. (example)

Add lenient HttpParser

HttpParser strictly follows RFC 2616 / RFC 7230. It is definitely good to have a validating parser available to check and verify WARC writing software. However, web servers may not follow the RFC and also the WARC 1.1 spec does not require that the content of a response record is strictly following the HTTP spec.

While testing several WARC files of different origin, I've seen so far the following types of errors which make the strict HttpParser fail (see #23 regarding logging of errors):

Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid HTTP message at byte position 109: ...ft-IIS/6.0\r\nX-Powered-By: PHP/4
.4.8\r\nP3P<-- HERE --> : CP="ALL CURa ADMa DEVa TAIa OUR BUS I...
Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid HTTP message at byte position 2047: ...Age=900\r\nSet-Cookie: ___utmvawMukYNX=lOV<-- HERE -->\x01WGsI; path=/; Max-Age=900\r\nSet-Cookie: ...
Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid HTTP message at byte position 12: HTTP/1.1 200<-- HERE -->\r\nSet-Cookie: JSESSIONID=0A6DC20EFB6D178...

To allow the usage of jwarc also for WARC files with invalid HTTP headers - no matter whether this happens because of bugs in the WARC writer or on the responding web server - a lenient HttpParser would be good to have. In addition, the WARC reader may just continue to read until the \r\n\r\n indicating the end of the header.

Rudimentary Memento support on replay

I noticed that replaying WARCs provides a 14-digit datetime placeholder. As I anticipate this will eventually be semantic, it need not necessarily be. However, providing Memento (RFC7089) HTTP response headers would give some temporal context to the capture.

As a start, initially providing the Memento-Datetime HTTP response header (in RFC1123 format, e.g., Memento-Datetime: Fri, 09 Jan 2009 01:00:00 GMT) when viewing a capture from the WARC would be useful for further integration into other systems.

Recording proxy

We've already got all the pieces (warc writing, http parsing, certificate generation).

We could include an example browser-based capture command similar to screenshot.

WARC 1.0 quirk: angle brackets around WARC-Target-URI

In WARC 1.0 the grammar specified the value of the WARC-Target-URI field as being wrapped in < and >. This was likely an editing mistake as it was not present in earlier drafts of the standard and is inconsistent with the examples in the standard itself and most implementations. It was corrected in WARC 1.1.

There is some software like wget 1.20.3 that generates WARCs with angle brackets in this field though and it really is what the standard said so we should strip them.

Should we include a Dockerfile?

I am wondering, will it be helpful to add a Dockerfile in the repo that includes Chromium/Google Chrome and other run-time requirements to make all the tools function as expected?

CDX server support

  • cdx command should be able to post records to a cdx server
  • replay server should be able to use a cdx server as a record index

Filter expressions

Parse a very simple filter expression language with Ragel?

use-cases:

  • jwarc filter image ex.warc > images.warc
  • jwarc recorder | jwarc filter 'http.method != HEAD' > record.warc
  • jwarc filter !error ex.warc | jwarc unpack
  • parameterizing your own tools or analysis jobs with a filter

operators:

! not
==, != string equality
~= regex match
<, <=, >=, > numeric comparison
&&, || boolean logic

shorthand predicates:

resource: WARC-Type == resource || WARC-Type == response
page: resource && payload.type == text/html
image: resource && payload.type ~= ^image/
error: WARC-Type == response && http.status > 400

Utility methods to read payload body

Most consumers of the content payload require the payload to be

  1. decoded using the provided HTTP Content-Encoding
  2. available as byte[] (eg. Tika) or even String (eg. Jsoup)

I've found myself writing similar code when consuming the payload body of WarcResponse records: jwarc's extract tool #41, a sitemap tester and StormCrawler. In order to make jwarc more usable, I'd propose to bundle the following functionality in two/few utility methods:

  • return the decoded payload body as channel using the HTTP Content-Encoding
    • with configurable behavior (fail or return payload without decoding) when Content-Encoding isn't understood or is not reliable (gzip without gzip magic/header)
    • ev. make it possible to pass decoders for encodings not supported by jwarc, eg. brotli (I assume that jwarc is designed to have zero dependencies)
    • or should the decoding functionality provided in a class HttpPayload extending WarcPayload?
  • read the (decoded) payload into byte[] (or ByteBuffer)
    • optionally limit the max. size of the byte[] array to ensure that oversized captures do not cause any issues

IoException reading gzip extra

jwarc fails with an exception when reading a .warc.gz containing a gzip extra field:

$> java org.netpreserve.jwarc.tools.WarcTool ls gzip_extra_sl.warc.gz
Exception in thread "main" java.io.UncheckedIOException: java.io.EOFException: reading gzip extra
        at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:226)
        at org.netpreserve.jwarc.tools.ListTool.main(ListTool.java:13)
        at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:32)
Caused by: java.io.EOFException: reading gzip extra
        at org.netpreserve.jwarc.GunzipChannel.readHeader(GunzipChannel.java:121)
        at org.netpreserve.jwarc.GunzipChannel.read(GunzipChannel.java:42)
        at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:305)
        at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:134)
        at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:224)
        ... 2 more

The file gzip_extra_sl.warc.gz has been written by wget. Ironically, wget adds an extra field indicating the length of the compressed WARC record following the WARC 0.9 recommendation.

Optional space after chunk-size in chunked transfer-encoding

Some servers put optional space after the chunk-size which causes the following exception:

org.netpreserve.jwarc.ParsingException: chunked encoding at position 6944: ..."></span></a><ul class=dropdown-men\r\nD61<-- HERE --> \r\nu><li><a href="/mena/en/marketing/cor...
        at org.netpreserve.jwarc.ChunkedBody.parse(ChunkedBody.java:203)
        at org.netpreserve.jwarc.ChunkedBody.read(ChunkedBody.java:70)

Captured using wget: http_chunked_3c.warc.gz

Looks like the chunk-size is padded using blanks when it's shorter than 4 hex digits. Optional white space is not allowed by RFC 7230,
however, assuming that the server header correctly indicates "Apache-Coyote/1.1", I tried to figure out whether this is a systematic problem: the issue is discussed in https://bz.apache.org/bugzilla/show_bug.cgi?id=41364 and it turns out that RFC 2616 allows optional "linear white space" after the chunk-size, maybe also in other positions where it is not yet considered:

implied *LWS
The grammar described by this specification is word-based. Except
where noted otherwise, linear white space (LWS) can be included
between any two adjacent words (token or quoted-string), and
between adjacent words and separators, without changing the
interpretation of a field.

CDX indexer: Keep calculated digest from WARC header

The CDX indexer does base64 encoding of the digest.

This WARC header:
WARC-Payload-Digest: sha256:b04af472c47a8b1b5059b3404caac0e1bfb5a3c07b329be66f65cfab5ee8d3f3
Will result in the digest from the cdx-indexer:
WBFPI4WEPKFRWUCZWNAEZKWA4G73LI6APMZJXZTPMXH2WXXI2PZQ====

This is also inconsistent with what the PyWb cdx-indexer does.

Fix: Add an option to keep the digest as is, when making cdx-index.

WarcReader/GunzipChannel to check ByteBuffer in constructor

The classes WarcReader and GunzipChannel both have a public constructor taking a ByteBuffer as argument. It's silently assumed that the ByteBuffer

  • is backed by an array
  • is already in read mode (.flip() called)
  • (WarcReader only) the buffer's byte order is big-endian - otherwise the detection of the gzip magic fails

The constructors should put the buffers into correct state (if possible) or throw immediately throw an IllegalArgumentException. Alternatively, the constructors taking "external" could be removed or made non-public.

Write embedded resources to WARC

Great to see work on this @ato!

I am using using jwarc 0.3.0 .jar release and noticed only the root page is included. Perhaps this is by design. If not:

For example, java -jar jwa-0.3.0.jar fetch https://www.cs.odu.edu/~mkelly/ > example.warc does not capture any embedded images, CSS, etc.

The WARC is replayable in a few replay systems (e.g., OpenWayback, Webrecorder Player) but does not appear to be replayable in the embedded one.

I tried to replay this WARC using the included java -jar jwa-0.3.0.jar serve example.warc but received a Service Unavailable in the browser when accessing http://localhost:8080

wget quirk: Content-Length off by one

Some versions of wget generated WARC headers with an off by one Content-Length. This causes us to throw:

org.netpreserve.jwarc.ParsingException: invalid WARC trailer: a0d0a57

Examples:

Other implementations appear to ignore this error. Perhaps by simply skipping arbitrary numbers of CR and LF characters before reading the next record?

I don't want to silently ignore this but perhaps we could log a warning and attempt to continue.

GzipChannel write() returns compressed length rather than buffer consumption

GzipChannel.write() returns the number of compressed bytes written to the underlying channel rather than the number of uncompressed bytes consumed from the buffer. While this is useful information to know it unfortunately is not what the WritableByteChannel interface intends when it refers to "bytes written" and confuses standard methods that operate on channels such as FileChannel.transferTo().

Recording proxy with browser javax.net.ssl.SSLHandshakeException

Hi,

I try to recording a warc with jwarc in proxy mode and anything browser I use fail.
For run jwarc in proxy mode I used this commands:

export PORT=8080
java -jar jwarc-0.13.1.jar recorder > test.warc

This is the log:

javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:356)
	at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
	at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:202)
	at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:171)
	at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1488)
	at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1394)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:441)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:412)
	at org.netpreserve.jwarc.net.HttpServer.upgradeToTls(HttpServer.java:137)
	at org.netpreserve.jwarc.net.HttpServer.interact(HttpServer.java:87)
	at org.netpreserve.jwarc.net.HttpServer.lambda$listen$1(HttpServer.java:58)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)

How I can resolve this problem? There is a possibility to run jwarc in proxy mode with a new certificate?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.