iipc / jwarc Goto Github PK

Java library for reading and writing WARC files with a typed API

License: Apache License 2.0

Java 92.87% Ragel 6.43% JavaScript 0.46% HTML 0.24%

jwarc's Introduction

jwarc

A Java library for reading and writing WARC files. This library includes a high level API modeling the standard record types as individual classes with typed accessors. The API is exensible and you can register extension record types and accessors for extension header fields.

try (WarcReader reader = new WarcReader(FileChannel.open(Paths.get("example.warc")))) {
    for (WarcRecord record : reader) {
        if (record instanceof WarcResponse && record.contentType().base().equals(MediaType.HTTP)) {
            WarcResponse response = (WarcResponse) record;
            System.out.println(response.http().status() + " " + response.target());
        }
    }
}

It uses a finite state machine parser generated from a strict grammar using Ragel. There is an optional lenient mode which can handle some forms of non-compliant WARC records. ARC and HTTP parsing is lenient by default.

Gzipped records are automatically decompressed. The parser interprets ARC/1.1 record as if they are a WARC dialect and populates the appropriate WARC headers.

All I/O is performed using NIO and an an effort is made to minimize data copies and share buffers whenever feasible. Direct buffers and even memory-mapped files can be used, but only with uncompressed WARCS until they're supported by Inflater (coming in JDK 11).

Getting it

To use as a library add jwarc as a dependency from Maven Central.

To use as a command-line tool install Java 8 or later, download the latest release jar and run it using:

java -jar jwarc-{version}.jar

If you would prefer to build it from source install JDK 8+ and Maven and then run:

mvn package

Examples

Saving a remote resource

try (WarcWriter writer = new WarcWriter(System.out)) {
    writer.fetch(URI.create("http://example.org/"));
}

Writing records

// write a warcinfo record
// date and record id will be populated automatically if unset
writer.write(new Warcinfo.Builder()
    .fields("software", "my-cool-crawler/1.0",
            "robots", "obey")
    .build());

// we can also supply a specific date
Instant captureDate = Instant.now();

// write a request but keep a copy of it to reference later
WarcRequest request = new WarcRequest.Builder()
    .date(captureDate)
    .target(uri)
    .contentType("application/http")
    .body(bodyStream, bodyLength)
    .build();
writer.write(request);

// write a response referencing the request
WarcResponse response = new WarcResponse.Builder()
    .date(captureDate)
    .target(uri)
    .contentType("application/http")
    .body("HTTP/1.0 200 OK\r\n...".getBytes())
    .concurrentTo(request.id())
    .build();
writer.write(response);

Filter expressions

The WarcFilter class provides a simple filter expression language for matching WARC records. For example here's a moderately complex filter which matches all records that are not image resources or image responses:

 !((warc-type == "resource" && content-type =~ "image/.*") || 
   (warc-type == "response" && http:content-type =~ "image/.*"))

WarcFilter implements Predicate<WarcRecord> and be used to conveniently with streams of records:

long errorCount = warcReader.records().filter(WarcFilter.compile(":status >= 400")).count();

Their real power though is as a building block for user-supplied options.

Command-line tools

jwarc also includes a set of command-lines tools which serve as examples. Note that many of the tools are lightweight demonstrations and may lack important options and features.

Capture a URL (without subresources):

java -jar jwarc.jar fetch http://example.org/ > example.warc

Create a CDX file:

java -jar jwarc.jar cdx example.warc > records.cdx

Run a replay proxy and web server:

export PORT=8080
java -jar jwarc.jar serve example.warc

Replay each page within in a WARC and use headless Chrome to render a screenshot and save it as a resource record:

export BROWSER=/opt/google/chrome/chrome
java -jar jwarc.jar screenshot example.warc > screenshots.warc

Running a proxy server which records requests and responses. This will generate self-signed SSL certificates so you will will need turn off TLS verification in the client. For Chrome/Chromium use the --ignore-certificate-errors command-line option.

export PORT=8080
java -jar jwarc.jar recorder > example.warc

chromium --proxy-server=http://localhost:8080 --ignore-certificate-errors

Record a command that obeys the http(s)_proxy and CURL_CA_BUNDLE environment variables:

java -jar jwarc.jar recorder -o example.warc curl http://example.org/

Capture a page by recording headless Chrome:

export BROWSER=/opt/google/chrome/chrome
java -jar jwarc.jar record > example.warc

Create a new file containing only html responses with status 200:

java -jar jwarc.jar filter ':status == 200 && http:content-type =~ "text/html(;.*)?"' example.warc > pages.warc

API Quick Reference

See the javadoc for more details.

WarcReader

              new WarcReader(stream|path|channel);                // opens a WARC file for reading
                  reader.close();                                 // closes the underlying channel
(WarcCompression) reader.compression();                           // type of compression: NONE or GZIP
       (Iterator) reader.iterator();                              // an iterator over the records
     (WarcRecord) reader.next();                                  // reads the next record
                  reader.registerType("myrecord", MyRecord::new); // registers a new record type
                  reader.setLenient(true);                        // enables lenient parsing mode

WarcWriter

                new WarcWriter(channel, NONE|GZIP);    // opens a WARC file for writing
                    writer.fetch(uri);                 // downloads a resource recording the request and response
             (long) writer.position();                 // byte position the next record will be written to
                    writer.write(record);              // adds a record to the WARC file

Record types

Message
  HttpMessage
    HttpRequest
    HttpResponse
  WarcRecord
    Warcinfo            (warcinfo)
    WarcTargetRecord
      WarcContinuation  (continuation)
      WarcConversion    (conversion)
      WarcCaptureRecord
        WarcMetadata    (metadata)
        WarcRequest     (request)
        WarcResource    (resource)
        WarcResponse    (response)
        WarcRevisit     (revisit)

Message

The basic building block of both HTTP protocol and WARC file format is a message consisting of set of named header fields and a body. Header field names are case-insensitvie and may have multiple values.

             (BodyChannel) message.body();                     // the message body as a ReadableByteChannel
                    (long) message.body().position();          // the next byte position to read from
                     (int) message.body().read(byteBuffer);    // reads a sequence of bytes from the body
                    (long) message.body().size();              // the length in bytes of the body
             (InputStream) message.body().stream();            // views the body as an InputStream
                  (String) message.contentType();              // the media type of the body
                 (Headers) message.headers();                  // the header fields
            (List<String>) message.headers().all("Cookie");    // all values of a header
                 (boolean) message.headers().contains("TE", "deflate"); // tests if a value is present
        (Optional<String>) message.headers().first("Cookie");  // the first value of a header
(Map<String,List<String>>) message.headers().map();            // views the header fields as a map
        (Optional<String>) message.headers().sole("Location"); // throws if header has multiple values
         (ProtocolVersion) message.version();                  // the protocol version (e.g. HTTP/1.0 or WARC/1.1)

WarcRecord

Methods available on all WARC records:

  (Optional<Digest>) record.blockDigest();   // value of hash function applied to bytes of body
           (Instant) record.date();          // instant that data capture began
               (URI) record.id();            // globally unique record identifier
    (Optional<Long>) record.segmentNumber(); // position of this record in segmentated series
   (TuncationReason) record.truncated();     // reason record was truncated; or else NOT_TRUNCATED
            (String) record.type();          // "warcinfo", "request", "response" etc

Warcinfo

            (Headers) warcinfo.fields();   // parses the body as application/warc-fields
   (Optional<String>) warcinfo.filename(); // filename of the containing WARC

WarcTargetRecord (abstract)

Methods available on all WARC records except Warcinfo:

     (Optional<String>) record.identifiedPayloadType(); // media type of payload identified by an independent check
               (String) record.target();                // captured URI as an unparsed string
                  (URI) record.targetURI();             // captured URI
(Optional<WarcPayload>) record.payload();               // payload
     (Optional<Digest>) record.payloadDigest();         // value of hash function applied to bytes of the payload
        (Optional<URI>) record.warcinfoID();            // ID of warcinfo record when stored separately

WarcContinuation

             (String) continuation.segmentOriginId();    // record ID of first segment
   (Optional<String>) continuation.segmentTotalLength(); // (last only) total length of all segments

WarcConversion

      (Optional<URI>) conversion.refersTo();    // ID of record this one was converted from

WarcCaptureRecord (abstract)

Methods available on metadata, request, resource and response records:

          (List<URI>) capture.concurrentTo();   // other record IDs from the same capture event
 (Optional<InetAddr>) capture.ipAddress();      // IP address of the server

WarcMetadata

            (Headers) metadata.fields();        // parses the body as application/warc-fields

WarcRequest

        (HttpRequest) request.http();           // parses the body as a HTTP request
        (BodyChannel) request.http().body();    // HTTP request body
            (Headers) request.http().headers(); // HTTP request headers

WarcResource

No methods are specific to resource records. See WarcRecord, WarcTargetRecord, WarcCaptureRecord above.

WarcResponse

       (HttpResponse) response.http();           // parses the body as a HTTP response
        (BodyChannel) response.http().body();    // HTTP response body
            (Headers) response.http().headers(); // HTTP response headers

WarcRevisit

       (HttpResponse) revisit.http();              // parses the body as a HTTP response
            (Headers) revisit.http().headers();    // HTTP response headers (note: revisits never have a payload!)
                (URI) revisit.profile()            // revisit profile (not modified or identical payload)
                (URI) revisit.refersTo();          // id of record this is a duplicate of
                (URI) revisit.refersToTargetURI(); // targetURI of the referred to record 
            (Instant) revisit.refersToDate();      // date of the referred to record

Note: revisit records never have a payload so

Comparison

Criteria	jwarc	JWAT	webarchive-commons
License	Apache 2	Apache 2	Apache 2
Parser based on	Ragel FSM	Hand-rolled FSM	Apache HTTP
Push parsing	Low level	✘	✘
Folded headers †	✔	✔	✔
Encoded words †	✘	✘ (disabled)	✘
Validation	The basics	✔	✘
Strict parsing ‡	✔	✘	✘
Lenient parsing	HTTP only	✔	✔
Multi-value headers	✔	✔	✘
I/O Framework	NIO	IO	IO
Record type classes	✔	✘	✘
Typed accessors	✔	✔	Some
GZIP detection	✔	✔	Filename only
WARC writer	Barebones	✔	✔
ARC reader	Auto	Separate API	Factory
ARC writer	✘	✔	✔
Speed * (.warc)	1x	~5x slower	~13x slower
Speed * (.warc.gz)	1x	~1.4x slower	~2.8x slower

(†) WARC features copied from HTTP that have since been deprecated in HTTP. I'm not aware of any software that writes WARCs using these features and usage of them should probably be avoided. JWAT behaves differently from jwarc and webarchive-commons as it does not trim whitespace on folded lines.

(‡) JWAT and webarchive-commons both accept arbitrary UTF-8 characters in field names. jwarc strictly enforces the grammar rules from the WARC specification, although it does not currently enforce the rules for the values of specific individual fields.

(*) Relative time to scan records after JIT steady state. Only indicative. Need to redo this with a better benchmark. JWAT was configured with a 8192 byte buffer as with default options it is 27x slower. For comparison merely decompressing the .warc.gz file with GZIPInputStream is about 0.95x.

More recent benchmarks against Java libraries

Other WARC libraries

go-warc (Go)
node-warc (Node.js)
warc (Go)
warc (Python)
warc-clojure (Clojure) - JWAT wrapper
warc-hadoop (Java)
warcat (Python)
warcio (Python)
warctools (Python)
webarchive (Go)

jwarc's People

Contributors

Stargazers

Watchers

Forkers

kris-sigur avinashdesireddy sebastian-nagel cpsandbox robertvanloenhout netarchivesuite gitbugactions digitaldwagon

jwarc's Issues

Decode chunked transfer encoding

GunzipChannel input position is off by 2 if gzip extra field is present

GunzipChannel does not update the input position properly while reading gzip the extra field (cf. #14, fixed by #15 which introduced this bug). The input offset is then off by 2.

ClueWeb09 WARC files faile to parse

The ClueWeb09 dataset WARC files (see sample files) use a single line feed \n as separator between WARC headers. The WarcParser expects \r\n (which would conform to the standard) and fails:

Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 9: WARC/0.18<-- HERE -->\nWARC-Type: warcinfo\nWARC-Date: 2009-03-...

See also #25 for a similar issue regarding HttpParser.

Multithreading issue on GzipChannel write header

When having multiple instances of WarcWriter the operations on
private static final ByteBuffer GZIP_HEADER = ByteBuffer.wrap(GZIP_HEADER_); are causing issues.
Some threads are writing the gzip header, some might not.

I think the issue could be fixed by removing the static part for the GZIP_HEADER.

Chunked body parser may read over end of chunk if destination buffer has higher capacity

The optimization to bypass the internal buffer reads) if the destination buffer has a higher capacity than the internal buffer may cause a read over the end of the current chunk.

Reproducible with http_chunked_4.warc.gz and a buffer of 16 kB, e.g.,

ByteBuffer buffer = ByteBuffer.allocate(16384);
while (payload.get().body().read(buffer) > -1);

The chunk has size 16122 - the first read will the second bypassed read will consume all input until EOF (end of WARC record). It must be ensured that nothing more than the content of a single chunk is forwarded to the destination buffer.

Note: if the internal buffer has been bypassed, the error message in line 52 while refilling the internal buffer is wrong/misleading because it uses the outdated internal buffer to show the context. Should be also fixed.

replay proxy doesn't start because of sw.js file not found

Thanks for the great project.
When I try to serve a warc I get the following error:

java -jar jwarc.jar serve mywarc.warc
Exception in thread "main" java.nio.file.NoSuchFileException: sw.js
	at org.netpreserve.jwarc.net.WarcServer.resource(WarcServer.java:202)
	at org.netpreserve.jwarc.net.WarcServer.<init>(WarcServer.java:52)
	at org.netpreserve.jwarc.net.WarcServer.<init>(WarcServer.java:45)
	at org.netpreserve.jwarc.tools.ServeTool.main(ServeTool.java:21)
	at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:47)

The files sw.js and inject.js are in resources/org/netpreserve/net/
Because the resources are relative resolved in WarcServer they are expected in resouces/org/netpreserve/jwarc/net/ or use absolute path /org/netpreserve/net/ in WarcServer to resolve the resource.

jwarc version 0.13.1
java 11 and 15

CDX indexer: CDXJ output support

It would be nice to have an option to output in CDXJ format.

Pywb's cdx-indexer uses the command-line option "-j, --cdxj" for that so it'd be nice if we support the same option names.

CDX indexer fails to parse (webrecorder) WARC file and terminates.

The WARC file has been made with webrecorder . cdxj(python) + warc-inder (BL) can parse the WARC-file.
The error will terminate the cdx-tool indexing and write a partial last line to the cdx-file also producing invalid CDX file.

Exception from CDX inder:
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
at java.base/java.net.URLDecoder.decode(URLDecoder.java:225)

I think this is the record that breaks it:
`
WARC/1.1
WARC-Record-ID: urn:uuid:eb2250e7-173b-5052-aa6e-1d0a2e78f996
WARC-Page-ID: 6l9q6b7tynm0ww8tezecpkc
WARC-Concurrent-To: urn:uuid:48aa011b-3ea1-505d-85af-49abbc668e66
WARC-Target-URI: https://graph.instagram.com/logging_client_events
WARC-Date: 2023-04-26T09:51:58.263Z
WARC-Type: request
Content-Type: application/http; msgtype=request
WARC-Payload-Digest: sha256:7368b96670b0fdd9525fcc645fa716b687795224d678a242af020113a1f7c97d
WARC-Block-Digest: sha256:db3b291b024653c5404401c321169ef68ff25c01aa6b25e20672a405796c6650
Content-Length: 2908

POST /logging_client_events HTTP/1.1
accept: /
accept-encoding: gzip, deflate, br
accept-language: da-DK,da;q=0.9,en-US;q=0.8,en;q=0.7
cache-control: no-cache
content-length: 2228
content-type: application/x-www-form-urlencoded
origin: https://www.instagram.com
pragma: no-cache
referer: https://www.instagram.com/
sec-ch-ua: "Chromium";v="112", "Google Chrome";v="112", "Not:A-Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
x-asbd-id: 198387

access_token=936619743392459%7C3cdb3f896252a1db29679cb4554db266&message=%7B%22app_uid%22%3A%2256974635280%22%2C%22app_id%22%3A%22936619743392459%22%2C%22app_ver%22%3A%221.0.0%22%2C%22data%22%3A%5B%7B%22time%22%3A1682502712.891%2C%22name%22%3A%22instagram_web_media_impressions%22%2C%22extra%22%3A%7B%22ig_userid%22%3A56974635280%2C%22pk%22%3A56974635280%2C%22rollout_hash%22%3A%221007379527%22%2C%22frontend_env%22%3A%22C3%22%2C%22app_id%22%3A%22936619743392459%22%2C%22original_referrer%22%3Anull%2C%22original_referrer_domain%22%3A%22%22%2C%22referrer%22%3Anull%2C%22referrer_domain%22%3A%22www.instagram.com%22%2C%22url%22%3A%22%2Fregeringdk%2F%22%2C%22nav_chain%22%3A%22PolarisProfileRoot%3AprofilePage%3A1%3Avia_cold_start%2CPolarisPostModal%3ApostPage%3A6%3AmodalLink%22%2C%22media_id%22%3A%222415285276315581060%22%2C%22media_type%22%3A%22sidecar%22%2C%22owner_id%22%3A%229272148702%22%2C%22surface%22%3A%22profile%22%7D%2C%22obj_type%22%3A%22url%22%2C%22obj_id%22%3A%22%2Fp%2FCGE0Sl-BXqE%2F%22%7D%2C%7B%22time%22%3A1682502712.894%2C%22name%22%3A%22comment_impression%22%2C%22extra%22%3A%7B%22ig_userid%22%3A56974635280%2C%22pk%22%3A56974635280%2C%22rollout_hash%22%3A%221007379527%22%2C%22frontend_env%22%3A%22C3%22%2C%22app_id%22%3A%22936619743392459%22%2C%22ca_pk%22%3A%221549276785%22%2C%22c_pk%22%3A%2218121740952145782%22%2C%22container_module%22%3A%22profilePageModal%22%2C%22deviceid%22%3A%229DCBE01D-1E88-4D43-BBE1-706410D26031%22%2C%22device_model%22%3A%22Chrome+112.0.0.0%22%2C%22device_os%22%3A%22Web%22%2C%22a_pk%22%3A%229272148702%22%2C%22m_pk%22%3A%222415285276315581060%22%2C%22primary_locale%22%3A%22da_DK%22%2C%22isCovered%22%3Afalse%2C%22original_referrer%22%3Anull%2C%22original_referrer_domain%22%3A%22%22%2C%22referrer%22%3Anull%2C%22referrer_domain%22%3A%22www.instagram.com%22%2C%22url%22%3A%22%2Fregeringdk%2F%22%2C%22nav_chain%22%3A%22PolarisProfileRoot%3AprofilePage%3A1%3Avia_cold_start%2CPolarisPostModal%3ApostPage%3A6%3AmodalLink%22%7D%7D%5D%2C%22log_type%22%3A%22client_event%22%2C%22seq%22%3A7%2C%22session_id%22%3A%22187bcf98552-d8e06e%22%2C%22device_id%22%3A%229DCBE01D-1E88-4D43-BBE1-706410D26031%22%2C%22claims%22%3A%5B%22hmac.AR1ZI1sWeUop6a1CkmjJA6UcWBgb97djMyjg0Sh8zmyVpaLE%22%5D%7D
`

The is the CDX-line produced by cdxj-indexer (pywb)
com,instagram,graph)/logging_client_events?__wb_method=post&access_token=936619743392459|3cdb3f896252a1db29679cb4554db266&message={"app_uid":"56974635280","app_id":"936619743392459","app_ver":"1.0.0","data":[{"time":1682502712.891,"name":"instagram_web_media_impressions","extra":{"ig_userid":56974635280,"pk":56974635280,"rollout_hash":"1007379527","frontend_env":"c3","app_id":"936619743392459","original_referrer":null,"original_referrer_domain":"","referrer":null,"referrer_domain":"www.instagram.com","url":"/regeringdk/","nav_chain":"polarisprofileroot:profilepage:1:via_cold_start,polarispostmodal:postpage:6:modallink","media_id":"2415285276315581060","media_type":"sidecar","owner_id":"9272148702","surface":"profile"},"obj_type":"url","obj_id":"/p/cge0sl-bxqe/"},{"time":1682502712.894,"name":"comment_impression","extra":{"ig_userid":56974635280,"pk":56974635280,"rollout_hash":"1007379527","frontend_env":"c3","app_id":"936619743392459","ca_pk":"1549276785","c_pk":"18121740952145782","container_module":"profilepagemodal","deviceid":"9dcbe01d-1e88-4d43-bbe1-706410d26031","device_model":"chrome%20112.0.0.0","device_os":"web","a_pk":"9272148702","m_pk":"2415285276315581060","primary_locale":"da_dk","iscovered":false,"original_referrer":null,"original_referrer_domain":"","referrer":null,"referrer_domain":"www.instagram.com","url":"/regeringdk/","nav_chain":"polarisprofileroot:profilepage:1:via_cold_start,polarispostmodal:postpage:6:modallink"}}],"log_type":"client_event","seq":7,"session_id":"187bcf98552-d8e06e","device_id":"9dcbe01d-1e88-4d43-bbe1-706410d26031","claims":["hmac.ar1zi1sweuop6a1ckmjja6ucwbgb97djmyjg0sh8zmyvpale"]} 20230426095158 https://graph.instagram.com/logging_client_events application/json 200 9b7c9bb91016a0d17171d9a9307591530d2211c64f33104a1b87299a6b386f95 - - 831 6992293 webrec_regering_20230426.warc.gz

And the partial line by cdx-tool:
com,instagram,graph)/logging_client_events?__wb_method=post&access_token=936619743392459|3cdb3f896252a1db29679cb4554db266&message={"app_uid":"56974635280","app_id":"936619743392459","app_ver":"1.0.0","data":[{"time":1682502720.792,"name":"instagram_web_media_impressions","extra":{"ig_userid":56974635280,"pk":56974635280,"rollout_hash":"1007379527","frontend_env":"c3","app_id":"936619743392459","original_referrer":null,"original_referrer_domain":"","referrer":null,"referrer_domain":"www.instagram.com","url":"/regeringdk/","nav_chain":"polarisprofileroot:profilepage:1:via_cold_start,polarispostmodal:postpage:7:modallink","media_id":"2408315599488034008","media_type":"video","owner_id":"9272148702","surface":"profile"},"obj_type":"url","obj_id":"/p/cfsdkcmhcty/"},{"time":1682502720.815,"name":"video_should_start","extra":{

Build WarcRevisit with refersTo String targetURI

I'd like to be able to create a WarcRevisit with the builder. However the target URI can only be set using an object of class URI.
It would be very convenient if I could pass a String.
Converting between URI and String gives some unnecessary headaches. For example I sometimes get a double URL encoded value from WarcTargetRecord.targetURI.

I'll add a pull request for this.
#66

Add build instructions

I have been using the .jar releases for testing but would let to manipulate the code and test JARs that I locally build.

It would be useful to me (and likely others) for instructions to be provided in the README (or within the repo) for building the jwarc library/JAR from source. The current README appears to be limited to usage instructions.

Avoid unchecked exceptions caused by malformed HTTP captures

The WARC parser often throws unchecked exceptions (IllegalArgumentException) when the input cannot be parsed or if it violates certain constraints (examples below). These exceptions make it nearly impossible to use jwarc to parse real-world HTTP captures because unchecked exceptions are not declared and in general considered to be unrecoverable. At least, the lenient parser (#25) should ignore malformed input and try to continue. Alternatively, checked exceptions could be used to force the user to handle the errors.

So far, I've run into these two issues:

duplicate parameters in the "Content-Type" HTTP header: text/html;Charset=utf-8;charset=UTF-8. This is a frequent error, see examples in content_type_dupl_param-CC-MAIN-20200525032636-20200525062636-00118.warc.gz. In CdxTool the IllegalArgumentException is caught, but if this is the intended usage, it'd be better to throw a checked exception.
duplicated HTTP header field Transfer-Encoding. So far, I've only seen a duplicated Transfer-Encoding: chunked which could be safely read as one single header, see examples in
transfer_encoding_duplicated.warc.gz. In theory, the transfer encoding can be multi-valued (Transfer-Encoding: chunked, gzip) and RFC 7230, 3.2.2 states that two single-value header fields (chunked and gzip) are equivalent. But I have not yet seen an example for this.

GunzipChannel fails on payload with uncompressed size exceeding int_max

A gzip-compressed payload with an uncompressed size exceed 2^31-1 (max. value of a 32-bit integer) causes the GunzipChannel to fail with the following exception:

$> java -cp target/jwarc-0.13.1-SNAPSHOT.jar org.netpreserve.jwarc.tools.WarcTool extract --payload test-size-int-max-overflow-content-encoding-gzip.warc.gz 975
Exception in thread "main" java.util.zip.ZipException: gzip uncompressed size mismatch
        at org.netpreserve.jwarc.GunzipChannel.readTrailer(GunzipChannel.java:92)
        at org.netpreserve.jwarc.GunzipChannel.read(GunzipChannel.java:70)
        at org.netpreserve.jwarc.tools.ExtractTool.writeBody(ExtractTool.java:81)
        at org.netpreserve.jwarc.tools.ExtractTool.writePayload(ExtractTool.java:70)
        at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:156)
        at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:21)

The WARC file test-size-int-max-overflow-content-encoding-gzip.warc.gz (21 kB) contains one record with a payload size of 2^31.

Run an external command on each record

jwarc filter resource | jwarc exec file
jwarc filter image | jwarc exec montage

Chunked transfer-encoding causes exceptions at end of WARC record

Reading the payload with Transfer-Encoding chunks may result in an exception thrown after the entire chunked body has been consumed.

EOFException (http_chunked_1b.warc.gz):

Exception in thread "main" java.io.EOFException: EOF reached before end of chunked encoding: ...ews:news>\n</url>\n</urlset>\r\n00000000\r\n\r\n<-- HERE -->

ParseException (http_chunked_2.warc.gz):

Exception in thread "main" org.netpreserve.jwarc.ParsingException: chunked encoding at position 1392: ...\n Total Excuted Time : 0.543\n-->\n\r\n0\r\n\r\n<-- HERE -->\r

WARC files have been recorded using Wget. See #23 for the logging of the current context (position in buffer/stream).

How to parse not standard http header? avoid not throw exception?

Hi, I use this tools to parse CommonCrawl data, but fail.

hit exception:

org.netpreserve.jwarc.ParsingException: invalid HTTP message at byte position 374: ...T; path=/\r\nX-UA-Compatible: IE=7\r\nPower <-- HERE -->by: Auto Capri\r\nDate: Sun, 24 May 2020 2...

the data:

HTTP/1.1 200 OK
Cache-Control: private
Pragma: private
Content-Type: text/html; charset=UTF-8
X-Crawler-Content-Encoding: gzip
Server: Microsoft-IIS/8.5
X-Powered-By: PHP/5.3.28
Set-Cookie: bblastvisit=1590360012; expires=Mon, 24-May-2021 22:40:12 GMT; path=/
Set-Cookie: bblastactivity=0; expires=Mon, 24-May-2021 22:40:12 GMT; path=/
X-UA-Compatible: IE=7
Power by: Auto Capri
Date: Sun, 24 May 2020 22:40:12 GMT
X-Crawler-Content-Length: 4855
Content-Length: 13868

related file:
crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/warc/CC-MAIN-20200524210325-20200525000325-00000.warc.gz

CDX indexer: support revisit records

It looks like the Pywb indexer indicates these by setting the mime type field to "warc/revisit". Presumably we should follow that. Currently the indexer just ignores revisit records entirely.

Native OSX / Linux binaries do not work

When using the native osx binary from the 0.16.1 release, it seems the binary does load, but the cmdline parsing is broken.
Getting an invalid command for every command, eg:

jwarc: 'validate' is not a jwarc command. See 'jwarc help'.

jwarc: 'help' is not a jwarc command. See 'jwarc help'.

I am using it on an arm64 mac, but that should be transparent (should automatically use the emulation).

Unpack as files

Lots of tricky details:

How do we map URLs to file paths?
What if a WARC contains several versions of the same URL?
How do we handle the file/directory name clashes?
Do we make files for metadata, headers and request payloads too?

ARC parser infinite loop reading body

On certain ARC files the parser may run into an infinite loop. So far, I've found the following ARC files which reproducibly cause the hang-up when running the "validate" tool:

IAH-20080430204825-00000-blackbook-truncated.arc - part of ukwa/webarchive-test-suite and also used by jwat as test resource. Note: when parsing the gzipped variant (also part of the test suite) the parser complains about an "invalid ARC trailer". The stack during the hang-up:

"main" #1 prio=5 os_prio=0 tid=0x00007f0e8400b800 nid=0x38450 runnable [0x00007f0e8908a000]
 java.lang.Thread.State: RUNNABLE
      at sun.nio.ch.NativeThread.current(Native Method)
      at sun.nio.ch.NativeThreadSet.add(NativeThreadSet.java:46)
      at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:155)
      - locked <0x000000071ab05d30> (a java.lang.Object)
      at org.netpreserve.jwarc.LengthedBody.read(LengthedBody.java:76)
      - locked <0x000000071ac21f30> (a org.netpreserve.jwarc.LengthedBody$Seekable)
      at org.netpreserve.jwarc.LengthedBody$Seekable$1.read(LengthedBody.java:236)
      at org.netpreserve.jwarc.LengthedBody.read(LengthedBody.java:76)
      - locked <0x000000071ac28738> (a org.netpreserve.jwarc.LengthedBody$Seekable)
      at org.netpreserve.jwarc.tools.ValidateTool.readBody(ValidateTool.java:83)
      at org.netpreserve.jwarc.tools.ValidateTool.validateCapture(ValidateTool.java:159)
      at org.netpreserve.jwarc.tools.ValidateTool.validate(ValidateTool.java:182)
      at org.netpreserve.jwarc.tools.ValidateTool.main(ValidateTool.java:283)
      at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:49)

the gzipped ARC 1266352769711_14.arc.gz (Common Crawl 2010):

"main" #1 prio=5 os_prio=0 tid=0x00007f414400b800 nid=0x38c12 runnable [0x00007f414a6b8000]
 java.lang.Thread.State: RUNNABLE
      at java.util.zip.Inflater.inflateBytes(Native Method)
      at java.util.zip.Inflater.inflate(Inflater.java:259)
      - locked <0x000000071ab5fe10> (a java.util.zip.ZStreamRef)
      at org.netpreserve.jwarc.GunzipChannel.read(GunzipChannel.java:59)
      at org.netpreserve.jwarc.LengthedBody.read(LengthedBody.java:76)
      - locked <0x000000071abb4330> (a org.netpreserve.jwarc.LengthedBody)
      at org.netpreserve.jwarc.LengthedBody$1.read(LengthedBody.java:138)
      at org.netpreserve.jwarc.LengthedBody.read(LengthedBody.java:76)
      - locked <0x000000071abf55d8> (a org.netpreserve.jwarc.LengthedBody)
      at org.netpreserve.jwarc.tools.ValidateTool.readBody(ValidateTool.java:83)
      at org.netpreserve.jwarc.tools.ValidateTool.validateCapture(ValidateTool.java:159)
      at org.netpreserve.jwarc.tools.ValidateTool.validate(ValidateTool.java:182)
      at org.netpreserve.jwarc.tools.ValidateTool.main(ValidateTool.java:283)
      at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:49)

Gzip compression

Does gzip compressed WARC support append ?
I tried to write to new WARC(with gzip compression), then append it, then read. It says:
Caused by: java.util.zip.ZipException: not in gzip format (magic=4157)
at org.netpreserve.jwarc.GunzipChannel.readHeader(GunzipChannel.java:109)
at org.netpreserve.jwarc.GunzipChannel.read(GunzipChannel.java:45)
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:306)
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:151)
at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:241)

Shouldn't each record be gzipped separatelly ?

Request/Response Builder with String targetURI

Please add a constructor in the WarcRequest.Builder and the WarcResponse.Builder that accepts a String instead of a URI class.
The internet is full of URI's that are difficult or impossible to parse to convert them to a URI class. Jwarc only calls targetURI.toString(), so this would be a small change.

Raw header access

For use cases like ExtractTool (#41), copying records and display/debugging it would be useful if the message parser kept the raw header bytes.

Payload body has size 0 if HTTP Content-Length header is missing

If there is no HTTP Content-Length header a LengthedBody with size 0 is created. Reproducible when parsing the response record in http_no_content_length_1.warc.gz.

Reason: LengthedBody.discardPushbackOnRead() does not return an instance of LengthedBody but of an anonymous class inside LengthedBody (eg. LengthedBody$1) and the check instanceof LengthedBody fails.

Don't throw on duplicate keys in media type parameters

Currently we throw IllegalArgumentException but that's not helpful. It looks like the behaviour in browsers varies but there's some momentum towards standardising on the first value and any subsequent ignored: whatwg/mimesniff#41

WarcRevisit Builder with String targetURI

Similar to #63 I have added an additional constructor to accept a String instead of an URI when building a WarcRevisit.
I'll add a pull request for this.
#68

I would be grateful if this can become part of the jwarc library.

disable serviceworker in replay proxy mode

Hi,

when running jwarc as a replay proxy is there a way to disable the serviceworker script injection?
Looking at the source code in the WarcServer class I would like to know if it was possible to add a parameter in get request for the "replay" which allows to change the value of the "proxy" argument. Currently the replay method is call always with "proxy" at false (line 112).

Thanks

Support multiple captures of the same URL in WarcServer index

The index should be keyed on (url, date) not just (url).

ByteBuffer inflate and deflate support

It'd be nice to make use of the Java 11 versions of inflate() and deflate() so that buffers that aren't array-backed can be used.

One option would be to produce a multi-release jar with a different class for 8 and 11. That might make it so jwarc is difficult to compile correctly on 8 though.

LambdaMetaFactory might be a reasonable runtime method selection mechanism with minimal performance overhead. (example)

RecordBuilder: Date/Timestamp truncated if .date(..) is called before .version(WARC_1_1)

Fixed by calling version first, but I'd argue the truncation should happen on .build() once it's certain it is needed.

Add lenient HttpParser

HttpParser strictly follows RFC 2616 / RFC 7230. It is definitely good to have a validating parser available to check and verify WARC writing software. However, web servers may not follow the RFC and also the WARC 1.1 spec does not require that the content of a response record is strictly following the HTTP spec.

While testing several WARC files of different origin, I've seen so far the following types of errors which make the strict HttpParser fail (see #23 regarding logging of errors):

white space before the colon in header lines "name: value" (http_header_exception_1.warc.gz, Common Crawl Aug 2018):

Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid HTTP message at byte position 109: ...ft-IIS/6.0\r\nX-Powered-By: PHP/4
.4.8\r\nP3P<-- HERE --> : CP="ALL CURa ADMa DEVa TAIa OUR BUS I...

invalid character (control character) in header value (http_header_exception_3.warc.gz, Wget/1.17.1):

Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid HTTP message at byte position 2047: ...Age=900\r\nSet-Cookie: ___utmvawMukYNX=lOV<-- HERE -->\x01WGsI; path=/; Max-Age=900\r\nSet-Cookie: ...

no space after status code in status line if message is empty (http_message_1.warc.gz, Wget/1.19.4, cf. NUTCH-2763:

Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid HTTP message at byte position 12: HTTP/1.1 200<-- HERE -->\r\nSet-Cookie: JSESSIONID=0A6DC20EFB6D178...

To allow the usage of jwarc also for WARC files with invalid HTTP headers - no matter whether this happens because of bugs in the WARC writer or on the responding web server - a lenient HttpParser would be good to have. In addition, the WARC reader may just continue to read until the \r\n\r\n indicating the end of the header.

Rudimentary Memento support on replay

I noticed that replaying WARCs provides a 14-digit datetime placeholder. As I anticipate this will eventually be semantic, it need not necessarily be. However, providing Memento (RFC7089) HTTP response headers would give some temporal context to the capture.

As a start, initially providing the Memento-Datetime HTTP response header (in RFC1123 format, e.g., Memento-Datetime: Fri, 09 Jan 2009 01:00:00 GMT) when viewing a capture from the WARC would be useful for further integration into other systems.

Recording proxy

We've already got all the pieces (warc writing, http parsing, certificate generation).

We could include an example browser-based capture command similar to screenshot.

WarcReader may hang up on clipped gzipped WARC file

WarcReader may hang up when processing a gzipped WARC file with the last record clipped/incomplete (due to an unfinished download or a killed WARC writer). Seen with
clipped.warc.gz, but I'll try to prepare a unit test which systematically checks for boundary conditions.

WARC 1.0 quirk: angle brackets around WARC-Target-URI

In WARC 1.0 the grammar specified the value of the WARC-Target-URI field as being wrapped in < and >. This was likely an editing mistake as it was not present in earlier drafts of the standard and is inconsistent with the examples in the standard itself and most implementations. It was corrected in WARC 1.1.

There is some software like wget 1.20.3 that generates WARCs with angle brackets in this field though and it really is what the standard said so we should strip them.

Leverage gzip extra field "sl" to skip over compressed WARC records

WARC writers may provide a gzip extra field "sl" (recommended by WARC 0.9 but dropped in newer versions) to encode the length of the compressed WARC record. This can be used to quickly skip over the current record for tasks (eg. CDX indexing) which do not require to read the payload. See also #14/#15.

Should we include a Dockerfile?

I am wondering, will it be helpful to add a Dockerfile in the repo that includes Chromium/Google Chrome and other run-time requirements to make all the tools function as expected?

CDX server support

cdx command should be able to post records to a cdx server
replay server should be able to use a cdx server as a record index

Filter expressions

Parse a very simple filter expression language with Ragel?

use-cases:

jwarc filter image ex.warc > images.warc
jwarc recorder | jwarc filter 'http.method != HEAD' > record.warc
jwarc filter !error ex.warc | jwarc unpack
parameterizing your own tools or analysis jobs with a filter

operators:

! not
==, != string equality
~= regex match
<, <=, >=, > numeric comparison
&&, || boolean logic

shorthand predicates:

resource: WARC-Type == resource || WARC-Type == response
page: resource && payload.type == text/html
image: resource && payload.type ~= ^image/
error: WARC-Type == response && http.status > 400

invalid HTTP message at byte position 6: HTTP/2<-- HERE --> 200

https://data.commoncrawl.org/crawl-data/CC-NEWS/2020/09/CC-NEWS-20200921024254-00130.warc.gz invalid HTTP message at byte position 6: HTTP/2<-- HERE --> 200 \r\nserver: Apache\r\nx-gen-mode: full\r...

multiple errors from files this year/month

Utility methods to read payload body

Most consumers of the content payload require the payload to be

decoded using the provided HTTP Content-Encoding
available as byte[] (eg. Tika) or even String (eg. Jsoup)

I've found myself writing similar code when consuming the payload body of WarcResponse records: jwarc's extract tool #41, a sitemap tester and StormCrawler. In order to make jwarc more usable, I'd propose to bundle the following functionality in two/few utility methods:

return the decoded payload body as channel using the HTTP Content-Encoding
- with configurable behavior (fail or return payload without decoding) when Content-Encoding isn't understood or is not reliable (gzip without gzip magic/header)
- ev. make it possible to pass decoders for encodings not supported by jwarc, eg. brotli (I assume that jwarc is designed to have zero dependencies)
- or should the decoding functionality provided in a class HttpPayload extending WarcPayload?
read the (decoded) payload into byte[] (or ByteBuffer)
- optionally limit the max. size of the byte[] array to ensure that oversized captures do not cause any issues

IoException reading gzip extra

jwarc fails with an exception when reading a .warc.gz containing a gzip extra field:

$> java org.netpreserve.jwarc.tools.WarcTool ls gzip_extra_sl.warc.gz
Exception in thread "main" java.io.UncheckedIOException: java.io.EOFException: reading gzip extra
        at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:226)
        at org.netpreserve.jwarc.tools.ListTool.main(ListTool.java:13)
        at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:32)
Caused by: java.io.EOFException: reading gzip extra
        at org.netpreserve.jwarc.GunzipChannel.readHeader(GunzipChannel.java:121)
        at org.netpreserve.jwarc.GunzipChannel.read(GunzipChannel.java:42)
        at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:305)
        at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:134)
        at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:224)
        ... 2 more

The file gzip_extra_sl.warc.gz has been written by wget. Ironically, wget adds an extra field indicating the length of the compressed WARC record following the WARC 0.9 recommendation.

Typo in url_byte in http and warc grammars

Double (

Optional space after chunk-size in chunked transfer-encoding

Some servers put optional space after the chunk-size which causes the following exception:

org.netpreserve.jwarc.ParsingException: chunked encoding at position 6944: ..."></span></a><ul class=dropdown-men\r\nD61<-- HERE --> \r\nu><li><a href="/mena/en/marketing/cor...
        at org.netpreserve.jwarc.ChunkedBody.parse(ChunkedBody.java:203)
        at org.netpreserve.jwarc.ChunkedBody.read(ChunkedBody.java:70)

Captured using wget: http_chunked_3c.warc.gz

Looks like the chunk-size is padded using blanks when it's shorter than 4 hex digits. Optional white space is not allowed by RFC 7230,
however, assuming that the server header correctly indicates "Apache-Coyote/1.1", I tried to figure out whether this is a systematic problem: the issue is discussed in https://bz.apache.org/bugzilla/show_bug.cgi?id=41364 and it turns out that RFC 2616 allows optional "linear white space" after the chunk-size, maybe also in other positions where it is not yet considered:

implied *LWS
The grammar described by this specification is word-based. Except
where noted otherwise, linear white space (LWS) can be included
between any two adjacent words (token or quoted-string), and
between adjacent words and separators, without changing the
interpretation of a field.

CDX indexer: Keep calculated digest from WARC header

The CDX indexer does base64 encoding of the digest.

This WARC header:
WARC-Payload-Digest: sha256:b04af472c47a8b1b5059b3404caac0e1bfb5a3c07b329be66f65cfab5ee8d3f3
Will result in the digest from the cdx-indexer:
WBFPI4WEPKFRWUCZWNAEZKWA4G73LI6APMZJXZTPMXH2WXXI2PZQ====

This is also inconsistent with what the PyWb cdx-indexer does.

Fix: Add an option to keep the digest as is, when making cdx-index.

WarcReader/GunzipChannel to check ByteBuffer in constructor

The classes WarcReader and GunzipChannel both have a public constructor taking a ByteBuffer as argument. It's silently assumed that the ByteBuffer

is backed by an array
is already in read mode (.flip() called)
(WarcReader only) the buffer's byte order is big-endian - otherwise the detection of the gzip magic fails

The constructors should put the buffers into correct state (if possible) or throw immediately throw an IllegalArgumentException. Alternatively, the constructors taking "external" could be removed or made non-public.

Write embedded resources to WARC

Great to see work on this @ato!

I am using using jwarc 0.3.0 .jar release and noticed only the root page is included. Perhaps this is by design. If not:

For example, java -jar jwa-0.3.0.jar fetch https://www.cs.odu.edu/~mkelly/ > example.warc does not capture any embedded images, CSS, etc.

The WARC is replayable in a few replay systems (e.g., OpenWayback, Webrecorder Player) but does not appear to be replayable in the embedded one.

I tried to replay this WARC using the included java -jar jwa-0.3.0.jar serve example.warc but received a Service Unavailable in the browser when accessing http://localhost:8080

Capture local files

For archiving non-web resources.

wget quirk: Content-Length off by one

Some versions of wget generated WARC headers with an off by one Content-Length. This causes us to throw:

org.netpreserve.jwarc.ParsingException: invalid WARC trailer: a0d0a57

Examples:

http_message_1.warc.gz Wget/1.19.4 (from #25)
http_chunked_2.warc.gz Wget/1.19.4 (from #24)

Other implementations appear to ignore this error. Perhaps by simply skipping arbitrary numbers of CR and LF characters before reading the next record?

I don't want to silently ignore this but perhaps we could log a warning and attempt to continue.

GzipChannel write() returns compressed length rather than buffer consumption

GzipChannel.write() returns the number of compressed bytes written to the underlying channel rather than the number of uncompressed bytes consumed from the buffer. While this is useful information to know it unfortunately is not what the WritableByteChannel interface intends when it refers to "bytes written" and confuses standard methods that operate on channels such as FileChannel.transferTo().

Recording proxy with browser javax.net.ssl.SSLHandshakeException

Hi,

I try to recording a warc with jwarc in proxy mode and anything browser I use fail.
For run jwarc in proxy mode I used this commands:

export PORT=8080
java -jar jwarc-0.13.1.jar recorder > test.warc

This is the log:

javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:356)
	at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
	at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:202)
	at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:171)
	at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1488)
	at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1394)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:441)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:412)
	at org.netpreserve.jwarc.net.HttpServer.upgradeToTls(HttpServer.java:137)
	at org.netpreserve.jwarc.net.HttpServer.interact(HttpServer.java:87)
	at org.netpreserve.jwarc.net.HttpServer.lambda$listen$1(HttpServer.java:58)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)

How I can resolve this problem? There is a possibility to run jwarc in proxy mode with a new certificate?

Thanks

iipc / jwarc Goto Github PK

jwarc's Introduction

jwarc

Getting it

Examples

Saving a remote resource

Writing records

Filter expressions

Command-line tools

API Quick Reference

Record types

WarcTargetRecord (abstract)

WarcCaptureRecord (abstract)

Comparison

Other WARC libraries

jwarc's People

Contributors

Stargazers

Watchers

Forkers

jwarc's Issues

Recommend Projects

Recommend Topics

Recommend Org