internetarchive / heritrix3 Goto Github PK

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Home Page: https://heritrix.readthedocs.io/

License: Other

Java 91.75% HTML 2.62% XSLT 0.20% CSS 0.11% JavaScript 3.15% PostScript 0.30% FreeMarker 0.67% Rich Text Format 1.08% Dockerfile 0.03% Makefile 0.07% Shell 0.02%

java webcrawling warc heritrix

heritrix3's Introduction

Heritrix

Introduction

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

Crawl Operators!

Heritrix is designed to respect the robots.txt exclusion directives^† and META nofollow tags. Please consider the load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the User-Agent so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly.

^† The newer wildcard extension to robots.txt is not yet supported.

Documentation

Developer Documentation

Developer Manual
REST API documentation
JavaDoc: engine, modules, commons, contrib

Latest Releases

Information about releases can be found here.

License

Heritrix is free software; you can redistribute it and/or modify it under the terms of the Apache License, Version 2.0

Some individual source code files are subject to or offered under other licenses. See the included LICENSE.txt file for more information.

Heritrix is distributed with the libraries it depends upon. The libraries can be found under the lib directory in the release distribution, and are used under the terms of their respective licenses, which are included alongside the libraries in the lib directory.

heritrix3's People

Contributors

Stargazers

Watchers

Forkers

kngenie travisfw cmiles74 hijbul benlucchesi acknapp adam-miller nagennaskar ato nlevitt prilia lemurproject zbhlove100 mwarhaftig yhtsnda laiweiwei iipc infernojj sumitmah jjoos lingchant mixianghang zhanzf quixey kris-sigur baeeq crosslink noguespi whwhzzz skord0411 chpooo apurtell ezhouyang zhoujiang2013 carl-platt zhentao cdlib zcfrank1st xntric78 zjxu zhengyouxiang web5design eldondev jkowalczyk talexu shriphani martinsbalodis gotomypc cetusz weiliu68 lingua nanux hydra1983 rightm vonrosen iis-markua qingyu1229 zjulib albertyou cshuig monolithic rogerluo yuwentao freetiger mxjl620 blademainer dhamaniasad xuzhikethinker bannerzhu airing0805 tempbottle bouchtaoui-dev socialpercon calm4wei nlnwa louistang landsbokasafn bigdatata hnlshzx javalixue iamdafu wangfeng3769 ilove0518 liusmchs huokedu yangshangshe mazekkkk y001j guoyunsky navindrayadav chinalongganhu daniel-he miao-a chinpeng haowg zhang637 zmjq01 claudiouzelac leonlei allupaku

heritrix3's Issues

Do not treat all URLs from link/@href tags as embeds.

Currently all URLs extracted from href attributes on link tags are treated as embeds (hop type E).

This makes perfect sense for when the link is to a stylesheet (ref='stylesheet), but is just wrong in many other scenarios.

The ExtractorHTML should be modified to take the rel attribute into account in determining the type of link, E or L.

heritrix doesn't scrape rewrite srcset urls correctly

Reposted from internetarchive/wayback#137 as I learn more about the architecture here.

I ran into an issue while trying to archive the website http://www.goodbyetohalos.com. Like many webcomics using wordpress nowadays, Goodbye to Halos uses html5 srcset attributes to displays different image sizes to different devices:

<img
    width="800" height="1200" 
    src="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg"
    class="attachment-full size-full" alt=""
    srcset="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg 800w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-480x720.jpg 480w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-96x144.jpg 96w"
    sizes="(max-width: 800px) 100vw, 800px"
    data-webcomic-parent="837"
>

so far, so good. however, after crawling/scraping these with wayback, only the src url is scraped and rewritten, leading to the image on wayback'ed page still being served from the original server:

<img
    width="800" height="1200"
    src="/web/20170127042412im_/http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg"
    class="attachment-full size-full" alt=""
    srcset="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg 800w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-480x720.jpg 480w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-96x144.jpg 96w"
    sizes="(max-width: 800px) 100vw, 800px"
    data-webcomic-parent="837"
>

this is very obvious because the original site doesn't use https, so it leads to a broken image on the wayback machine view:

Obviously, the correct behavior here is that all of the images should be scraped (in this case they're just resizings, but in theory they could be completely different images—nothing prevents that) and rewritten.

Thanks! let me know if you need more information, or want me to whip up a more minimal test case

JDK11 support: jaxb

jaxb has been removed from the JDK and is now available as separate maven dependency. Compiling with openjdk11 fails with:

[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/XmlMarshaller.java:[34,33] package javax.xml.bind.annotation does not exist
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/XmlMarshaller.java:[35,33] package javax.xml.bind.annotation does not exist
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/XmlMarshaller.java:[36,33] package javax.xml.bind.annotation does not exist
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/models/ScriptModel.java:[9,33] package javax.xml.bind.annotation does not exist
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/models/ScriptModel.java:[10,33] package javax.xml.bind.annotation does not exist
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/models/ScriptModel.java:[14,2] cannot find symbol
  symbol: class XmlRootElement
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/models/ScriptModel.java:[15,2] cannot find symbol
  symbol: class XmlType
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/XmlMarshaller.java:[141,9] cannot find symbol
  symbol:   class XmlRootElement
  location: class org.archive.crawler.restlet.XmlMarshaller
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/XmlMarshaller.java:[141,59] cannot find symbol
  symbol:   class XmlRootElement
  location: class org.archive.crawler.restlet.XmlMarshaller
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/XmlMarshaller.java:[160,13] cannot find symbol
  symbol:   class XmlType
  location: class org.archive.crawler.restlet.XmlMarshaller
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/XmlMarshaller.java:[160,60] cannot find symbol
  symbol:   class XmlType
  location: class org.archive.crawler.restlet.XmlMarshaller
[ERROR] heritrix3/engine/src/main/java/org/archive/crawler/restlet/XmlMarshaller.java:[170,50] cannot find symbol
  symbol:   class XmlTransient
  location: class org.archive.crawler.restlet.XmlMarshaller

Heritrix3 (including pre-built binaries) Fails to Bootstrap with Java8 due to Changes in Java stdlib

The in depth description of the issue resides at a 2014 blog post by "Kristinn Sigurðsson", titled:
"Heritrix, Java 8 and sun.security.tools.Keytool"
archival copy .

An excerpt from my version of the heritrix_out.log :

pending signals                 (-i) 31957
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 31957
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Oracle Corporation Java(TM) SE Runtime Environment 1.8.0_121-b13
Exception in thread "main" java.lang.NoClassDefFoundError: sun/security/tools/KeyTool
        at org.archive.crawler.Heritrix.useAdhocKeystore(Heritrix.java:438)
        at org.archive.crawler.Heritrix.instanceMain(Heritrix.java:319)
        at org.archive.crawler.Heritrix.main(Heritrix.java:189)
Caused by: java.lang.ClassNotFoundException: sun.security.tools.KeyTool
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 3 more
T jaan  31 15:30:07 EET 2017 Starting heritrix
Linux mvahi 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
JAVA_OPTS= -Xmx256m
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31957
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8

Bug in non-fatal-error log

While resolving #158 I noticed that the resulting entry in the non-fatal-error log had redundant stacktraces. I.e. one exception triggered the following:

2016-05-02T12:58:56.239Z   401       4758 http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu - - text/html #001 20160502125856064+163 sha1:KKGWJBFE2H4XPVTWXRNIMTCEQTJ4N76N - -
 java.lang.IllegalStateException: Missing auth challenge headers for uri with response status 401: http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu
    at org.archive.modules.fetcher.FetchHTTP.extractChallenges(FetchHTTP.java:884)
    at org.archive.modules.fetcher.FetchHTTP.handle401(FetchHTTP.java:802)
    at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:743)
    at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
    at org.archive.modules.Processor.process(Processor.java:142)
    at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
 java.lang.IllegalStateException: Missing auth challenge headers for uri with response status 401: http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu
    at org.archive.modules.fetcher.FetchHTTP.extractChallenges(FetchHTTP.java:884)
    at org.archive.modules.fetcher.FetchHTTP.handle401(FetchHTTP.java:802)
    at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:743)
    at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
    at org.archive.modules.Processor.process(Processor.java:142)
    at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

Probably a bug somewhere in

org.archive.crawler.io.NonFatalErrorFormatter.format()

Requesting inaccurate paths from js causes routing errors

Not sure if this is the place for this issue, but we're seeing elevated routing errors coming from a Heretrix user agent that is trying to request random require lines from our application javascript. One example is from some minified reactjs code:

c=t("./ReactComponent"),d=t("./ReactElement"),h=(t("./ReactPropTypeLocations"),t("./ReactPropTypeLocationNames"),t("./ReactNoopUpdateQueue"))

Heretrix tries to get these non-existent assets from our servers:

ActionController::RoutingError: No route matches [GET] "/ReactNoopUpdateQueue"

Not sure what to do about this aside from just blocking Heretrix up front.

fetchDNS lacks timeout option

Running a recent 3.3.0 snapshot, a harvest remains in the RUNNING state with 4 requests queued. These 4 requests relate to 2 domains. For each domain the dns request 'hangs'. Indeed, an nslookup requests for the given domains replies "*** Can't find firebaseio.com: No answer" So it appears that Heritrix needs to be able to timeout DNS requests.

ToeThread death when using HighestUriPrecedenceProvider

We're using HighestUriPrecedenceProvider and have noticed this occasional (very rare) failure leading to ToeThreads dying...

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #53:  [Sat Nov 03 14:36:05 GMT 2018]
java.util.NoSuchElementException
        at java.util.TreeMap.key(TreeMap.java:1327)
        at java.util.TreeMap.firstKey(TreeMap.java:290)
        at org.archive.crawler.frontier.precedence.HighestUriQueuePrecedencePolicy$HighestUriPrecedenceProvider.getPrecedence(HighestUriQueuePrecedencePolicy.java:89)
        at org.archive.crawler.frontier.WorkQueue.getPrecedence(WorkQueue.java:627)
        at org.archive.crawler.frontier.WorkQueueFrontier.activateInactiveQueue(WorkQueueFrontier.java:781)
        at org.archive.crawler.frontier.WorkQueueFrontier.findEligibleURI(WorkQueueFrontier.java:597)
        at org.archive.crawler.frontier.AbstractFrontier.next(AbstractFrontier.java:450)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:134)

Presumably this is likely some subtle race-condition, i.e. the TreeMap being modified while it's being read.

See

heritrix3/engine/src/main/java/org/archive/crawler/frontier/precedence/HighestUriQueuePrecedencePolicy.java

Line 89 in 0581170

Integer delta = (enqueuedCounts.size() > 0) ? enqueuedCounts.firstKey() : 0;

Upgrade HTTP Client to 4.5.x

We're seeing some minor issues (ukwa/ukwa-heritrix#31) that appear to be bugs in the HTTP Client, so I'd like to see about upgrading. I've attempted to build with 4.5.7 and it compiles fine, but the cookie-related tests fail with errors like iterator() not implemented, and indeed the iterator is not:

heritrix3/modules/src/main/java/org/archive/modules/fetcher/BdbCookieStore.java

Line 81 in 0581170

 @Override public Iterator<T> iterator() { throw new RuntimeException("not implemented"); } 

As per the above comment:

heritrix3/modules/src/main/java/org/archive/modules/fetcher/BdbCookieStore.java

Lines 66 to 73 in 0581170

  * <p> 

  * This class is "restricted" in the sense that it is immutable, and also 

  * because some methods throw {@link RuntimeException} for other reasons. 

  * For example, {@link #iterator()} is not implemented, because we use this 

  * class to wrap a bdb {@link StoredCollection}, and iterators from that 

  * class need to be explicitly closed. Since this class hides the fact that 

  * a StoredCollection underlies it, we simply prevent {@link #iterator()} 

  * from being used.

But the HTTP Client code now expects an iterator, so we need to figure out a way of providing an iterator interface but ensuring the underlying iterator gets closed.

Cookies being sent to wrong site

We've now see a few cases where some cookies are being sent over-and-over in requests to lots of different sites. For an example WARC from our domain crawl, we see

WARC/1.0^M
WARC-Type: request^M
WARC-Target-URI: http://gamstop.co.uk/^M
WARC-Date: 2019-06-26T21:57:15Z^M
WARC-Concurrent-To: <urn:uuid:c8c3061e-74aa-4836-8e95-11270677cac7>^M
WARC-Record-ID: <urn:uuid:d51c9542-a658-4901-9445-e5a8e06d4cf3>^M
Content-Type: application/http; msgtype=request^M
Content-Length: 1750^M
^M
GET / HTTP/1.0^M
Connection: Close^M
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8^M
Host: gamstop.co.uk^M
User-Agent: bl.uk_ldfc_bot/3.4.0-20190418 (+http://www.bl.uk/aboutus/legaldeposit/websites/websites/faqswebmaster/index.html)^M
Cookie: Country=GB; EPRAT=1966168438-1561106086442; ESTN=2; PHPSESSID=p4bsl9dkml874kbnrs28isqrl2; PM_unread_Onlineshop="6669,9502,6679,5575,6055,6936,9117,5574,6670,6673,6627,70170,6814,6595,6464,6003,5582,5338,6938,9488,6811,9360,6597,9366,5586,9823,9454,9377,9453,7368,6573,6121,9800,9847,5558,9801,6267,6592,9057,5583,9843,6463,9542,9811,9814,70038,5584,9812,6672,9420,9362,9363,6596,9424,5560,6590,9816,9844,9464,6389,9275,5559,6598,9806,9805,9842,70219,6939,9815,6410,9472,6215,9807,9808,6594,6591,9830,6476,9421,9423,70013,9404,9361,5131,6584,6593,6599,6426,6354,6279,6578,9822,9833,5398,9438,70107,9435,9491,6586,6581,9827,9852,5364,9832,9846,9277,9276,70175,9819,6158,70041,70124,9448,70043,9279,5653,9442,6588,5399,9834,70121,70120,9375,6589,70106,70025,9818,6495,6214,6576,9425,70040,9456,9474,70026,70044,9813,9447,6574,6922,9851,9379,5657,6577,9356,9804,6575,70042,9383,9378,9492,6506,7700,9321,9426,6587,70123,9826,7892,9825,9439,9440,9355,9352,70045,9353,9350,70029,70032,70030,9446,9380,9381,6585,6583,9449,9441,9437,9359,70031,70027,9357,9365,9809,9418,70119,9810,70037,9457,9817,9824,70122,9376,70028,9358,9487,9322,70118,5626,9462,9455,9278,9417,9465,9463,9468,9419,9354,9415,9820,9490,9390,9406,9428,9416,9351,70039,5650,9466,9850,9055,70012,9467,9401,9803,9427,9831,9373,9405,9845,70172,9436,9489,9853,9059,9403,9471,9384,9543,70173,6270,9469,9340,9345,9364,5130,9821,9343,9342,9374,9473,9341,5656,9422,70171,9382,9144,9389,70174,6427,9344,9402,9802,9470"; PSACountry=GB^M
^M
^M
^M

This appears to be in large numbers of requests, and because it's large, sometimes causes problems (we got 403 blocked by http://gamstop.co.uk/ in this example.)

We've visited the site before in this crawl, but these cookies didn't turn up there.

BdbCookieStore not implemented iterator at RetryExec

java.lang.RuntimeException: not implemented
at org.archive.modules.fetcher.BdbCookieStore$RestrictedCollectionWrappedList.iterator(BdbCookieStore.java:81)
at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:164)
at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:133)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:183)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:745)
at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:658)
at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
at org.archive.modules.Processor.process(Processor.java:142)
at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

Hello,

I'm running the latest release of Heritrix version: 3.4.0-20190207 and get the following errors or warnings in my log. I unable to connect with a browser as a results. Currently running it on RH7 with java 8. I applied the same fix from an earlier release 3.2 to make it work with java 8 on RH7. Is there something I need to different with this release in order for it to work with java8?

Here are my logs:

Verify in browser before accepting exception.
2019-02-15 16:39:28.122 WARNING thread-1 org.archive.crawler.framework.Engine.findJobConfigs() invalid job directory: ./jobs/..gitignore where job expected from: ./jobs/..gitignore
2019-02-15 16:39:28.145 WARNING thread-1 org.archive.crawler.framework.Engine.findJobConfigs() invalid job directory: ./jobs/.gitignore where job expected from: ./jobs/.gitignore
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#
StaticLoggerBinder for further details.
engine listening at port 8443

-50 NOTCRAWLED https://podcrto.si/

hi,

I try to harvest this site: https://podcrto.si
As National Library we harvest several domains to preserve the information.
I tried with Heritrix 1.14.4 and 3.4 but without success.

I'm getting this:

[code] [status] [seed] [redirect]
-50 NOTCRAWLED https://podcrto.si/
200 CRAWLED https://e-uprava.gov.si/

and

LONGEST#2:
Queue si,podcrto, (p3)
2 items
wakes in: 13m45s77ms
last enqueued: https://podcrto.si/robots.txt
last peeked: https://podcrto.si/robots.txt
total expended: 15 (total budget: -1)
active balance: 2985
last(avg) cost: 1(1)
totalScheduled fetchSuccesses fetchFailures fetchDisregards fetchResponses robotsDenials successBytes totalBytes fetchNonResponses lastSuccessTime
3 1 0 0 1 0 54 54 16 2019-05-23T07:14:29.825Z
SimplePrecedenceProvider
3

Can anyone help or explain what could be the reason for this?
Thank you in advance.

Best,
Matjaž

SEVERE Cannot Find Class: java.lang.ClassNotFoundException: org.archive.modules.fetcher.BdbCookieStorage; ; Can't create bean 'metadata'

Hello,

When I attempt to run jobs I get java related errors.

2016-11-17T14:46:14.449Z INFO Job instantiated
2016-11-17T14:46:16.296Z INFO Job launched
2016-11-17T14:46:18.220Z INFO PREPARING 20161117144617
2016-11-17T14:46:18.715Z INFO PAUSED 20161117144617
2016-11-17T14:46:22.428Z INFO RUNNING 20161117144617
2016-11-17T22:01:53.986Z INFO STOPPING 20161117144617
2016-11-17T22:01:53.987Z INFO EMPTY 20161117144617
2016-11-17T22:01:55.767Z INFO FINISHED 20161117144617
2019-02-26T15:36:50.797Z SEVERE Cannot find class [org.archive.modules.fetcher.BdbCookieStorage] for bean with name 'cookieStorage' defined in URL [file:/srv/heritrix/./jobs/ks-3Dcenter-20161117/crawler-beans.cxml]; nested exception is java.lang.ClassNotFoundException: org.archive.modules.fetcher.BdbCookieStorage; ; Can't create bean 'metadata'
2019-02-26T15:36:50.797Z SEVERE Cannot find class [org.archive.modules.fetcher.BdbCookieStorage] for bean with name 'cookieStorage' defined in URL [file:/srv/heritrix/./jobs/ks-3Dcenter-20161117/crawler-beans.cxml]; nested exception is java.lang.ClassNotFoundException: org.archive.modules.fetcher.BdbCookieStorage; ; Can't create bean 'metadata'
2019-02-26T15:36:54.270Z SEVERE Cannot find class [org.archive.modules.fetcher.BdbCookieStorage] for bean with name 'cookieStorage' defined in URL [file:/srv/heritrix/./jobs/ks-3Dcenter-20161117/crawler-beans.cxml]; nested exception is java.lang.ClassNotFoundException: org.archive.modules.fetcher.BdbCookieStorage; ; Can't create bean 'metadata'
2019-02-26T15:36:54.270Z SEVERE Cannot find class [org.archive.modules.fetcher.BdbCookieStorage] for bean with name 'cookieStorage' defined in URL [file:/srv/heritrix/./jobs/ks-3Dcenter-20161117/crawler-beans.cxml]; nested exception is java.lang.ClassNotFoundException: org.archive.modules.fetcher.BdbCookieStorage; ; Can't create bean 'metadata'
2019-02-26T15:36:54.272Z SEVERE Can't launch problem configuration
2019-02-26T15:36:54.272Z SEVERE Can't launch problem configuration

Many Thanks,

Tpegues2

Heritrix treats inline images as relative URLs

For example on this page http://haggmark.dk/solgt/oversigt there is an element

<div class="ejendom rammebaggrund"><a href="http://haggmark.dk/sag/13009"><div class="solgtlabel"><img src="http://haggmark.dk/foto/SolgtLabel
 " style="width:70px;height:70px; border: none;"></div><img class="foto" src="data:image/png;base64,/9j/4AAQSkZJRgABAQEAYABg ...
Heritrix constructs an enormous URL by concatenating the base64 encoded data as if it were a relative path.

can you integration with spring boot

can you integration with spring boot?
instead use embedded jetty server

Heritrix Fails to Build from Source on 32bit Raspberry Pi 1 (missing libjnidispatch.so)

heritrix_runner@computenode1softf1com ~/t66versioon/heritrix3 $ ls
README.md  commons  contrib  dist  engine  modules  pom.xml
heritrix_runner@computenode1softf1com ~/t66versioon/heritrix3 $ time nice -n20 mvn clean
[INFO] Scanning for projects...
[INFO] Reactor build order: 
[INFO]   Heritrix 3
[INFO]   Heritrix 3: 'commons' subproject (utility classes)
[INFO]   Heritrix 3: 'modules' subproject (reusable components)
[INFO]   Heritrix 3: 'engine' subproject
[INFO]   Heritrix 3 (distribution bundles)
[INFO] ------------------------------------------------------------------------
[INFO] Building Heritrix 3
[INFO]    task-segment: [clean]
[INFO] ------------------------------------------------------------------------
[INFO] [clean:clean {execution: default-clean}]
[INFO] ------------------------------------------------------------------------
[INFO] Building Heritrix 3: 'commons' subproject (utility classes)
[INFO]    task-segment: [clean]
[INFO] ------------------------------------------------------------------------
[INFO] [clean:clean {execution: default-clean}]
[INFO] Deleting /home/heritrix_runner/t66versioon/heritrix3/commons/target
[INFO] ------------------------------------------------------------------------
[INFO] Building Heritrix 3: 'modules' subproject (reusable components)
[INFO]    task-segment: [clean]
[INFO] ------------------------------------------------------------------------
[INFO] [clean:clean {execution: default-clean}]
[INFO] ------------------------------------------------------------------------
[INFO] Building Heritrix 3: 'engine' subproject
[INFO]    task-segment: [clean]
[INFO] ------------------------------------------------------------------------
[INFO] [clean:clean {execution: default-clean}]
[INFO] ------------------------------------------------------------------------
[INFO] Building Heritrix 3 (distribution bundles)
[INFO]    task-segment: [clean]
[INFO] ------------------------------------------------------------------------
[INFO] [clean:clean {execution: default-clean}]
[INFO] 
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] ------------------------------------------------------------------------
[INFO] Heritrix 3 ............................................ SUCCESS [1:11.151s]
[INFO] Heritrix 3: 'commons' subproject (utility classes) .... SUCCESS [36.048s]
[INFO] Heritrix 3: 'modules' subproject (reusable components)  SUCCESS [0.733s]
[INFO] Heritrix 3: 'engine' subproject ....................... SUCCESS [0.759s]
[INFO] Heritrix 3 (distribution bundles) ..................... SUCCESS [29.488s]
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2 minutes 32 seconds
[INFO] Finished at: Sat Oct 08 06:39:37 UTC 2016
[INFO] Final Memory: 6M/16M
[INFO] ------------------------------------------------------------------------

real    2m50.035s
user    2m43.060s
sys 0m0.950s
heritrix_runner@computenode1softf1com ~/t66versioon/heritrix3 $ sync
heritrix_runner@computenode1softf1com ~/t66versioon/heritrix3 $ time nice -n20 mvn package
[INFO] Scanning for projects...
[INFO] Reactor build order: 
[INFO]   Heritrix 3
[INFO]   Heritrix 3: 'commons' subproject (utility classes)
[INFO]   Heritrix 3: 'modules' subproject (reusable components)
[INFO]   Heritrix 3: 'engine' subproject
[INFO]   Heritrix 3 (distribution bundles)
[INFO] ------------------------------------------------------------------------
[INFO] Building Heritrix 3
[INFO]    task-segment: [package]
[INFO] ------------------------------------------------------------------------
[INFO] [site:attach-descriptor {execution: default-attach-descriptor}]
[INFO] ------------------------------------------------------------------------
[INFO] Building Heritrix 3: 'commons' subproject (utility classes)
[INFO]    task-segment: [package]
[INFO] ------------------------------------------------------------------------
[debug] execute contextualize
[INFO] [resources:resources {execution: default-resources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 6 resources
[WARNING] While downloading poi:poi:2.5.1
  This artifact has been relocated to poi:poi:2.5.1-final-20040804.


[WARNING] While downloading itext:itext:1.3
  This artifact has been relocated to com.lowagie:itext:1.3.


[INFO] [compiler:compile {execution: default-compile}]
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 77 source files to /home/heritrix_runner/t66versioon/heritrix3/commons/target/classes
[WARNING] /home/heritrix_runner/t66versioon/heritrix3/commons/src/main/java/org/archive/bdb/KryoBinding.java:[7,63] sun.reflect.ReflectionFactory is internal proprietary API and may be removed in a future release
[WARNING] /home/heritrix_runner/t66versioon/heritrix3/commons/src/main/java/org/archive/bdb/AutoKryo.java:[53,28] sun.reflect.ReflectionFactory is internal proprietary API and may be removed in a future release
[WARNING] /home/heritrix_runner/t66versioon/heritrix3/commons/src/main/java/org/archive/bdb/AutoKryo.java:[53,67] sun.reflect.ReflectionFactory is internal proprietary API and may be removed in a future release
[INFO] /home/heritrix_runner/t66versioon/heritrix3/commons/src/main/java/org/archive/io/Warc2Arc.java: Some input files use or override a deprecated API.
[INFO] /home/heritrix_runner/t66versioon/heritrix3/commons/src/main/java/org/archive/io/Warc2Arc.java: Recompile with -Xlint:deprecation for details.
[INFO] /home/heritrix_runner/t66versioon/heritrix3/commons/src/main/java/org/archive/util/TestUtils.java: /home/heritrix_runner/t66versioon/heritrix3/commons/src/main/java/org/archive/util/TestUtils.java uses unchecked or unsafe operations.
[INFO] /home/heritrix_runner/t66versioon/heritrix3/commons/src/main/java/org/archive/util/TestUtils.java: Recompile with -Xlint:unchecked for details.
[debug] execute contextualize
[INFO] [resources:testResources {execution: default-testResources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 6 resources
Downloading: http://download.oracle.com/maven/com/esotericsoftware/reflectasm/0.8/reflectasm-0.8.jar
[INFO] Unable to find resource 'com.esotericsoftware:reflectasm:jar:0.8' in repository download.oracle.com,maven (http://download.oracle.com/maven)
Downloading: http://builds.archive.org/maven2/com/esotericsoftware/reflectasm/0.8/reflectasm-0.8.jar
7K downloaded  (reflectasm-0.8.jar)
Downloading: http://download.oracle.com/maven/com/esotericsoftware/minlog/1.2/minlog-1.2.jar
[INFO] Unable to find resource 'com.esotericsoftware:minlog:jar:1.2' in repository download.oracle.com,maven (http://download.oracle.com/maven)
Downloading: http://builds.archive.org/maven2/com/esotericsoftware/minlog/1.2/minlog-1.2.jar
3K downloaded  (minlog-1.2.jar)
[INFO] [compiler:testCompile {execution: default-testCompile}]
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 37 source files to /home/heritrix_runner/t66versioon/heritrix3/commons/target/test-classes
[INFO] /home/heritrix_runner/t66versioon/heritrix3/commons/src/test/java/org/archive/io/arc/ARCWriterTest.java: Some input files use or override a deprecated API.
[INFO] /home/heritrix_runner/t66versioon/heritrix3/commons/src/test/java/org/archive/io/arc/ARCWriterTest.java: Recompile with -Xlint:deprecation for details.
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-plugin-api/2.0.9/maven-plugin-api-2.0.9.pom
1K downloaded  (maven-plugin-api-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven/2.0.9/maven-2.0.9.pom
18K downloaded  (maven-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/surefire/surefire-booter/2.9/surefire-booter-2.9.pom
2K downloaded  (surefire-booter-2.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/surefire/surefire-api/2.9/surefire-api-2.9.pom
2K downloaded  (surefire-api-2.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/surefire/maven-surefire-common/2.9/maven-surefire-common-2.9.pom
3K downloaded  (maven-surefire-common-2.9.pom)
Downloading: http://builds.archive.org/maven2/org/codehaus/plexus/plexus-utils/2.1/plexus-utils-2.1.pom
3K downloaded  (plexus-utils-2.1.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-artifact/2.0.9/maven-artifact-2.0.9.pom
1K downloaded  (maven-artifact-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-project/2.0.9/maven-project-2.0.9.pom
2K downloaded  (maven-project-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-settings/2.0.9/maven-settings-2.0.9.pom
2K downloaded  (maven-settings-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-model/2.0.9/maven-model-2.0.9.pom
3K downloaded  (maven-model-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-profile/2.0.9/maven-profile-2.0.9.pom
2K downloaded  (maven-profile-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-artifact-manager/2.0.9/maven-artifact-manager-2.0.9.pom
2K downloaded  (maven-artifact-manager-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-repository-metadata/2.0.9/maven-repository-metadata-2.0.9.pom
1K downloaded  (maven-repository-metadata-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-plugin-registry/2.0.9/maven-plugin-registry-2.0.9.pom
1K downloaded  (maven-plugin-registry-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-core/2.0.9/maven-core-2.0.9.pom
7K downloaded  (maven-core-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-plugin-parameter-documenter/2.0.9/maven-plugin-parameter-documenter-2.0.9.pom
1K downloaded  (maven-plugin-parameter-documenter-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/reporting/maven-reporting-api/2.0.9/maven-reporting-api-2.0.9.pom
1K downloaded  (maven-reporting-api-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/reporting/maven-reporting/2.0.9/maven-reporting-2.0.9.pom
1K downloaded  (maven-reporting-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-error-diagnostics/2.0.9/maven-error-diagnostics-2.0.9.pom
1K downloaded  (maven-error-diagnostics-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-plugin-descriptor/2.0.9/maven-plugin-descriptor-2.0.9.pom
2K downloaded  (maven-plugin-descriptor-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-monitor/2.0.9/maven-monitor-2.0.9.pom
1K downloaded  (maven-monitor-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-toolchain/2.0.9/maven-toolchain-2.0.9.pom
3K downloaded  (maven-toolchain-2.0.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/shared/maven-common-artifact-filters/1.3/maven-common-artifact-filters-1.3.pom
3K downloaded  (maven-common-artifact-filters-1.3.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/shared/maven-shared-components/12/maven-shared-components-12.pom
9K downloaded  (maven-shared-components-12.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/maven-parent/13/maven-parent-13.pom
22K downloaded  (maven-parent-13.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/surefire/surefire-booter/2.9/surefire-booter-2.9.jar
Downloading: http://builds.archive.org/maven2/org/codehaus/plexus/plexus-utils/2.1/plexus-utils-2.1.jar
Downloading: http://builds.archive.org/maven2/org/apache/maven/shared/maven-common-artifact-filters/1.3/maven-common-artifact-filters-1.3.jar
32K downloaded  (surefire-booter-2.9.jar)
Downloading: http://builds.archive.org/maven2/org/apache/maven/surefire/surefire-api/2.9/surefire-api-2.9.jar
30K downloaded  (maven-common-artifact-filters-1.3.jar)
155K downloaded  (surefire-api-2.9.jar)
Downloading: http://builds.archive.org/maven2/org/apache/maven/surefire/maven-surefire-common/2.9/maven-surefire-common-2.9.jar
219K downloaded  (plexus-utils-2.1.jar)
59K downloaded  (maven-surefire-common-2.9.jar)
[INFO] [surefire:test {execution: default-test}]
[INFO] Surefire report directory: /home/heritrix_runner/t66versioon/heritrix3/commons/target/surefire-reports
Downloading: http://builds.archive.org/maven2/org/apache/maven/surefire/surefire-junit3/2.9/surefire-junit3-2.9.pom
1K downloaded  (surefire-junit3-2.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/surefire/surefire-providers/2.9/surefire-providers-2.9.pom
2K downloaded  (surefire-providers-2.9.pom)
Downloading: http://builds.archive.org/maven2/org/apache/maven/surefire/surefire-junit3/2.9/surefire-junit3-2.9.jar
25K downloaded  (surefire-junit3-2.9.jar)

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.archive.io.warc.WARCWriterTest
log4j:ERROR Could not create the Layout. Reported error follows.
java.lang.ClassNotFoundException: org.archive.util.OneLineSimpleLayout
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at org.apache.maven.surefire.booter.IsolatedClassLoader.loadClass(IsolatedClassLoader.java:93)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:195)
    at org.apache.log4j.helpers.Loader.loadClass(Loader.java:198)
    at org.apache.log4j.xml.DOMConfigurator.parseLayout(DOMConfigurator.java:555)
    at org.apache.log4j.xml.DOMConfigurator.parseAppender(DOMConfigurator.java:269)
    at org.apache.log4j.xml.DOMConfigurator.findAppenderByName(DOMConfigurator.java:176)
    at org.apache.log4j.xml.DOMConfigurator.findAppenderByReference(DOMConfigurator.java:191)
    at org.apache.log4j.xml.DOMConfigurator.parseChildrenOfLoggerElement(DOMConfigurator.java:523)
    at org.apache.log4j.xml.DOMConfigurator.parseRoot(DOMConfigurator.java:492)
    at org.apache.log4j.xml.DOMConfigurator.parse(DOMConfigurator.java:1006)
    at org.apache.log4j.xml.DOMConfigurator.doConfigure(DOMConfigurator.java:872)
    at org.apache.log4j.xml.DOMConfigurator.doConfigure(DOMConfigurator.java:778)
    at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
    at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
    at org.apache.log4j.Logger.getLogger(Logger.java:104)
    at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)
    at org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:65)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:529)
    at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235)
    at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:209)
    at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:351)
    at org.archive.util.LaxHttpParser.<clinit>(LaxHttpParser.java:60)
    at org.archive.io.warc.WARCRecord.parseHeaders(WARCRecord.java:113)
    at org.archive.io.warc.WARCRecord.<init>(WARCRecord.java:90)
    at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:94)
    at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader$1.innerNext(WARCReaderFactory.java:290)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:537)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:495)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
    at org.archive.io.ArchiveReader.validate(ArchiveReader.java:249)
    at org.archive.io.warc.WARCWriterTest.validate(WARCWriterTest.java:278)
    at org.archive.io.warc.WARCWriterTest.testWriteRecordCompressed(WARCWriterTest.java:356)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at junit.framework.TestCase.runTest(TestCase.java:164)
    at junit.framework.TestCase.runBare(TestCase.java:130)
    at junit.framework.TestResult$1.protect(TestResult.java:106)
    at junit.framework.TestResult.runProtected(TestResult.java:124)
    at junit.framework.TestResult.run(TestResult.java:109)
    at junit.framework.TestCase.run(TestCase.java:120)
    at junit.framework.TestSuite.runTest(TestSuite.java:230)
    at junit.framework.TestSuite.run(TestSuite.java:225)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.maven.surefire.junit.JUnitTestSet.execute(JUnitTestSet.java:98)
    at org.apache.maven.surefire.junit.JUnit3Provider.executeTestSet(JUnit3Provider.java:117)
    at org.apache.maven.surefire.junit.JUnit3Provider.invoke(JUnit3Provider.java:94)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
    at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:172)
    at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:104)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:70)
Oct 08, 2016 7:07:10 AM org.archive.io.warc.WARCWriter writeRecord
SEVERE: could not write record type: resourcefor URL: http://www.archive.org/test/ /index.html
java.lang.IllegalArgumentException: Contains disallowed white space 0x20: http://www.archive.org/test/ /index.html
    at org.archive.io.warc.WARCWriter.checkHeaderValue(WARCWriter.java:148)
    at org.archive.io.warc.WARCWriter.createRecordHeader(WARCWriter.java:193)
    at org.archive.io.warc.WARCWriter.writeRecord(WARCWriter.java:227)
    at org.archive.io.warc.WARCWriterTest.writeRecord(WARCWriterTest.java:394)
    at org.archive.io.warc.WARCWriterTest.holeyUrl(WARCWriterTest.java:439)
    at org.archive.io.warc.WARCWriterTest.testSpaceInURL(WARCWriterTest.java:423)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at junit.framework.TestCase.runTest(TestCase.java:164)
    at junit.framework.TestCase.runBare(TestCase.java:130)
    at junit.framework.TestResult$1.protect(TestResult.java:106)
    at junit.framework.TestResult.runProtected(TestResult.java:124)
    at junit.framework.TestResult.run(TestResult.java:109)
    at junit.framework.TestCase.run(TestCase.java:120)
    at junit.framework.TestSuite.runTest(TestSuite.java:230)
    at junit.framework.TestSuite.run(TestSuite.java:225)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.maven.surefire.junit.JUnitTestSet.execute(JUnitTestSet.java:98)
    at org.apache.maven.surefire.junit.JUnit3Provider.executeTestSet(JUnit3Provider.java:117)
    at org.apache.maven.surefire.junit.JUnit3Provider.invoke(JUnit3Provider.java:94)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
    at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:172)
    at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:104)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:70)

Oct 08, 2016 7:07:10 AM org.archive.io.warc.WARCWriter writeRecord
SEVERE: could not write record type: resourcefor URL: http://www.archive.org/test/  /index.html
java.lang.IllegalArgumentException: Contains illegal character 0x9: http://www.archive.org/test/    /index.html
    at org.archive.io.warc.WARCWriter.baseCharacterCheck(WARCWriter.java:137)
    at org.archive.io.warc.WARCWriter.checkHeaderValue(WARCWriter.java:146)
    at org.archive.io.warc.WARCWriter.createRecordHeader(WARCWriter.java:193)
    at org.archive.io.warc.WARCWriter.writeRecord(WARCWriter.java:227)
    at org.archive.io.warc.WARCWriterTest.writeRecord(WARCWriterTest.java:394)
    at org.archive.io.warc.WARCWriterTest.holeyUrl(WARCWriterTest.java:439)
    at org.archive.io.warc.WARCWriterTest.testTabInURL(WARCWriterTest.java:428)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at junit.framework.TestCase.runTest(TestCase.java:164)
    at junit.framework.TestCase.runBare(TestCase.java:130)
    at junit.framework.TestResult$1.protect(TestResult.java:106)
    at junit.framework.TestResult.runProtected(TestResult.java:124)
    at junit.framework.TestResult.run(TestResult.java:109)
    at junit.framework.TestCase.run(TestCase.java:120)
    at junit.framework.TestSuite.runTest(TestSuite.java:230)
    at junit.framework.TestSuite.run(TestSuite.java:225)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.maven.surefire.junit.JUnitTestSet.execute(JUnitTestSet.java:98)
    at org.apache.maven.surefire.junit.JUnit3Provider.executeTestSet(JUnit3Provider.java:117)
    at org.apache.maven.surefire.junit.JUnit3Provider.invoke(JUnit3Provider.java:94)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
    at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:172)
    at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:104)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:70)

Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 15.091 sec
Running org.archive.io.BufferedSeekInputStreamTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.027 sec
Running org.archive.io.RepositionableInputStreamTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.042 sec
Running org.archive.io.ArchiveTest
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
Running org.archive.io.RecordingInputStreamTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.868 sec
Running org.archive.io.arc.ARCWriterTest
Oct 08, 2016 7:07:25 AM org.archive.util.DevUtils warnHandle
WARNING: java.lang.Throwable: Gap between expected and actual: -1
 writing arc /tmp/heritrix-junit-tests/testGapError-JUNIT.arc.gz.open
    at org.archive.io.arc.ARCWriter.write(ARCWriter.java:398)
    at org.archive.io.arc.ARCWriter.write(ARCWriter.java:357)
    at org.archive.io.arc.ARCWriterTest.testGapError(ARCWriterTest.java:522)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at junit.framework.TestCase.runTest(TestCase.java:164)
    at junit.framework.TestCase.runBare(TestCase.java:130)
    at junit.framework.TestResult$1.protect(TestResult.java:106)
    at junit.framework.TestResult.runProtected(TestResult.java:124)
    at junit.framework.TestResult.run(TestResult.java:109)
    at junit.framework.TestCase.run(TestCase.java:120)
    at junit.framework.TestSuite.runTest(TestSuite.java:230)
    at junit.framework.TestSuite.run(TestSuite.java:225)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.maven.surefire.junit.JUnitTestSet.execute(JUnitTestSet.java:98)
    at org.apache.maven.surefire.junit.JUnit3Provider.executeTestSet(JUnit3Provider.java:117)
    at org.apache.maven.surefire.junit.JUnit3Provider.invoke(JUnit3Provider.java:94)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
    at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:172)
    at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:104)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:70)

Tests run: 19, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.06 sec
Running org.archive.io.arc.ARCWriterPoolTest
Oct 08, 2016 7:07:50 AM org.archive.io.WriterPool <init>
INFO: Initial configuration: prefix=TEST, template=${prefix}-${timestamp17}-${serialno}-${heritrix.hostname}, compress=true, maxSize=100000000, maxActive=3, maxWait=100
Oct 08, 2016 7:07:51 AM org.archive.io.WriterPool <init>
INFO: Initial configuration: prefix=TEST, template=${prefix}-${timestamp17}-${serialno}-${heritrix.hostname}, compress=true, maxSize=100000000, maxActive=3, maxWait=100
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.979 sec
Running org.archive.io.arc.ARCReaderFactoryTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.282 sec
Running org.archive.io.ArchiveReaderFactoryTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.537 sec
Running org.archive.io.HeaderedArchiveRecordTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.143 sec
Running org.archive.io.ReplayCharSequenceTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 13.474 sec
Running org.archive.io.SinkHandlerTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.032 sec
Running org.archive.uid.UUIDGeneratorTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
Running org.archive.settings.file.BdbModuleTest
Oct 08, 2016 7:08:34 AM org.archive.util.FilesystemLinkMaker makeHardLink
WARNING: hard links not supported on this platform - java.lang.UnsatisfiedLinkError: jnidispatch (/com/sun/jna/linux-arm/libjnidispatch.so) not found in resource path
Oct 08, 2016 7:08:34 AM org.archive.bdb.BdbModule doCheckpoint
SEVERE: unable to create required checkpoint link /tmp/heritrix-junit-tests/bdb/cp00998-20161008070830/00000000.jdb,51813
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 40.176 sec <<< FAILURE!
Running org.archive.settings.file.PrefixFinderTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.116 sec
Running org.archive.bdb.StoredQueueTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.522 sec
Running org.archive.surt.SURTTokenizerTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.984 sec
Running org.archive.util.fingerprint.LongFPSetCacheTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 90.55 sec
Running org.archive.util.fingerprint.MemLongFPSetTest
Oct 08, 2016 7:10:32 AM org.archive.util.fingerprint.MemLongFPSet grow
INFO: Doubling fingerprinting slots to 1024
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 10.946 sec
Running org.archive.util.fingerprint.ArrayLongFPCacheTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.061 sec
Running org.archive.util.LongToIntConsistentHashTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1,243.576 sec
Exception in thread "ThreadedStreamConsumer" org.apache.maven.surefire.util.NestedRuntimeException: null; nested exception is org.apache.maven.surefire.report.ReporterException: Unable to create file: /home/heritrix_runner/t66versioon/heritrix3/commons/target/surefire-reports/TEST-org.archive.util.LongToIntConsistentHashTest.xml (No such file or directory); nested exception is java.io.FileNotFoundException: /home/heritrix_runner/t66versioon/heritrix3/commons/target/surefire-reports/TEST-org.archive.util.LongToIntConsistentHashTest.xml (No such file or directory)
org.apache.maven.surefire.report.ReporterException: Unable to create file: /home/heritrix_runner/t66versioon/heritrix3/commons/target/surefire-reports/TEST-org.archive.util.LongToIntConsistentHashTest.xml (No such file or directory); nested exception is java.io.FileNotFoundException: /home/heritrix_runner/t66versioon/heritrix3/commons/target/surefire-reports/TEST-org.archive.util.LongToIntConsistentHashTest.xml (No such file or directory)
java.io.FileNotFoundException: /home/heritrix_runner/t66versioon/heritrix3/commons/target/surefire-reports/TEST-org.archive.util.LongToIntConsistentHashTest.xml (No such file or directory)
    at java.io.FileOutputStream.open(Native Method)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
    at org.apache.maven.surefire.report.XMLReporter.testSetCompleted(XMLReporter.java:123)
    at org.apache.maven.surefire.report.MulticastingReporter.testSetCompleted(MulticastingReporter.java:51)
    at org.apache.maven.surefire.report.TestSetRunListener.testSetCompleted(TestSetRunListener.java:115)
    at org.apache.maven.plugin.surefire.booterclient.output.ForkClient.consumeLine(ForkClient.java:97)
    at org.apache.maven.plugin.surefire.booterclient.output.ThreadedStreamConsumer$Pumper.run(ThreadedStreamConsumer.java:67)
    at java.lang.Thread.run(Thread.java:745)

Results :

Tests in error: 
  testDoCheckpoint(org.archive.settings.file.BdbModuleTest): Could not initialize class org.archive.util.CLibrary

Tests run: 91, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] There are test failures.

Please refer to /home/heritrix_runner/t66versioon/heritrix3/commons/target/surefire-reports for the individual test results.
[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 51 minutes 24 seconds
[INFO] Finished at: Sat Oct 08 07:31:27 UTC 2016
[INFO] Final Memory: 36M/88M
[INFO] ------------------------------------------------------------------------

real    51m43.995s
user    49m36.060s
sys 0m6.720s
heritrix_runner@computenode1softf1com ~/t66versioon/heritrix3

The system:

pi@computenode1softf1com ~ $ uname -a
Linux computenode1softf1com 4.1.19+ #858 Tue Mar 15 15:52:03 GMT 2016 armv6l GNU/Linux
pi@computenode1softf1com ~ $ date
Sat Oct  8 18:30:19 UTC 2016
pi@computenode1softf1com ~ $

Intermittent problems with Kryo serialisation for crawls resumed from checkpoints

I'm hitting problems when re-using crawl state (checkpoints). I get a lot of errors like:

WARNING: com.google.common.cache.LocalCache processPendingNotifications Exception thrown by removal listener [Tue Mar 19 12:07:00 GMT 2019]
java.lang.IllegalArgumentException: Can not set org.archive.modules.fetcher.FetchStats field org.archive.crawler.frontier.WorkQueue.substats to java.lang.Byte
        at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167)
        at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171)
        at sun.reflect.UnsafeObjectFieldAccessorImpl.set(UnsafeObjectFieldAccessorImpl.java:81)
        at java.lang.reflect.Field.set(Field.java:764)
        at com.esotericsoftware.kryo.serialize.FieldSerializer$CachedField.set(FieldSerializer.java:290)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:209)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:178)
        at com.esotericsoftware.kryo.Kryo.readObjectData(Kryo.java:512)
        at com.esotericsoftware.kryo.ObjectBuffer.readObjectData(ObjectBuffer.java:212)
        at org.archive.bdb.KryoBinding.entryToObject(KryoBinding.java:84)
        at com.sleepycat.collections.DataView.makeValue(DataView.java:595)
        at com.sleepycat.collections.DataCursor.getCurrentValue(DataCursor.java:349)
        at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:813)
        at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
        at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
        at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
        at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:119)
        at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
        at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
        at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
        at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
        at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
        at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:374)
        at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:688)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:1016)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:569)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

One possible cause is that the Kryo serialisers are not getting set up right.

As I understand it, the reflection-based auto-registration magic attempts to register the classes needed, and as I understand the documentation this saves storage space but relies on classes getting registered in a consistent order (so the same classes get the same IDs).

However, this registration appears to happen on the Spring Lifecycle.start() event, e.g. org.archive.modules.net.BdbServerCache.start() or org.archive.crawler.frontier.WorkQueueFrontier.start() and AFAICT nothing is explicitly enforcing the order of these events.

It looks like the latter leads to

heritrix3/modules/src/main/java/org/archive/modules/CrawlURI.java

Lines 1808 to 1811 in 0581170

 public static void autoregisterTo(AutoKryo kryo) { 

 // kryo.register(CrawlURI.class,new DeflateCompressor(kryo.newSerializer(CrawlURI.class))); 

 kryo.register(CrawlURI.class); 

 kryo.autoregister(byte[].class);

(i.e. there we see Byte getting registered) and the former leads to

heritrix3/modules/src/main/java/org/archive/modules/net/CrawlServer.java

Lines 319 to 321 in 0581170

 public static void autoregisterTo(AutoKryo kryo) { 

 kryo.register(CrawlServer.class); 

 kryo.autoregister(FetchStats.class);

(i.e. there's FetchStats) which seems suspicious. However, in both cases, the autoregistered class is the second class to get registered, not the first, so it's not clear why this would be the case.

I'm having trouble understanding exactly what goes on with Kryo 1 and thread context and therefore whether the reference IDs are global or ThreadLocal or AutoKyro-instance scoped.

I'm left to assume I must have missed something, otherwise this would never have worked reliably at all!

Refetching of failed URLs based on HTTP status codes/content

Heritrix should support failing URL refetch and automatic retry later if HTTP status code is user configured one (For Example CloudFlare uses special codes like 529 when content is fetched from a site too quickly) or if the returned content matches regular expression(s).

HTML extractor does not handle the base href correctly when it's relative

From archive-crawler:

On Tue, Jul 3, 2018 at 3:51 AM [email protected] [archive-crawler] [email protected] wrote:

At the moment Heritrix (3.3.0 Snapshop) seems to ignore base href tags.

Example:
https://www.schmid-gartenpflanzen.de/forum/index.php/mv/msg/7627/216142/0/
IMG_7651kl.jpg
Web browsers interpret the link this way:
https://www.schmid-gartenpflanzen.de/forum/index.php/fa/89652/0/

However, Heritrix calls this url:
https://www.schmid-gartenpflanzen.de/forum/index.php/mv/msg/7627/216142/0/index.php/fa/89652/0/

This behaviour leads to thousands of false urls being crawled. From my point of view, the Heritrix behavior is a bug, isn't it?

If I construct a test case with a relative base href, I get:

INFO: https://www.schmid-gartenpflanzen.de/forum/index.php/mv/msg/7627/216142/0/
org.apache.commons.httpclient.URIException: Relative URI but no base: /forum/
	at org.archive.url.UsableURIFactory.fixup(UsableURIFactory.java:416)
	at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
	at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
	at org.archive.net.UURIFactory.getInstance(UURIFactory.java:44)
	at org.archive.modules.extractor.ExtractorHTML.processGeneralTag(ExtractorHTML.java:426)
	at org.archive.modules.extractor.ExtractorHTML.extract(ExtractorHTML.java:850)
	at org.archive.modules.extractor.ExtractorHTMLTest.testRelativeBaseHrefRelativeLinks(ExtractorHTMLTest.java:284)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

So I think this is a bug.

Need way to securely verify Internet Archive crawler

I could be mistaken, but I haven't been able to find the documentation on how to verify the IA crawler. The closest thing I've found is that you can check the User-Agent string of the crawler, but that's easily faked. My issue is that I want to invite the IA crawler to crawl my content, but I want to detect things like spammers and block them.

Google and Bing both handle this by using a reverse DNS request of the IP address of the crawler, followed by a regular DNS request checking the host returned by the reverse DNS.

Put another way:

host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

So, since the second host command returns the same IP that we started with, and since the domain ends with googlebot.com, we're in business.

Here's google's docs: https://support.google.com/webmasters/answer/80553?hl=en
And Bings: https://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26

Could IA add this feature too? I think it would only require that you do some work with your DNS whenever you have a new IP address.

Bad requests with GTM

I have noticed a lot of bad requests from archive.org's crawler on our sites using Google Tag Manager. For instance:

/in.tag/11/10/2024/gtm.start/
/mouseup.dismiss/11/10/2024/gtm.start
/mousedown.dismiss/
/gtm.load/gtm.start/11/10/2024
/json/11/10/2024/gtm.start
/11/10/2024/gtm.js/

These are starting to add noticeable load on the server (which serves many sites).

I understand Heritrix is speculatively trying URLs based on the Javascript code, which is known to sometimes result in 404s. But GTM is used on many websites, so these issues are bad for everybody. Could this speculation be improved to take Google's code into account ? Alternatively, is there a way to disable that speculation with robots.txt ?

Failed DNS requests remain enqueued

It seems that failed DNS requests remain enqueued. This can keep a crawl job in the RUNNING state, long (i.e. hours) after it has fetched all its fetchable URIs.

This behaviour can be reproduced by creating a job with just one seed, containing a hostname that does not exist. Heritrix will put this job in the RUNNING state, while effectively doing nothing for (IIRC) about 7 hours.

This is somewhat of a pain for the Web Curator Tool, because we use the number of currently running Heritrix jobs to determine whether enough resources are available to start a new crawl job.

Is this easy to fix? Or is there perhaps an easy way to mitigate this issue, e.g. via the scripting API?

This issue seems to be related to #234 and #198.

Do not require DNS when using a web proxy

Hi,

I am trying to crawl some WWW domain from behind a web proxy in a corporate network using Heritrix build 3.3.0-20180727.011238-114 without success. Heritrix just hangs right after the crawl is started. I suppose this is caused by Heritrix taking a rather unusual approach to DNS queries when using a web proxy. Let me explain:

The DNS servers in our corporate network only resolve host names from our local network. (And I cannot use external DNS servers because of firewall rules.) That's OK because all external requests are routed through a web proxy anyway. A client tells the web proxy the (external) URL it wishes to access, and the web proxy takes care of everything, including DNS resolution (which is done by forwarding the request to a parent proxy that does the actual DNS resolution using other, "more knowledgable" DNS servers).

Long story short, when using a web proxy, it's not necessary to query the DNS. Most tools do it this way, and "just work". (Example: curl makes DNS requests only when not using a web proxy.)

However, Heritrix always makes DNS requests, regardless of its proxy configuration. If a DNS request fails, it does not even go on asking the proxy for the URL it tries to crawl, although the proxy could easily fetch the content. Instead, it just hangs.

So may I suggest that Heritrix should not query the DNS when it is configured to use a web proxy? Or if it does, that at least it should continue asking the web proxy even if the DNS request fails?

Probably related: #198

Thanks,
Martin

JDK11 support: tools.jar

heritrix-contrib depends on hbase-client which eventually depends on tools.jar. tools.jar is no longer included in the JDK.

[ERROR] Failed to execute goal on project heritrix-contrib: Could not resolve dependencies for project org.archive.heritrix:heritrix-contrib:jar:3.4.0-SNAPSHOT: Could not find artifact jdk.tools:jdk.tools:jar:1.6 at specified path /usr/lib/jvm/java-11-openjdk-11.0.3.7-1.fc29.x86_64/../lib/tools.jar -> [Help 1]

Upgrade dependencies to spring 4.x.x

Heritrix-commons depends on spring 3.0.5-RELEASE which is an outdated version.

The use of heritrix-commons as a library may lead to dependencies issues.

Missing OneLineSimpleLayout class file

Hi,

I was trying to build the master branch with mvn package but the tests of the subproject commons failed due to a NoClassDefFoundError. The missing class was org.archive.util.OneLineSimpleLayout.

This class is listed on the commons/src/test/resources/log4j.xml configuration file but it does not exist on the src. it is placed only at the contrib folder.

HTTP response only results in garbage bytes

I'm trying to run the latest Heritrix build (build heritrix-3.3.0-20180529.100446-105-dist.tar.gz which I downloaded here) for some tests.

I try to start Heritrix with the below command::

~/heritrix-3.3.0-SNAPSHOT/bin/heritrix -a foo

This works, but when I open http://localhost:8443/ in my browser (Firefox), it only shows 6 garbled characters (Chromium returns a ERR_INVALID_HTTP_RESPONSE error). Saving the page and opening it in a Hex editor shows these 7 bytes:

15 03 03 00 02 02 0A

Some info on Java on my system:

openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

Here's the Heritrix log file:

Thu Jun 21 13:15:54 CEST 2018 Starting heritrix
Linux johan-HP-ProBook-640-G1 4.10.0-38-generic #42~16.04.1-Ubuntu SMP Tue Oct 10 16:32:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
JAVA_OPTS= -Xmx256m
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31394
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 31394
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Oracle Corporation OpenJDK Runtime Environment 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore adhoc.keystore -destkeystore adhoc.keystore -deststoretype pkcs12".
Using ad-hoc HTTPS certificate with fingerprint...
SHA1:55:BA:62:92:98:5A:DB:26:1B:08:70:D8:90:5D:9C:F3:A4:E7:BF:81
Verify in browser before accepting exception.
2018-06-21 11:15:55.239 WARNING thread-1 org.archive.crawler.framework.Engine.findJobConfigs() invalid job directory: ./jobs/.gitignore where job expected from: ./jobs/.gitignore
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
engine listening at port 8443
operator login set per command-line
NOTE: We recommend a longer, stronger password, especially if your web 
interface will be internet-accessible.
Heritrix version: 3.3.0-SNAPSHOT-2018-05-29T09:43:19Z

The log contains a number of warnings, but I have no idea if they are related to this.

Perhaps I'm doing something wrong myself (this my first attempt at installing and running Heritrix). Anyway, if anyone could give me a hint on how to make this work that would be really helpful. (Side note: I initially tried the "stable" 3.2 release, but gave up on that because of the dependency on Java 7.)

Depth first search issue

I set preferenceDepthHops to 0, based on the documentation it should proceed DFS, but it does not crawl seeds at all just prints -50 fetch status in crawl.log file which could be a bug based on the documentation.

I'm using default configuration, only preferenceDepthHops is set to 0.

Long-lived cookies might have unintended consequences on a crawling session

In a recent investigation we found that about half of the Twitter pages are archived in non-English languages. Also, about half of those non-English captures are in Kannada language alone. The root cause of this unintended consequence was found to be a sticky cookie.

Here is why it happens. Twitter pages have 47 alternate language links like this:

<link rel="alternate" hreflang="fr" href="https://twitter.com/?lang=fr">
<link rel="alternate" hreflang="en" href="https://twitter.com/?lang=en">
<link rel="alternate" hreflang="ar" href="https://twitter.com/?lang=ar">
...
<link rel="alternate" hreflang="kn" href="https://twitter.com/?lang=kn">

These links are added in the frontier queue every once in a while. Once any of these links are fetched, Twitter sets an explicit long-lasting lang cookie with the corresponding language. After that, every Twitter page is served in that sticky language unless it includes an explicit ?lang=<language-code> query parameter in the URL. In this list of alternate links, Kannada (kn) happens to be the last one, which overwrites any previous language cookies and affects all upcoming twitter links in that crawling session.

This is not a Twitter-specific issue. Similar consequences might exist in other places. A potential solution to this issue is to never let any cookie live for too long, instead, explicitly expire or remove them after a short period of time.

Read more details about our investigation of the matter in our blog post http://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html.

Noisy alerts about 401s without auth challenge

A 401 response is supposed to include an auth challenge but in practice a lot of sites erroneously use 401 without it (they should really be using 403s).

When Heritrix encounters such a situation it logs the error in a such a manner that it is added to the alerts log. As this isn't an issue with the crawler, this isn't very useful and the spamming of such errors may hide other, more serious and actionable errors.

Example entry from the alerts log:

Apr 27, 2016 1:47:31 PM org.archive.modules.fetcher.FetchHTTP extractChallenges
WARNING: Failed to extract auth challenge headers for uri with response status 401: http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu (in thread 'ToeThread #7: http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu'; in processor 'fetchHttp')

Suggest we modify how these errors are handled and log them in the nonfatal-errors.log only.

Make FetchHistoryProcessor 304 handler more robust

The FetchHistoryProcessor has special logic to handle HTTP 304 responses...

heritrix3/modules/src/main/java/org/archive/modules/recrawl/FetchHistoryProcessor.java

Lines 109 to 126 in 0581170

 if (curi.getFetchStatus() == 304) { 

 // Copy forward the content digest as the current digest is simply of an empty response 

 latestFetch.put(A_CONTENT_DIGEST, history[1].get(A_CONTENT_DIGEST)); 

 // Create revisit profile 

 curi.getAnnotations().add("duplicate:server-not-modified"); 

 ServerNotModifiedRevisit revisit = new ServerNotModifiedRevisit(); 

 revisit.setETag((String) latestFetch.get(A_ETAG_HEADER)); 

 revisit.setLastModified((String) latestFetch.get(A_LAST_MODIFIED_HEADER)); 

 revisit.setPayloadDigest((String)latestFetch.get(A_CONTENT_DIGEST)); 

 curi.setRevisitProfile(revisit); 

 } else if (hasIdenticalDigest(curi)) { 

 curi.getAnnotations().add("duplicate:digest"); 

 IdenticalPayloadDigestRevisit revisit = 

 new IdenticalPayloadDigestRevisit((String)history[1].get(A_CONTENT_DIGEST)); 

 revisit.setRefersToTargetURI(curi.getURI()); // Matches are always on the same URI 

 revisit.setRefersToDate((Long)history[1].get(A_FETCH_BEGAN_TIME)); 

 curi.setRevisitProfile(revisit); 

 }

...but we're having problems because a server is returning this even though we never sent a If-Modified-Since or If-None-Match in the request (pretty sure the crawler never does this).

I'm not sure this is an appropriate use of de-duplication logic: de-duping a 304 because the server tells us it's the same as an earlier response is really not the same as de-duping a 200 that we've downloaded before and de-duping after verifying the hashes match.

So, the whole section should be removed, IMO. Failing that, we'll need to re-write it so if does not assume the history is present.

(see ukwa/ukwa-heritrix#27 for context)

Domain name lookup failures get cached forever

When a job first looks up a URL, the DNS record is fetched first. If this fails, this code kicks in:

heritrix3/engine/src/main/java/org/archive/crawler/prefetch/PreconditionEnforcer.java

Line 294 in aa705be

if (ch == null || ch.hasBeenLookedUp() && ch.getIP() == null) {

Otherwise, this code use use:

heritrix3/engine/src/main/java/org/archive/crawler/prefetch/PreconditionEnforcer.java

Line 306 in aa705be

if (isIpExpired(curi) && !curi.getUURI().getScheme().equals("dns")) {

The latter uses 'isIpExpiredwhich implements theipValidityDurationSeconds` check. However, the former does not and thus while successful IP lookups get refreshed every six hours (by default), failed IP lookups are never re-tried.

Avoid speculative links extraction for meta fields known not to contain links

Following this report of a URL being constructed from <meta> elements:

I'm using heritrix 3.3.0-SNAPSHOT and see some strange behavior in the link extraction. This is one example in crawl.log:
2018-12-21T04:07:03.874Z   404       7161 https://stitch-maps.com/news/2018/10/twofer/Stitch-Maps.com RLX https://stitch-maps.com/news/2018/10/twofer/ text/html #116 20181219040702090+1782 sha1:K7HLTQ7SFI4KAQN3NVAO4OJ4UBYT3FGE - -
There isn't any link to the crawled url on the given src page, so it seems like the facebook tags on the src page have something to do with it:
<meta property="og:url" content="http://stitch-maps.com/news/2018/10/twofer/"/>
<meta property="og:site_name" content="Stitch-Maps.com"/>
Isn't it a bug, that heritrix combined these two urls to https://stitch-maps.com/news/2018/10/twofer/Stitch-Maps.com?

Heritrix3 Fails to Build from Source

jaan 31, 2017 1:20:23 PM org.archive.modules.fetcher.FetchHTTPRequest$ServerCacheResolver resolve
INFO: host "localhost" is not in serverCache, allowing java to resolve it
127.0.0.1 -  -  [31/jaan/2017:11:20:23 +0000] "GET /unsupported-charset HTTP/1.0" 200 37 "-" "org.archive.modules.fetcher.FetchHTTPTests"
jaan 31, 2017 1:20:23 PM org.archive.modules.fetcher.FetchHTTPRequest$ServerCacheResolver resolve
INFO: host "localhost" is not in serverCache, allowing java to resolve it
127.0.0.1 -  -  [31/jaan/2017:11:20:23 +0000] "GET /invalid-charset HTTP/1.0" 200 37 "-" "org.archive.modules.fetcher.FetchHTTPTests"
jaan 31, 2017 1:20:23 PM org.archive.modules.fetcher.FetchHTTPRequest$ServerCacheResolver resolve
INFO: host "example.com" is not in serverCache, allowing java to resolve it
jaan 31, 2017 1:20:23 PM org.archive.modules.fetcher.FetchHTTPRequest$ServerCacheResolver resolve
INFO: host "example.com" is not in serverCache, allowing java to resolve it
jaan 31, 2017 1:20:24 PM org.archive.modules.fetcher.FetchHTTPRequest$ServerCacheResolver resolve
INFO: host "localhost" is not in serverCache, allowing java to resolve it
127.0.0.1 -  -  [31/jaan/2017:11:20:24 +0000] "POST / HTTP/1.0" 200 37 "-" "org.archive.modules.fetcher.FetchHTTPTests"
Tests run: 30, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.187 sec
Running org.archive.modules.fetcher.CookieStoreTest
jaan 31, 2017 1:20:24 PM org.archive.modules.fetcher.CookieStoreTest bdb
INFO: created org.archive.bdb.BdbModule@4a0df195
jaan 31, 2017 1:20:24 PM org.archive.modules.fetcher.CookieStoreTest bdb
INFO: created org.archive.bdb.BdbModule@ec1b2e4
jaan 31, 2017 1:20:25 PM org.archive.modules.fetcher.CookieStoreTest bdb
INFO: created org.archive.bdb.BdbModule@7bbbb6a8
jaan 31, 2017 1:20:25 PM org.archive.modules.fetcher.CookieStoreTest bdb
INFO: created org.archive.bdb.BdbModule@1dc2de84
jaan 31, 2017 1:20:25 PM org.archive.modules.fetcher.CookieStoreTest bdb
INFO: created org.archive.bdb.BdbModule@33f98231
jaan 31, 2017 1:20:25 PM org.archive.modules.fetcher.CookieStoreTest bdb
INFO: created org.archive.bdb.BdbModule@f2ce6b
jaan 31, 2017 1:20:25 PM org.archive.modules.fetcher.CookieStoreTest testSaveLoadCookies
INFO: before: [[version: 0][name: name1][value: value1][domain: example.com][path: null][expiry: null], [version: 0][name: name2][value: value2][domain: example.com][path: null][expiry: null], [version: 0][name: name3][value: value3][domain: example.com][path: null][expiry: null], [version: 0][name: name5][value: value5][domain: example.com][path: null][expiry: Sat May 27 08:07:05 EEST 2017], [version: 0][name: name6][value: value6][domain: example.com][path: /path1][expiry: null], [version: 0][name: name4][value: value4][domain: example.org][path: null][expiry: null]]
jaan 31, 2017 1:20:25 PM org.archive.modules.fetcher.CookieStoreTest testSaveLoadCookies
INFO:  after: [[version: 0][name: name1][value: value1][domain: example.com][path: /][expiry: null], [version: 0][name: name2][value: value2][domain: example.com][path: /][expiry: null], [version: 0][name: name3][value: value3][domain: example.com][path: /][expiry: null], [version: 0][name: name5][value: value5][domain: example.com][path: /][expiry: Sat May 27 08:07:05 EEST 2017], [version: 0][name: name6][value: value6][domain: example.com][path: /path1][expiry: null], [version: 0][name: name4][value: value4][domain: example.org][path: /][expiry: null]]
jaan 31, 2017 1:20:25 PM org.archive.modules.fetcher.CookieStoreTest bdb
INFO: created org.archive.bdb.BdbModule@7a22a3c2
jaan 31, 2017 1:20:35 PM org.archive.modules.fetcher.CookieStoreTest bdb
INFO: created org.archive.bdb.BdbModule@c6634d
Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 11.79 sec <<< FAILURE!
Running org.archive.modules.fetcher.FetchDNSTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec

Results :

Failed tests:   testConcurrentLoad(org.archive.modules.fetcher.CookieStoreTest)

Tests run: 188, Failures: 1, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Heritrix 3 ......................................... SUCCESS [  0.002 s]
[INFO] Heritrix 3: 'commons' subproject (utility classes) . SUCCESS [11:51 min]
[INFO] Heritrix 3: 'modules' subproject (reusable components) FAILURE [01:22 min]
[INFO] Heritrix 3: 'engine' subproject .................... SKIPPED
[INFO] Heritrix 3 (distribution bundles) .................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13:14 min
[INFO] Finished at: 2017-01-31T13:20:36+02:00
[INFO] Final Memory: 32M/223M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.9:test (default-test) on project heritrix-modules: There are test failures.
[ERROR] 
[ERROR] Please refer to /opt/hdd_01_for_large_files/large_files/se_dynamicsearchengine_01/se_heritrix_01/modules/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :heritrix-modules
se_dynamicsearchengine_01@mvahi:~/large_files/se_heritrix_01$ date
T jaan  31 13:33:55 EET 2017
se_dynamicsearchengine_01@mvahi:~/large_files/se_heritrix_01$ uname -a
Linux mvahi 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
se_dynamicsearchengine_01@mvahi:~/large_files/se_heritrix_01$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
se_dynamicsearchengine_01@mvahi:~/large_files/se_heritrix_01$

Documentation update for 3.4.0-YYYYMMDD releases

We should clean up some of the more problematic documentation issues.

The release notes are a bit of a mess (only found via the wiki index). Should we move them into a CHANGES.md file? Can we also pull in @kris-sigur blogs 1, 2 and get all the way back to the Cookie Monster?
The major 'front pages' should be reviewed to check they make basic sense and link to the right places.

Possible race-condition when first using the WARC writers?

Most of the times I start a crawl using a recent Heritrix build, an exception gets thrown when the system first tries to write to a WARC file. e.g.

2016-08-12T09:06:35.096Z    -5         54 dns:madebybridge.com RLP http://madebybridge.com/ text/dns #030 20160812090634493+63 sha1:QT62S24FK6C32Z7PBBYEDY3OBTNBZFMI - err=java.lang.NullPointerException
 java.lang.NullPointerException
        at org.archive.io.warc.WARCWriter.getFilenameWithoutOccupiedSuffix(WARCWriter.java:275)
        at org.archive.modules.writer.WARCWriterProcessor.updateMetadataAfterWrite(WARCWriterProcessor.java:289)
        at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:259)
        at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
 java.lang.NullPointerException
        at org.archive.io.warc.WARCWriter.getFilenameWithoutOccupiedSuffix(WARCWriter.java:275)
        at org.archive.modules.writer.WARCWriterProcessor.updateMetadataAfterWrite(WARCWriterProcessor.java:289)
        at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:259)
        at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
2016-08-12T09:06:35.098Z    -5         52 dns:wiki.bl.uk P https://wiki.bl.uk:8443/ text/dns #002 20160812090633536+66 sha1:KHNVONKMAARJYO2OZHRC2TPPZZM6A7AG - err=java.lang.NullPointerException
 java.lang.NullPointerException
        at org.archive.io.warc.WARCWriter.getFilenameWithoutOccupiedSuffix(WARCWriter.java:275)
        at org.archive.modules.writer.WARCWriterProcessor.updateMetadataAfterWrite(WARCWriterProcessor.java:289)
        at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:259)
        at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
 java.lang.NullPointerException
        at org.archive.io.warc.WARCWriter.getFilenameWithoutOccupiedSuffix(WARCWriter.java:275)
        at org.archive.modules.writer.WARCWriterProcessor.updateMetadataAfterWrite(WARCWriterProcessor.java:289)
        at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:259)
        at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

Any ideas what's going awry here? Seems like a race-condition when setting up the WARC writers?

Can't see all beans in scripts

I was trying to execute the following code from https://webarchive.jira.com/wiki/display/Heritrix/Heritrix3+Useful+Scripts#Heritrix3UsefulScripts-addasheetforcingmanyqueuesinto'retired'state

mgr = appCtx.getBean("sheetOverlaysManager");
newSheetName = "urbanOrgAndTaxpolicycenterOrgSingleQueue"
mgr.putSheetOverlay(newSheetName, "queueAssignmentPolicy.forceQueueAssignment", "urbanorg_and_taxpolicycenterorg");
mgr.addSurtAssociation("http://(org,urban,", newSheetName);
mgr.addSurtAssociation("http://(org,taxpolicycenter,", newSheetName);

//check your results
mgr.sheetNamesBySurt.each{ rawOut.println(it) }
rawOut.println(mgr.sheetNamesBySurt.size())

This gives the following error: javax.script.ScriptException: org.springframework.beans.factory.NoSuchBeanDefinitionException: No bean named 'queueAssignmentPolicy' is defined.

It appears to me that any bean that is within comment ( they are default values and can be kept untouched for common use cases, I think) in the default configuration file can not be accessed through the Groovy script. Is this an intentional behavior? Is there any way to get and change these values?

Where do i find the crawled information (Contents) after crawling is completed

Hi , I have completed crawling using WUI , Now just i need to get crawled information and i need to store into the My SQL data base.

I'm not finding any crawled contents in the WUI , Where do i find my crawled information ??

Invalid format exception in scanJobLog

@ruebot encountered the following exception after checkpointing and restarting Heritrix:

2019-03-11 14:08:53.599 SEVERE thread-1 org.archive.crawler.framework.Engine.addJobDirectory() bad cxml: /data/heritrix-jobs/academic-calendars/crawler-beans.cxml
java.lang.IllegalArgumentException: Invalid format: "539Z" is malformed at "Z"
    at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:634)
    at org.joda.time.convert.StringConverter.getInstantMillis(StringConverter.java:65)
    at org.joda.time.base.BaseDateTime.<init>(BaseDateTime.java:171)
    at org.joda.time.DateTime.<init>(DateTime.java:168)
    at org.archive.crawler.framework.CrawlJob.scanJobLog(CrawlJob.java:179)
    at org.archive.crawler.framework.CrawlJob.<init>(CrawlJob.java:101)
    at org.archive.crawler.framework.Engine.addJobDirectory(Engine.java:153)
    at org.archive.crawler.framework.Engine.findJobConfigs(Engine.java:109)
    at org.archive.crawler.framework.Engine.<init>(Engine.java:72)
    at org.archive.crawler.Heritrix.instanceMain(Heritrix.java:335)
    at org.archive.crawler.Heritrix.main(Heritrix.java:188)

Looks like a bug here:

heritrix3/engine/src/main/java/org/archive/crawler/framework/CrawlJob.java

Line 168 in aa705be

startPosition = jobLog.length()-(FileUtils.ONE_KB * 100);

If the job log is larger than 100KB, startPosition is set to 100KB from the end which might be in the middle of a line. If that partial line still happens to match Pattern.compile("(\\S+) (\\S+) Job launched") then an incomplete timestamp may be parsed causing the exception.

@anjackson suggests the following fix:

it should read a line and discard it if the startPosition is not zero.

BdbFrontier thread safety

We're attempting to use Heritrix3 with an external module that populates the BdbFrontier via Kafka, and we're hitting problems interacting with the frontier safely. There's some more details in ukwa/ukwa-heritrix#16, but to summarise, ToeThreads are dying because keepItem is null when it should not be.

I believe this is because peekItem is marked as transient. Occasionally, between setting peekItem (this statement) and using it (this one), the WorkQueue gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. As peekItem is transient, flushing it out to the disk and back drops the value and we're left with a null.

NetArchive Suite have also seen this issue when using a RabbitMQ-based URL receiver, and patched it by ignoring the null.

The simplest way to avoid this would be to remove the transient modified from peekItem but that makes me worry because someone deliberately chose to make it transient and I don't understand why.

Secondly, I don't understand why we are seeing this, when IA also use similar methods and are (presumably?) not seeing this. Moreover, this model appears not to be fundamentally different to the traditional ActionDirectory, so I don't understand why this wasn't seen a long time ago.

Finally, this issue also made it clear that I don't actually understand how best to interact with the BdbFrontier in a thread-safe manner. If I am right in assuming that every modification to a WorkQueue needs to be followed by a .makeDirty() that serialised the queue out to disk and reads it back in again, then surely every modification needs to edit-then-write within a synchronized(WorkQueue) block? But it's pretty easy to find examples where this appears to be deliberately not the case:

heritrix3/engine/src/main/java/org/archive/crawler/frontier/WorkQueueFrontier.java

Lines 390 to 410 in 0581170

 synchronized(wq) { 

 int originalPrecedence = wq.getPrecedence(); 

 wq.enqueue(this, curi); 

 // always take budgeting values from current curi 

 // (whose overlay settings should be active here) 

 wq.setSessionBudget(getBalanceReplenishAmount()); 

 wq.setTotalBudget(getQueueTotalBudget()); 

 if(!wq.isRetired()) { 

 incrementQueuedUriCount(); 

 int currentPrecedence = wq.getPrecedence(); 

 if(!wq.isManaged() || currentPrecedence < originalPrecedence) { 

 // queue newly filled or bumped up in precedence; ensure enqueuing 

 // at precedence level (perhaps duplicate; if so that's handled elsewhere) 

 deactivateQueue(wq); 

 } 

 } 

 } 

 // Update recovery log. 

 doJournalAdded(curi); 

 wq.makeDirty();

I'd appreciate any information anyone has on how best to inject URLs into Heritrix3, and on whether or not I've understood how the BdbFrontier works.

"RIS already open for ToeThread..." exception during https pages crawl over proxy

When I try to crawl https pages over a proxy with Heritrix 3, I get following exceptions:

java.io.IOException: RIS already open for ToeThread #5: https://www.XXX/robots.txt at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:84) at org.archive.util.Recorder.inputWrap(Recorder.java:185) at org.archive.modules.fetcher.FetchHTTPRequest$RecordingHttpClientConnection.getSocketInputStream(FetchHTTPRequest.java:648) at org.apache.http.impl.BHttpConnectionBase.ensureOpen(BHttpConnectionBase.java:131) at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestHeader(DefaultBHttpClientConnection.java:140) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:203) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:751) at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:658) at org.archive.modules.Processor.innerProcessResult(Processor.java:175) at org.archive.modules.Processor.process(Processor.java:142) at org.archive.modules.ProcessorChain.process(ProcessorChain.java:138) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

Heritrix sometimes writes empty WARC records for redirects

Just noticed an oddity in our crawls. We have a WARC response with no response in it (see below). This seems to be due to the crawler getting a HTTP 204 response.

However, I only think that because the @ikreymer's pywb cdx-indexer creates this CDX line:

com,facebook)/plugins/like.php?action=like&colorscheme=light&height=21&href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105 20180422171119 http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21 unk 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 383 10514026 BL-20180422170134461-00018-63~ukwa-h3-pulse-daily~8443.warc.gz

But frankly I don't understand where it's getting the 204 from!

Assuming it is really a 204 (I'll check the crawl log), the question is: What should Heritrix3 be writing to the WARC file?

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-IP-Address: 157.240.1.35
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Record-ID: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
Content-Length: 0



WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-Concurrent-To: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
WARC-Record-ID: <urn:uuid:1fa4ddfb-2285-48b3-a835-61378b29a1d4>
Content-Length: 0



WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-Concurrent-To: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
WARC-Record-ID: <urn:uuid:dc148cb3-2c39-42c5-b1c6-02654fe428b7>
Content-Type: application/warc-fields
Content-Length: 564

via: http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/
hopsFromSeed: LLLE
sourceTag: http://newspig.co.uk/
fetchTimeMs: 12
charsetForLinkExtraction: ISO-8859-1
outlink: https://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fnewspig.co.uk%2F8-reasons-to-hold-cash-markets-are-rational-until-theyre-not%2F&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21 R Location:
outlink: http://www.facebook.com/favicon.ico I =INFERRED_MISC
outlink: http://www.facebook.com/ I =INFERRED_MISC

`-j` option can'not handle spaces in directory names?

This is my log on the latest snapshot

./bin/heritrix -a admin:admin -j "/media/sagnik/OS_Install/Documents and Settings"
Sun Jul 23 14:12:54 EDT 2017 Heritrix starting (pid 9377)
ERROR: JVM terminated without running Heritrix.
This could be due to invalid JAVA_OPTS or JMX_PORT, etc.
See heritrix_out.log for more details.
Here are its last three lines: 

                                 password, and key password for HTTPS use. Separate with commas, no
                                 whitespace.
Your arguments were: -a admin:admin -j /media/sagnik/OS_Install/Documents and Settings

I am pretty sure that is because of the space in the -j argument., because when I do ./bin/heritrix -a admin:admin -j "/media/sagnik/OS_Install/", it runs fine. Has this been pointed out before?

Shutdown hangs when ExtractorHTML is stuck on big gnarly HTML

We've found we can't shut down H3 cleanly because it's getting stuck on a very large and poorly-formed HTML file.

Anything we can do to help this at least shut down cleanly?

[ToeThread #93: http://s152224197.websitehome.co.uk/other_secured_loans.php
 CrawlURI http://s152224197.websitehome.co.uk/other_secured_loans.php ILLLL http://s152224197.websitehome.co.uk/mortgage_guides.php    0 attempts
    in processor: extractorHtml
    ACTIVE for 2h24m30s338ms
    step: ABOUT_TO_BEGIN_PROCESSOR for 2h24m10s571ms
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
    org.archive.util.InterruptibleCharSequence.charAt(InterruptibleCharSequence.java:41)
    java.util.regex.Pattern$SliceI.match(Pattern.java:3890)
    java.util.regex.Pattern$Curly.match1(Pattern.java:4185)
    java.util.regex.Pattern$Curly.match(Pattern.java:4134)
    java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    java.util.regex.Pattern$Curly.match2(Pattern.java:4209)
    java.util.regex.Pattern$Curly.match(Pattern.java:4136)
    java.util.regex.Pattern$SliceI.match(Pattern.java:3895)
    java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    java.util.regex.Pattern$Branch.match(Pattern.java:4502)
    java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    java.util.regex.Pattern$Start.match(Pattern.java:3408)
    java.util.regex.Matcher.search(Matcher.java:1199)
    java.util.regex.Matcher.find(Matcher.java:592)
    org.archive.modules.extractor.ExtractorHTML.extract(ExtractorHTML.java:810)
    org.archive.modules.extractor.ExtractorHTML.innerExtract(ExtractorHTML.java:743)
    org.archive.modules.extractor.ContentExtractor.extract(ContentExtractor.java:37)
    org.archive.modules.extractor.Extractor.innerProcess(Extractor.java:102)
    org.archive.modules.Processor.innerProcessResult(Processor.java:175)
    org.archive.modules.Processor.process(Processor.java:142)
    org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
]

JDK11 support: ClassNotFoundException: javax.transaction.xa.Xid

Under openjdk 11.0.3 several bdb related unit tests error:

  testStoredSortedMap(org.archive.settings.file.PrefixFinderTest): javax/transaction/xa/Xid
  testDoCheckpoint(org.archive.settings.file.BdbModuleTest): javax/transaction/xa/Xid
  testAdd(org.archive.bdb.StoredQueueTest): javax/transaction/xa/Xid
  testClear(org.archive.bdb.StoredQueueTest): javax/transaction/xa/Xid
  testRemove(org.archive.bdb.StoredQueueTest): javax/transaction/xa/Xid
  testOrdering(org.archive.bdb.StoredQueueTest): javax/transaction/xa/Xid
  testElement(org.archive.bdb.StoredQueueTest): javax/transaction/xa/Xid
  testIdentity(org.archive.bdb.StoredQueueTest): javax/transaction/xa/Xid
  testReadConsistencyUnderLoad(org.archive.util.ObjectIdentityBdbCacheTest): javax/transaction/xa/Xid
  testBackingDbGetsUpdated(org.archive.util.ObjectIdentityBdbCacheTest): javax/transaction/xa/Xid
  testMemMapCleared(org.archive.util.ObjectIdentityBdbCacheTest): javax/transaction/xa/Xid
  testReadConsistencyUnderLoad(org.archive.util.ObjectIdentityBdbManualCacheTest): javax/transaction/xa/Xid
  testBackingDbGetsUpdated(org.archive.util.ObjectIdentityBdbManualCacheTest): javax/transaction/xa/Xid

with this stacktrace:

java.lang.NoClassDefFoundError: javax/transaction/xa/Xid
        at com.sleepycat.je.dbi.DatabaseImpl.<init>(DatabaseImpl.java:173)
        at com.sleepycat.je.dbi.DbTree.<init>(DbTree.java:254)
        at com.sleepycat.je.dbi.EnvironmentImpl.<init>(EnvironmentImpl.java:476)
        at com.sleepycat.je.dbi.EnvironmentImpl.<init>(EnvironmentImpl.java:340)
        at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:195)
        at com.sleepycat.je.Environment.makeEnvironmentImpl(Environment.java:229)
        at com.sleepycat.je.Environment.<init>(Environment.java:211)
        at com.sleepycat.je.Environment.<init>(Environment.java:165)
        at org.archive.settings.file.PrefixFinderTest.testStoredSortedMap(PrefixFinderTest.java:85)
Caused by: java.lang.ClassNotFoundException: javax.transaction.xa.Xid
        at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        at org.apache.maven.surefire.booter.IsolatedClassLoader.loadClass(IsolatedClassLoader.java:93)
        ... 37 more

This class does still exist in jdk11:

$ jshell
|  Welcome to JShell -- Version 11.0.3
|  For an introduction type: /help intro

jshell> javax.transaction.xa.Xid.MAXBQUALSIZE
$1 ==> 64

I don't understand the root cause but on a hunch found commenting out this line in pom.xml fixes the error:

<useSystemClassLoader>false</useSystemClassLoader>

The pom has a comment explaining:

However, using the systemClassLoader means that we inherit
maven's CLASSPATH while running our test code. This is a
problem since maven uses an earlier version of
commons-lang than we do.

I think it'd be very unlikely that maven would today still using an older version of commons-lang than heritrix but haven't checked that.

How to configure warcWriter with MirrorWriter?

I am saving warc files from a heritrix crawl, my warcwriter settings as follows:

<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
       <property name="shouldProcessRule">
        <bean class="org.archive.modules.deciderules.DecideRuleSequence">
         <property name="rules">
          <list>
           <!-- Begin by REJECTing all -->
           <bean class="org.archive.modules.deciderules.RejectDecideRule" />
           <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
            <property name="decision" value="ACCEPT" />
            <property name="regex" value="^text/html.*" />
           </bean>
          </list>
         </property>
        </bean>
       </property>
       <!-- other properties -->
  <!-- <property name="compress" value="true" /> -->
  <!-- <property name="prefix" value="IAH" /> -->
  <!-- <property name="suffix" value="${HOSTNAME}" /> -->
  <!-- <property name="maxFileSizeBytes" value="1000000000" /> -->
  <!-- <property name="poolMaxActive" value="1" /> -->
  <!-- <property name="MaxWaitForIdleMs" value="500" /> -->
  <!-- <property name="skipIdenticalDigests" value="false" /> -->
  <!-- <property name="maxTotalBytesToWrite" value="0" /> -->
  <!-- <property name="directory" value="${launchId}" /> -->
  <!-- <property name="storePaths">
        <list>
         <value>warcs</value>
        </list>
       </property> -->
  <!-- <property name="template" value="${prefix}-${timestamp17}-${serialno}-${heritrix.pid}~${heritrix.hostname}~${heritrix.port}" /> -->
  <!-- <property name="writeRequests" value="true" /> -->
  <!-- <property name="writeMetadata" value="true" /> -->
  <!-- <property name="writeRevisitForIdenticalDigests" value="true" /> -->
  <!-- <property name="writeRevisitForNotModified" value="true" /> -->
  <!-- <property name="startNewFilesOnCheckpoint" value="true" /> -->
 </bean>

My intended use case is similar to https://webarchive.jira.com/wiki/display/Heritrix/Mirroring+HTML+Files+Only . But I want to store just HTML files in the mirror instead of all files, as the shouldProcessRule in the warcWriter bean is doing. Is that possible to do with MirrorWriter? i.e., is the following a proper configuration for my use case?

<bean id="warcWriter" class="org.archive.modules.writer.MirrorWriterProcessor">
       <property name="shouldProcessRule">
        <bean class="org.archive.modules.deciderules.DecideRuleSequence">
         <property name="rules">
          <list>
           <!-- Begin by REJECTing all -->
           <bean class="org.archive.modules.deciderules.RejectDecideRule" />
           <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
            <property name="decision" value="ACCEPT" />
            <property name="regex" value="^text/html.*" />
           </bean>
          </list>
         </property>
        </bean>
       </property>
</bean>

Possibly stalled crawl

I have a job that looks like it has stalled out, and I'm not sure why.

I am running Heritrix 3.1.1 on Debian 9.

Here are some stats from the job web page:

Job is Active: RUNNING
Totals
4403 downloaded + 16788 queued = 21191 total
241 MiB crawled (241 MiB novel, 0 B dupByHash, 0 B notModified)
Alerts
31 tail alert log...
Rates
0 URIs/sec (0.16 avg); 0 KB/sec (8 avg)
Load
0 active of 2 threads; 1 congestion ratio; 16785 deepest queue; 4197 average depth
Elapsed
7h50m16s124ms
Threads
2 threads: 2 ABOUT_TO_GET_URI; 2 noActiveProcessor 
Frontier
RUN - 33 URI queues: 4 active (0 in-process; 0 ready; 4 snoozed); 0 inactive; 0 ineligible; 0 retired; 29 exhausted 
Memory
82250 KiB used; 131320 KiB current heap; 253440 KiB max heap

Pastebin of the job log: https://pastebin.com/raw/nYj0mVDy

Support full wildcard syntax in robots.txt directives

We only support trailing * wildcards at present. Ideally we should support wildcards as defined in https://developers.google.com/search/reference/robots_txt

The code to modify would be:

heritrix3/modules/src/main/java/org/archive/modules/net/RobotsDirectives.java

Lines 40 to 42 in 0581170

 public boolean allows(String path) { 

 return !(longestPrefixLength(disallows, path) > longestPrefixLength(allows, path)); 

 }

The actual wildcards are not that difficult, but getting the precedence right is harder. Perhaps we can use a standard library e.g. the crawler commons code?

Google Drive robots.txt broken

Robots txt parsing is broken for google drive. Here's the directives:

User-agent: *
Crawl-delay: 1
Allow: /$
Allow: /?hl=
Disallow: /?hl=*&
Allow: /support/
Allow: /a/
Allow: /Doc
Allow: /View
Allow: /ViewDoc
Allow: /present
Allow: /Present
Allow: /TeamPresent
Allow: /EmbedSlideshow
Allow: /presentation
Allow: /templates
Allow: /previewtemplate
Allow: /fileview
Allow: /gview
Allow: /viewer
Allow: /leaf
Allow: /file
Allow: /open
Allow: /document
Allow: /drawings
Allow: /demo
Allow: /folder
Allow: /start
Allow: /spreadsheet
Allow: /forms
Allow: /macros
Allow: /keep
Allow: /static
Allow: /drive/
Disallow: /templateabuse
Disallow: /

showing a clear intent to allow "/file".

and here's the error I get when I try to save this page: https://drive.google.com/file/d/1SDlPNIl02BH-q-itDAR1nMNJxGOB_iCz/view

	* <p>
	* This class is "restricted" in the sense that it is immutable, and also
	* because some methods throw {@link RuntimeException} for other reasons.
	* For example, {@link #iterator()} is not implemented, because we use this
	* class to wrap a bdb {@link StoredCollection}, and iterators from that
	* class need to be explicitly closed. Since this class hides the fact that
	* a StoredCollection underlies it, we simply prevent {@link #iterator()}
	* from being used.

	public static void autoregisterTo(AutoKryo kryo) {
	// kryo.register(CrawlURI.class,new DeflateCompressor(kryo.newSerializer(CrawlURI.class)));
	kryo.register(CrawlURI.class);
	kryo.autoregister(byte[].class);

	public static void autoregisterTo(AutoKryo kryo) {
	kryo.register(CrawlServer.class);
	kryo.autoregister(FetchStats.class);

	if (curi.getFetchStatus() == 304) {
	// Copy forward the content digest as the current digest is simply of an empty response
	latestFetch.put(A_CONTENT_DIGEST, history[1].get(A_CONTENT_DIGEST));
	// Create revisit profile
	curi.getAnnotations().add("duplicate:server-not-modified");
	ServerNotModifiedRevisit revisit = new ServerNotModifiedRevisit();
	revisit.setETag((String) latestFetch.get(A_ETAG_HEADER));
	revisit.setLastModified((String) latestFetch.get(A_LAST_MODIFIED_HEADER));
	revisit.setPayloadDigest((String)latestFetch.get(A_CONTENT_DIGEST));
	curi.setRevisitProfile(revisit);
	} else if (hasIdenticalDigest(curi)) {
	curi.getAnnotations().add("duplicate:digest");
	IdenticalPayloadDigestRevisit revisit =
	new IdenticalPayloadDigestRevisit((String)history[1].get(A_CONTENT_DIGEST));
	revisit.setRefersToTargetURI(curi.getURI()); // Matches are always on the same URI
	revisit.setRefersToDate((Long)history[1].get(A_FETCH_BEGAN_TIME));
	curi.setRevisitProfile(revisit);
	}

	synchronized(wq) {
	int originalPrecedence = wq.getPrecedence();
	wq.enqueue(this, curi);
	// always take budgeting values from current curi
	// (whose overlay settings should be active here)
	wq.setSessionBudget(getBalanceReplenishAmount());
	wq.setTotalBudget(getQueueTotalBudget());

	if(!wq.isRetired()) {
	incrementQueuedUriCount();
	int currentPrecedence = wq.getPrecedence();
	if(!wq.isManaged() \|\| currentPrecedence < originalPrecedence) {
	// queue newly filled or bumped up in precedence; ensure enqueuing
	// at precedence level (perhaps duplicate; if so that's handled elsewhere)
	deactivateQueue(wq);
	}
	}
	}
	// Update recovery log.
	doJournalAdded(curi);
	wq.makeDirty();

	public boolean allows(String path) {
	return !(longestPrefixLength(disallows, path) > longestPrefixLength(allows, path));
	}