Comments (6)
I am not sure if it is a "decent" one, but here is a PR to add DoH support: #476
from heritrix3.
Hi Martin,
Heritrix does its own DNS lookups as it writes the DNS records and IP addresses to the WARC files. Other features like geolocation and ip address decide rules also depend on knowing the IP addresses. Ignoring hostnames that do not resolve while perhaps not essential likely also helps keep a certain amount of garbage URLs out of the queues early.
If you don't need those features and would like to try to modify Heritrix to work without DNS, a quick and dirty workaround might be a new implementation of org.archive.modules.net.ServerCache which returns a dummy IP address for every lookup. A proper solution would need an option to disable the DNS pecondition check in PreconditionEnforcer and to modify the WARC and ARC writers to work without IP addresses. Right now they assume IP addresses are always available.
Cheers,
Alex
from heritrix3.
Hi Alex,
thanks for the quick and thorough explanation, greatly appreciated! Should I read your second paragraph more like "we're looking at it" or "we definitely won't do it ourselves, but wouldn't reject a decent pull request either"?
Thanks,
Martin
from heritrix3.
"we definitely won't do it ourselves, but wouldn't reject a decent pull request either"
I can't speak for all Heritrix contributors but I suspect the answer for most would be this one. ;-)
from heritrix3.
OK, good to know. Thanks again!
from heritrix3.
I tried to make some modifications but finally gave up because I cannot judge the global implications of DNS queries and IP addresses in Heritrix without digging really deep into the codebase.
Here's a possible workaround however, for anyone with a similar problem: Use DNS over HTTPS to query an external DNS server via your web (HTTPS) proxy. I achieved good results with dnss which plays together well with a web proxy provided you use a recent version. The configuration details are specific to your environment, but the general idea is this: A tool like dnss can run as a service on your local machine listening on localhost:53. If you configure your network settings to use 127.0.0.1 as your DNS server all DNS queries, and particularly those made by Heritrix will go to localhost:53, from where dnss forwards them via HTTPS (and thus via your web proxy whose IP address it is able to pick up from environment variables) to an external DNS server like 1.1.1.1 (Cloudflare) or 8.8.8.8 (Google). That way it is possible to use an external DNS server regardless of firewall constraints blocking port 53.
from heritrix3.
Related Issues (20)
- ExtractorHTML matches srcset attribute case-sensitively
- Heritrix not ignoring robots.txt HOT 2
- Maven build fails due to HTTP only upstream servers HOT 5
- Job build fails with NoClassDefFoundError: org/slf4j/LoggerFactory HOT 2
- Question
- Question re: cloudfront.net HOT 1
- Compatibility problems with Sonatype release process
- ${launchId} is not being replaced (sometimes) HOT 1
- Questions about TransclusionDecideRule HOT 6
- Bean reference missing inherited properties
- Question about the size of the 'state' directory HOT 3
- Time is not stopped when Disk Space Monitor is triggered and report files are removed HOT 5
- Resume a crawl for later
- Question: how to create a new log/report for a single class
- Implicit max. value of URI cost and precedence (?) should raise warning if exceeded HOT 1
- Error: Could not find or load main class org.archive.crawler.Heritrix Caused by: java.lang.ClassNotFoundException: org.archive.crawler.Heritrix HOT 2
- WARNING: politessDelay unset, returning default 5000
- How to change auth type?
- Provided seed files are updated (the more the job is repited, the more they are modified)
- Error when more than 125 jobs are instantiated HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from heritrix3.