Git Product home page Git Product logo

Comments (6)

ClemensRobbenhaar avatar ClemensRobbenhaar commented on June 9, 2024 1

I am not sure if it is a "decent" one, but here is a PR to add DoH support: #476

from heritrix3.

ato avatar ato commented on June 9, 2024

Hi Martin,

Heritrix does its own DNS lookups as it writes the DNS records and IP addresses to the WARC files. Other features like geolocation and ip address decide rules also depend on knowing the IP addresses. Ignoring hostnames that do not resolve while perhaps not essential likely also helps keep a certain amount of garbage URLs out of the queues early.

If you don't need those features and would like to try to modify Heritrix to work without DNS, a quick and dirty workaround might be a new implementation of org.archive.modules.net.ServerCache which returns a dummy IP address for every lookup. A proper solution would need an option to disable the DNS pecondition check in PreconditionEnforcer and to modify the WARC and ARC writers to work without IP addresses. Right now they assume IP addresses are always available.

Cheers,

Alex

from heritrix3.

marhop avatar marhop commented on June 9, 2024

Hi Alex,

thanks for the quick and thorough explanation, greatly appreciated! Should I read your second paragraph more like "we're looking at it" or "we definitely won't do it ourselves, but wouldn't reject a decent pull request either"?

Thanks,
Martin

from heritrix3.

ato avatar ato commented on June 9, 2024

"we definitely won't do it ourselves, but wouldn't reject a decent pull request either"

I can't speak for all Heritrix contributors but I suspect the answer for most would be this one. ;-)

from heritrix3.

marhop avatar marhop commented on June 9, 2024

OK, good to know. Thanks again!

from heritrix3.

marhop avatar marhop commented on June 9, 2024

I tried to make some modifications but finally gave up because I cannot judge the global implications of DNS queries and IP addresses in Heritrix without digging really deep into the codebase.

Here's a possible workaround however, for anyone with a similar problem: Use DNS over HTTPS to query an external DNS server via your web (HTTPS) proxy. I achieved good results with dnss which plays together well with a web proxy provided you use a recent version. The configuration details are specific to your environment, but the general idea is this: A tool like dnss can run as a service on your local machine listening on localhost:53. If you configure your network settings to use 127.0.0.1 as your DNS server all DNS queries, and particularly those made by Heritrix will go to localhost:53, from where dnss forwards them via HTTPS (and thus via your web proxy whose IP address it is able to pick up from environment variables) to an external DNS server like 1.1.1.1 (Cloudflare) or 8.8.8.8 (Google). That way it is possible to use an external DNS server regardless of firewall constraints blocking port 53.

from heritrix3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.