Git Product home page Git Product logo

Comments (5)

anjackson avatar anjackson commented on June 10, 2024 4

Apparently this happens a lot with og:facebook-tags attributes.

Perhaps given the change in usage of these fields in recent years, it's time to change the default behaviour to avoid this speculative link extraction?

from heritrix3.

anjackson avatar anjackson commented on June 10, 2024

However, looking at the code in question, it appears that the ExtractorHTML extracts links that might be URLs from any <meta content="..." attribute except for property="robots" or property="refresh":

else if (content != null) {
//look for likely urls in 'content' attribute
try {
if (UriUtils.isVeryLikelyUri(content)) {
int max = getExtractorParameters().getMaxOutlinks();
addRelativeToBase(curi, max, content,
HTMLLinkContext.META, Hop.SPECULATIVE);

I think, in general this won't happen with textual content attributes, but in this case the domain-name form appears to be causing this to be judged isVeryLikelyUri(...) == true.

protected static final String QNV = "[a-zA-Z_]+=(?:[\\w-/.]|%[0-9a-fA-F]{2})*"; // name=value for query strings
// group(1) filename
// group(2) filename extension with leading '.'
protected static final String LIKELY_RELATIVE_URI_PATTERN =
"(?:\\.?/)?" // may start with "/" or "./"
+ "(?:(?:[\\s\\w-]+|\\.\\.)(?:/))*" // may have path/segments/segment2
+ "([\\s\\w-]+(?:\\.[\\w-]+)??(\\.[a-zA-Z0-9]{2,5})?)?" // may have a filename with or without an extension
+ "(?:\\?(?:"+ QNV + ")(?:&(?:" + QNV + "))*)?" // may have a ?query=string
+ "(?:#[\\w-]+)?"; // may have a #fragment
public static boolean isVeryLikelyUri(CharSequence candidate) {
// must have a . or /
if (!TextUtils.matches(NAIVE_LIKELY_URI_PATTERN, candidate)) {
return false;
}
// absolute uri
if (TextUtils.matches("^(?i)https?://[^<>\\s/]+\\.[^<>\\s/]+(?:/[^<>\\s]*)?", candidate)) {
return true;
}
// "protocol-relative" uri
if (TextUtils.matches("^//[^<>\\s/]+\\.[^<>\\s/]+(?:/[^<>\\s]*)?", candidate)) {
return true;
}
// relative or server-relative uri
Matcher matcher = TextUtils.getMatcher(LIKELY_RELATIVE_URI_PATTERN, candidate);
if (!matcher.matches()) {
return false;
}
/*
* Remaining tests discard stuff that the
* LIKELY_RELATIVE_URI_PATTERN can't catch
*/
// if filename contains two dots, it must end with a known good extension
String filename = matcher.group(1);
String extension = matcher.group(2);
if (filename != null && extension != null
&& filename.indexOf('.') != filename.lastIndexOf('.')
&& !KNOWN_GOOD_FILE_EXTENSIONS.contains(extension)) {
return false;
}
if (TextUtils.matches(".*\\s+.*", candidate)
&& (extension == null
|| !KNOWN_GOOD_FILE_EXTENSIONS.contains(extension))) {
return false;
}
// text or application mimetype
if (TextUtils.matches("(?:text|application)/[^/]+", candidate)) {
return false;
}
// audio, video or image mimetype
if (AUDIO_VIDEO_IMAGE_MIMETYPE_SET.contains(candidate)) {
return false;
}
// decimal number
if (TextUtils.matches("\\d+(?:\\.\\d+)*", candidate)) {
return false;
}
// likely css class, e.g. "div.menu", "a.help", etc
Matcher m = TextUtils.getMatcher("([^./]+)\\.([^./]+)", candidate);
if (m.matches() && HTML_TAGS.contains(m.group(1).toLowerCase())) {
return false;
}
return true;
}

Hence, I'm not sure how often this problem will really turn up - it may not be worth worrying about.

However, for common properties that are known not to be used for absolute or relative URLs of any sort, the ExtractorHTML class could be modified to skip this speculative link extraction.

from heritrix3.

ToRu82 avatar ToRu82 commented on June 10, 2024

This really happens very often and would be a great fix saving a lot of bandwidth and trouble. E. g. when crawling www.klausenstein.at an automatic abuse-report by this host is created because of this line in the page src:

<meta name="publisher" content="iNetWorker.at"/>

This causes heritrix to request http://www.klausenstein.at/iNetWorker.at which is interpreted as a crawler-trap and results in an abuse-report. We faced lots of similar situations with something like
<meta name="publisher" content="domain.com"/> ...

from heritrix3.

ToRu82 avatar ToRu82 commented on June 10, 2024

Unfortunately the problems are increasing more and more, this tag also causes problems:

<meta name="twitter:domain" content="Drivingthenation.com" />

It is placed on every page of the domain and generates an additional invalid call (404) of the form "current URL + Drivingthenation.com" for every single page request, which leads to thousands of additional invalid requests with 404 return code. For instance www.drivingthenation.com/category/automobilesandenergy/ "links" to www.drivingthenation.com/category/automobilesandenergy/Drivingthenation.com and so on. But all these "linked" pages do not exist.

It would be very helpful if a solution could be found for this problem in the near future. These incorrectly extracted URLs lead to great frustration for webmasters. It's always the content="domain.com" attribute which most likely is never a link!?

from heritrix3.

mvaitkus avatar mvaitkus commented on June 10, 2024

In my opinion, this URL guessing approach by parsing javascript content must die completely. This easily causes hundreds of RPM of not found errors, which often triggers alerts. Whoever thought that this is a good approach has probably never hosted or monitored anything.

from heritrix3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.