Following <a href="https://groups.yahoo.com/neo/groups/archive-crawler/conversations/t

However, looking at the code in question, it appears that the <code class="notranslate

In my opinion, this URL guessing approach by parsing java content must die compl

Avoid speculative links extraction for meta fields known not to contain links about heritrix3 HOT 5 OPEN

internetarchive commented on June 10, 2024

Avoid speculative links extraction for meta fields known not to contain links

from heritrix3.

Comments (5)

anjackson commented on June 10, 2024 4

Apparently this happens a lot with og:facebook-tags attributes.

Perhaps given the change in usage of these fields in recent years, it's time to change the default behaviour to avoid this speculative link extraction?

from heritrix3.

anjackson commented on June 10, 2024

However, looking at the code in question, it appears that the ExtractorHTML extracts links that might be URLs from any <meta content="..." attribute except for property="robots" or property="refresh":

heritrix3/modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java

Lines 990 to 996 in a831676

 else if (content != null) { 

 //look for likely urls in 'content' attribute 

 try { 

 if (UriUtils.isVeryLikelyUri(content)) { 

 int max = getExtractorParameters().getMaxOutlinks(); 

 addRelativeToBase(curi, max, content, 

 HTMLLinkContext.META, Hop.SPECULATIVE);

I think, in general this won't happen with textual content attributes, but in this case the domain-name form appears to be causing this to be judged isVeryLikelyUri(...) == true.

heritrix3/commons/src/main/java/org/archive/util/UriUtils.java

Lines 394 to 469 in 0581170

 protected static final String QNV = "[a-zA-Z_]+=(?:[\\w-/.]|%[0-9a-fA-F]{2})*"; // name=value for query strings 

 // group(1) filename 

 // group(2) filename extension with leading '.' 

 protected static final String LIKELY_RELATIVE_URI_PATTERN = 

 "(?:\\.?/)?" // may start with "/" or "./" 

 + "(?:(?:[\\s\\w-]+|\\.\\.)(?:/))*" // may have path/segments/segment2 

 + "([\\s\\w-]+(?:\\.[\\w-]+)??(\\.[a-zA-Z0-9]{2,5})?)?" // may have a filename with or without an extension 

 + "(?:\\?(?:"+ QNV + ")(?:&(?:" + QNV + "))*)?" // may have a ?query=string 

 + "(?:#[\\w-]+)?"; // may have a #fragment 

 public static boolean isVeryLikelyUri(CharSequence candidate) { 

 // must have a . or / 

 if (!TextUtils.matches(NAIVE_LIKELY_URI_PATTERN, candidate)) { 

 return false; 

 } 

 // absolute uri 

 if (TextUtils.matches("^(?i)https?://[^<>\\s/]+\\.[^<>\\s/]+(?:/[^<>\\s]*)?", candidate)) { 

 return true; 

 } 

 // "protocol-relative" uri 

 if (TextUtils.matches("^//[^<>\\s/]+\\.[^<>\\s/]+(?:/[^<>\\s]*)?", candidate)) { 

 return true; 

 } 

 // relative or server-relative uri 

 Matcher matcher = TextUtils.getMatcher(LIKELY_RELATIVE_URI_PATTERN, candidate); 

 if (!matcher.matches()) { 

 return false; 

 } 

 /* 

  * Remaining tests discard stuff that the 

  * LIKELY_RELATIVE_URI_PATTERN can't catch 

  */ 

 // if filename contains two dots, it must end with a known good extension 

 String filename = matcher.group(1); 

 String extension = matcher.group(2); 

 if (filename != null && extension != null 

 && filename.indexOf('.') != filename.lastIndexOf('.') 

 && !KNOWN_GOOD_FILE_EXTENSIONS.contains(extension)) { 

 return false; 

 } 

 if (TextUtils.matches(".*\\s+.*", candidate) 

 && (extension == null 

 || !KNOWN_GOOD_FILE_EXTENSIONS.contains(extension))) { 

 return false; 

 } 

 // text or application mimetype 

 if (TextUtils.matches("(?:text|application)/[^/]+", candidate)) { 

 return false; 

 } 

 // audio, video or image mimetype 

 if (AUDIO_VIDEO_IMAGE_MIMETYPE_SET.contains(candidate)) { 

 return false; 

 } 

 // decimal number 

 if (TextUtils.matches("\\d+(?:\\.\\d+)*", candidate)) { 

 return false; 

 } 

 // likely css class, e.g. "div.menu", "a.help", etc 

 Matcher m = TextUtils.getMatcher("([^./]+)\\.([^./]+)", candidate); 

 if (m.matches() && HTML_TAGS.contains(m.group(1).toLowerCase())) { 

 return false; 

 } 

 return true; 

 }

Hence, I'm not sure how often this problem will really turn up - it may not be worth worrying about.

However, for common properties that are known not to be used for absolute or relative URLs of any sort, the ExtractorHTML class could be modified to skip this speculative link extraction.

from heritrix3.

ToRu82 commented on June 10, 2024

This really happens very often and would be a great fix saving a lot of bandwidth and trouble. E. g. when crawling www.klausenstein.at an automatic abuse-report by this host is created because of this line in the page src:

<meta name="publisher" content="iNetWorker.at"/>

This causes heritrix to request http://www.klausenstein.at/iNetWorker.at which is interpreted as a crawler-trap and results in an abuse-report. We faced lots of similar situations with something like
<meta name="publisher" content="domain.com"/> ...

from heritrix3.

ToRu82 commented on June 10, 2024

Unfortunately the problems are increasing more and more, this tag also causes problems:

<meta name="twitter:domain" content="Drivingthenation.com" />

It is placed on every page of the domain and generates an additional invalid call (404) of the form "current URL + Drivingthenation.com" for every single page request, which leads to thousands of additional invalid requests with 404 return code. For instance www.drivingthenation.com/category/automobilesandenergy/ "links" to www.drivingthenation.com/category/automobilesandenergy/Drivingthenation.com and so on. But all these "linked" pages do not exist.

It would be very helpful if a solution could be found for this problem in the near future. These incorrectly extracted URLs lead to great frustration for webmasters. It's always the content="domain.com" attribute which most likely is never a link!?

from heritrix3.

mvaitkus commented on June 10, 2024

In my opinion, this URL guessing approach by parsing javascript content must die completely. This easily causes hundreds of RPM of not found errors, which often triggers alerts. Whoever thought that this is a good approach has probably never hosted or monitored anything.

from heritrix3.

Avoid speculative links extraction for meta fields known not to contain links about heritrix3 HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	else if (content != null) {
	//look for likely urls in 'content' attribute
	try {
	if (UriUtils.isVeryLikelyUri(content)) {
	int max = getExtractorParameters().getMaxOutlinks();
	addRelativeToBase(curi, max, content,
	HTMLLinkContext.META, Hop.SPECULATIVE);

	protected static final String QNV = "[a-zA-Z_]+=(?:[\\w-/.]\|%[0-9a-fA-F]{2})*"; // name=value for query strings
	// group(1) filename
	// group(2) filename extension with leading '.'
	protected static final String LIKELY_RELATIVE_URI_PATTERN =
	"(?:\\.?/)?" // may start with "/" or "./"
	+ "(?:(?:[\\s\\w-]+\|\\.\\.)(?:/))*" // may have path/segments/segment2
	+ "([\\s\\w-]+(?:\\.[\\w-]+)??(\\.[a-zA-Z0-9]{2,5})?)?" // may have a filename with or without an extension
	+ "(?:\\?(?:"+ QNV + ")(?:&(?:" + QNV + "))*)?" // may have a ?query=string
	+ "(?:#[\\w-]+)?"; // may have a #fragment


	public static boolean isVeryLikelyUri(CharSequence candidate) {
	// must have a . or /
	if (!TextUtils.matches(NAIVE_LIKELY_URI_PATTERN, candidate)) {
	return false;
	}

	// absolute uri
	if (TextUtils.matches("^(?i)https?://[^<>\\s/]+\\.[^<>\\s/]+(?:/[^<>\\s]*)?", candidate)) {
	return true;
	}

	// "protocol-relative" uri
	if (TextUtils.matches("^//[^<>\\s/]+\\.[^<>\\s/]+(?:/[^<>\\s]*)?", candidate)) {
	return true;
	}

	// relative or server-relative uri
	Matcher matcher = TextUtils.getMatcher(LIKELY_RELATIVE_URI_PATTERN, candidate);
	if (!matcher.matches()) {
	return false;
	}

	/*
	* Remaining tests discard stuff that the
	* LIKELY_RELATIVE_URI_PATTERN can't catch
	*/

	// if filename contains two dots, it must end with a known good extension
	String filename = matcher.group(1);
	String extension = matcher.group(2);
	if (filename != null && extension != null
	&& filename.indexOf('.') != filename.lastIndexOf('.')
	&& !KNOWN_GOOD_FILE_EXTENSIONS.contains(extension)) {
	return false;
	}

	if (TextUtils.matches(".\\s+.", candidate)
	&& (extension == null
	\|\| !KNOWN_GOOD_FILE_EXTENSIONS.contains(extension))) {
	return false;
	}

	// text or application mimetype
	if (TextUtils.matches("(?:text\|application)/[^/]+", candidate)) {
	return false;
	}

	// audio, video or image mimetype
	if (AUDIO_VIDEO_IMAGE_MIMETYPE_SET.contains(candidate)) {
	return false;
	}

	// decimal number
	if (TextUtils.matches("\\d+(?:\\.\\d+)*", candidate)) {
	return false;
	}

	// likely css class, e.g. "div.menu", "a.help", etc
	Matcher m = TextUtils.getMatcher("([^./]+)\\.([^./]+)", candidate);
	if (m.matches() && HTML_TAGS.contains(m.group(1).toLowerCase())) {
	return false;
	}

	return true;
	}