For example Googlebot gets blocked by following robots.txt (check it in <a href="https

Combination of Crawl-delay and badbot Disallow results in blocking of Googlebot about robotstxt HOT 3 OPEN

google commented on May 5, 2024

Combination of Crawl-delay and badbot Disallow results in blocking of Googlebot

from robotstxt.

Comments (3)

garyillyes commented on May 5, 2024

Hi Mojmir and thanks for opening this issue. Custom lines such as Crawl-delay and Sitemap should indeed not affect parsing of other lines, in fact they should be ignored as also stipulated in the REP internet draft. For example, from the perspective of Googlebot these two robots.txt snippets are equivalent:

User-agent: *
Crawl-delay: 10

User-agent: badbot
Disallow: /

User-agent: *
Some other unsupported line in plain text

User-agent: badbot
Disallow: /

This means that User-agent: * is merged together with User-agent: badbot, essentially disallowing everything for the global (*) user-agent.

Not ignoring custom lines such as Crawl-delay was a bug, and has been fixed in the following commit: c8ac4b1

Unfortunately the testing tool in Google Search Console, unlike Googlebot, is not using this library so we haven't gotten to fixing this obscure bug there, too.

from robotstxt.

mojmirdurik commented on May 5, 2024

Hi Gary, thank you for your answer.

I didn't know - the syntax of robots.txt is really a bit tricky. Unofficial rules (e.g. Crawl-delay) can result in different evaluations by different bots. Because if unofficial rule is ignored by bot then the two groups are merged into one and the meaning of robots.txt can be changed dramatically.

Having this on mind it is better to put unofficial rules at the end of file, especially if they are used with User-agent: *. So Googlebot will be blocked by robots.txt like this (ignoring Crawl-delay):

User-agent: *
Crawl-delay: 10

User-agent: badbot
Disallow: /

...but not by robots.txt like this:

User-agent: badbot
Disallow: /

User-agent: *
Crawl-delay: 10

...even both seems to do the same thing.

from robotstxt.

garyillyes commented on May 5, 2024

You're correct, lines that are not supported by Googlebot but are in a group otherwise like Crawl-delay in your examples, ideally should be at the end of file.

from robotstxt.

Recommend Projects

Combination of Crawl-delay and badbot Disallow results in blocking of Googlebot about robotstxt HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent