google / robotstxt Goto Github PK

View Code? Open in Web Editor NEW

3.3K 92.0 220.0 136 KB

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).

License: Apache License 2.0

C++ 91.15% CMake 6.39% Starlark 2.46%

robotstxt's People

Contributors

Stargazers

Watchers

Forkers

lcbasu mickael-van-der-beek rjammala longjohncoder misheska dillon-brown devsdmf joncampbell123 pawroman simonw edwardbetts naveenarun designtips statuscue veopaul cattleprodigy bruno-oliveira labaran1 shiftwinting rbs-pli sayhellotogithub cjiajiazhuiqiu tarsbase andylavr slunski ajaykgp qinjee jluispcardenas knst sd37 leno730 pawda jpswade jmonte-sph genesem rkly fridex einstein-github vaisman imabhinav-am ka1bi4 brickc7 skymysky beeasyo chenxujin denisiums qpanweb s-you stephaneerard asakasinsky zed-dz limushi albertalbs thaneacheron vishwgupta jamesg419 tomanthony zhoudaqing sayrer wisechannel c0d3sh3lf ammogcoder akimdi dst1213 chimdiadi wottan32 jxva kandyjam monad-one dut3062796s dothanthitiendiettiende lxzchen ming-fork raymelon gisairo ii0 donglinyin greenhaha jxingzha luyouli dzeckelev wprobot juniorsilva42 cvega znatz hoangpq victormwenda chaitanyaphalak bellagao topkiller safferw hhy5277 cookie1599 ourobouros doovemax susyimes peeyush113 gusthavosouza 307677814 newcharlie

robotstxt's Issues

CMAKE compilation not working on Ubuntu 16.04

I'm on Ubuntu 16.04.7 LTS and I tried compiling this project by using the CMAKE compiler.

But I had this problematic output when doing the cmake command:

deploy@yamada:~/robotstxt/c-build⟫ cmake .. -DROBOTS_BUILD_TESTS=ON
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/deploy/robotstxt/c-build/libs
Scanning dependencies of target googletest
[  5%] Creating directories for 'googletest'
[ 11%] Performing download step (git clone) for 'googletest'
-- Avoiding repeated git clone, stamp file is up to date: '/home/deploy/robotstxt/c-build/libs'
[ 16%] No patch step for 'googletest'
[ 22%] Performing update step for 'googletest'
fatal: Needed a single revision
invalid upstream GIT_PROGRESS/master
No rebase in progress?
CMake Error at /home/deploy/robotstxt/c-build/libs/googletest-prefix/tmp/googletest-gitupdate.cmake:105 (message):


  Failed to rebase in: '/'.

  You will have to resolve the conflicts manually


CMakeFiles/googletest.dir/build.make:95: recipe for target 'googletest-prefix/src/googletest-stamp/googletest-update' failed
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/googletest.dir/all' failed
make[2]: *** [googletest-prefix/src/googletest-stamp/googletest-update] Error 1
make[1]: *** [CMakeFiles/googletest.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
CMake Error at CMakeLists.txt:73 (MESSAGE):
  Failed to download dependencies: 2


-- Configuring incomplete, errors occurred!
See also "/home/deploy/robotstxt/c-build/CMakeFiles/CMakeOutput.log".

Note: My version of git is 2.7.4.

Do you have an idea why this is not working?

googletest.git has tag main

Hello,

Could you please change the tag here for https://github.com/google/googletest.git to be main instead of master? It looks like it was renamed.

Thank you

I think this comment is misleading

robotstxt/robots.h

Line 114 in 750aec7

// Returns true iff 'url' is allowed to be fetched by any member of the

It says returns true iff any user agent in the vector is allowed to crawl. In fact, what it appears to me that it does is effectively collapse all rules that apply to any of the user agents in the vector into a single ruleset and then evaluate against that. That isn't always the same as any in the list being allowed.

e.g.

robots.txt:

User-agent: googlebot
Disallow: /foo/

if we call this method against the url /foo/ with a vector containing both googlebot and otherbot, it will return FALSE even though clearly otherbot is allowed to crawl /foo/ because (as I understand it) it's doing the equivalent of finding all rules that apply to either ua and collapsing into a single ruleset like:

User-agent: googlebot
User-agent: otherbot
Disallow: /foo/

So I think the comment is misleading, but would appreciate more eyes on the question!

[redacted]

gpyrobotstxt, a Python Native port of this repo

Thanks for releasing this reference code!

We have ported a Python version of this library, exe and test suite.
The Python version keeps the same behaviour as the original library and passed 100% C++ testing cases.

Here is the repository: https://github.com/Cocon-Se/gpyrobotstxt

Google's robots.txt parser and matchet

Genetic Cloning

Let's say a rogue nation state uses this as "entropy" for human cloning.

I do a text stream, and "fuzz" the output to your input.

???

Allow wider range of chars for valid user-agent identifier / 'product' token

Hi,

I've just fixed an issue reported against my Go port of this library — according to the specs used (by the Google library here), valid chars for a user-agent string are "a-zA-Z_-", but RFC7231 defines the 'product' part of user-agent as being a 'token', defined as:

  token          = 1*tchar

  tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                 / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~" 
                 / DIGIT / ALPHA
                 ; any VCHAR, except delimiters

I think it's important to fix this, because, according to Wikipedia's robots.txt, there are bots in the wild that are using user-agent strings with characters outside of the characters permitted by the current RobotsMatcher::ExtractUserAgent implementation - which means that 'disallow' directives that would otherwise match are in fact failing to match. (Examples include: MJ12bot and k2spider)

See jimsmart/grobotstxt#4 for further details.

HTH

Library ported to Go

Hi,

Just to let folk know that I have ported this library from its original C++ into Go.

https://github.com/jimsmart/grobotstxt

My conversion includes 100% of the original library's functionality, and 100% of the tests.

Because I approached this conversion function-by-function, some of the resulting code is not necessarily idiomatic Go — but, in places, I have made some cleanups, including renaming a few things.

But otherwise my package is a faithful reproduction of the code in this here repo.

I have licensed my code with Apache 2.0, as per this repo, and I have preserved existing copyright notices and licensing within the appropriate source files.

— Regarding this last matter, as my code is technically a derivative of this repo's code, would someone here please care to check my project with regards to the above-mentioned licensing requirements, to ensure that what I have done is correct? Many thanks.

/Jim

[redacted]

Consider a WASM build

Noticing this is getting ported to Golang, Rust ... would it be worth integrating a WASM (web assembly) build into the process?

User-agent names in test ID_UserAgentValueCaseInsensitive to follow the standard

The robots.txt in test "ID_UserAgentValueCaseInsensitive" (robots_test.cc, line 200) uses user-agent names violating the standard as they include a white space (FOO BAR, foo bar, FoO bAr). User-agent names in the test are matched up to the first white space - a Google-specific feature following the comments in test GoogleOnly_AcceptUserAgentUpToFirstSpace. What about using standard-conform names in the user-agent directives? Eg. FOO or FOOBAR?

[redacted]

bazel test failed with `bazelisk`: Repository '@bazel_skylib' is not defined

I installed Bazeliskfor macOS with brew https://bazel.build/install/bazelisk and ran bazel test :robots_test. Received the following error:

ERROR: Analysis of target '//:robots_test' failed; build aborted: error loading package '@com_google_absl//absl': Unable to find package for @bazel_skylib//lib:selects.bzl: The repository '@bazel_skylib' could not be resolved: Repository '@bazel_skylib' is not defined.
INFO: Elapsed time: 0.742s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
    currently loading: @com_google_absl//absl
    Fetching @local_config_xcode; fetching

CMAKE Build failing after newest commit to abseil/abseil-cpp

Today abseil/abseil-cpp@a766987 was merged and robotstxt build started to fail.

Build is run in debian:buster-slim docker

CMAKE_CXX_STANDARD 14

can you please upgrade the CMAKE_CXX_STANDARD to 14?
It is required now by abseil-cpp

Thank you

duplicate function body

Issues with Bazel build

I get the following errors, which seem to refer to changes with Bazel (in particular referring to this issue ):

ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/bazel_tools/platforms/BUILD:89:6: in alias rule @bazel_tools//platforms:windows: Constraints from @bazel_tools//platforms have been removed. Please use constraints from @platforms repository embedded in Bazel, or preferably declare dependency on https://github.com/bazelbuild/platforms. See https://github.com/bazelbuild/bazel/issues/8622 for details.
ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/bazel_tools/platforms/BUILD:89:6: Analysis of target '@bazel_tools//platforms:windows' failed
ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/com_google_googletest/BUILD.bazel:57:11: errors encountered resolving select() keys for @com_google_googletest//:gtest
ERROR: Analysis of target '//:robots_test' failed; build aborted: 
INFO: Elapsed time: 0.378s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 1 target configured)
    currently loading: @com_google_absl//absl
ERROR: Couldn't start the build. Unable to run tests

Crawl-Delay support?

There's a commonly-supported optional field, Crawl-Delay, which indicates the requested minimum time between bot requests on a site. It would be really nice if this library could parse that and provide a function to query the crawl delay for a specified user-agent.

An encoding test does not appear to match the RFC?

The first ID_Encoding test caught me by surprise, since it does not appear to match the RFC:

  // /foo/bar?baz=http://foo.bar stays unencoded.
  {
    const absl::string_view robotstxt =
        "User-agent: FooBot\n"
        "Disallow: /\n"
        "Allow: /foo/bar?qux=taz&baz=http://foo.bar?tar&par\n";
    EXPECT_TRUE(IsUserAgentAllowed(
        robotstxt, "FooBot",
        "http://foo.bar/foo/bar?qux=taz&baz=http://foo.bar?tar&par"));
  }

However, section 2.2.2 of the REP RFC seems to indicate that /foo/bar?baz=http://foo.bar should be encoded as /foo/bar?baz=http%3A%2F%2Ffoo.bar.

I can't decide if I'm mis-reading the RFC or if the test intentionally deviates from the RFC in this case.

Thanks!

[redacted]

<redacted>

Special characters * and $ not matched in URI

Section 2.2.3 Special Characters contains two examples about path matching for paths containing the special characters * and $. The two characters are percent-encoded in the allow/disallow rule but not encoded in the URL/URI to be matched. Looks like the robots.txt parser and matcher does not follow the examples in the RFC here and fails to match the percent-encoded characters in the rule with the unencoded ones in the URI. See the unit test below.

* and $ are among the reserved characters in URIs (RFC 3986, section 2.2) and therefor cannot be percent-encoded without potentially changing the semantics of the URI.

diff --git a/robots_test.cc b/robots_test.cc
index 35853de..3a37813 100644
--- a/robots_test.cc
+++ b/robots_test.cc
@@ -492,6 +492,19 @@ TEST(RobotsUnittest, ID_SpecialCharacters) {
     EXPECT_FALSE(
         IsUserAgentAllowed(robotstxt, "FooBot", "http://foo.bar/foo/quz"));
   }
+  {
+    const absl::string_view robotstxt =
+        "User-agent: FooBot\n"
+        "Disallow: /path/file-with-a-%2A.html\n"
+        "Disallow: /path/foo-%24\n"
+        "Allow: /\n";
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/file-with-a-*.html"));
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/foo-$"));
+  }
 }
 
 // Google-specific: "index.html" (and only that) at the end of a pattern is

A rust ported robotstxt

I have ported a Rust version of this library recently. The Rust version keeps the same behavior of the original library, provides a consistent API and passed 100% C++ testing cases via Rust FFI.

Also, Rust version licensed with Apache 2.0 too, and I have preserved existing copyright notices and licensing within the appropriate source files.

Here is my repository: https://github.com/Folyd/robotstxt

Update README build requirements

The README says that we should have installed a compatible C++ compiler supporting at least C++11, but once I ran tests, I got an error saying that C++ versions less than C++14 are not supported.

Should REP define the `X-Robots-Tag` header?

I see no mentions of X-Robots-Tag in the draft, should the spec define it perhaps?

[redacted]

SEOサイト

Combination of Crawl-delay and badbot Disallow results in blocking of Googlebot

For example Googlebot gets blocked by following robots.txt (check it in google testing tool):

# Slow down bots
User-agent: *
Crawl-delay: 10

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

If you remove Crawl-delay directive Googlebot will pass. This works:

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

And this too:

# Disallow: Badbot
User-agent: badbot
Disallow: /

If you would like to use Crawl-delay directive and to not block Googlebot you must add Allow directive:

# Slow down bots
User-agent: *
Crawl-delay: 10

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

# allow explicitly all other bots (supported only by google and bing)
User-agent: *
Allow: /

Both Crawl-delay and Allow are unofficial directives. Crawl-delay is widely supported (except of Googlebot). Allow is supported only by Googlebot and Bingbot (AFAIK). Normally Googlebot should be allowed by all above mentioned robots.txt. E.g. if you choose Adsbot-Google in mentioned google tool it will pass for all. All other google bots will fail in the same way. For first time we have noticed this unexpected behaviour at the end of 2021.

Is this a mistake in parsing of robots.txt by Googlebot or do I just miss something?

google / robotstxt Goto Github PK

robotstxt's People

Contributors

Stargazers

Watchers

Forkers

robotstxt's Issues

Recommend Projects

Recommend Topics

Recommend Org