google / robotstxt Goto Github PK
View Code? Open in Web Editor NEWThe repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).
License: Apache License 2.0
The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).
License: Apache License 2.0
I'm on Ubuntu 16.04.7 LTS and I tried compiling this project by using the CMAKE compiler.
But I had this problematic output when doing the cmake
command:
deploy@yamada:~/robotstxt/c-build⟫ cmake .. -DROBOTS_BUILD_TESTS=ON
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/deploy/robotstxt/c-build/libs
Scanning dependencies of target googletest
[ 5%] Creating directories for 'googletest'
[ 11%] Performing download step (git clone) for 'googletest'
-- Avoiding repeated git clone, stamp file is up to date: '/home/deploy/robotstxt/c-build/libs'
[ 16%] No patch step for 'googletest'
[ 22%] Performing update step for 'googletest'
fatal: Needed a single revision
invalid upstream GIT_PROGRESS/master
No rebase in progress?
CMake Error at /home/deploy/robotstxt/c-build/libs/googletest-prefix/tmp/googletest-gitupdate.cmake:105 (message):
Failed to rebase in: '/'.
You will have to resolve the conflicts manually
CMakeFiles/googletest.dir/build.make:95: recipe for target 'googletest-prefix/src/googletest-stamp/googletest-update' failed
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/googletest.dir/all' failed
make[2]: *** [googletest-prefix/src/googletest-stamp/googletest-update] Error 1
make[1]: *** [CMakeFiles/googletest.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
CMake Error at CMakeLists.txt:73 (MESSAGE):
Failed to download dependencies: 2
-- Configuring incomplete, errors occurred!
See also "/home/deploy/robotstxt/c-build/CMakeFiles/CMakeOutput.log".
Note: My version of git is 2.7.4
.
Do you have an idea why this is not working?
Hello,
Could you please change the tag here for https://github.com/google/googletest.git to be main instead of master? It looks like it was renamed.
Thank you
Line 114 in 750aec7
It says returns true iff any user agent in the vector is allowed to crawl. In fact, what it appears to me that it does is effectively collapse all rules that apply to any of the user agents in the vector into a single ruleset and then evaluate against that. That isn't always the same as any in the list being allowed.
e.g.
robots.txt:
User-agent: googlebot
Disallow: /foo/
if we call this method against the url /foo/
with a vector containing both googlebot
and otherbot
, it will return FALSE
even though clearly otherbot
is allowed to crawl /foo/
because (as I understand it) it's doing the equivalent of finding all rules that apply to either ua and collapsing into a single ruleset like:
User-agent: googlebot
User-agent: otherbot
Disallow: /foo/
So I think the comment is misleading, but would appreciate more eyes on the question!
Thanks for releasing this reference code!
We have ported a Python version of this library, exe and test suite.
The Python version keeps the same behaviour as the original library and passed 100% C++ testing cases.
Here is the repository: https://github.com/Cocon-Se/gpyrobotstxt
Let's say a rogue nation state uses this as "entropy" for human cloning.
I do a text stream, and "fuzz" the output to your input.
:(
???
Hi,
I've just fixed an issue reported against my Go port of this library — according to the specs used (by the Google library here), valid chars for a user-agent string are "a-zA-Z_-"
, but RFC7231 defines the 'product' part of user-agent as being a 'token', defined as:
token = 1*tchar
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*"
/ "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
/ DIGIT / ALPHA
; any VCHAR, except delimiters
I think it's important to fix this, because, according to Wikipedia's robots.txt, there are bots in the wild that are using user-agent strings with characters outside of the characters permitted by the current RobotsMatcher::ExtractUserAgent
implementation - which means that 'disallow' directives that would otherwise match are in fact failing to match. (Examples include: MJ12bot
and k2spider
)
See jimsmart/grobotstxt#4 for further details.
HTH
Hi,
Just to let folk know that I have ported this library from its original C++ into Go.
https://github.com/jimsmart/grobotstxt
My conversion includes 100% of the original library's functionality, and 100% of the tests.
Because I approached this conversion function-by-function, some of the resulting code is not necessarily idiomatic Go — but, in places, I have made some cleanups, including renaming a few things.
But otherwise my package is a faithful reproduction of the code in this here repo.
I have licensed my code with Apache 2.0, as per this repo, and I have preserved existing copyright notices and licensing within the appropriate source files.
— Regarding this last matter, as my code is technically a derivative of this repo's code, would someone here please care to check my project with regards to the above-mentioned licensing requirements, to ensure that what I have done is correct? Many thanks.
/Jim
Noticing this is getting ported to Golang, Rust ... would it be worth integrating a WASM (web assembly) build into the process?
The robots.txt in test "ID_UserAgentValueCaseInsensitive" (robots_test.cc, line 200) uses user-agent names violating the standard as they include a white space (FOO BAR
, foo bar
, FoO bAr
). User-agent names in the test are matched up to the first white space - a Google-specific feature following the comments in test GoogleOnly_AcceptUserAgentUpToFirstSpace. What about using standard-conform names in the user-agent directives? Eg. FOO
or FOOBAR
?
I installed Bazelisk
for macOS with brew
https://bazel.build/install/bazelisk and ran bazel test :robots_test
. Received the following error:
ERROR: Analysis of target '//:robots_test' failed; build aborted: error loading package '@com_google_absl//absl': Unable to find package for @bazel_skylib//lib:selects.bzl: The repository '@bazel_skylib' could not be resolved: Repository '@bazel_skylib' is not defined.
INFO: Elapsed time: 0.742s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
currently loading: @com_google_absl//absl
Fetching @local_config_xcode; fetching
Today abseil/abseil-cpp@a766987 was merged and robotstxt build started to fail.
Build is run in debian:buster-slim docker
can you please upgrade the CMAKE_CXX_STANDARD to 14?
It is required now by abseil-cpp
Thank you
I get the following errors, which seem to refer to changes with Bazel (in particular referring to this issue ):
ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/bazel_tools/platforms/BUILD:89:6: in alias rule @bazel_tools//platforms:windows: Constraints from @bazel_tools//platforms have been removed. Please use constraints from @platforms repository embedded in Bazel, or preferably declare dependency on https://github.com/bazelbuild/platforms. See https://github.com/bazelbuild/bazel/issues/8622 for details.
ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/bazel_tools/platforms/BUILD:89:6: Analysis of target '@bazel_tools//platforms:windows' failed
ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/com_google_googletest/BUILD.bazel:57:11: errors encountered resolving select() keys for @com_google_googletest//:gtest
ERROR: Analysis of target '//:robots_test' failed; build aborted:
INFO: Elapsed time: 0.378s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 1 target configured)
currently loading: @com_google_absl//absl
ERROR: Couldn't start the build. Unable to run tests
There's a commonly-supported optional field, Crawl-Delay
, which indicates the requested minimum time between bot requests on a site. It would be really nice if this library could parse that and provide a function to query the crawl delay for a specified user-agent.
The first ID_Encoding
test caught me by surprise, since it does not appear to match the RFC:
// /foo/bar?baz=http://foo.bar stays unencoded.
{
const absl::string_view robotstxt =
"User-agent: FooBot\n"
"Disallow: /\n"
"Allow: /foo/bar?qux=taz&baz=http://foo.bar?tar&par\n";
EXPECT_TRUE(IsUserAgentAllowed(
robotstxt, "FooBot",
"http://foo.bar/foo/bar?qux=taz&baz=http://foo.bar?tar&par"));
}
However, section 2.2.2 of the REP RFC seems to indicate that /foo/bar?baz=http://foo.bar
should be encoded as /foo/bar?baz=http%3A%2F%2Ffoo.bar
.
I can't decide if I'm mis-reading the RFC or if the test intentionally deviates from the RFC in this case.
Thanks!
Section 2.2.3 Special Characters contains two examples about path matching for paths containing the special characters *
and $
. The two characters are percent-encoded in the allow/disallow rule but not encoded in the URL/URI to be matched. Looks like the robots.txt parser and matcher does not follow the examples in the RFC here and fails to match the percent-encoded characters in the rule with the unencoded ones in the URI. See the unit test below.
*
and $
are among the reserved characters in URIs (RFC 3986, section 2.2) and therefor cannot be percent-encoded without potentially changing the semantics of the URI.
diff --git a/robots_test.cc b/robots_test.cc
index 35853de..3a37813 100644
--- a/robots_test.cc
+++ b/robots_test.cc
@@ -492,6 +492,19 @@ TEST(RobotsUnittest, ID_SpecialCharacters) {
EXPECT_FALSE(
IsUserAgentAllowed(robotstxt, "FooBot", "http://foo.bar/foo/quz"));
}
+ {
+ const absl::string_view robotstxt =
+ "User-agent: FooBot\n"
+ "Disallow: /path/file-with-a-%2A.html\n"
+ "Disallow: /path/foo-%24\n"
+ "Allow: /\n";
+ EXPECT_FALSE(
+ IsUserAgentAllowed(robotstxt, "FooBot",
+ "https://www.example.com/path/file-with-a-*.html"));
+ EXPECT_FALSE(
+ IsUserAgentAllowed(robotstxt, "FooBot",
+ "https://www.example.com/path/foo-$"));
+ }
}
// Google-specific: "index.html" (and only that) at the end of a pattern is
I have ported a Rust version of this library recently. The Rust version keeps the same behavior of the original library, provides a consistent API and passed 100% C++ testing cases via Rust FFI.
Also, Rust version licensed with Apache 2.0 too, and I have preserved existing copyright notices and licensing within the appropriate source files.
Here is my repository: https://github.com/Folyd/robotstxt
The README says that we should have installed a compatible C++ compiler supporting at least C++11, but once I ran tests, I got an error saying that C++ versions less than C++14 are not supported.
I see no mentions of X-Robots-Tag
in the draft, should the spec define it perhaps?
For example Googlebot gets blocked by following robots.txt (check it in google testing tool):
# Slow down bots
User-agent: *
Crawl-delay: 10
# Disallow: Badbot
User-agent: badbot
Disallow: /
# allow explicitly all other bots
User-agent: *
Disallow:
If you remove Crawl-delay
directive Googlebot will pass. This works:
# Disallow: Badbot
User-agent: badbot
Disallow: /
# allow explicitly all other bots
User-agent: *
Disallow:
And this too:
# Disallow: Badbot
User-agent: badbot
Disallow: /
If you would like to use Crawl-delay
directive and to not block Googlebot you must add Allow
directive:
# Slow down bots
User-agent: *
Crawl-delay: 10
# Disallow: Badbot
User-agent: badbot
Disallow: /
# allow explicitly all other bots
User-agent: *
Disallow:
# allow explicitly all other bots (supported only by google and bing)
User-agent: *
Allow: /
Both Crawl-delay
and Allow
are unofficial directives. Crawl-delay
is widely supported (except of Googlebot). Allow
is supported only by Googlebot and Bingbot (AFAIK). Normally Googlebot should be allowed by all above mentioned robots.txt. E.g. if you choose Adsbot-Google in mentioned google tool it will pass for all. All other google bots will fail in the same way. For first time we have noticed this unexpected behaviour at the end of 2021.
Is this a mistake in parsing of robots.txt by Googlebot or do I just miss something?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.