Git Product home page Git Product logo

robotstxt's People

Contributors

alanyee avatar anubhavp28 avatar devsdmf avatar dwsmart avatar edwardbetts avatar epere4 avatar fridex avatar garyillyes avatar happyxgang avatar korilakkuma avatar lucasassisrosa avatar luchaninov avatar lvandeve avatar naveenarun avatar tomanthony avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

robotstxt's Issues

CMAKE compilation not working on Ubuntu 16.04

I'm on Ubuntu 16.04.7 LTS and I tried compiling this project by using the CMAKE compiler.

But I had this problematic output when doing the cmake command:

deploy@yamada:~/robotstxt/c-build⟫ cmake .. -DROBOTS_BUILD_TESTS=ON
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/deploy/robotstxt/c-build/libs
Scanning dependencies of target googletest
[  5%] Creating directories for 'googletest'
[ 11%] Performing download step (git clone) for 'googletest'
-- Avoiding repeated git clone, stamp file is up to date: '/home/deploy/robotstxt/c-build/libs'
[ 16%] No patch step for 'googletest'
[ 22%] Performing update step for 'googletest'
fatal: Needed a single revision
invalid upstream GIT_PROGRESS/master
No rebase in progress?
CMake Error at /home/deploy/robotstxt/c-build/libs/googletest-prefix/tmp/googletest-gitupdate.cmake:105 (message):


  Failed to rebase in: '/'.

  You will have to resolve the conflicts manually


CMakeFiles/googletest.dir/build.make:95: recipe for target 'googletest-prefix/src/googletest-stamp/googletest-update' failed
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/googletest.dir/all' failed
make[2]: *** [googletest-prefix/src/googletest-stamp/googletest-update] Error 1
make[1]: *** [CMakeFiles/googletest.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
CMake Error at CMakeLists.txt:73 (MESSAGE):
  Failed to download dependencies: 2


-- Configuring incomplete, errors occurred!
See also "/home/deploy/robotstxt/c-build/CMakeFiles/CMakeOutput.log".

Note: My version of git is 2.7.4.

Do you have an idea why this is not working?

I think this comment is misleading

// Returns true iff 'url' is allowed to be fetched by any member of the

It says returns true iff any user agent in the vector is allowed to crawl. In fact, what it appears to me that it does is effectively collapse all rules that apply to any of the user agents in the vector into a single ruleset and then evaluate against that. That isn't always the same as any in the list being allowed.

e.g.

robots.txt:

User-agent: googlebot
Disallow: /foo/

if we call this method against the url /foo/ with a vector containing both googlebot and otherbot, it will return FALSE even though clearly otherbot is allowed to crawl /foo/ because (as I understand it) it's doing the equivalent of finding all rules that apply to either ua and collapsing into a single ruleset like:

User-agent: googlebot
User-agent: otherbot
Disallow: /foo/

So I think the comment is misleading, but would appreciate more eyes on the question!

Genetic Cloning

Let's say a rogue nation state uses this as "entropy" for human cloning.

I do a text stream, and "fuzz" the output to your input.

:(

???

Allow wider range of chars for valid user-agent identifier / 'product' token

Hi,

I've just fixed an issue reported against my Go port of this library — according to the specs used (by the Google library here), valid chars for a user-agent string are "a-zA-Z_-", but RFC7231 defines the 'product' part of user-agent as being a 'token', defined as:

  token          = 1*tchar

  tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                 / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~" 
                 / DIGIT / ALPHA
                 ; any VCHAR, except delimiters

I think it's important to fix this, because, according to Wikipedia's robots.txt, there are bots in the wild that are using user-agent strings with characters outside of the characters permitted by the current RobotsMatcher::ExtractUserAgent implementation - which means that 'disallow' directives that would otherwise match are in fact failing to match. (Examples include: MJ12bot and k2spider)

See jimsmart/grobotstxt#4 for further details.

HTH

Library ported to Go

Hi,

Just to let folk know that I have ported this library from its original C++ into Go.

https://github.com/jimsmart/grobotstxt

My conversion includes 100% of the original library's functionality, and 100% of the tests.

Because I approached this conversion function-by-function, some of the resulting code is not necessarily idiomatic Go — but, in places, I have made some cleanups, including renaming a few things.

But otherwise my package is a faithful reproduction of the code in this here repo.

I have licensed my code with Apache 2.0, as per this repo, and I have preserved existing copyright notices and licensing within the appropriate source files.

— Regarding this last matter, as my code is technically a derivative of this repo's code, would someone here please care to check my project with regards to the above-mentioned licensing requirements, to ensure that what I have done is correct? Many thanks.

/Jim

Consider a WASM build

Noticing this is getting ported to Golang, Rust ... would it be worth integrating a WASM (web assembly) build into the process?

User-agent names in test ID_UserAgentValueCaseInsensitive to follow the standard

The robots.txt in test "ID_UserAgentValueCaseInsensitive" (robots_test.cc, line 200) uses user-agent names violating the standard as they include a white space (FOO BAR, foo bar, FoO bAr). User-agent names in the test are matched up to the first white space - a Google-specific feature following the comments in test GoogleOnly_AcceptUserAgentUpToFirstSpace. What about using standard-conform names in the user-agent directives? Eg. FOO or FOOBAR?

bazel test failed with `bazelisk`: Repository '@bazel_skylib' is not defined

I installed Bazeliskfor macOS with brew https://bazel.build/install/bazelisk and ran bazel test :robots_test. Received the following error:

ERROR: Analysis of target '//:robots_test' failed; build aborted: error loading package '@com_google_absl//absl': Unable to find package for @bazel_skylib//lib:selects.bzl: The repository '@bazel_skylib' could not be resolved: Repository '@bazel_skylib' is not defined.
INFO: Elapsed time: 0.742s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
    currently loading: @com_google_absl//absl
    Fetching @local_config_xcode; fetching

Issues with Bazel build

I get the following errors, which seem to refer to changes with Bazel (in particular referring to this issue ):

ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/bazel_tools/platforms/BUILD:89:6: in alias rule @bazel_tools//platforms:windows: Constraints from @bazel_tools//platforms have been removed. Please use constraints from @platforms repository embedded in Bazel, or preferably declare dependency on https://github.com/bazelbuild/platforms. See https://github.com/bazelbuild/bazel/issues/8622 for details.
ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/bazel_tools/platforms/BUILD:89:6: Analysis of target '@bazel_tools//platforms:windows' failed
ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/com_google_googletest/BUILD.bazel:57:11: errors encountered resolving select() keys for @com_google_googletest//:gtest
ERROR: Analysis of target '//:robots_test' failed; build aborted: 
INFO: Elapsed time: 0.378s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 1 target configured)
    currently loading: @com_google_absl//absl
ERROR: Couldn't start the build. Unable to run tests

Crawl-Delay support?

There's a commonly-supported optional field, Crawl-Delay, which indicates the requested minimum time between bot requests on a site. It would be really nice if this library could parse that and provide a function to query the crawl delay for a specified user-agent.

An encoding test does not appear to match the RFC?

The first ID_Encoding test caught me by surprise, since it does not appear to match the RFC:

  // /foo/bar?baz=http://foo.bar stays unencoded.
  {
    const absl::string_view robotstxt =
        "User-agent: FooBot\n"
        "Disallow: /\n"
        "Allow: /foo/bar?qux=taz&baz=http://foo.bar?tar&par\n";
    EXPECT_TRUE(IsUserAgentAllowed(
        robotstxt, "FooBot",
        "http://foo.bar/foo/bar?qux=taz&baz=http://foo.bar?tar&par"));
  }

However, section 2.2.2 of the REP RFC seems to indicate that /foo/bar?baz=http://foo.bar should be encoded as /foo/bar?baz=http%3A%2F%2Ffoo.bar.

I can't decide if I'm mis-reading the RFC or if the test intentionally deviates from the RFC in this case.

Thanks!

Special characters * and $ not matched in URI

Section 2.2.3 Special Characters contains two examples about path matching for paths containing the special characters * and $. The two characters are percent-encoded in the allow/disallow rule but not encoded in the URL/URI to be matched. Looks like the robots.txt parser and matcher does not follow the examples in the RFC here and fails to match the percent-encoded characters in the rule with the unencoded ones in the URI. See the unit test below.

* and $ are among the reserved characters in URIs (RFC 3986, section 2.2) and therefor cannot be percent-encoded without potentially changing the semantics of the URI.

diff --git a/robots_test.cc b/robots_test.cc
index 35853de..3a37813 100644
--- a/robots_test.cc
+++ b/robots_test.cc
@@ -492,6 +492,19 @@ TEST(RobotsUnittest, ID_SpecialCharacters) {
     EXPECT_FALSE(
         IsUserAgentAllowed(robotstxt, "FooBot", "http://foo.bar/foo/quz"));
   }
+  {
+    const absl::string_view robotstxt =
+        "User-agent: FooBot\n"
+        "Disallow: /path/file-with-a-%2A.html\n"
+        "Disallow: /path/foo-%24\n"
+        "Allow: /\n";
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/file-with-a-*.html"));
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/foo-$"));
+  }
 }
 
 // Google-specific: "index.html" (and only that) at the end of a pattern is

A rust ported robotstxt

I have ported a Rust version of this library recently. The Rust version keeps the same behavior of the original library, provides a consistent API and passed 100% C++ testing cases via Rust FFI.

Also, Rust version licensed with Apache 2.0 too, and I have preserved existing copyright notices and licensing within the appropriate source files.

Here is my repository: https://github.com/Folyd/robotstxt

Update README build requirements

The README says that we should have installed a compatible C++ compiler supporting at least C++11, but once I ran tests, I got an error saying that C++ versions less than C++14 are not supported.

Combination of Crawl-delay and badbot Disallow results in blocking of Googlebot

For example Googlebot gets blocked by following robots.txt (check it in google testing tool):

# Slow down bots
User-agent: *
Crawl-delay: 10

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

If you remove Crawl-delay directive Googlebot will pass. This works:

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

And this too:

# Disallow: Badbot
User-agent: badbot
Disallow: /

If you would like to use Crawl-delay directive and to not block Googlebot you must add Allow directive:

# Slow down bots
User-agent: *
Crawl-delay: 10

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

# allow explicitly all other bots (supported only by google and bing)
User-agent: *
Allow: /

Both Crawl-delay and Allow are unofficial directives. Crawl-delay is widely supported (except of Googlebot). Allow is supported only by Googlebot and Bingbot (AFAIK). Normally Googlebot should be allowed by all above mentioned robots.txt. E.g. if you choose Adsbot-Google in mentioned google tool it will pass for all. All other google bots will fail in the same way. For first time we have noticed this unexpected behaviour at the end of 2021.

Is this a mistake in parsing of robots.txt by Googlebot or do I just miss something?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.