Git Product home page Git Product logo

robotspy's Introduction

Robots Exclusion Standard Parser for Python

The robotspy Python module implements a parser for robots.txt files. The recommended class to use is robots.RobotsParser.

A thin facade robots.RobotFileParser can also be used as a substitute for urllib.robotparser.RobotFileParser, available in the Python standard library. The class robots.RobotFileParser exposes an API that is mostly compatible with urllib.robotparser.RobotFileParser.

The main reasons for this rewrite are the following:

  1. It was initially intended to experiment with parsing robots.txt files for a link checker project (not implemented yet).
  2. It is attempting to follow the latest internet draft Robots Exclusion Protocol.
  3. It does not try to be compliant with commonly accepted directives that are not in the current specs such as request-rate and crawl-delay, but it currently supports sitemaps.
  4. It satisfies the same tests as the Google Robots.txt Parser, except for some custom behaviors specific to Google Robots.

To use the robots command line tool (CLI) in a Docker container, read the following section Docker Image.

To install robotspy globally as a tool on your system with pipx skip to the Global Installation section.

If you are interested in using robotspy in a local Python environment or as a library, skip to section Module Installation.

Docker Image

The Robotspy CLI, robots, is available as a Docker automated built image at https://hub.docker.com/r/andreburgaud/robotspy.

If you already have Docker installed on your machine, first pull the image from Docker Hub:

$ docker pull andreburgaud/robotspy

Then, you can exercise the tool against the following remote Python robots.txt test file located at http://www.pythontest.net/elsewhere/robots.txt:

# Used by NetworkTestCase in Lib/test/test_robotparser.py

User-agent: Nutch
Disallow: /
Allow: /brian/

User-agent: *
Disallow: /webstats/

The following examples demonstrate how to use the robots command line with the Docker container:

$ # Example 1: User agent "Johnny" is allowed to access path "/"
$ docker run --rm andreburgaud/robotspy http://www.pythontest.net/elsewhere/robots.txt Johnny /
user-agent 'Johnny' with path '/': ALLOWED
$ # Example 2:  User agent "Nutch" is not allowed to access path "/brian"
$ docker run --rm andreburgaud/robotspy http://www.pythontest.net/elsewhere/robots.txt Nutch /brian
user-agent 'Nutch' with path '/brian': DISALLOWED
$ # Example 3: User agent "Johnny" is not allowed to access path "/webstats/"
docker run --rm andreburgaud/robotspy http://www.pythontest.net/elsewhere/robots.txt Johnny /webstats/
user-agent 'Johnny' with path '/webstats/': DISALLOWED

The arguments are the following:

  1. Location of the robots.txt file (http://www.pythontest.net/elsewhere/robots.txt)
  2. User agent name (Johnny)
  3. Path or URL (/)

Without any argument, robots displays the help:

docker run --rm andreburgaud/robotspy
usage: robots <robotstxt> <useragent> <path>

Shows whether the given user agent and path combination are allowed or disallowed by the given robots.txt file.

positional arguments:
  robotstxt      robots.txt file path or URL
  useragent      User agent name
  path           Path or URI

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

To use the CLI robots as a global tools, continue to the following section. If you want to use robotspy as a Python module, skip to Module Installation.

Global Installation with pipx

If you only want to use the command line tool robots, you may want to use pipx to install it as a global tool on your system.

To install robotspy using pipx execute the following command:

$  pipx install robotspy

When robotspy is installed globally on your system, you can invoke it from any folder locations. For example, you can execute:

$  robots --version
robots 0.6.0

You can see more detailed usages in section Usage.

Module Installation

Note: Python 3.8.x or 3.9.x required

You preferably want to install the robotspy package after creating a Python virtual environment, in a newly created directory, as follows:

$ mkdir project && cd project
$ python -m venv .venv
$ . .venv/bin/activate
(.venv) $ python -m pip install --upgrade pip
(.venv) $ python -m pip install --upgrade setuptools
(.venv) $ python -m pip install robotspy
(.venv) $ python -m robots --help
...

On Windows:

C:/> mkdir project && cd project
C:/> python -m venv .venv
C:/> .venv\scripts\activate
(.venv) c:\> python -m pip install --upgrade pip
(.venv) c:\> python -m pip install --upgrade setuptools
(.venv) c:\> python -m pip install robotspy
(.venv) c:\> python -m robots --help
...

Usage

The robotspy package can be imported as a module and also exposes an executable, robots, invocable with python -m. If installed globally with pipx, the command robots can be invoked from any folders. The usage examples in the following section use the command robots, but you can also substitute it with python -m robots in a virtual environment.

Execute the Tool

After installing robotspy, you can validate the installation by running the following command:

$ robots --help
usage: robots <robotstxt> <useragent> <path>

Shows whether the given user agent and path combination are allowed or disallowed by the given robots.txt file.

positional arguments:
  robotstxt      robots.txt file path or URL
  useragent      User agent name
  path           Path or URI

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

Examples

The content of http://www.pythontest.net/elsewhere/robots.txt is the following:

# Used by NetworkTestCase in Lib/test/test_robotparser.py

User-agent: Nutch
Disallow: /
Allow: /brian/

User-agent: *
Disallow: /webstats/

To check if the user agent Nutch can fetch the path /brian/ you can execute:

$ robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with path '/brian/': ALLOWED

Or, you can also pass the full URL, http://www.pythontest.net/brian/:

$ robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with url 'http://www.pythontest.net/brian/': ALLOWED

Can user agent Nutch fetch the path /brian?

$ robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian
user-agent 'Nutch' with path '/brian': DISALLOWED

Or, /?

$ robots http://www.pythontest.net/elsewhere/robots.txt Nutch /
user-agent 'Nutch' with path '/': DISALLOWED

How about user agent Johnny?

$ robots http://www.pythontest.net/elsewhere/robots.txt Johnny /
user-agent 'Johnny' with path '/': ALLOWED

Use the Module in a Project

If you have a virtual environment with the robotspy package installed, you can use the robots module from the Python shell:

(.venv) $ python
>>> import robots
>>> parser = robots.RobotsParser.from_uri('http://www.pythontest.net/elsewhere/robots.txt')
>>> useragent = 'Nutch'
>>> path = '/brian/'
>>> result = parser.can_fetch(useragent, path)
>>> print(f'Can {useragent} fetch {path}? {result}')
Can Nutch fetch /brian/? True
>>>

Bug in the Python standard library

There is a bug in urllib.robotparser from the Python standard library that causes the following test to differ from the example above with robotspy.

The example with urllib.robotparser is the following:

$ python
>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url('http://www.pythontest.net/elsewhere/robots.txt')
>>> rp.read()
>>> rp.can_fetch('Nutch', '/brian/')
False

Notice that the result is False whereas robotspy returns True.

Bug bpo-39187 was open to raise awareness on this issue and PR python/cpython#17794 was submitted as a possible fix. robotspy does not exhibit this problem.

Development

The main development dependency is pytest for executing the tests. It is automatically installed if you perform the following steps:

$ git clone https://github.com/andreburgaud/robotspy
$ cd robotspy
$ python -m venv .venv --prompt robots
$ . .venv/bin/activate
(robots) $ python -m pip install -r requirements.txt
(robots) $ python -m pip install -e .
(robots) $ make test
(robots) $ deactivate
$

On Windows:

C:/> git clone https://github.com/andreburgaud/robotspy
C:/> cd robotspy
C:/> python -m venv .venv --prompt robotspy
C:/> .venv\scripts\activate
(robots) c:\> python -m pip install -r requirements.txt
(robots) c:\> python -m pip install -e .
(robots) c:\> make test
(robots) c:\> deactivate

Global Tools

The following tools were used during the development of robotspy:

See the build file, Makefile or make.bat on Windows, for the commands and parameters.

Release History

  • 0.7.0:
    • Fixed bug with the argument path when using the CLI
    • Print 'url' when the argument is a URL, 'path' otherwise
  • 0.6.0:
    • Simplified dependencies by keeping only pytest in requirements.txt
  • 0.5.0:
    • Updated all libraries. Tested with Python 3.9.
  • 0.4.0:
    • Fixed issue with robots text pointed by relative paths
    • Integration of Mypy, Black and Pylint as depencencies to ease cross-platform development
    • Limited make.bat build file for Windows
    • Git ignore vscode files, tmp directory, multiple virtual env (.venv*)
    • Fixed case insensitive issues on Windows
    • Tests successful on Windows
    • Added an ATRIBUTIONS files and build task to generate it
    • Upgraded pyparsing and certifi
  • 0.3.3:
    • Upgraded tqdm, and cryptography packages
    • 0.3.2:
    • Upgraded bleach, tqdm, and setuptools packages
  • 0.3.1:
    • Updated idna and wcwidth packages
    • Added pipdeptree package to provide visibility on dependencies
    • Fixed mypy errors
    • Explicitly ignored pylint errors related to commonly used names like f, m, or T
  • 0.3.0: Updated bleach package to address CVE-2020-6802
  • 0.2.0: Updated the documentation
  • 0.1.0: Initial release

License

MIT License

robotspy's People

Contributors

andreburgaud avatar dependabot-preview[bot] avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

oddjobz

robotspy's Issues

Issue with get_uri when the encoding isn't 'utf-8'

The current implementation of gen_uri would fail with an entry in the robots.txt line which isn't correctly encoded in 'utf-8' additionally it is assuming that all robots.txt files are encoded in 'utf-8' which isn't a safe option. The following version solves the problem:

def gen_uri(self, uri: str):
        """Instantiate a generator from a URI (either http:// or https:// or file:///"""
        try:
            with urllib.request.urlopen(uri) as f:
                self._time = time.time()
                encoding = f.headers.get_content_charset() or "utf-8"
                for line in f:
                    try:
                        yield line.decode(encoding)
                    except UnicodeDecodeError:
                        yield line.decode("utf-8", "replace")
        except urllib.error.HTTPError as err:
            if err.code in (401, 403):
                self.disallow_all = True
                self._errors.append((str(err.code), str(err)))
            elif 400 <= err.code < 500:
                self.allow_all = True
                self._warnings.append((str(err.code), str(err)))
            self._time = 0
        except urllib.error.URLError as err:
            self._errors.append(("", str(err)))

I have tried to submit a pull request but I have no rights 😉

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.