Git Product home page Git Product logo

robots_txt's Introduction

A complete, dependency-less and fully documented robots.txt ruleset parser.

Usage

You can obtain the robot exclusion rulesets for a particular website as follows:

// Get the contents of the `robots.txt` file.
final contents = /* Your method of obtaining the contents of a `robots.txt` file. */;
// Parse the contents.
final robots = Robots.parse(contents);

Now that you have parsed the robots.txt file, you can perform checks to establish whether or not a user-agent is allowed to visit a particular path:

final userAgent = /* Your user-agent. */;
print(robots.verifyCanAccess('/gist/', userAgent: userAgent)); // False
print(robots.verifyCanAccess('/government/robots_txt/', userAgent: userAgent)); // True

If you are only concerned about directives pertaining to your own user-agent, you may instruct the parser to ignore other user-agents as follows:

// Parse the contents, disregarding user-agents other than 'government'.
final robots = Robots.parse(contents, onlyApplicableTo: const {'government'});

The Robots.parse() function does not have any built-in structure validation. It will not throw exceptions, and will fail silently wherever appropriate. If the file contents passed into it were not a valid robots.txt file, there is no guarantee that it will produce useful data, and disallow a bot wherever possible.

If you wish to ensure before parsing that a particular file is valid, use the Robots.validate() function. Unlike Robots.parse(), this one will throw a FormatException if the file is not valid:

// Validating an invalid file will throw a `FormatException`.
try {
  Robots.validate('This is an obviously invalid robots.txt file.');
} on FormatException {
  print('As expected, this file is flagged as invalid.');
}

// Validating an already valid file will not throw anything.
try {
  Robots.validate('''
User-agent: *
Crawl-delay: 10
Disallow: /
Allow: /file.txt

Host: https://hosting.example.com/
Sitemap: https://example.com/sitemap.xml
''');
  print('As expected also, this file is not flagged as invalid.');
} on FormatException {
  // Code to handle an invalid file.
}

By default, the validator will only accept the following fields:

  • User-agent
  • Allow
  • Disallow
  • Sitemap
  • Crawl-delay
  • Host

If you want to accept files that feature any other fields, you will have to specify them as so:

try {
  Robots.validate(
    '''
User-agent: *
Custom-field: value
''',
    allowedFieldNames: {'Custom-field'},
  );
} on FormatException {
  // Code to handle an invalid file.
}

By default, the Allow field is treated as having precedence by the parser. This is the standard approach to both writing and reading robots.txt files, however, you can instruct the parser to follow another approach by telling it to do so:

robots.verifyCanAccess(
  '/path', 
  userAgent: userAgent, 
  typePrecedence: RuleTypePrecedence.disallow,
);

Similarly, fields defined later in the file are considered to have precedence too. Similarly also, this is the standard approach. You can instruct the parser to rule otherwise:

robots.verifyCanAccess(
  '/path',
  userAgent: userAgent,
  comparisonMethod: PrecedenceStrategy.lowerTakesPrecedence,
);

robots_txt's People

Contributors

vxern avatar

Stargazers

 avatar  avatar

robots_txt's Issues

Remove non-developer dependencies.

For it to be a legitimate claim that the package is 'lightweight', the package should have as few dependencies as possible.

  • Remove sprint.
    This will be a breaking change, as the library will have to throw an exception instead.
  • Remove web_scraper.
    Instead of scraping webpages itself to obtain the contents of the robots.txt file, the package could instead accept a content parameter passed into it by the developer using the library, thus making the package both more flexible and legitimately 'lightweight'. This will also be a breaking change.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.