Git Product home page Git Product logo

spidersuite's Introduction

NPM version build status Downloads

spidersuite

This project uses Node.js to implement a spider test suite, and outputs a list of warnings, errors, and 404s.

Install spidersuite

$ npm install spidersuite --save-dev

Usage

  1. Start the app that you want to crawl.

  2. Open another Terminal window in the same directory as your project. Run node for your localhost:

    ./node_modules/.bin/spidersuite https://localhost:8443/ [--config <PATH_TO_CONFIG_FILE>]
    

    The spidersuite results appear in this Terminal window.

Reporting results

After running, spidersuite outputs a report on broken links, broken hashes, and other checks over the crawled site. To limit output, it reports only the first five broken links and redirects for a specific link.

Broken links

If spidersuite finds broken links, the not found count value in the response is greater than 0. The response also includes not found information:

not found count: 5
not found: [
  {
    "url": "https://localhost:8443/brokenurl/",
    "linkedFrom": [
      "https://localhost:8443/"
    ]
  },
  ...
]

The linkedFrom information shows where spidersuite found this file.

Broken redirects

If spidersuite finds a link that redirects to a missing page, the report contains the redirectFrom property, which lists the URLs for a chain of redirects. Any URL in that chain might be present in the content and contribute to the overall error. For example:

not found count: 1
not found: [
  {
    "url": "https://localhost:8443/brokenurlredirect2",
    "redirectFrom": [
      "https://localhost:8443/brokenurlredirect1",
      "https://localhost:8443/brokenurlredirect0"
    ]
  }
]

In this example, the original URL in the content is https://localhost:8443/brokenurlredirect0. This URL redirected several times and ended on https://localhost:8443/brokenurlredirect2, which returns the HTTP 404 Not Found status code.

Important: The redirectFrom property can have multiple heads to the redirect chain. For example, the list might show URLs A, B, C, and D, but A might redirect to B, C might redirect to D, and both B and D redirect to the offending URL. When spidersuite detects an error that results from a chain of redirects, more than one chain can redirect to the same faulty page so you must check the content for all parts of all chains specified by the redirectFrom list.

Configuration

To meet your crawling and reporting needs, set one or more options in the spidersuite configuration file.

Configuration file format

The spidersuite configuration file resembles the eslint configuration file.

Use the extends property to find either a referenced file or the default configuration if you specify spidersuite:default.

Supports only .json or .js files.

Configuration file options

The following table describes the full set of configuration options.

For any pattern, spidersuite replaces #{ROOT_URL} with the extracted root URL, such as https://domain.example.com:5522, which lets you treat URLs that intentionally, or unintentionally, link to other hosts differently.

Option Description
additionalPaths A list of paths to crawl, in addition to the initial URL provided. Use this option for hidden pages, which are pages that you cannot navigate from the main URL.
excludePatterns A list of patterns that specify which URLs to not attempt. Default is spidersuite includes all URLs it finds.
includePatterns A list of patterns that specify which URLs to attempt. Default is spidersuite includes all URLs it finds. If specified, spidersuite fetches pages that match at least one of the patterns.
titlePattern A regular expression pattern that indicates what the HTML title of the crawled pages on the same domain should contain.
<ERROR>WarnOnlyPatterns Reports the specified <ERROR> as a warning rather than a failure. Value is either hashNotFound or http<XXX>. http<XXX>WarnOnlyPatterns reports the specified <XXX> errors as warnings. The <XXX> value is a number from 400 to 510.
reportSpoolInterval A number that is greater than zero. Indicates the interval with which to report the current spool. The spool comprises the pages that are currently being fetched. Useful for debugging.
strictCiphers If false, the cipher list is relaxed. If true, a more strict version of ciphers is used over TLS.
simplecrawlerConfig Spidersuite is based on simplecrawler. This module has many configuration options. Use the simplecrawlerConfig option to set simplecrawler options.

Note: For more details about these options, see the configuration file examples in the examples directory.

License

See License.

Contributing

See Contributing.

Acknowledgements

This project was inspired by - and heavily influenced by - eslint. The configuration parsing code was modified for usage in this project.

spidersuite's People

Contributors

braebot avatar drfleming0227 avatar jackellenberger avatar vaughan99 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.