Git Product home page Git Product logo

badlinkfinder's Introduction

badlinkfinder

This is a Python 3 program that recursively searches your website to find issues. It was started as an afternoon project out of my desire to explore some web crawling and indexing methodologies, and aims to be an easy tool to scan your website and locate broken assets, links, etc.

You are required to supply a seed URL which is where the web crawler will begin its search. In order to prevent it from crawling the entire internet, it will only stay within the initial domain provided in the seed URL. Additionally, it is smart about crawling and tries to guess the MIME type of the URL before downloading it. If it is detected to be an asset (e.g. an image or audio file), it only performs a HEAD request in order to save on bandwidth and processing time.

Table of Contents

Installation

This can be deployed on any OS that has Python 3 available. Simply clone this repository into a directory of your choice.

git clone https://github.com/Jonchun/badlinkfinder.git

Install Python requirements

pip3 install -r requirements.txt

Usage

Execute as a module.

python3 -m badlinkfinder URL

or

blf

Current --help output.

$ blf --help
usage: blf [--workers WORKERS] [--timeout TIMEOUT] [--include-inbound]
           [--output-file OUTPUT_FILE] [--ignore-schemes IGNORE_SCHEMES]
           [--ignore-domains IGNORE_DOMAINS] [--help] [--version]
           [--log_level LOG_LEVEL]
           URL

BadLinkFinder - a recursive bot that scrapes a domain and finds all bad assets/links.

Positional Arguments:
  These arguments come after any flags and in the order they are listed here.
      Only URL is required.

  URL
      The starting seed URL to begin crawling your website. This is required to begin searching for bad links.


Crawler Settings:
  --workers WORKERS
      By default, 5 workers are used for the crawler.

  --timeout TIMEOUT
      By default, requests time out after 10 seconds.

  --include-inbound
      Whether to include inbound URLs when reporting Site Errors (show where they were referenced from)

  --output-file OUTPUT_FILE
      File name for storing the errors found.


Parser Settings:
  --ignore-schemes IGNORE_SCHEMES
      Ignore scheme when parsing URLs so that it does not detect as invalid.
          --ignore-schemes custom
      will ignore any URL that looks like "custom:nonstandardurlhere.com"
      (You can declare this option multiple times)

  --ignore-domains IGNORE_DOMAINS
      Ignore external domain when crawling URLs.
          --ignore-domains example.com
      will not crawl any URL that is on "example.com".
      (You can declare this option multiple times)


Troubleshooting:
  --help
      Show this help message and exit.

  --version
      Show version and exit.

  --log_level LOG_LEVEL, --log-level LOG_LEVEL
      [CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET]

Support

Please open an issue for support. This is very much a work in progress, so rapid-development will be happening.

Contributing

Open a pull request.

badlinkfinder's People

Contributors

jonchun avatar sbrinkerhoff avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

sbrinkerhoff

badlinkfinder's Issues

Usage of `_variable`

There seems to be quite a few places that underscored variables or attributes are used:

    def add_neighbor(self, _from, _to):
        _from = normalize_url(_from)
        _to = normalize_url(_to)
        self._graph.inbound[_to].add(_from)
        self._graph.outbound[_from].add(_to)

And for methods

    def _smart_crawl(self, url):
        parsed_url = urlparse(url)

Is this intentional?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.