Git Product home page Git Product logo

broken-link-crawler's Introduction

Broken-Link-Crawler

A Python project that captures all the links, recursively, from any websites sitemap.xml file and checks whether the links are 200OK or are broken. It also sends an email notification upon completion of the checks.

linkChecker

A simple website URL and link checker. It crawls through a given sitemap for every accessible URL, checks that the response is 200 (OK). It then procedes to check every link in the html of that page.

This program does not create a sitemap.xml for you, so for that you can a website such as https://www.xml-sitemaps.com/.

Setup

Note: Make sure to have Python pre-installed on your system before moving forward.

There are few packages and libraries that you'll be required to install. These packages, for your benifit, have been added as lineitems within the requirements.txt file.

To install the required packages use pip: pip install -r requirements.txt

Usage

Command Line & Server side

Copy the contents of the conf/conf.ini.example file into conf/conf.ini in the same directory.

Modify your config file with the details for your site:

GENERAL CONFIG

Config option Description
SiteName The name of your site, this is for display only and gets used in the output for easier identification
UseLocalFile yes (default) / no
LocalSitemapFile File path + name relative to this directory
DownloadSitemap yes / no (default)
RemoteSitemapUrl The url of the sitemap hosted on your website
OutputToFile yes (default) / no
OutputFileName Name of the file that the results will store. Can be placed elsewhere using relative path
LogfileDirectory The directory where logs will be saved, ensure you have the correct permissions for the directory. The script will a directories per site, ie: <LogFileDirectory>/Broken-Link-Crawler/<SiteName>/<date-of-scan>

EMAIL CONFIG

Config option Description
EmailOutput yes (default) / no
SMTPDomain The domain that will be used to send these emails from (eg. smtp.gmail.com or smtp.mailtrap.io)
AdminEmailAddress The address of that emails will be sent from
AdminEmailPassword The password of the Admins email account -> PLAIN TEXT!
RecipientEmailAddresses The recipient(s) email where the output gets sent to. Separate multiple emails with ','

AUTH CONFIG

For sites that are protected behind a username and password, you can authenticate by providing the username and password in the config.

WARNING These are stored in plain text, so the right priviledges should be granted to keep them as secure as possible.

Config option Description
SiteUsername The username for the protected site
SitePassword The password for the protected site -> Plain Text

To run the script, ensure you have python (min <2.7) installed and run: python3 linkChecker.py

MULTI CONFIGS

You can pass the config file name as a Command line argument, this is useful for multiple sites with a config for each site. ie:

python3 linkchecker.py mysite.ini

python3 linkchecker.py mysecondsite.ini

Graphical interface

Enter the Broker-Link-Crawler directory, and run: python3 main.py

From here you can enter or browse for the filename of the XML sitemap, and click enter.

HTTP Auth

If your site has http authentication, then you will be asked to enter the username and password for the site. These details are not stored.

The script will carry out the test on every url, and then output a report of all the broken links found.

MailTrap

Mailtrap(https://mailtrap.io/) is a fake SMTP server for development teams to test, view and share emails sent from the development and staging environments without spamming.

WARNING These are stored in plain text, so the right priviledges should be granted to keep them as secure as possible.

Config option Description
EmailOutput yes (default)
SMTPDomain smtp.mailtrap.io
AdminEmailAddress e343dbe4d45b50
AdminEmailPassword 65bc8ce20b3425
RecipientEmailAddresses [email protected] or one that you have setup for yourself

If all has been set correctly, you can login to the mailtrap.io website and can see the emails coming thru.

broken-link-crawler's People

Contributors

kaushalshah1307 avatar

Stargazers

ijf8090 avatar

Watchers

James Cloos avatar vignesh kumar avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.