Git Product home page Git Product logo

robotstxt-change-monitor's Introduction

Robots.txt Monitor

Never miss a robots.txt change again.

An accidental "Disallow: /" can happen to anyone, but it doesn't need to linger unnoticed. Whether you're a webmaster, developer, or SEO, this tool can help you quickly discover unwelcome robots.txt changes.

Contents

Key features

  • Easily check results. Robots.txt check results are printed, logged, and optionally emailed (see below).
  • Snapshots. The robots.txt content is saved following the first check and whenever the file changes.
  • Diffs. A diff file is created after every change to help you view the difference at a glance.
  • Email alerts (optional). Automatically notify site watchers about changes and send a summary email to the tool admin after every run. No need to run and check everything manually.
  • Designed for reliability. Errors such as a mistyped URL or connection issue are handled gracefully and won't break other robots.txt checks. Unexpected issues are caught and logged.
  • Comprehensive logging. Check results and any errors are recorded in a main log and website log (where relevant), so you can refer back to historic data and investigate if anything goes wrong.

How it works

  1. A /data directory is created/accessed to store robots.txt file data and logs.
  2. All monitored robots.txt files are downloaded and compared against the previous version.
  3. The check result is logged/printed and categorised as either "first run", "no change", "change", or "error".
  4. Timestamped snapshots ("first run" and "change") and diffs ("change") are saved in the relevant site directory.
  5. If enabled, site-specific email alerts are sent out ("first run", "change", and "error").
  6. If enabled, an administrative email alert is sent out detailing overall check results and any unexpected errors.

Setup

Environment

  1. Save the project to a new directory (locally or on a server).
  2. Locate the example_config.py file in the /app/ directory and create a new config.py file in the same directory. Copy in the contents of example_config.py (this will be filled in properly later).
  3. Install all requirements using the Pipfile. For more information, please refer to the Pipenv documentation.

Emails disabled, local

The quickest setup, suitable if you plan to run the tool on your local machine for sites that you're personally monitoring.

  1. Open config.py and set EMAILS_ENABLED = False.
  2. Create a CSV file named "monitored_sites.csv" in the same directory as the .py files.
  3. Add a header row (with column labels) and details of sites you want to monitor, as defined below:
  • URL (column 1): the absolute URL of the website homepage, with a trailing slash.
  • Name (column 2): the website's name identifier (letters/numbers only).
  • Email (column 3): the email address of the site admin, who will receive alerts.
URL Name Email
https://github.com/ Github [email protected]
  1. Run main.py. The results will be printed and data/logs saved in a newly created /data subdirectory.

  2. Run main.py again whenever you want to re-check the robots.txt files. It's recommended that you check the print output or main log after every run, or at least after new sites are added, in case of unexpected errors.

Emails enabled, local

Slightly more setup, suitable if you plan to run the tool on your local machine for yourself and others.

  1. Add the required details to config.py:
  • Set ADMIN_EMAIL to equal an email address which will receive the summary report.
  • Set SENDER_EMAIL to equal a Gmail address which will send all emails. Less secure app access must be enabled. It's strongly recommended that you set up a new Google account for this.
  • Ensure EMAILS_ENABLED = True.
  1. Create a CSV file named "monitored_sites.csv" in the same directory as the .py files.
  2. Add a header row (with column labels) and details of sites you want to monitor, as defined below:
  • URL (column 1): the absolute URL of the website homepage, with a trailing slash.
  • Name (column 2): the website's name identifier (letters/numbers only).
  • Email (column 3): the email address of the site admin, who will receive alerts.
URL Name Email
https://github.com/ Github [email protected]
  1. Open main.py, uncomment emails.set_email_login(), and save the file:
if __name__ == "__main__":
    # Use set_email_login() to save login details on first run or if email/password changes:
    emails.set_email_login()
    main()
  1. Run main.py. The results will be printed and data/logs saved in a newly created /data subdirectory. You will be prompted to enter your SENDER_EMAIL password, which will be saved for future use via Keyring.

  2. Re-comment emails.set_email_login() and save the file:

if __name__ == "__main__":
    # Use set_email_login() to save login details on first run or if email/password changes:
    # emails.set_email_login()
    main()
  1. Run main.py again whenever you want to re-check the robots.txt files. It's recommended that you check the main log after every run, or at least after new sites are added, in case of unexpected errors.

Emails enabled, server cron job (recommended)

More setup for a fully automated experience.

Refer to "Emails enabled, local", with the following considerations:

  • If Keyring isn't compatible with your server:

    • Open emails.pyand locate the following line within send_emails(): with yagmail.SMTP(config.SENDER_EMAIL) as server:
    • Edit this line to include the sender email password as the second argument of yagmail.SMTP: with yagmail.SMTP(config.SENDER_EMAIL, SENDER_EMAIL_PASSWORD) as server:
    • Open main.py and ensure emails.set_email_login() is commented out.
  • You may need to edit the shebang line at the top of main.py.

  • You may need to edit the PATH variable in config.py.

  • It's strongly recommended that you test the cron job implementation is working correctly. To test that changes are being correctly detected/reported, you can edit new_file.txt within the program_files subdirectory of a monitored site directory. On the following run, a change (versus the edited, "old" file) should be reported.

FAQs

How can I ask questions, report bugs, or provide feedback?

Feel free to create an issue or open a new discussion.

What should I do if there's a connection error, non-200 status, or inaccurate content?

If the tool repeatedly reports an error or inaccurate robots.txt content for a site, but you're able to view the file via your browser, this is likely due to an invalid URL format or the tool being blocked in some way.

Try the following:

  1. Check that the monitored URL is in the correct format.
  2. Request that your IP address (or your server's IP address) is whitelisted.
  3. Try adjusting the USER_AGENT string (e.g. to spoof a common browser) in config.py.
  4. Troubleshoot with your IT/development teams.

Is this project in active development?

There are no further updates/features planned, and I'm not looking for contributions, but I'll be happy to fix any (significant) bugs.

robotstxt-change-monitor's People

Contributors

cmastris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

admariner

robotstxt-change-monitor's Issues

Getting this unicodeEncode Error for few sites:

Hi,

I tried to run few websites i'm getting the below error for 2 websites. Can anyone please help me on this how to resolve this.

16-02-21, 11:09: Error: https://www.example.com/. Unexpected error during https://www.example.com/ check.
TYPE: <class 'UnicodeEncodeError'>
DETAILS: 'charmap' codec can't encode character '\u0432' in position 3529: character maps to
TRACEBACK:

File "D:\Python\robots_txt\main.py", line 222, in run_check
self.update_records(extraction)
File "D:\Python\robots_txt\main.py", line 313, in update_records
new.write(new_extraction)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]

Thank you,

Automate the process by cron job implementation through task scheduler (OS - Windows 10)

Hi @Cmastris

I'm trying to automate the process by (cron job implementation) through task scheduler (OS - Windows 10).
I created batch file instead of directly adding the python script in scheduler. Batch file contains path to python.exe (Application) and path to my "main.py" script.
By manually running the batch file it is working correctly, but when i add that batch file as a program in task scheduler it is not working it is throwing the below error.

C:\windows\system32>C:\Users\suri02\AppData\Local\Programs\Python\Python39\python.exe "D:\Python Tasks\robots txt stuff\main.py"
Fatal error.
TYPE: <class 'FileNotFoundError'>
DETAILS: [Errno 2] No such file or directory: 'monitored_sites.csv'
TRACEBACK:

File "D:\Python Tasks\robots txt stuff\main.py", line 468, in main
sites_data = sites_from_file(config.MONITORED_SITES)
File "D:\Python Tasks\robots txt stuff\main.py", line 26, in sites_from_file
with open(file, 'r') as sites_file:

Error when updating the main log.
TYPE: <class 'FileNotFoundError'>
DETAILS: [Errno 2] No such file or directory: 'data/main_log.txt'
TRACEBACK:

File "D:\Python Tasks\robots txt stuff\logs.py", line 86, in update_main_log
with open(config.MAIN_LOG, 'a') as f:

Note: I don't have admin privileges to the system which i'm working on.'

Thanks
Suresh

Originally posted by @suri02 in #2 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.