Git Product home page Git Product logo

robotsdisallowed's Introduction

RobotsDisallowed

The RobotsDisallowed project is a harvest of the Disallowed directories from the robots.txt files of the world's top websites--specifically the Alexa 100K.

This list of Disallowed directories is a great way to supplement content discovery during a web security assessment, since the website owner is basically saying "Don't go here; there's sensitive stuff in there!".

It's basically a list of potential high-value targets.

The project

So what we did is take the Alexa Top 100,000 websites, download their robots.txt files, extracted all Disallowed directories, and then performed a bunch of cleanup on them (they are a mess) to make the lists as useful as possible during web assessments.

How to use the project

You use the project by coming to the root and downloading the DisallowedDirectories files there. You can then plug them into your favorite web assessment tool/function, e.g., Burp Intruder.

The files are broken down into Top-n lists, which are sorted lists based on the most common directories found. But if you are pressed for time or are looking for the highest-value targets, check out the InterestingDirectories.txt file, which I blogged about here: https://danielmiessler.com/blog/the-most-interesting-disallowed-directories/.

If you want to see how the output is created, enter the 'Code' directory. There you can get the raw Alexa site list, the scripts that are used to download and manipulate the robots.txt files, etc.

Credit

This concept is not new. The RAFT project was the first to do this, but the project is now dead and gone. And since the concept works best when it's kept up-to-date, we decided to give it a refresh in the form of RobotsDisallowed.

Next steps

There are lots of things we want to do with this:

  1. Write a cleanup script that prunes the least likely hits
  2. Complete all one million sites
  3. Create individual lists for the top 10, top 100, top 1000 directories, etc. So if you're pushed for time you can use one of the condensed versions.

More ideas welcome!

Leaders

It's harder than it looks to make the list both comprehensive and usable. People tend to have some pretty silly stuff in their robots.txt files, and many of the entries are only useful for one site.

So we curate.

If you'd like to help out, feel free to submit issues to the repo or send pull requests.

Thanks!

Credits

It's important to us to thank people when they help out with the project.

  • Brad Wolfe for adding the epic Bash multithreading (yeah, you read that correctly; go check the code)

robotsdisallowed's People

Contributors

danielmiessler avatar tbwolfe avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.