Git Product home page Git Product logo

pol-infer's Introduction

pol-infer

Inferring password composition policies from breached user credential databases.

Logo

Overview

Sometimes as security researchers we need to be able to work out the password composition policy that some publicly-available breached user credential database was created under. This tool is able to assist with this, even when the data is "contaminated" with passwords that do not comply with the policy.

Prerequisites

This library requires you to have the following software installed:

  • Python 3.7.2 or later [^]
  • Pandas for loading large CSV files [^]
  • Matplotlib for plotting figures [^]

Both Pandas and Matplotlib can be installed using pip [^]:

pip install pandas
pip install matplotlib

Usage

Using the utility is a two-step process. The first thing you'll need is a plaintext password database (try SecLists for these), which you'll need to format as a CSV file like so:

password, frequency
"123456", 290729
"12345", 79076
"123456789", 76789
"password", 59462
"iloveyou", 49952
"princess", 33291
"1234567", 21725
"rockyou", 20901
"12345678", 20553
...

Now, you'll be able to pass this file to /src/extractfeatures.py to generate a JSON file containing features of the database. For convenience, I've included some of these files under /features to save you doing this part yourself:

  • 000webhost.json is from the 000webhost breach. This service apparently had a password composition policy in place mandating that passwords be at least length 6 with at least one letter and at least one number.
  • linkedin.json is from the LinkedIn breach. Reported password composition policy is length 6 with no other constraints.
  • rockyou.json is from the RockYou breach. Reported password composition policy is length 5 with no other constraints.
  • xato.json is from the data dump compiled by Mark Burnett sampled randomly from several breaches. Because this is a compound dataset, passwords here are likely to have been created under multiple different policies (or no policy at all).
  • yahoo.json is from the Yahoo Voice breach sampled randomly from several breaches. Reported password composition policy is length 6 with no other constraints.

Some feature files created from synthetic datasets are also included. These are:

  • linkedin-2class8-errors.json is the LinkedIn dataset (see linkedin.json) fitlered according to a 2class8 policy (two character classes from lowercase, uppercase, digits and symbols, length at least 8), then run through introduceerrors.py which simulates common data formatting errors by splitting passwords along potentially problematic tokens ( and ,).
  • linkedin-2word12-padded.json as above, but filtered according to a 2word12 policy (at least two letter sequences separated by non-letter sequences, length at least 12) and padded with the singles.org, elitehacker, hak5 and faithwriters datasets using combine.py. This is designed to simulate intentional padding of a dataset with smaller ones in order to increase its resale value.

Here's what these files look like:

{
  "lengths": {
    "1": 314,
    "2": 1042,
    "3": 6725,
    // ...
  },
  "lowerCounts": {
    "0": 6329765,
    "1": 333254,
    "2": 449242,
    "3": 852241,
    // ...
  },
  "upperCounts": {
    "0": 30653712,
    "1": 668835,
    "2": 162895,
    "3": 89374,
    // ...
  },
  // ...
}

Here's how you generate one for rockyou.csv for example (the CSV file is way too big to include here, check out SecLists for the raw data):

python ./src/extractfeatures.py rockyou.csv > rockyou.json

Now for the interesting bit. Using src/polinfer.py to infer password composition policy rules. First, let's determine that most of the passwords in the set described by rockyou.json were created under a policy enforcing a minimum length constraint of 5:

python ./src/polinfer.py -k lengths ./features/rockyou.json
# > Lower constraint on lengths inferred as 5.

Nice, this is backed up by existing literature (for example, see the work by Golla and Dürmuth here).

Now, let's check for a minimum number of digits:

python ./src/polinfer.py -k digitCounts ./features/rockyou.json
# > Lower constraint on digitCounts unlikely to be present in policy.

This gives us the correct answer, that RockYou did not mandate a minimum number of digits in passwords.

We are similarly able to infer the policy in place for webhost (minimum length 6, at least 1 number):

python ./src/polinfer.py -k lengths ./features/000webhost.json
# > Lower constraint on lengths inferred as 5.
python ./src/polinfer.py -k digitCounts -l 0 ./features/000webhost.json
# > Lower constraint on digitCounts inferred as 1.

You can get a better idea about command-line arguments you can pass to each utility using the -h help flag:

python ./src/extractfeatures.py -h
# > Help information...
python ./src/polinfer.py -h
# > Help information...

Generating Figures

It's possible to use the utility to generate some interesting figures (included under /docs/figures). Matplotlib is used for this purpose. Here's an example:

Figure

The above figure was generated like this:

python ./src/polinfer.py -t '000webhost: $mult(l)$ for $l=1$ to $l=20$' -x 'Length ($l$)' -y '$mult(l)$' -o ./docs/figures/000webhost_lengthsAccum.svg -s ./features/000webhost.json

Acknowledgements

I wish to thank the following parties for their contribution to this project:

  • The font used in the logo is Monofur by Tobias Benjamin Köhler.
  • The Tango Icon Library (used in the logo) is an excellent free icon pack that I recommend checking out.

pol-infer's People

Contributors

lambdacasserole avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.