Git Product home page Git Product logo

probable-wordlists's Introduction

Probable Wordlists - Version 2.0

Do you know what the world's most common passwords are?
Do you know what they look like?
You'll want to avoid them to be secure!

Thinking of Cloning?

This repository does not contain code, but links to a group of lists.
A clone may not be necessary to get the files you need.
Visit the downloads page for more information.

Logo

Check out the Password Trend Analysis - and learn!

I visualized the trends of passwords that appeared 10 times or more in the Version 1 files. The charts contain immediately actionable advice on how to make your passwords more unique.

Methodology: Why and How

The Why

Password wordlists are not hard to find. It seems like every few weeks we hear about a massive, record-breaking data breach that has scattered millions of credentials across the internet for everyone to see. If our data is leaked, we'll change our passwords, the hard-working security teams will address the vulnerabilities and everyone will wait until they hear about the next breach.

While leaks may be published with malicious intent, I see an opportunity here for the us to make ourselves a bit more secure online.

Passwords, by definition, are meant to be secret. If it weren't for these leaks, we might not have any idea what a password looks like. Sure, we might know the password to a friend's home Wifi network, or for a company expense account, but passwords are usually only intended to be known by the user and an authentication system.

But, consider this:
If you are never supposed to tell me yours, and I am never going to tell you mine...
How do we know that we aren't using the same passwords?

How do we know we aren't using the same passwords as millions of other people?

If crooks are the only ones who understand what common passwords look like, then the rest of us may never change our passwords! Without this knowledge, we may just continue believing that our password is one of a kind. Data shows that frequently, passwords certainly are not one of a kind.

This is confirmed year after year when password is found to be among the top 3 password for the umpteenth time in a row. Until we know what common passwords look like, we will come up with passwords that appear on dozens of leaks.

If any of your passwords has been published on the internet for everyone to see, then can you really claim it as your password?

The How

While studying password wordlists, I noticed most were either sorted alphabetically or not sorted at all. This might be okay computerized analysis, but I wanted to learn something about the way people think.

I determined that for the most practical analysis, lists had to be sorted in a manner that reflected actual human behavior, not an arbitrary alphabet system or random chronology.

For the better part of a year, I went to sites like SecLists, Weakpass, and Hashes.org to download nearly every single Wordlist containing real passwords I could find. After attempting to remove non-pertinent information, this harvest yielded 1600 files spanning more than 350GB worth of leaked passwords.

For each file, I removed internal duplicates and ensured that they all used the same style of newline character. Some of these lists were composed of smaller lists, and some lists were exact copies, but I took care that the source material was as "pure" as possible. Then, all files were combined into a single amalgamation that represented all of the source files.

Each time a password was found in this file represented a time it was found in the source materials. I considered the number of times a password was found across all of the files to be an approximation of its overall popularity. If an entry was found in less than 5 files, it isn't commonly used. But, if an entry could be found more than 350 files, it is incredibly popular. The passwords that were found in the highest number of source files are considered to be the most popular and are placed at top of the list. Files that didn't appear frequently were placed at the bottom.

The giant source file represented nearly 13 billion passwords! However, since this project aims to find the most popular passwords, and not just list as many passwords as I could find, a password needed to be found at least 5 times in analysis to be included on these lists.

The end result is a list of approximately 2 Billion real passwords, sorted in order of their popularity, not by the alphabet.


Directories In This Repository

Files sorted by popularity will include probable-v2 in the filename

These are REAL passwords.

The files in this folder come from sites like https://github.com/danielmiessler/SecLists, https://weakpass.com/ and https://hashes.org/

Some files contain entries between 8-40 characters. These can be found in the Real-Passwords/WPA-Length directory.

Files including dictionaries, encyclopedic lists and miscellaneous. Wordlists in this folder were not necessarily associated with the "password" label.

Some technically useful lists, such as common usernames, tlds, directories, etc. are included.

Files useful for password recovery and analysis. Includes HashCat Rules and Character Masks.

These files were generated using the PACK project.

Attributions

People Are Talking About Probable-Wordlists?!

Note that the author is not affiliated with or officially endorsing the visiting of any of the links below.

I found most (if not all) of these mentions by simply searching for the project in various engines

Thanks for the shout-outs!


Disclaimer and License

  • These lists are for LAWFUL, ETHICAL AND EDUCATIONAL PURPOSES ONLY.
  • The files contained in this repository are released "as is" without warranty, support, or guarantee of effectiveness.
  • However, I am open to hearing about any issues found within these files and will be actively maintaining this repository for the foreseeable future. If you find anything noteworthy, let me know and I'll see what I can do about it.

The author did not steal, phish, deceive or hack in any way to get hold of these passwords. All lines in these files were obtained through freely available means.

The author's intent for this project is to provide information on insecure passwords in order to increase overall password security. The lists will show you what passwords are the most common, what patterns are the most common, and what you should avoid when creating your own passwords.

License: CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


Enjoy!

probable-wordlists's People

Contributors

berzerk0 avatar borekon avatar jimbergman avatar spmedia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

probable-wordlists's Issues

Finicky Torrents

As of now, the torrents are finicky.
I can get some people seeding, but not others. Sometimes it stalls out.
I haven't spotted a pattern to how and why, but I suspect it has to do with trackers.

If you have found your torrent has stalled, first try pausing and resuming, or using the "update tracker" option in your client.

Personally, I can get them to leech onto one of my computers using Deluge, but not qBittorrent.
However, I have seen some downloaders that are downloading successfully with qBittorrent, so that seems inconclusive.

Anyone have any ideas?

These Wordlists Don't Target Specific Individuals

While these lists are representative of the WORLD, they may not be representative of a particular PERSON.

People are more likely to use passwords that include some aspects of their personal lives, things that are important to them.

Is there some kind of tool that can create wordlists that are laser-guided to a specific individual?

Suggestion: Human passwords only

Hello again :)
I think there is a way to generate human-generated wordlists only, but I am sure will be tricky :).

What I mean - we already have plenty of human words + names + city names, etc.
We already know how people put 4 instead of A, 1, instead of i, etc.
if you search (no case sensitive + number replacement option) for all human words in the current biggest file and extract all matches you will find (still ordered by probability) all passwords, that for sure are NOT generated by random password generator.

I am sure there are people who can provide good analysis what is word in general - there are specific patterns, that can be found only in human words, no matter of the language. This way all kind of slang, jargon, and street offensive words can be included, and for some funny reason, they are HUGE % of all passwords :)

I believe this new list will be far more probable, especially for WPA.

Further de-duplication for rules cracking

Great project, thanks for taking the time.

Food for thought .. typically when using hashcat I like to run through and pull out the straight matches, then switch to rules like Korelogic or the build-in set. To that end, having various permutations in the file reduces efficiency because the rules will catch them anyway .. example, having "password" in the list would suffice since "Password0", "p455w0rd" and "Pa55word" would all be generated by the most common mungers. Sure, rules on top of a munged version might produce more words, but there are better ways of layer rules on top of each other in a more deliberate way.

Anyway, as long as you are on the path of creating derivative password lists, one that is normalized for munging rules would be something to think about. For my purposes I just strip out the easy stuff -- tolower it all, strip off leading and trailing single digits, replace mid-stream digits with corresponding letters, etc.

cheers

$ egrep '^[Pp][aA4][sS5]{2}w[oO0]rd[0-9]{0,2}$' Top125Thousand-probable.txt | head
password
password1
passw0rd
Password
Password1
pa55word
password2
pa55w0rd
password12
password01
$ egrep '^[Pp][aA4][sS5]{2}w[oO0]rd[0-9]{0,2}$' Top125Thousand-probable.txt | wc -l
106

SecLists Integration

Great work here!

We'd like to include the content in SecLists. Is that ok with you?

Duplicated entries found on WPA-Length wordlists

There are duplicated entries for some words in the Top 31 Million, Top 102 Million and Top 1.8 Billion files. As an example, the word 'password' can be found on both line 1 and line 11,853,466 of the files.

I am not good with Unix commands, but the files can be easily fixed using SQL / MySQL. I already fixed them with the code I am sharing below, including also removing words with length of 7 characters. For the 31 Million file, 302,363 entries were removed after cleaning. This code is an example for the 31 Million wordlist, but the same code can be used for the other wordlist just changing the name of the txt file:

/* Creates a Database named 'WPA' */
CREATE DATABASE WPA;
USE WPA;

/* Creates a table named 'Top31MillionWPA'
with two columns: a unique auto_incremental 'id' to keep the
popularity order and 'word' containing the text.
Uses utf8_bin to compare strings case-sensitively */

CREATE TABLE Top31MillionWPA(
id BIGINT NOT NULL AUTO_INCREMENT, Word varchar(255)
, PRIMARY KEY (id), INDEX IX_word (word)
) AUTO_INCREMENT=1 COLLATE utf8_bin;

/*Temporary settings for speed up load of text file*/
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;

/*Loads the text file into the table, into the 'word' column.
 The id column will get automatically populated
////// Change directory and filename accordingly //////
 */ 
LOAD DATA INFILE '/tmp/Top31Million-probable-WPA.txt' INTO TABLE Top31MillionWPA(word);

/* Back to default settings*/
set unique_checks = 1;
set foreign_key_checks = 1;
set sql_log_bin=1;

 /*  This will keep the first entry of the duplicates only in a new table
  this is faster than deleting the duplicates (at the cost of storage space) */
CREATE TABLE Top31MillionWPAclean SELECT Top31MillionWPA.* FROM Top31MillionWPA
LEFT OUTER JOIN(
	SELECT MIN(id) AS FirstID, word
	FROM Top31MillionWPA
	GROUP BY word
	) AS KeepFirst ON
	Top31MillionWPA.id = KeepFirst.FirstID
	WHERE KeepFirst.FirstID IS NOT NULL;

/* Delete original MySQL table  */
DROP TABLE Top31MillionWPA;	
	
/* CREATE Primary Key on new table to speed up the query */
ALTER TABLE Top31MillionWPAclean
ADD PRIMARY KEY (id);
 
/* Create clean text file keeping the popularity order. 
Also, the output is only words of length >= 8 characters
////// Change directory and filename accordingly //////
 */
SELECT word INTO OUTFILE '/tmp/Top31Million-probable-WPA-clean.txt'
FROM Top31MillionWPAclean WHERE LENGTH(word)>=8 ORDER BY id ASC;

/* Delete the MySQL table  */
DROP TABLE Top31MillionWPAclean;

Easier Readme Guide

Add a link to an easier to follow to readme guide, perheps with a "what not to do" disclaimer.
Also add a some scripts to make sure we have all the prerequisites we need and maybe handle downloading

Provide torrents

Instead of Mega links maybe providing torrents would also be nice. E.g. Mega requires an add-on for downloading files >1GB.

Suggestion: Statistics about popularity.

Hello,
Maybe I am wrong, but I have a feeling that a big number of all passwords, are "seen" only once of the different sources (if they are not just copy/upgrade of each other). Will be useful to have some general guididence like:

first milion - words seen between 200 to 20 times
from 1000k to 10000k - words seen between 19 to 4 times
From 10000k to 1000000k - words seen between 3 to 2 times
from 100000k to the end - words seen 1 time only

This will give better understanding - where the probability stops, and random/alphabetically order starts.

For examble - even in the 120m wordlist I saw many passwords, that are obviously from random generator, and the chance to be used by many people or on many places is close to zero.

Mix of line endings

It seems to me everything under Dictionary-Style has CRLF line endings. IMHO every file should have LF endings, so people don't end up with a mix of line endings after concatenating files.

Rev 2 isn't released yet

TLDR;
I said Rev 2 would be out mid-july 2017. That's now.
New estimate is Mid-August

I am not close to release.

I had a major setback due to a minor script typo that required me to do about 30% of the total Rev 2 work all over again. C'est la vie, or perhaps shikata ga nai is better here.

I'm pretty much back to where I was before the typo, if not farther along.
There's a lot of manual work that needs to be done that isn't script friendly.
I see bash in my sleep, but luckily I am nearing the end of that portion.

Next up is the stage where I just set it up to run and go about my business.
New estimate is Mid-August.

Note/warn about size

You should state the size of the whole repo in the Readme, so people are not surprised when cloning it… πŸ˜„

Some questions for v. 1.2

Hi,

i have any questions. You have a lot of funny things in the list:

  1. Passwords from 1 to 4 characters - with brute-force goes fast and saves approx. 370 MB

  2. Passwords consisting only of numbers. Up to 9 characters with brute-force is faster. Above 10 characters such passwords are rather rare. You can also save 2 GB (save it in a separate file).

  3. E-mails are rarely used as passwords.

  4. Passwords consisting only of special characters. Are also rather rare.

  5. Very long lines with code fragments and MD5 hashes (32 characters and longer). The most are definitely not passwords, but garbage, the hackers in their password-lists not eliminated because too lazy.

  6. Special characters a-la "&036;" in passwords. Most of them have been created between UTF-8, Win and UTF-16 in the event of a wrong configuration. You have to convert such things.

Regards,

John

Seedbox File Switchover

After the release of Version 2 in the next few days, the seedbox will go down briefly as I switch over from the old to the new files.

If you want to get the Rev 1 files, do so ASAP

Provide password occurences

Could you please provide how often the passwords occur?

This way one could build adequately weighted probable password masks for hashcat.

Some duplicates may appear due to newlines - a judgement call.

In some of the Release 2.0 files, a blankspace character was at the end of every line. In these cases, I would remove the final blankspace character from all lines. However, some files did not have consistency when it came to beginning or ending with blankspace characters. In this instance, I would leave them in place, since I had reason to believe the blankspaces were part of the data.

This may cause the appearance of duplicates that differ only with the inclusion of a blankspace character.

I am labeling this as "won't fix" since it doesn't appear to be feasible to do so.

Compress the files

Please compress the files. .tar.gz, .tar.xz and .zip versions of single files or entire folders (+ #4) would be great! Top35Million-probable.txt uncompressed is 369Mb, compressed with xz it's just 85Mb. One could check their contents with zcat or zgrep -a without first uncompressing them.

Are passwords for the same mail address deduplicated?

Looking through recent leaks, I found mail:password combos that are contained particularly often.
This, however, does not indicate it would be more commonly used. It should still count as a single occurrence.
Is this taken into account?

1.1 repo size

Following the comments about the reduced repo size I just tried to clone it but it still appears to be absolutely massive:

Cloning into 'Probable-Wordlists'...
remote: Counting objects: 1649, done.
remote: Compressing objects: 100% (7/7), done.
receiving objects:  31% (522/1649), 1.83 GiB | 6.67 MiB/s

I am assuming this is because the old versions of the files are still in the commit history, could they be removed using BFG?

Full database size

I'm doing some analyses based on the appearances data now added, but two specific numbers would be helpful in characterizing the full dataset that these top X appearances are then extracted from.

(1) How many unique passwords (i.e., >=1 appearance) were present in the full database? I.e., the "nearly 13 billion" value, but I would appreciate the specific number.

(2) What is the total number of password appearances in the full database, i.e., the sum of the appearances column across all nearly 13 billion passwords.

Wordlists don't contain Non-ASCII Characters

Americans aren't the only ones with passwords - why not have special wordlists that include non-ASCII Characters?

I'm glad you asked.

As my knowledge level increases so does my ability to sort out lines. I have two methodologies that I will put to use for Rev 2.0

1. Grep out passwords containing characters from different alphabets

If there is an alphabet published in unicode on Wikipedia, I plan to grep for it

  • The Ukranian Alphabet is different than the Russian, which is different than the Belorussian, which is different than the Common Cyrillic, which is different than the Serbian which is different than...
  • This means we could have NATIONALLY targeted lists based on predominant languages
  • This isn't only true for Cyrillic-based alphabets. Dano-Norwegian is a different alphabet than Swedish, English... etc.
  • At the very least by language family
  • My sources still bias towards English, so the ASCII-only lists may simply dwarf the others, but they should still be available.

2. Make Sub-set lists based on source name.

  • I have many sources with "Rus", "ru", and "Russian" in the title. These lists contain are presumably from Russian sources - so perhaps they should be amalgamated themselves.
  • Some sources are obviously geared towards WPA, etc.
  • Caveat: Since my methodology is based on approximating accuracy using the number of files a given line appears in, these groups made of sub-set sources are likely to be precise, but inaccurate. An analogy would be me throwing darts. I might be landing them within a circle of less than 1", but the target is about 4ft over to the left.

In actuality, I'm awful at darts.

I welcome any suggestions - except on my darts game. I mean suggestions about the wordlists.

license of password data

"This is released without license, but also without intent for commercial use."

This means that no commercial distribution can ship this password list as part of the default password cracking dictionary. Can you relicense this work under a more acceptable license such as the APL?

Please update

Please update these awesome lists with all the new breaches and etc. Awesome list.

Passwords without spaces

Hi I'm new to all of this and I'm using Kali Linux.
I'm just downloaded these word-lists and opened some of them and I saw that there aren't any spaces between passwords, how can I fix this without manually adding spaces ?
Or is there no need for spaces when doing a dictionary attack ?

I do this for research purposes only of course.

So is this all the passwords, or only those that showed up in the analysis twice?

Hello,

Is this all individual passwords you found or all of those that only showed across the files twice?

If so, what about other passwords that were unique to only one list (only 1 person had that password), or words from books, Wikipedia, Gutenberg etc...

Perhaps I'm just misunderstanding but would like this clarified....

Thanks for your work on this project!

Why so many trackers?

There are a lot of trackers in the included torrents. I dont have a good way to count them all, but it looks like well over 100, with many being just random IP addresses, not even a domain. Is there some reason for that? Could that number come down to a something more reasonable (like, say, 3 or 4)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.