The duplicut
tool finds and removes duplicate entries from
a wordlist, without changing the order, and without getting
OOM on huge wordlists whose size exceeds available memory.
make release
./duplicut <WORDLIST_WITH_DUPLICATES> -o <NEW_CLEAN_WORDLIST>
Building statictically optimized wordlists for password cracking often requires to be able to find and remove duplicate entries without changing the order.
Unfortunately, existing duplicate removal tools are not able to handle very huge wordlists without crashing due to insufficient memory:
Duplicut is written in C, and optimized to be as fast and memory frugal as possible.
For example, duplicut hashmap saves up to 50% space by packing
size
information within line pointer's extra bits:
If the whole file doesn't fit in memory, file is split into chunks, and each one is tested against following chunks.
So complexity is equal to th triangle number:
Usage: duplicut [OPTION]... [INFILE] -o [OUTFILE]
Remove duplicate lines from INFILE without sorting.
Options:
-o, --outfile <FILE> Write result to <FILE>
-t, --threads <NUM> Max threads to use (default max)
-m, --memlimit <VALUE> Limit max used memory (default max)
-l, --line-max-size <NUM> Max line size (default 14)
-p, --printable Filter ascii printable lines
-h, --help Display this help and exit
-v, --version Output version information and exit
Example: duplicut wordlist.txt -o new-wordlist.txt
-
Features:
- Handle huge wordlists, even those whose size exceeds available RAM.
- Line max length based filtering (-l option).
- Ascii printable chars based filtering (-p option).
- Press any key to get program status.
-
Implementation:
- Written in pure C code, designed to be fast.
- Compressed hash map items on 64 bit platforms.
- [TODO]: Multi threaded application.
- [TODO]: Uses huge memory pages to increase performance.
-
Limitations:
- Any line longer than 255 chars is ignored.
- Heavily tested on Linux x64, mostly untested on other platforms.