The duplicut from pandapentest

Duplicut

The duplicut tool finds and removes duplicate entries from a wordlist, without changing the order, and without getting OOM on huge wordlists whose size exceeds available memory.

Quick start:

make release
./duplicut <WORDLIST_WITH_DUPLICATES> -o <NEW_CLEAN_WORDLIST>

Overview

Building statictically optimized wordlists for password cracking often requires to be able to find and remove duplicate entries without changing the order.

Unfortunately, existing duplicate removal tools are not able to handle very huge wordlists without crashing due to insufficient memory:

Duplicut is written in C, and optimized to be as fast and memory frugal as possible.

For example, duplicut hashmap saves up to 50% space by packing size information within line pointer's extra bits:

If the whole file doesn't fit in memory, file is split into chunks, and each one is tested against following chunks.

So complexity is equal to th triangle number:

Usage: duplicut [OPTION]... [INFILE] -o [OUTFILE]
Remove duplicate lines from INFILE without sorting.

Options:
-o, --outfile <FILE>       Write result to <FILE>
-t, --threads <NUM>        Max threads to use (default max)
-m, --memlimit <VALUE>     Limit max used memory (default max)
-l, --line-max-size <NUM>  Max line size (default 14)
-p, --printable            Filter ascii printable lines
-h, --help                 Display this help and exit
-v, --version              Output version information and exit

Example: duplicut wordlist.txt -o new-wordlist.txt

Features:
- Handle huge wordlists, even those whose size exceeds available RAM.
- Line max length based filtering (-l option).
- Ascii printable chars based filtering (-p option).
- Press any key to get program status.
Implementation:
- Written in pure C code, designed to be fast.
- Compressed hash map items on 64 bit platforms.
- [TODO]: Multi threaded application.
- [TODO]: Uses huge memory pages to increase performance.
Limitations:
- Any line longer than 255 chars is ignored.
- Heavily tested on Linux x64, mostly untested on other platforms.

pandapentest / duplicut Goto Github PK

duplicut's Introduction

Duplicut

Quick start:

Overview

duplicut's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent