Git Product home page Git Product logo

duplicut's Introduction

Duplicut

The duplicut tool finds and removes duplicate entries from a wordlist, without changing the order, and without getting OOM on huge wordlists whose size exceeds available memory.

Build Status

Quick start:

make release
./duplicut <WORDLIST_WITH_DUPLICATES> -o <NEW_CLEAN_WORDLIST>

Overview

Building statictically optimized wordlists for password cracking often requires to be able to find and remove duplicate entries without changing the order.

Unfortunately, existing duplicate removal tools are not able to handle very huge wordlists without crashing due to insufficient memory:

Duplicut is written in C, and optimized to be as fast and memory frugal as possible.

For example, duplicut hashmap saves up to 50% space by packing size information within line pointer's extra bits:

If the whole file doesn't fit in memory, file is split into chunks, and each one is tested against following chunks.

So complexity is equal to th triangle number:


Usage: duplicut [OPTION]... [INFILE] -o [OUTFILE]
Remove duplicate lines from INFILE without sorting.

Options:
-o, --outfile <FILE>       Write result to <FILE>
-t, --threads <NUM>        Max threads to use (default max)
-m, --memlimit <VALUE>     Limit max used memory (default max)
-l, --line-max-size <NUM>  Max line size (default 14)
-p, --printable            Filter ascii printable lines
-h, --help                 Display this help and exit
-v, --version              Output version information and exit

Example: duplicut wordlist.txt -o new-wordlist.txt
  • Features:

    • Handle huge wordlists, even those whose size exceeds available RAM.
    • Line max length based filtering (-l option).
    • Ascii printable chars based filtering (-p option).
    • Press any key to get program status.
  • Implementation:

    • Written in pure C code, designed to be fast.
    • Compressed hash map items on 64 bit platforms.
    • [TODO]: Multi threaded application.
    • [TODO]: Uses huge memory pages to increase performance.
  • Limitations:

    • Any line longer than 255 chars is ignored.
    • Heavily tested on Linux x64, mostly untested on other platforms.

duplicut's People

Contributors

nil0x42 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.