Git Product home page Git Product logo

deduplicator's Introduction

Deduplicator

Find, Sort, Filter & Delete duplicate files

Usage

Usage: deduplicator [OPTIONS] [scan_dir_path]

Arguments:
  [scan_dir_path]  Run Deduplicator on dir different from pwd (e.g., ~/Pictures )

Options:
  -t, --types <TYPES>          Filetypes to deduplicate [default = all]
  -i, --interactive            Delete files interactively
  -s, --min-size <MIN_SIZE>    Minimum filesize of duplicates to scan (e.g., 100B/1K/2M/3G/4T) [default: 1b]
  -d, --max-depth <MAX_DEPTH>  Max Depth to scan while looking for duplicates
      --min-depth <MIN_DEPTH>  Min Depth to scan while looking for duplicates
  -f, --follow-links           Follow links while scanning directories
  -h, --help                   Print help information
  -V, --version                Print version information

Examples

# Scan for duplicates recursively from the current dir, only look for png, jpg & pdf file types & interactively delete files
deduplicator -t pdf,jpg,png -i

# Scan for duplicates recursively from the ~/Pictures dir, only look for png, jpeg, jpg & pdf file types & interactively delete files
deduplicator ~/Pictures/ -t png,jpeg,jpg,pdf -i

# Scan for duplicates in the ~/Pictures without recursing into subdirectories
deduplicator ~/Pictures --max-depth 0

# look for duplicates in the ~/.config directory while also recursing into symbolic link paths
deduplicator ~/.config --follow-links

# scan for duplicates that are greater than 100mb in the ~/Media directory
deduplicator ~/Media --min-size 100mb

Installation

Cargo Install

Stable

$ cargo install deduplicator

Nightly

if you'd like to install with nightly features, you can use

$ cargo install --git https://github.com/sreedevk/deduplicator

Please note that if you use a version manager to install rust (like asdf), you need to reshim (asdf reshim rust).

Linux (Pre-built Binary)

you can download the pre-built binary from the Releases page. download the deduplicator-x86_64-unknown-linux-gnu.tar.gz for linux. Once you have the tarball file with the executable, you can follow these steps to install:

$ tar -zxvf deduplicator-x86_64-unknown-linux-gnu.tar.gz
$ sudo mv deduplicator /usr/bin/

Mac OS (Pre-built Binary)

you can download the pre-build binary from the Releases page. download the deduplicator-x86_64-apple-darwin.tar.gz tarball for mac os. Once you have the tarball file with the executable, you can follow these steps to install:

$ tar -zxvf deduplicator-x86_64-unknown-linux-gnu.tar.gz
$ sudo mv deduplicator /usr/bin/

Windows (Pre-built Binary)

you can download the pre-build binary from the Releases page. download the deduplicator-x86_64-pc-windows-msvc.zip zip file for windows. unzip the zip file & move the deduplicator.exe to a location in the PATH system environment variable.

Note: If you Run into an msvc error, please install MSCV from here

Performance

Deduplicator uses size comparison and fxhash (a non non-cryptographic hashing algo) to quickly scan through large number of files to find duplicates. its also highly parallel (uses rayon and dashmap). I was able to scan through 120GB of files (Videos, PDFs, Images) in ~300ms. checkout the benchmarks

benchmarks

Command Dirsize Mean [ms] Min [ms] Max [ms] Relative
deduplicator --dir ~/Data/tmp (~120G) 27.5 ± 1.0 26.0 32.1 1.70 ± 0.09
deduplicator --dir ~/Data/books (~8.6G) 21.8 ± 0.7 20.5 24.4 1.35 ± 0.07
deduplicator --dir ~/Data/books --min-size 10M (~8.6G) 16.1 ± 0.6 14.9 18.8 1.00
deduplicator --dir ~/Data/ --types pdf,jpg,png,jpeg (~290G) 1857.4 ± 24.5 1817.0 1895.5 115.07 ± 4.64
  • The last entry is lower because of the number of files deduplicator had to go through (~660895 Files). The average size of the files rarely affect the performance of deduplicator.

These benchmarks were run using hyperfine. Here are the specs of the machine used to benchmark deduplicator:

OS: Arch Linux x86_64 
Host: Precision 5540
Kernel: 5.15.89-1-lts 
Uptime: 4 hours, 44 mins 
Shell: zsh 5.9                        
Terminal: kitty 
CPU: Intel i9-9880H (16) @ 4.800GHz 
GPU: NVIDIA Quadro T2000 Mobile / Max-Q 
GPU: Intel CoffeeLake-H GT2 [UHD Graphics 630] 
Memory: 31731MiB (~32GiB)

Screenshots

deduplicator's People

Contributors

sreedevk avatar beeb avatar ghfghfg23 avatar dhruvasagar avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.