Git Product home page Git Product logo

deduper's Introduction

deduper

Analyse 2 paths on the same file system to found identical files and hard link them to save space.

How it works

  • Indexing: both paths will be analyzed and the structure of the directories tree and their corresponding inodes mapped in memory (files & directories)
  • Then the structure of the A path will be walked and for each regular file, the mapped memory structure of path B will be analyzed to find potential candidates
    • First a list of all files in B having the exact same size of the A files analyzed will be compiled (empty files will be ignored)
    • Then this list will be pruned based on several criterias
      • Candidates in B that are already hardlinks of the reference A file will be removed from the list
      • Files that do not have the same inode metadata (ownership [uid, gid] and file mode) will be removed from the candidates list to avoid breaking potential current access to these files (as hardlinks share the same metadata by design)
        • Unless the -force flag is set, in that case candidates are kept (but will have their metadata changed once hardlinking is done)
      • For candidates that are still on the list, a SHA256 checksum will be performed to ensure they have indeed the same content as the reference file in A currently being processed
  • For candidates that have passed all the tests and are still on the candidates list:
    • if the -apply flag has been set
      • They will be removed (in order to free their path)
      • Reffile in A will be hard linked to the path that the B candidate had, making it available once again but dedupped with A this time
    • if the -apply flag has not been set
      • A reporting will be printed of what would have been done (and saved) with the flag on

Usage

Usage of ./deduper:
  -apply
        By default deduper run in dry run mode: set this flag to actually apply changes
  -debug
        Show debug logs during the analysis phase
  -dirA string
        Referential directory
  -dirB string
        Second directory to compare dirA against
  -force
        Dedup files that have the same content even if their inode metadata (ownership and mode) are not the same
  -minSize string
        Set the minimum size a file must have to be kept for analysis (ex: 100MiB)
  -workers int
        Set the maximum numbers of workers that will perform IO tasks (default 6)

Example

./deduper -minSize 10MiB -workers 8 -dirA "$(pwd)/example/dirA" -dirB "$(pwd)/example/dirB" -apply

Example GIF

deduper's People

Contributors

hekmon avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.