m-manu / go-find-duplicates Goto Github PK

View Code? Open in Web Editor NEW

260.0 7.0 22.0 47 KB

Find duplicate files (photos, videos, music, documents) on your computer, portable hard drives etc.

License: Apache License 2.0

Dockerfile 0.94% Go 99.06%

command-line-tool file-scanner media utilities golang-application duplicate-files go golang

go-find-duplicates's Introduction

Go Find Duplicates

Introduction

A blazingly-fast simple-to-use tool to find duplicate files (photos, videos, music, documents etc.) on your computer, portable hard drives etc.

Note:

This tool just reads your files and creates a 'duplicates report' file
It does not delete or otherwise modify your files in any way 🙂
So, it's very safe to use 👍

How to install?

Install Go version at least 1.19
- See: Go installation instructions

Run command:

go install github.com/m-manu/go-find-duplicates@latest

Add following line in your .bashrc/.zshrc file:
```
export PATH="$PATH:$HOME/go/bin"
```

How to use?

go-find-duplicates {dir-1} {dir-2} ... {dir-n}

Command line options

Running go-find-duplicates --help displays following:

go-find-duplicates is a tool to find duplicate files and directories

Usage:
  go-find-duplicates [flags] <dir-1> <dir-2> ... <dir-n>

where,
  arguments are readable directories that need to be scanned for duplicates

Flags (all optional):
  -x, --exclusions string   path to file containing newline-separated list of file/directory names to be excluded
                            (if this is not set, by default these will be ignored:
                            .DS_Store, System Volume Information, $RECYCLE.BIN etc.)
  -h, --help                display help
  -m, --minsize uint        minimum size of file in KiB to consider (default 4)
  -o, --output string       following modes are accepted:
                             text = creates a text file in current directory with basic information
                              csv = creates a csv file in current directory with detailed information
                            print = just prints the report without creating any file
                             json = creates a JSON file in the current directory with basic information
                             (default "text")
  -p, --parallelism uint8   extent of parallelism (defaults to number of cores minus 1)
  -t, --thorough            apply thorough check of uniqueness of files
                            (caution: this makes the scan very slow!)
      --version             Display version (1.6.0) and exit (useful for incorporating this in scripts)

For more details: https://github.com/m-manu/go-find-duplicates

Running this through a Docker container

docker run --rm -v /Volumes/PortableHD:/mnt/PortableHD manumk/go-find-duplicates:latest go-find-duplicates -o print /mnt/PortableHD

In above command:

option --rm removes the container when it exits
option -v is mounts host directory /Volumes/PortableHD as /mnt/PortableHD inside the container

How does this identify duplicates?

By default, this tool identifies duplicates if all of the following conditions match:

file extension is same
file size is same
CRC32 hash of "crucial bytes" is same

If above default isn't enough for your requirements, you could use the command line option --thorough to switch to SHA-256 hash of entire file contents. But remember, with this, scan becomes much slower!

When tested on my portable hard drive containing >172k files (videos, audio files, images and documents), with and without --thorough option, the results were same!

go-find-duplicates's People

Contributors

Stargazers

Watchers

go-find-duplicates's Issues

Removing duplicate files

This tool is definitely better and faster than rdfind, FSlint, etc but I can't find a way to actually remove the duplicate files. I see the list but it's too hard to remove by hand.

What would be the way to do this automatically?

Thanks

Writing dynamic fields

Hi there!

I have an incoming FlowFile that looks like this:
{ "TagEpoch": "1630346400", "InfluxMeasurement": "my_measurement", "field": "my_field", "my_field": 178.39694 }

It might be quite obvious from this, but I've got a measurement called my_measurement that has multiple fields on it. I'm trying to write each field dynamically, so in this case my_field could be replaced by any field name.

Now, I want to set up my PutInfluxDatabaseRecord so that it reads from this flow file which field it should be writing, and I currently have the following setup.

For my incoming RecordReader, I have an Avro schema indicating it should be looking for "TagEpoch", "InfluxMeasurement", "field" and every possible field name that I'm expecting, but this does not work!

I get the following error:

Can you help me to figure out how to dynamically write field names here?

Sorting output results by file size

I love this tool! The thing I'm trying to figure out is how I could write a script to sort the resulting output file to put the results in order from largest to smallest files - so I know which ones are the biggest problems. Have you considered adding that to your code?

GNU command line option conventions

First of all, I love the tool, thank you for creating it. Very useful and helped me discover a whole slew of duplicates in my library.

However, my only small gripe is that the tool does not use typical POSIX convention when it comes to command line options (single - for short options, -- for long options). This it makes it stand out from most of other UNIX tools.

Please document how files are compared for uniqueness

The readme says nothing about how files are checked for uniqueness. I've looked through the source code and couldn't identify neither the usage of a cryptographic hash nor the comparison of actual contents byte-for-byte (although I might have missed any of that).

The only thing I found was the usage of CRC32 and the following comment in file_hash.go:

// GetDigest generates entity.FileDigest of the file provided, in an extremely fast manner
// without compromising the quality of file's uniqueness.
//
// When this function was called on approximately 172k files (mix of photos, videos, audio files, PDFs etc.), the
// uniqueness identified by this matched uniqueness identified by SHA-256 for *all* files

To me it seems that for any file the uniqueness is determined by a CRC32 based on 8 KiB of the file contents (for larger files taken from the beginning, middle and end).

If this is the case I personally find this very concerning... It might work for high-entropy data formats (like the audio, video and other file formats you've tested against which employ some form of compression), but imagine using it for text files, say source code of the same code in multiple folders, then I think it is trivial to find "duplicates" which aren't actually duplicates.

I would dare to say that your statement on in the readme "blazingly-fast simple-to-use tool to find duplicate files" borderlines to false advertisement... Yes, it is blazingly-fast because you don't actually read the whole file, nor do you compute any sort of cryptographic hash, and as a consequence you don't actually test for uniqueness...

However to be constructive, I do think that using CRC32 is a good start to find duplicate candidates (since two files with different CRC32 are certainly different), so after you cluster files with the same CRC32 (and I might add the same size) you could now compute a proper cryptographic hash (I recommend testing a few like SHA1, SHA512/256 and Blake3 to see which is faster on your architecture, I went with Blake3 in my own tests).

feature want: close log print and print the duplicate result to stdout

I want to do some auto remove duplicate tools

here is what I do now

go-find-duplicates -o json -m 0 ./
cat duplicates_230506_2340187.json|jq '.[]|.paths[1:]|.[]'|xargs -I {} rm {}

I want use -q or other flag to quite the running log and just print the duplicate result to stdout

there is two pros

we can use pipe to combine the command
there is no extra output file if i run this in current dir ./

go-find-duplicates -o json -m 0 ./ -q|jq '.[]|.paths[1:]|.[]'|xargs -I {} rm {}

go get doesn't work

When following the directions:

$ go get github.com/m-manu/go-find-duplicates
unrecognized import path "embed": import path does not begin with hostname
unrecognized import path "io/fs": import path does not begin with hostname