pkolaczk / fclones Goto Github PK

View Code? Open in Web Editor NEW

1.8K 20.0 69.0 857 KB

Efficient Duplicate File Finder

License: MIT License

Rust 99.41% Shell 0.46% Dockerfile 0.13%

fclones's Introduction

fclones

Efficient duplicate file finder and remover

This is the repo for command line fclones and its core libraries. For the desktop frontend, see fclones-gui.

fclones is a command line utility that identifies groups of identical files and gets rid of the file copies you no longer need. It comes with plenty of configuration options for controlling the search scope and offers many ways of removing duplicates. For maximum flexibility, it integrates well with other Unix utilities like find and it speaks JSON, so you have a lot of control over the search and cleanup process.

fclones treats your data seriously. You can inspect and modify the list of duplicate files before removing them. There is also a --dry-run option that can tell you exactly what changes on the file system would be made.

fclones has been implemented in Rust with a strong focus on high performance on modern hardware. It employs several optimization techniques not present in many other programs. It adapts to the type of the hard drive, orders file operations by physical data placement on HDDs, scans directory tree in parallel and uses prefix compression of paths to reduce memory consumption when working with millions of files. It is also friendly to page-cache and does not push out your data out of cache. As a result, fclones easily outperforms many other popular duplicate finders by a wide margin on either SSD or HDD storage.

fclones is available on a wide variety of operating systems, but it works best on Linux.

Features
Demo
Installation
Usage
Algorithm
Tuning
Benchmarks

Features

Identifying groups of identical files
- finding duplicate files
- finding files with more than N replicas
- finding unique files
- finding files with fewer than N replicas
Advanced file selection for reducing the amount of data to process
- scanning multiple directory roots
- can work with a list of files piped directly from standard input
- recursive/non-recursive file selection
- recursion depth limit
- filtering names and paths by extended UNIX globs
- filtering names and paths by regular expressions
- filtering by min/max file size
- proper handling of symlinks and hardlinks
Removing redundant data
- removing, moving or replacing files with soft or hard links
- removing redundant file data using native copy-on-write (reflink) support on some file systems
- selecting files for removal by path or name patterns
- prioritizing files to remove by creation, modification, last access time or nesting level
High performance
- parallel processing capability in all I/O and CPU heavy stages
- automatic tuning of parallelism and access strategy based on device type (SSD vs HDD)
- low memory footprint thanks to heavily optimized path representation
- variety of fast non-cryptographic and cryptographic hash functions up to 512 bits wide
- doesn't push data out of the page-cache (Linux-only)
- optional persistent caching of file hashes
- accurate progress reporting
Variety of output formats for easy further processing of results
- standard text format
  - groups separated by group headers with file size and hash
  - one path per line in a group
- optional fdupes compatibility (no headers, no indent, groups separated by blank lines)
- machine-readable formats: CSV, JSON

Limitations

Copy-on-write file data deduplication (reflink) is not supported on Windows.

Some optimisations are not available on platforms other than Linux:

ordering of file accesses by physical placement
page-cache drop-behind

Demo

Let's first create some files:

$ mkdir test
$ cd test
$ echo foo >foo1.txt
$ echo foo >foo2.txt
$ echo foo >foo3.txt
$ echo bar >bar1.txt
$ echo bar >bar2.txt

Now let's identify the duplicates:

$ fclones group . >dupes.txt
[2021-06-05 18:21:33.358] fclones:  info: Started grouping
[2021-06-05 18:21:33.738] fclones:  info: Scanned 7 file entries
[2021-06-05 18:21:33.738] fclones:  info: Found 5 (20 B) files matching selection criteria
[2021-06-05 18:21:33.738] fclones:  info: Found 4 (16 B) candidates after grouping by size
[2021-06-05 18:21:33.738] fclones:  info: Found 4 (16 B) candidates after grouping by paths and file identifiers
[2021-06-05 18:21:33.739] fclones:  info: Found 3 (12 B) candidates after grouping by prefix
[2021-06-05 18:21:33.740] fclones:  info: Found 3 (12 B) candidates after grouping by suffix
[2021-06-05 18:21:33.741] fclones:  info: Found 3 (12 B) redundant files

$ cat dupes.txt
# Report by fclones 0.12.0
# Timestamp: 2021-06-05 18:21:33.741 +0200
# Command: fclones group .
# Found 2 file groups
# 12 B (12 B) in 3 redundant files can be removed
7d6ebf613bf94dfd976d169ff6ae02c3, 4 B (4 B) * 2:
    /tmp/test/bar1.txt
    /tmp/test/bar2.txt
6109f093b3fd5eb1060989c990d1226f, 4 B (4 B) * 3:
    /tmp/test/foo1.txt
    /tmp/test/foo2.txt
    /tmp/test/foo3.txt

Finally we can replace the duplicates by soft links:

$ fclones link --soft <dupes.txt 
[2021-06-05 18:25:42.488] fclones:  info: Started deduplicating
[2021-06-05 18:25:42.493] fclones:  info: Processed 3 files and reclaimed 12 B space

$ ls -l
total 12
-rw-rw-r-- 1 pkolaczk pkolaczk   4 cze  5 18:19 bar1.txt
lrwxrwxrwx 1 pkolaczk pkolaczk  18 cze  5 18:25 bar2.txt -> /tmp/test/bar1.txt
-rw-rw-r-- 1 pkolaczk pkolaczk 382 cze  5 18:21 dupes.txt
-rw-rw-r-- 1 pkolaczk pkolaczk   4 cze  5 18:19 foo1.txt
lrwxrwxrwx 1 pkolaczk pkolaczk  18 cze  5 18:25 foo2.txt -> /tmp/test/foo1.txt
lrwxrwxrwx 1 pkolaczk pkolaczk  18 cze  5 18:25 foo3.txt -> /tmp/test/foo1.txt

Installation

The code has been thoroughly tested on Ubuntu Linux 21.10. Other systems like Windows or Mac OS X and other architectures may work. Help test and/or port to other platforms is welcome. Please report successes as well as failures.

Official Packages

Snap store (Linux):

snap install fclones

Homebrew (macOS and Linux)

brew install fclones

Installation packages and binaries for some platforms are also attached directly to Releases.

Third-party Packages

Building from Source

Install Rust Toolchain and then run:

cargo install fclones

The build will write the binary to .cargo/bin/fclones.

Usage

fclones offers separate commands for finding and removing files. This way, you can inspect the list of found files before applying any modifications to the file system.

group – identifies groups of identical files and prints them to the standard output
remove – removes redundant files earlier identified by group
link – replaces redundant files with links (default: hard links)
dedupe – does not remove any files, but deduplicates file data by using native copy-on-write capabilities of the file system (reflink)

Finding Files

Find duplicate, unique, under-replicated or over-replicated files in the current directory, including subdirectories:

fclones group .
fclones group . --unique 
fclones group . --rf-under 3
fclones group . --rf-over 3

You can search in multiple directories:

fclones group dir1 dir2 dir3

By default, hidden files and files matching patterns listed in .gitignore and .fdignore are ignored. To search all files, use:

fclones group --no-ignore --hidden dir

Limit the recursion depth:

fclones group . --depth 1   # scan only files in the current dir, skip subdirs
fclones group * --depth 0   # similar as above in shells that expand `*`

Caution: Versions up to 0.10 did not descend into directories by default. In those old versions, add -R flag to enable recursive directory walking.

Finding files that match across two directory trees, without matching identical files within each tree:

fclones group --isolate dir1 dir2

Finding duplicate files of size at least 100 MB:

fclones group . -s 100M

Filter by file name or path pattern:

fclones group . --name '*.jpg' '*.png'

Run fclones on files selected by find (note: this is likely slower than built-in filtering):

find . -name '*.c' | fclones group --stdin --depth 0

Follow symbolic links, but don't escape out of the home folder:

fclones group . -L --path '/home/**'

Exclude a part of the directory tree from the scan:

fclones group / --exclude '/dev/**' '/proc/**'

Removing Files

To remove duplicate files, move them to a different place or replace them by links, you need to send the report produced by fclones group to the standard input of fclones remove, fclones move or fclones link command. The report format is detected automatically. Currently, default and json report formats are supported.

Assuming the list of duplicates has been saved in file dupes.txt, the following commands would remove the redundant files:

fclones link <dupes.txt             # replace with hard links
fclones link -s <dupes.txt          # replace with symbolic links
fclones move target_dir <dupes.txt  # move to target_dir  
fclones remove <dupes.txt           # remove totally

If you prefer to do everything at once without storing the list of groups in a file, you can pipe:

fclones group . | fclones link

To select the number of files to preserve, use the -n/--rf-over option. By default, it is set to the value used when running group (which is 1 if it wasn't set explicitly). To leave 2 replicas in each group, run:

fclones remove -n 2 <dupes.txt

By default, fclones follows the order of files specified in the input file. It keeps the files given at the beginning of each list, and removes / replaces the files given at the end of each list. It is possible to change that order by --priority option, for example:

fclones remove --priority newest <dupes.txt        # remove the newest replicas
fclones remove --priority oldest <dupes.txt        # remove the oldest replicas

For more priority options, see fclones remove --help.

It is also possible to restrict removing files to only files with names or paths matching a pattern:

fclones remove --name '*.jpg' <dupes.txt       # remove only jpg files
fclones remove --path '/trash/**' <dupes.txt   # remove only files in the /trash folder

If it is easier to specify a pattern for files which you do not want to remove, then use one of keep options:

fclones remove --keep-name '*.mov' <dupes.txt           # never remove mov files
fclones remove --keep-path '/important/**' <dupes.txt   # never remove files in the /important folder

To make sure you're not going to remove wrong files accidentally, use --dry-run option. This option prints all the commands that would be executed, but it doesn't actually execute them:

fclones link --soft <dupes.txt --dry-run 2>/dev/null

mv /tmp/test/bar2.txt /tmp/test/bar2.txt.jkXswbsDxhqItPeOfCXsWN4d
ln -s /tmp/test/bar1.txt /tmp/test/bar2.txt
rm /tmp/test/bar2.txt.jkXswbsDxhqItPeOfCXsWN4d
mv /tmp/test/foo2.txt /tmp/test/foo2.txt.ze1hvhNjfre618TkRGUxJNzx
ln -s /tmp/test/foo1.txt /tmp/test/foo2.txt
rm /tmp/test/foo2.txt.ze1hvhNjfre618TkRGUxJNzx
mv /tmp/test/foo3.txt /tmp/test/foo3.txt.ttLAWO6YckczL1LXEsHfcEau
ln -s /tmp/test/foo1.txt /tmp/test/foo3.txt
rm /tmp/test/foo3.txt.ttLAWO6YckczL1LXEsHfcEau

Handling links

Files linked by symbolic links or hard links are not treated as duplicates. You can change this behavior by setting the following flags:

When --isolate is set:
- links residing in different directory trees are treated as duplicates,
- links residing in the same directory tree are counted as a single replica.
When --match-links is set, fclones treats all linked files as duplicates.

Consider the following directory structure, where all files are hard links sharing the same content:

dir1:
  - file1
  - file2
dir2:
  - file3
  - file4

Because all files are essentially the same data, they will end up in the same file group, but the actual number of replicas present in that file group will differ depending on the flags given:

Command	Number of replicas	Group reported	Files to remove
`fclones group dir1 dir2`	1	No
`fclones group dir1 dir2 --isolate`	2	Yes	file3, file4
`fclones group dir1 dir2 --match-links`	4	Yes	file2, file3, file4

Symbolic links

The group command ignores symbolic links to files unless at least --follow-links or --symbolic-links flag is set. If only --follow-links is set, symbolic links to files are followed and resolved to their targets. If --symbolic-links is set, symbolic links to files are not followed, but treated as hard links and potentially reported in the output report. When both --symbolic-links and --follow-links are set, symbolic links to directories are followed, but symbolic links to files are treated as hard links.

Caution: Using --match-links together with --symbolic-links is very dangerous. It is easy to end up deleting the only regular file you have, and to be left with a bunch of orphan symbolic links.

Preprocessing Files

Use --transform option to safely transform files by an external command. By default, the transformation happens on a copy of file data, to avoid accidental data loss. Note that this option may significantly slow down processing of a huge number of files, because it invokes the external program for each file.

The following command will strip exif before matching duplicate jpg images:

fclones group . --name '*.jpg' -i --transform 'exiv2 -d a $IN' --in-place

Other

List more options:

fclones [command] -h      # short help
fclones [command] --help  # detailed help

Path Globbing

fclones understands a subset of Bash Extended Globbing. The following wildcards can be used:

? matches any character except the directory separator
[a-z] matches one of the characters or character ranges given in the square brackets
[!a-z] matches any character that is not given in the square brackets
* matches any sequence of characters except the directory separator
** matches any sequence of characters including the directory separator
{a,b} matches exactly one pattern from the comma-separated patterns given inside the curly brackets
@(a|b) same as {a,b}
?(a|b) matches at most one occurrence of the pattern inside the brackets
+(a|b) matches at least occurrence of the patterns given inside the brackets
*(a|b) matches any number of occurrences of the patterns given inside the brackets
\ escapes wildcards on Unix-like systems, e.g. \? would match ? literally
^ escapes wildcards on Windows, e.g. ^? would match ? literally

Caution

On Unix-like systems, when using globs, one must be very careful to avoid accidental expansion of globs by the shell. In many cases having globs expanded by the shell instead of by fclones is not what you want. In such cases, you need to quote the globs:
```
fclones group . --name '*.jpg'       
```
On Windows, the default shell doesn't remove quotes before passing the arguments to the program, therefore you need to pass globs unquoted:
```
fclones group . --name *.jpg
```
On Windows, the default shell doesn't support path globbing, therefore wildcard characters such as * and ? used in paths will be passed literally, and they are likely to create invalid paths. For example, the following command that searches for duplicate files in the current directory in Bash, will likely fail in the default Windows shell:
```
fclones group *
```
If you need path globbing, and your shell does not support it, use the builtin path globbing provided by --name or --path.

The Algorithm

Files are processed in several stages. Each stage except the last one is parallel, but the previous stage must complete fully before the next one is started.

Scan input files and filter files matching the selection criteria. Walk directories recursively if requested. Follow symbolic links if requested. For files that match the selection criteria, read their size.
Group collected files by size by storing them in a hash-map. Remove groups smaller than the desired lower-bound (default 2).
In each group, remove duplicate files with the same inode id. The same file could be reached through different paths when hardlinks are present. This step can be optionally skipped.
For each remaining file, compute a hash of a tiny block of initial data. Put files with different hashes into separate groups. Prune result groups if needed.
For each remaining file, compute a hash of a tiny block of data at the end of the file. Put files with different hashes into separate groups. Prune small groups if needed.
For each remaining file, compute a hash of the whole contents of the file. Note that for small files we might have already computed a full contents hash in step 4, therefore these files can be safely omitted. Same as in steps 4 and 5, split groups and remove the ones that are too small.
Write report to the stdout.

Note that there is no byte-by-byte comparison of files anywhere. All available hash functions are at least 128-bit wide, and you don't need to worry about hash collisions. At 10¹⁵ files, the probability of collision is 0.000000001 when using a 128-bit hash, without taking into account the requirement for the files to also match by size.

Hashes

You can select the hash function with --hash-fn (default: metro). Non-cryptographic hashes are much more efficient than cryptographic, however you probably won't see much difference unless you're reading from a fast SSD or if file data is cached.

Hash function	Hash width	Cryptographic
metro	128-bit	No
xxhash3	128-bit	No
blake3	256-bit	Yes
sha256	256-bit	Yes
sha512	512-bit	Yes
sha3-256	256-bit	Yes
sha3-512	512-bit	Yes

Tuning

This section provides hints on getting the best performance from fclones.

Incremental Mode

If you expect to run fclones group more than once on the same set of files, you might benefit from turning on the hash cache by adding the --cache flag:

fclones group --cache <dir>

Caching can dramatically improve grouping speed on subsequent runs of fclones at the expense of some additional storage space needed for the cache. Caching also allows for resuming work quickly after interruption, so it is recommended if you plan to run fclones on huge data sets.

The cache works as follows:

Each newly computed file hash is persisted in the cache together with some metadata of the file such as its modification timestamp and length.
Whenever a file hash needs to be computed, it is first looked up in the cache. The cached hash is used if the current metadata of the file strictly matches the metadata stored in the cache.

Cached hashes are not invalidated by file moves because files are identified by their internal identifiers (inode identifiers on Unix), not by path names, and moves/renames typically preserve those.

Beware that caching relies on file metadata to detect changes in file contents. This might introduce some inaccuracies to the grouping process if a file modification timestamp and file length is not updated immediately whenever a file gets modified. Most file systems update the timestamps automatically on closing the file. Therefore, changed files that are held open for a long time (e.g. by database systems) might be not noticed by fclones group and might use stale cached values.

The cache database is located in the standard cache directory of the user account. Typically, those are:

Linux: $HOME/.cache/fclones
macOS: $HOME/Library/Caches/fclones
Windows: $HOME/AppData/Local/fclones

Configuring Parallelism

The --threads parameter controls the sizes of the internal thread-pool(s). This can be used to reduce parallelism level when you don't want fclones to impact performance of your system too much, e.g. when you need to do some other work at the same time. We recommended reducing the parallelism level if you need to reduce memory usage.

When using fclones up to version 0.6.x to deduplicate files of sizes of at least a few MBs each
on spinning drives (HDD), it is recommended to set --threads 1, because accessing big files from multiple threads on HDD can be much slower than single-threaded access (YMMV, this is heavily OS-dependent, 2x-10x performance differences have been reported).

Since version 0.7.0, fclones uses separate per-device thread-pools for final hashing and it will automatically tune the level of parallelism, memory buffer sizes and partial hashing sizes based on the device type. These automatic settings can be overridden with --threads as well.

The following options can be passed to --threads. The more specific options override the less specific ones.

main:<n> – sets the size of the main thread-pool used for random I/O: directory tree scanning, file metadata fetching and in-memory sorting/hashing. These operations typically benefit from high parallelism level, even on spinning drives. Unset by default, which means the pool will be configured to use all available CPU cores.
dev:<device>:<r>,<s> – sets the size of the thread-pool r used for random I/O and s used for sequential I/O on the block device with the given name. The name of the device is OS-dependent. Note this is not the same as the partition name or mount point.
ssd:<r>,<s> – sets the sizes of the thread-pools used for I/O on solid-state drives. Unset by default.
hdd:<r>,<s> – sets the sizes of the thread-pools used for I/O on spinning drives. Defaults to 8,1
removable:<r>,<s> – sets the size of the thread-pools used for I/O on removable devices (e.g. USB sticks). Defaults to 4,1
unknown:<r>,<s> – sets the size of the thread-pools used for I/O on devices of unknown type. Sometimes the device type can't be determined e.g. if it is mounted as NAS. Defaults to 4,1
default:<r>,<s> – sets the pool sizes to be used by all unset options
<r>,<s> - same as default:<r>,<s>
<n> - same as default:<n>,<n>

Examples

To limit the parallelism level for the main thread pool to 1:

fclones group <paths> --threads main:1

To limit the parallelism level for all I/O access for all SSD devices:

fclones group <paths> --threads ssd:1

To set the parallelism level to the number of cores for random I/O access and to 2 for sequential I/O access for /dev/sda block device:

fclones group <paths> --threads dev:/dev/sda:0,2

Multiple --threads options can be given, separated by spaces:

fclones group <paths> --threads main:16 ssd:4 hdd:1,1

Benchmarks

Different duplicate finders were given a task to find duplicates in a large set of files. Before each run, the system page cache was evicted with echo 3 > /proc/sys/vm/drop_caches.

SSD Benchmark

Model: Dell Precision 5520
CPU: Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz
RAM: 32 GB
Storage: local NVMe SSD 512 GB
System: Ubuntu Linux 20.10, kernel 5.8.0-53-generic
Task: 1,460,720 paths, 316 GB of data

Program	Version	Language	Time	Peak Memory
fclones	0.12.1	Rust	0:34.59	266 MB
yadf	0.15.2	Rust	0:59.32	329 MB
czkawka	3.1.0	Rust	2:09.00	1.4 GB
rmlint	2.9.0	C, Python	2:28.43	942 MB
jdupes	1.18.2	C	5:01.91	332 MB
dupe-krill	1.4.5	Rust	5:09.52	706 MB
fdupes	2.1.1	C	5:46.19	342 MB
rdfind	1.4.1	C++	5:53.07	496 MB
dupeguru	4.1.1	Python	7:49.89	1.4 GB
fdupes-java	1.3.1	Java	> 20 minutes	4.2 GB

fdupes-java did not finish the test. I interrupted it after 20 minutes while it was still computing MD5 in stage 2/3. Unfortunately fdupes-java doesn't display a useful progress bar, so it is not possible to estimate how long it would take.

HDD Benchmark

Model: Dell Precision M4600
CPU: Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz
RAM: 24 GB
System: Mint Linux 19.3, kernel 5.4.0-70-generic
Storage: Seagate Momentus 7200 RPM SATA drive, EXT4 filesystem
Task: 51370 paths, 2 GB data, 6811 (471 MB) duplicate files

Commands used:

  /usr/bin/time -v fclones -R <file set root> 
  /usr/bin/time -v jdupes -R -Q <file set root>
  /usr/bin/time -v fdupes -R <file set root>
  /usr/bin/time -v rdfind <file set root>

In this benchmark, the page cache was dropped before each run.

Program	Version	Language	Threads	Time	Peak Memory
fclones	0.9.1	Rust	1	0:19.45	18.1 MB
rdfind	1.3.5	C++	1	0:33.70	18.5 MB
yadf	0.14.1	Rust		1:11.69	22.9 MB
jdupes	1.9	C	1	1:18.47	15.7 MB
fdupes	1.6.1	C	1	1:33.71	15.9 MB

fclones's People

Contributors

Stargazers

Watchers

fclones's Issues

Write report to a file

Add an -o <file> option. This would allow for more flexibility when building pipelines, e.g. wrapping fclones with time.

Hashing algorithm better suited for ARM

Add an option to choose a different hash function.
E.g.:

fclones --hash highway -R ~

Does notr build on OSX

Hi,

Your tool looks very promosing, so I wanted to give it a go on Mac. Unfortunately I get build errors. Mainly about unresolved PosixFadviseAdvice.

short trace

error[E0433]: failed to resolve: use of undeclared type or module `PosixFadviseAdvice`

If have zero RUST skills but if you have some advice I would gladly help you out.

No binary releases

Hey!

I'm maintaining fclones and fclones-bin packages on AUR and I saw 0.9.0 release doesn't have binary artifacts. Unfortunately I couldn't bump the -bin package because of that.

Are you considering to bring back binary artifacts (e.g. fclones-$pkgver.tgz) along with the upcoming releases?

Add directory clone detection

Sometime i keep 2 times the content of a CF card full of video or photos.

It will be great to have a detection of directory which all content are already present somewhere else.

Add timestamps to the diagnostic log

Example:

[2020-06-23 18:25:13.126] fclones: info: Found 963 (3.0 MB) duplicate files

Stream output to file

I'm running into an issue - scanning 20GB over the network and was wondering when the last stage is grouping by content would be great to be able to get duplicates written out as they come. For the sake of better UX maybe only stream them when there is an argument to write the output to a file. This way if I interrupt or something interrupts the content hashing I can still get a partial result.

Feedback after inital testing

Hello @pkolaczk, Again Thanks a lot for this GREAT Tool. I do love the idea of parallel processing and using the power of rust in this tool! and I do have a couple of questions and maybe feature requests that I want to discuss with you.

So First of all, I Tested this tool on a low power DS215j NAS Device.

CPU: MARVELL Armada 375 88F6720 - Dual Core - 800 MHz (ARMv7)
RAM: 512 MB
HDD: 6TB WD NAS Drive

and those are my questions after testing:

1. Is There a way to Export the Report to a file instead of printing to the console?

This is useful when you have too much duplicates that the terminal window cannot handle. In my case I want to run the command on Screen and then come later to get the results.

I Tried the command below, but this don't give me any Progress/Status from the tool:
sudo /usr/bin/time --verbose **./fclones ~ -R --format JSON** |& tee -a /volume2/duplicatesdata.json

2. Can The Tool Add a timestamp to the status updates ?

This could be helpful to how much time each phase took, Something like:

2020-06-20T21:57:06 [INFO] - fclones:  info: Scanned 4687831 file entries
2020-06-21T05:57:06 [INFO] - fclones:  fclones:  info: Found 3857155 (5.4 TB) files matching selection criteria
2020-06-21T08:57:06 [INFO] - fclones:  fclones:  info: Found 3447623 (1.4 TB) candidates after grouping by size

3. Can The Tool have an option to persist the Analysis Data on a file/data-store, instead of RAM ?

In my Case The Device is slow & HDD is Big, Sometimes in other tools I need to run duplicate analysis for 6 days.
When I was trying this tool I lost power after leaving it running for 2 days. So this can help me run the tool again to continue analysis.

4. Can The Tool have an option to choose the Hashing algorithm ?

From what I can Seen in here Metrohash is a great hashing algorithm. but it is optimized for machine-specific (x64 SSE4.2) x86-64 architectures.
So Adding another algorithm that is not machine-specific would be a great addition.

Build fails: "non-exhaustive patterns `Removable` not covered"

Hello, I am unable to compile fclones due to an error. This happens with both the AUR package and manually running cargo build --release.
OS: Manjaro Linux x86_64
Rust/Cargo version 1.49.0

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:22:24
|
22  |             0 => match disk_type {
|                        ^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:41:15
|
41  |         match self.disk_type {
|               ^^^^^^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:49:23
|
49  |         FileLen(match self.disk_type {
|                       ^^^^^^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:57:23
|
57  |         FileLen(match self.disk_type {
|                       ^^^^^^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:69:23
|
69  |         FileLen(match self.disk_type {
|                       ^^^^^^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:100:31
|
100 |                 let p = match disk_type {
|                               ^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error: aborting due to 6 previous errors

For more information about this error, try `rustc --explain E0004`.
error: could not compile `fclones`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed

Update README with a link to the AUR package for Archlinux

I just created an AUR package for Archlinux. You can find it here: https://aur.archlinux.org/packages/fclones-git/
It might be worth adding its name and a link to it in the README file.

`fclones dedupe` results in updates to mtimes on directories

Running fclones dedupe on some directory tree results in mtimes being updated for directories containing files that were deduplicated.

I don't know whether this should be addressed, because while mtimes may be desirable to preserve, the directories really were updated through file creation. This effect does make it less likely that I would want to use fclones dedupe on old directory trees with potentially informative mtimes, though.

Identical files under the same root are returned despite `--isolate` option

When running fclones group -I, it seems to be finding duplicate files underneath the same root (path argument on command line). For example, if I construct a tree like:

echo hi > source.txt
mkdir -p {a,b}/{1,2}
for i in {a,b}/{1,2}/test; do cp source.txt "${i}"; done

and then run fclones group -I a b, I get:

[2021-11-17 16:08:01.482] fclones:  info: Started grouping
[2021-11-17 16:08:02.019] fclones:  info: Scanned 10 file entries
[2021-11-17 16:08:02.019] fclones:  info: Found 4 (12 B) files matching selection criteria
[2021-11-17 16:08:02.019] fclones:  info: Found 3 (9 B) candidates after grouping by size
[2021-11-17 16:08:02.019] fclones:  info: Found 3 (9 B) candidates after grouping by paths and file identifiers
[2021-11-17 16:08:02.033] fclones:  info: Found 3 (9 B) candidates after grouping by prefix
[2021-11-17 16:08:02.033] fclones:  info: Found 3 (9 B) candidates after grouping by suffix
[2021-11-17 16:08:02.034] fclones:  info: Found 3 (9 B) redundant files
# Report by fclones 0.17.1
# Timestamp: 2021-11-17 16:08:02.036 -0500
# Command: fclones group -I a b
# Found 1 file groups
# 9 B (9 B) in 3 redundant files can be removed
e872d4a1bdc12e1262820a95eebb530a, 3 B (3 B) * 4:
    /tmp/tree/a/1/test
    /tmp/tree/a/2/test
    /tmp/tree/b/1/test
    /tmp/tree/b/2/test

Compiling to linux ARM

Great work - already had a play with the utility on macos and it works great.

Just feedback, I also tried to compile to manjaro on the rpi4 and the compile failed on the hashing crate. I might ask the author o the fashhash-sys crate what would be required to allow it to compile on the arm architecture.

fclones git:(master) cargo build --release
   Compiling fasthash-sys v0.3.2
   Compiling getrandom v0.1.14
   Compiling num_cpus v1.13.0
   Compiling atty v0.2.14
error: failed to run custom build command for `fasthash-sys v0.3.2`

Caused by:
  process didn't exit successfully: `/home/stuart/rust_projects/fclones/target/release/build/fasthash-sys-3bfb9e86593b1584/build-script-build` (exit code: 101)
--- stdout
TARGET = Some("aarch64-unknown-linux-gnu")
OPT_LEVEL = Some("3")
TARGET = Some("aarch64-unknown-linux-gnu")
HOST = Some("aarch64-unknown-linux-gnu")
TARGET = Some("aarch64-unknown-linux-gnu")
TARGET = Some("aarch64-unknown-linux-gnu")
HOST = Some("aarch64-unknown-linux-gnu")
CC_aarch64-unknown-linux-gnu = None
CC_aarch64_unknown_linux_gnu = None
HOST_CC = None
CC = None
HOST = Some("aarch64-unknown-linux-gnu")
TARGET = Some("aarch64-unknown-linux-gnu")
HOST = Some("aarch64-unknown-linux-gnu")
CFLAGS_aarch64-unknown-linux-gnu = None
CFLAGS_aarch64_unknown_linux_gnu = None
HOST_CFLAGS = None
CFLAGS = None
DEBUG = Some("false")
running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-Wno-implicit-fallthrough" "-Wno-unknown-attributes" "-msse4.2" "-maes" "-mavx" "-mavx2" "-DT1HA0_RUNTIME_SELECT=1" "-DT1HA0_AESNI_AVAILABLE=1" "-Wall" "-Wextra" "-o" "/home/stuart/rust_projects/fclones/target/release/build/fasthash-sys-fc57bf495c3381b2/out/src/fasthash.o" "-c" "src/fasthash.cpp"
cargo:warning=cc: error: unrecognized command line option \u2018-msse4.2\u2019
cargo:warning=cc: error: unrecognized command line option \u2018-maes\u2019
cargo:warning=cc: error: unrecognized command line option \u2018-mavx\u2019
cargo:warning=cc: error: unrecognized command line option \u2018-mavx2\u2019
exit code: 1

--- stderr
thread 'main' panicked at '

Internal error occurred: Command "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-Wno-implicit-fallthrough" "-Wno-unknown-attributes" "-msse4.2" "-maes" "-mavx" "-mavx2" "-DT1HA0_RUNTIME_SELECT=1" "-DT1HA0_AESNI_AVAILABLE=1" "-Wall" "-Wextra" "-o" "/home/stuart/rust_projects/fclones/target/release/build/fasthash-sys-fc57bf495c3381b2/out/src/fasthash.o" "-c" "src/fasthash.cpp" with args "cc" did not execute successfully (status code exit code: 1).

', /home/stuart/.cargo/registry/src/github.com-1ecc6299db9ec823/gcc-0.3.55/src/lib.rs:1672:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

warning: build failed, waiting for other jobs to finish...
error: build failed

-- feel free to close this issue - I just wanted to give feedback.

Incremental mode

Persist hashes to a file in order to speedup subsequent runs or to avoid recomputing hashes when the previous run was interrupted.

Feat.req. GUI

subj

Perform partial hashing in order of physical block placement

This paper reports that ordering accesses by inode id or by physical block location retrieved with ioctl fiemap API can give substantial performance improvements.

These techniques could be applied for the partial hashing phase of fclones, where seek time and rotational latency are the major bottleneck.

ignore size and report duplicates

Is the implementation of finding duplicates based on size?
Can it find duplicates even if there's a size mismatch and report the file names?

Warning message confusing when physical file data location not available

I get the following messages running fclones group yyy :

[2021-10-24 14:33:02.212] fclones: warn: Failed to fetch extents for file : Operation not supported (os error 95)

Maybe it's harmless, my problem is that I don't know what it means.
I am using version: fclones 0.17.0
system is xubuntu 20.04 (everything upgraded)
I run fclones on aboaut 2TiB of data - 200000 files of all sizes
filesystem is zfs (no mirror or raid but encrypted)
disk is spinning disk 4TiB

Warning occurs on 10 files of the 200000
Any ideas? Could the warning message be somewhat more verbose?
Can I just ignore it?

By the way - I have run a brief speed comparison on that data above - here are my results:
(Intel quad core - 16GB ram - disk cache fully loaded from previous operations)
fclones group xxx 34 min
rdfind xxx 81 min
jdupes -S -M -Q -r xxx 90 min
rmlint -T df xxx 134 min
Pretty impresive!!!

Support JSON input for remove/link

$ fclones remove ... < dupes.json
fclones: error: Input error: Not a default fclones report. Formats other than the default one are not supported yet.

I found it very useful to process the fclones group JSON output with jq and would like continue the workflow with fclones remove.

Improve initialization speed - avoid traversing sysfs

Running fclones on a relatively small directory, I noticed its performance is surprisingly bad:

$ time fclones group ~/Downloads/ 
[2021-06-06 18:57:22.658] fclones:  info: Started grouping
[2021-06-06 18:57:23.091] fclones:  info: Scanned 967 file entries
[2021-06-06 18:57:23.091] fclones:  info: Found 873 (2.7 GB) files matching selection criteria
[2021-06-06 18:57:23.091] fclones:  info: Found 47 (9.0 MB) candidates after grouping by size
[2021-06-06 18:57:23.092] fclones:  info: Found 47 (9.0 MB) candidates after grouping by paths and file identifiers
[2021-06-06 18:57:23.097] fclones:  info: Found 45 (8.5 MB) candidates after grouping by prefix
[2021-06-06 18:57:23.105] fclones:  info: Found 45 (8.5 MB) candidates after grouping by suffix
[2021-06-06 18:57:23.125] fclones:  info: Found 45 (8.5 MB) redundant files
<...>
real	0m0.481s
user	0m0.271s
sys	0m0.195s

So it takes about 0.5 sec to process a directory with less than 1000 files. I noticed that most of the time is spent in "Initializing" phase. So I ran strace:

$ strace -c fclones group ~/Downloads/
<...>
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 34.07    0.349066           5     61719      1563 openat
 25.31    0.259249           4     60156           close
 23.58    0.241530           4     53660           newfstatat
  6.21    0.063639           6     10372           read
  4.34    0.044439           6      6691           readlinkat
  3.39    0.034740           7      4502         5 access
  1.92    0.019707         221        89        17 futex
  0.60    0.006150           5      1096           getdents64
<...>

So, it appears fclones makes 60k openat, 60k close, and 54k newfstat. This is very surprising.

Inspecting openat syscalls it seems that most of them are traversing /sys/ filesystem. Here is a fragment of strace output (filtered by openat syscall):

openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4:1.0/uevent", O_RDONLY|O_CLOEXEC) = 5
openat(AT_FDCWD, "/run/udev/data/+usb:2-4:1.0", O_RDONLY|O_CLOEXEC) = 5
openat(AT_FDCWD, "/", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = 5
openat(5, "sys", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "bus", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "usb", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "devices", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "1-4.4:1.0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(5, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "devices", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "pci0000:00", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "0000:00:14.0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "usb1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "1-4", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "1-4.4", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "1-4.4:1.0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5

os error 123 in Windows 10

Running fclones * in Windows 10 results in an error (sorry for the error message in Polish ;-)

[2020-07-19 16:53:35.267] fclones.exe: error: Failed to stat C:\*: Nazwa pliku, nazwa katalogu lub składnia etykiety woluminu jest niepoprawna. (os error 123)
[2020-07-19 16:53:35.269] fclones.exe:  info: Scanned 0 file entries
[2020-07-19 16:53:35.270] fclones.exe:  info: Found 0 (0 B) files matching selection criteria                           
[2020-07-19 16:53:35.272] fclones.exe:  info: Found 0 (0 B) candidates after grouping by size                           
[2020-07-19 16:53:35.274] fclones.exe:  info: Found 0 (0 B) candidates after pruning hard-links                         
[2020-07-19 16:53:35.277] fclones.exe:  info: Found 0 (0 B) candidates after grouping by prefix                         
[2020-07-19 16:53:35.278] fclones.exe:  info: Found 0 (0 B) candidates after grouping by suffix                         
[2020-07-19 16:53:35.280] fclones.exe:  info: Found 0 (0 B) duplicate files

Running fclones . -R works properly and also running under WSL works properly.

Remove dependency on pcre-sys

This dependency makes it harder to install fclones on some platforms. Let's switch to regex crate.

Tests failing on file systems which don't support querying file creation (birth) time like zfs or f2fs

fclones fails to build on my Arch Linux f2fs partition.

failures:

---- dedupe::test::test_partition_respects_creation_time_priority stdout ----
[2021-07-29 18:44:45.674] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/ctime_priority/file_3: creation time is not available for the filesystem
[2021-07-29 18:44:45.674] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/ctime_priority/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.674] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/ctime_priority/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.674] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/ctime_priority/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_creation_time_priority' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:856:80
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- dedupe::test::test_partition_respects_drop_patterns stdout ----
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/drop/file_3: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/drop/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/drop/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/drop/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_drop_patterns' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:923:68

---- dedupe::test::test_partition_respects_keep_patterns stdout ----
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/keep/file_3: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/keep/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/keep/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/keep/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_keep_patterns' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:904:68

---- dedupe::test::test_run_dedupe_script stdout ----
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/dedupe_script/file_3: creation time is not available for the filesystem
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/dedupe_script/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/dedupe_script/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/dedupe_script/file_1: creation time is not available for the filesystem
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read.
thread 'dedupe::test::test_run_dedupe_script' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `2`', src/dedupe.rs:944:13


failures:
    dedupe::test::test_partition_respects_creation_time_priority
    dedupe::test::test_partition_respects_drop_patterns
    dedupe::test::test_partition_respects_keep_patterns
    dedupe::test::test_run_dedupe_script

test result: FAILED. 94 passed; 4 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.49s

error: test failed, to rerun pass '--lib'

Failure to read creation time on ZFS

It looks like stat does not report creation time from zfs properly, listing now "birth". I assume what ever fclones is using is doing similar and not getting a creation time reported. I'm still digging around for details, Does ZFS store "Birth Time" or "Creation Time" ? is what I've uncovered so far.

failures:

---- dedupe::test::test_partition_respects_keep_patterns stdout ----
[2021-06-05 20:02:23.458] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/keep/file_3: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/keep/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/keep/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/keep/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_keep_patterns' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:904:68
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- dedupe::test::test_partition_respects_drop_patterns stdout ----
[2021-06-05 20:02:23.458] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_3: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_drop_patterns' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:923:68

---- dedupe::test::test_partition_respects_creation_time_priority stdout ----
[2021-06-05 20:02:23.458] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/ctime_priority/file_3: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/ctime_priority/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/ctime_priority/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/ctime_priority/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_creation_time_priority' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:856:80

---- dedupe::test::test_run_dedupe_script stdout ----
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/dedupe_script/file_3: creation time is not available for the filesystem
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/dedupe_script/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/dedupe_script/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/dedupe_script/file_1: creation time is not available for the filesystem
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read.
thread 'dedupe::test::test_run_dedupe_script' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `2`', src/dedupe.rs:944:13


failures:
    dedupe::test::test_partition_respects_creation_time_priority
    dedupe::test::test_partition_respects_drop_patterns
    dedupe::test::test_partition_respects_keep_patterns
    dedupe::test::test_run_dedupe_script

test result: FAILED. 92 passed; 4 failed; 0 ignored; 0 measured; 0 filtered out; finished in 65.13s

0 ✓ fryfrog@apollo ~ $ ls -alh /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
-rw-r--r-- 1 fryfrog fryfrog 0 Jun  5 20:02 /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
0 ✓ fryfrog@apollo ~ $ stat /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
  File: /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
  Size: 0               Blocks: 1          IO Block: 131072 regular empty file
Device: 19h/25d Inode: 2938048     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/ fryfrog)   Gid: ( 1000/ fryfrog)
Access: 2021-06-05 20:02:23.449795401 -0700
Modify: 2021-06-05 20:02:23.449795401 -0700
Change: 2021-06-05 20:02:23.449795401 -0700
 Birth: -

0 ✓ fryfrog@apollo ~ $ sudo zdb -O rpool/ROOT/arch home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1

   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  2938048    1   128K    512      0     512    512    0.00  ZFS plain file

0 ✓ fryfrog@apollo ~ $ sudo zdb -ddddd rpool/ROOT/arch  2938048
Dataset rpool/ROOT/arch [ZPL], ID 394, cr_txg 20, 81.7G, 1547348 objects, rootbp DVA[0]=<0:2287a77000:1000> DVA[1]=<0:2875361000:1000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=140448231L/140448231P fill=1547348 cksum=11e787a4a1:3022d1624a43:45bff9d1ab72ff:480dc35cda0302c4

   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  2938048    1   128K    512      0     512    512    0.00  ZFS plain file
                                              176   bonus  System attributes
       dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
       dnode maxblkid: 0
       path    /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
       uid     1000
       gid     1000
       atime   Sat Jun  5 20:02:23 2021
       mtime   Sat Jun  5 20:02:23 2021
       ctime   Sat Jun  5 20:02:23 2021
       crtime  Sat Jun  5 20:02:23 2021
       gen     140447745
       mode    100644
       size    0
       parent  2938047
       links   1
       pflags  840800000004

Update benchmarks and add czkawka and rmlint to comparsion

Hi,
I see that your app do some crazy optimizations for SSD and HDD, and I'm curious how fast it works in comparison to my app Czkawka(it use mostly simple optimizations and rather primitive algorighms, since I focus more on GUI).

I'm almost sure that with big amount of duplicated files fclones will be faster, but I'm curious if with second scan Czkawka will be faster due using caching hash results.

--stdin Parameter Not Working

Greetings!

I've been playing a bit with fclones this morning (super cool tool, BTW) and wanted to use the --stdin parameter to read the list of files to analyze from the output of find. Based on the documentation it seems like passing the input to fclones group --stdin should work, but whenever I try this I always get an error: fclones: error: No input files

Here's a simple, trivial example of what I mean:

localhost~ % fclones --version
fclones 0.12.2
localhost~ % mkdir blah
localhost~ % cd blah
localhost~/blah % touch {1,2,3}.c
localhost~/blah % find . -name '*.c'
./1.c
./2.c
./3.c
localhost~/blah % find . -name '*.c' | fclones group --stdin
[2021-06-19 15:01:12.126] fclones: error: No input files

I'm not sure if I'm doing something incorrectly - any ideas? Thanks in advance!

Add fdupes compatible output format

How should I interpret progress bar sizes? Affected by hard links?

Here's my output as it's running currently:

[2021-01-25 11:58:30.332] fclones:  info: Started
[2021-01-25 11:58:41.402] fclones:  info: Scanned 40512 file entries
[2021-01-25 11:58:41.402] fclones:  info: Found 38084 (10.7 TB) files matching selection criteria
[2021-01-25 11:58:41.408] fclones:  info: Found 15854 (4.2 TB) candidates after grouping by size
[2021-01-25 11:58:41.414] fclones:  info: Found 15694 (3.3 TB) candidates after grouping by paths and file identifiers
[2021-01-25 12:00:32.283] fclones:  info: Found 2159 (3.3 TB) candidates after grouping by prefix
[2021-01-25 12:00:53.996] fclones:  info: Found 2159 (3.3 TB) candidates after grouping by suffix
Grouping by contents        [=>                                                ]   139.20GB/5.97TB

The size reporting in the log messages seems accurate given the data I'm running this tool on, but what confuses me is the 5.97TB total grouping progress. If we have 3.3TB of candidates, I would expect to see matching numbers.

I expect this is something to do with the fact that a lot of the existing data consists of large files which are hard-linked and exist in two places, so depending on how the size count handles those files, that could be the source of the discrepancy. I'm not sure if this is just a clarity thing, or if this means that there's actually room for speeding up the hashing process - I'm not an expert obviously, but I assume that if these hard-linked files are hashed once, it would be unnecessary to hash any other files/paths that point to the same data.

Add an option to process files by programs which modify files in-place

Some programs like exiv2 modify the file instead of creating a new file without modifying the original.
In such cases, using --transform is not possible without additional scripting to make a copy before modification.

A new option --transform-copy would first make a copy of each file into a temporary directory, and then invoke the external program on that copy.

Can't compile on FreeBSD

Tried to compile from source on a FreeBSD jail, got these errors, tried again with the verbose flag to get more information.

Compiling fclones v0.17.0

error[E0063]: missing field l_sysid in initializer of flock
--> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/fclones-0.17.0/src/lock.rs:31:17
|
31 | let f = libc::flock {
| ^^^^^^^^^^^ missing l_sysid

error[E0063]: missing field l_sysid in initializer of flock
--> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/fclones-0.17.0/src/lock.rs:47:17
|
47 | let f = libc::flock {
| ^^^^^^^^^^^ missing l_sysid

error: aborting due to 2 previous errors

For more information about this error, try rustc --explain E0063.
error: failed to compile fclones v0.17.0, intermediate artifacts can be foundat /tmp/cargo-installPdO1Nf

Caused by:
could not compile fclones

See attached file for full message.
FreeBSD fclones Errors.docx

`fclones dedupe` does not preserve mtimes on Linux

fclones dedupe doesn't seem to preserve mtimes on Linux. Preserving mtimes seems like something that should be both possible and desirable, but please let me know if I missed something.

I tested this with btrfs on NixOS 21.11-pre.

# uname -a
Linux ra 5.10.76-hardened1 #1-NixOS SMP Wed Oct 27 07:56:57 UTC 2021 x86_64 GNU/Linux

# fclones --version
fclones 0.17.0

# cp -a /etc/passwd ./

# touch --date 2009-01-01 passwd

# l
total 4,096
-rw-r--r-- 1 at at 3,891 2009-01-01 00:00 passwd

# cp -a passwd passwd.2

# l
total 8,192
-rw-r--r-- 1 at at 3,891 2009-01-01 00:00 passwd.2
-rw-r--r-- 1 at at 3,891 2009-01-01 00:00 passwd

# fclones group . | fclones dedupe
[2021-10-29 04:25:29.532] fclones:  info: Started grouping
[2021-10-29 04:25:29.540] fclones:  info: Scanned 3 file entries
[2021-10-29 04:25:29.540] fclones:  info: Found 2 (7.8 KB) files matching selection criteria
[2021-10-29 04:25:29.540] fclones:  info: Found 1 (3.9 KB) candidates after grouping by size
[2021-10-29 04:25:29.540] fclones:  info: Found 1 (3.9 KB) candidates after grouping by paths and file identifiers
[2021-10-29 04:25:29.552] fclones:  info: Found 1 (3.9 KB) candidates after grouping by prefix
[2021-10-29 04:25:29.552] fclones:  info: Found 1 (3.9 KB) candidates after grouping by suffix
[2021-10-29 04:25:29.552] fclones:  info: Found 1 (3.9 KB) redundant files
[2021-10-29 04:25:29.553] fclones:  info: Started deduplicating
[2021-10-29 04:25:29.561] fclones:  info: Processed 1 files and reclaimed up to 3.9 KB space

# l
total 8,192
-rw-r--r-- 1 at at 3,891 2009-01-01 00:00 passwd
-rw-r--r-- 1 at at 3,891 2021-10-29 04:25 passwd.2

I also tested a user xattr and it did seem to be preserved.

unable to compile on aarch64 musl

hello,
I'm unable to (cross)compile a musl binary on aarch64 ("cross" as in from ubuntu, still aarch64)
[also tried compiling native on alpine aarch64, same result]

$ cargo install --target aarch64-unknown-linux-musl fclones
[cut]
   Compiling reflink v0.1.3
error[E0308]: mismatched types
  --> .cargo/registry/src/github.com-1ecc6299db9ec823/reflink-0.1.3/src/sys/unix.rs:21:39
   |
21 |         libc::ioctl(dest.as_raw_fd(), IOCTL_FICLONE, src.as_raw_fd())
   |                                       ^^^^^^^^^^^^^ expected `i32`, found `u64`
   |
help: you can convert a `u64` to an `i32` and panic if the converted value doesn't fit
   |
21 |         libc::ioctl(dest.as_raw_fd(), IOCTL_FICLONE.try_into().unwrap(), src.as_raw_fd())
   |                                                    ++++++++++++++++++++

For more information about this error, try `rustc --explain E0308`.
error: could not compile `reflink` due to previous error
warning: build failed, waiting for other jobs to finish...
error: failed to compile `fclones v0.17.1`

compiling with glibc works correctly

(rust 1.57.0)

regards,
m

Tests of device names should be enabled only in Linux

Found when testing #64

Fea request: Provide MacOS builds

fclones looks great, but installing a whole Rust build stack to test it out is a bit of barrier. At some point it would be great if MacOS-compatible builds where generated & provided as part of the regular release process.

error: Failed to read file list: Malformed group header:

Problem: when I run the command "fclones group . | fclones remove" on Windows 10 (Build 19042.985), I get the error "Failed to read file list: Malformed group header: F:\Photos\Sorted Photos\2005\07._DSC00284.jpg."

I had ran "fclones group ." and it worked flawlessly, so I naturally wanted to remove the duplicate files. After including the "| flcones remove", it gave me this error. Is it something to do with the file starting with a '.'?

Running in Windows terminal using Powershell.

The `--dry-run` output looks suspiciously like a usable shell script

The output of --dry-run for link and dedupe moves the original files out of the way via mv (just like fclones itself) but then completely ignores possible failure of the next command and removes the backup in the following step.

If the calling shell does not have errexit set this can lead to data loss (actually just filename loss) if the ln/cp fails.

I would suggest to just print the action for each line instead and optionally emit a shell header which is called and includes proper error handling.

Use ZFS checksums for faster comparison

There was a proposal for rdfind to use existing ZFS checksum, which is created when a file is written to ZFS. This may result in much faster comparison, especially for big files on ZFS.
I think this would be great enhancement for fclones.

Here is the original rdfind post.

Thank you for this nice tool!

Duplicate search between, but not within, two distinct directories?

Is it possible to use fclones to find duplicates between, but not within, two directory trees? Here's an example:

destination/
  2021/
    January/
      A.jpg

source/
  A1.jpg <-- copy of destination/2021/January/A.jpg (also same as A2.jpg)
  A2.jpg <-- copy of destination/2021/January/A.jpg (also same as A1.jpg)  
  B1.jpg <-- same as B2.jpg
  B2.jpg <-- same as B1.jpg

I want to identify A1.jpg and A2.jpg under source as duplicates of A.jpg in destination.

B1.jpg and B2.jpg are also duplicates but only under sources. They should be excluded from the match list because they don't match anything in destination.

FWIW, the use case is a source folder of images that have previously been processed by scripts to rename them and sort them into a destination directory structure (e.g. by year and month, or by other EXIF metadata). Then we come across a new folder of images, some of which may have been processed previously, and we want to know if we can safely delete them because we already have copies in the destination directory.

Make fclones usable as a library

... in case someone wanted to create a GUI or just use file deduplication in their own programs.

Offer a way of deleting / hardlinking / softlinking duplicated files automatically

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

In #25:

@pkolaczk wrote:

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:

#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

Problem with finding files on synology NAS share mounted as CIFS volume under Linux

Hello Piotr,

Thanks for your work on fclones,

I had a plan to use for deduplication for files on my NAS, however I encountered a strange problem.

Please find directory contents on my nas:

'IMG_20210416_204824 (1).jpg'*  'IMG_20210416_204830 (2).jpg'*   IMG_20210416_204845.jpg*       'IMG_20210416_223757 (1).jpg'*  'IMG_20210416_224002 (2).jpg'*   IMG_20210416_224021.jpg*        IMG_20210416_232550.jpg*
'IMG_20210416_204824 (2).jpg'*   IMG_20210416_204830.jpg*       'IMG_20210416_204847 (1).jpg'*  'IMG_20210416_223757 (2).jpg'*   IMG_20210416_224002.jpg*       'IMG_20210416_224022 (1).jpg'*   Thumbs.db*
 IMG_20210416_204824.jpg*       'IMG_20210416_204832 (1).jpg'*  'IMG_20210416_204847 (2).jpg'*   IMG_20210416_223757.jpg*       'IMG_20210416_224015 (1).jpg'*  'IMG_20210416_224022 (2).jpg'*   VID_20210416_224914.mp4*
'IMG_20210416_204827 (1).jpg'*  'IMG_20210416_204832 (2).jpg'*   IMG_20210416_204847.jpg*       'IMG_20210416_223758 (1).jpg'*  'IMG_20210416_224015 (2).jpg'*   IMG_20210416_224022.jpg*        VID_20210416_232553.mp4*
'IMG_20210416_204827 (2).jpg'*   IMG_20210416_204832.jpg*       'IMG_20210416_223755 (1).jpg'*  'IMG_20210416_223758 (2).jpg'*   IMG_20210416_224015.jpg*       'IMG_20210416_224107 (1).jpg'*
 IMG_20210416_204827.jpg*       'IMG_20210416_204845 (1).jpg'*  'IMG_20210416_223755 (2).jpg'*   IMG_20210416_223758.jpg*       'IMG_20210416_224021 (1).jpg'*  'IMG_20210416_224107 (2).jpg'*
'IMG_20210416_204830 (1).jpg'*  'IMG_20210416_204845 (2).jpg'*   IMG_20210416_223755.jpg*       'IMG_20210416_224002 (1).jpg'*  'IMG_20210416_224021 (2).jpg'*   IMG_20210416_224107.jpg*

Please find that files with (1) or (2) in their names are duplicates for sure, I confirmed this by md5sum cmd - they simply have the same size.

Directory is mounted as type cifs (rw,relatime,vers=3.1.1,cache=strict,username=agnieszka,uid=1000,forceuid,gid=1000,forcegid,addr=10.0.0.10,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1)

fclones --version 0.10.2
linux version 5.4.102-rt53-MANJARO

While I'm in this directory with duplicates I use the following fclones cmd: fclones .

And I got report:

[2021-04-17 20:51:17.333] fclones:  info: Started
[2021-04-17 20:51:18.125] fclones:  info: Scanned 1 file entries
[2021-04-17 20:51:18.125] fclones:  info: Found 0 (0 B) files matching selection criteria
[2021-04-17 20:51:18.125] fclones:  info: Found 0 (0 B) candidates after grouping by size
[2021-04-17 20:51:18.126] fclones:  info: Found 0 (0 B) candidates after grouping by paths and file identifiers
[2021-04-17 20:51:18.126] fclones:  info: Found 0 (0 B) candidates after grouping by prefix
[2021-04-17 20:51:18.126] fclones:  info: Found 0 (0 B) candidates after grouping by suffix
[2021-04-17 20:51:18.126] fclones:  info: Found 0 (0 B) duplicate files

And same story (report) for fclones . --names '*.jpg'

It looks like fclones does not see these files correctly. I thought this is because of their long names with whitespaces (sorry, these are names generated by my phone). I renamed two duplicates for simples names like a.jpg and b.jpg but I got same results - no duplications found.

Interestingly I tracked fclones by strace and there is no single strace log which claims fclones reads any files in this location.

Finally, I copied all these files to local directory on my disk and... same results - no duplications found.

Please let me know if you would need additional data about this issue to diagnose this problem

Thanks in advance

Pluggable logging and progress reporting

Currently logging and progress bar implementations are tightly coupled to the fclones engine. These should be replaced by traits + adapters so they can be swapped to different implementations e.g. in a GUI app.

Apply ordering to `fclones action --dry-run` results

I'm running this in order to test that my --priority and --keep-path values are right:

watch 'fclones remove --dry-run < results-file.out'

Unfortunately, it's not working because the results are not ordered. So the command is rerun, and the list items dance around every time watch updates.

I thought maybe concurrency was affecting it, so I tried --threads main:1 but that only seems to apply to the group action and not the others.

My workaround right now is to run the results through | sort but it's not ideal because I need to look for each result in the alphabetical list.

So my feature request is to add some ordering to the result list, preferably in the same order as the input list. Even if the ordering is not exposed in the CLI options in any way, it would be a help.

Compare with `fdupes`

Add a compare with fdupes: performance, features...

Provide a Cargo.lock file

You should commit the Cargo.lock file after building the project since it's a Rust binary. I suggest you to remove Cargo.lock line from .gitignore#L3

For more information, please see: https://doc.rust-lang.org/cargo/faq.html#why-do-binaries-have-cargolock-in-version-control-but-not-libraries

Processing files before reporting

Hi, Piotr

I found something obvious - but nevertheless interesting.

I ran my regex against the fclones report to get 2 text files - a set of unique files and a set of duplicate files (minus one copy to use as an original).

I found that the set of unique files still had some duplicates images - which were different on the hash and file size due to exif data.

It seems that the camera sourced exif metadata was the superset and a number of fields (maybe half of them) were dropped when the photos were imported into Apple iphotos.

So, that got myself and a friend wondering how easy/hard it would be to pipe images (on the fly) stripped of exif data via exiftool to fclones which could then create the report ignoring the exif data (since it would no longer be there) - and then finally maybe parse the data again to sort on largest size first based on the persistent size on the disk.

The largest size file (where exif data was different would be indicative of the richest data set to keep as the original - which would be easier to regex to keep if sorted to the top for each set of duplicates.

Happy to have a play with this idea, but if you have any thoughts about this - specifically about ingesting from exiftool into fclones I would be keen to hear about it.

Cheers,
Stu

The first answer here is suggesting a similar approach to the same kind of problem.
ref:https://softwarerecs.stackexchange.com/questions/51032/compare-two-image-files-for-identical-data-excluding-metadata

Publish on crates.io

It would be nice to cargo install fclones, have cargo track versions, and so on. This could be part of the CI pipeline, and/or there's cargo-release which handles tagging and so on.

Detect changes after `fclones group` to avoid copying the wrong data

If files change after an fclones group run without updating the timestamps and remain the same size, then the fclones link command (and others) can lead to data loss:

$ mkdir z; cd z; echo same > 1; echo same > 2; echo abcd > Z
$ cat ?
same
same
abcd
$ fclones group . -o log 
$ cp -a Z 1   # timestamp is kept
$ fclones link < log
$ cat ?
abcd
abcd
abcd

This could be avoided by also checking that the ctime of a file is older than the start of the group run, and if not re-checking or aborting.

Maybe even add a --paranoid option to check the content bite-by-byte before acting on it. But even in this case I am not aware of any (Unix) way to guarantee exclusive write access to a file, so maybe mention that the checked data is expected to not change.

Dedup on btrfs (and others)

Btrfs supports in place dedup ( https://btrfs.wiki.kernel.org/index.php/Deduplication ), via a syscall. This is completely safe, as checks if the files are identical before deduping.

This would be very Linux specific low level code.

Autodetect drive type and tune properly for HDD

Jody Bruchon found that the default performance on a single spinning drive was bad.

This doesn't surprise me, because all the settings like parallelism level, buffer sizes etc are tuned for SSD, and they are really bad for spinning drives.