I really dig fclones - thank you for writing it! This is purely a th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Performance gains from even more small hash tests? about fclones HOT 5 OPEN

clunkclunk commented on August 15, 2024

Performance gains from even more small hash tests?

from fclones.

Comments (5)

kapitainsky commented on August 15, 2024 2

(side note - how big are the 'tiny blocks' referenced in the algorithm?)

They are also configurable:

      --max-prefix-size <BYTES>
          Maximum prefix size to check in bytes
          Units like KB, KiB, MB, MiB, GB, GiB are supported.
          Default: 16KiB for hard drives, 4KiB for SSDs

      --max-suffix-size <BYTES>
          Maximum suffix size to check in bytes
          Units like KB, KiB, MB, MiB, GB, GiB are supported.
          Default: 16KiB for hard drives, 4KiB for SSDs

from fclones.

gcflymoto commented on August 15, 2024 1

@clunkclunk the algorithm you describe is implemented by https://github.com/kornelski/dupe-krill

from fclones.

pkolaczk commented on August 15, 2024

Why not have some more 'tiny block hashes' at various points in the file aside from the first and last tiny blocks to determine uniqueness before doing the time consuming entire file hash?

Can you give me an example of a situation where files of the same size would match with the beginning and end, but not the middle? Even matching the ends turns out to filter out very few files. I believe such situations would be extremely rare and that wouldn't justify the added complexity and cost.

from fclones.

johnpyp commented on August 15, 2024

For reference on a way to speed up dedupes, I typically use this is like so:

fclones group --cache /my/media/folder --min 500MiB --max-prefix-size 128MiB --max-suffix-size 64MiB --skip-content-hash

I set a large prefix size, suffix size, and skip the content hash alltogether. As well as skip small files where there's a higher false negative likelihood and less benefit to this approach. I've processed multiple thousands of media files and seen zero false negatives with cursory manual investigation.

For what it's worth, I do see value in random / "statistical" checks, though like mentioned it would add a lot of complexity. What I like about it is that it gives you a "reliable" statistical confidence instead of just the beginning and end. You get to set a threshold of how comfortable you are that is mostly independent of the kind of file, rather than relying on the fact that the middle chunks haven't been tampered with.

from fclones.

bmfrosty commented on August 15, 2024

I can't say what files they are (I can't seem to find a verbose option), but I see this when scanning a very large directory:

[2023-10-27 13:54:46.198] fclones:  info: Found 2303 (276.5 GB) candidates after grouping by prefix
[2023-10-27 13:54:53.994] fclones:  info: Found 2298 (274.9 GB) candidates after grouping by suffix
[2023-10-27 14:50:36.020] fclones:  info: Found 2186 (271.0 GB) redundant files

These should all be zip and rar files.

I should open up a ticket for a verbose option for what's unmatched between steps 4 and 5, but I definitely have files with the same size, prefix, and suffix that don't have the same content - or maybe just update to the latest version.

Another option would be to do random chunk matching of a configurable percentage of the file size using a seed based on the file size to pick which chunks.

from fclones.

Performance gains from even more small hash tests? about fclones HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent