Git Product home page Git Product logo

Comments (5)

kapitainsky avatar kapitainsky commented on August 15, 2024 2

(side note - how big are the 'tiny blocks' referenced in the algorithm?)

They are also configurable:

      --max-prefix-size <BYTES>
          Maximum prefix size to check in bytes
          Units like KB, KiB, MB, MiB, GB, GiB are supported.
          Default: 16KiB for hard drives, 4KiB for SSDs

      --max-suffix-size <BYTES>
          Maximum suffix size to check in bytes
          Units like KB, KiB, MB, MiB, GB, GiB are supported.
          Default: 16KiB for hard drives, 4KiB for SSDs

from fclones.

gcflymoto avatar gcflymoto commented on August 15, 2024 1

@clunkclunk the algorithm you describe is implemented by https://github.com/kornelski/dupe-krill

from fclones.

pkolaczk avatar pkolaczk commented on August 15, 2024

Why not have some more 'tiny block hashes' at various points in the file aside from the first and last tiny blocks to determine uniqueness before doing the time consuming entire file hash?

Can you give me an example of a situation where files of the same size would match with the beginning and end, but not the middle? Even matching the ends turns out to filter out very few files. I believe such situations would be extremely rare and that wouldn't justify the added complexity and cost.

from fclones.

johnpyp avatar johnpyp commented on August 15, 2024

For reference on a way to speed up dedupes, I typically use this is like so:

fclones group --cache /my/media/folder --min 500MiB --max-prefix-size 128MiB --max-suffix-size 64MiB --skip-content-hash 

I set a large prefix size, suffix size, and skip the content hash alltogether. As well as skip small files where there's a higher false negative likelihood and less benefit to this approach. I've processed multiple thousands of media files and seen zero false negatives with cursory manual investigation.


For what it's worth, I do see value in random / "statistical" checks, though like mentioned it would add a lot of complexity. What I like about it is that it gives you a "reliable" statistical confidence instead of just the beginning and end. You get to set a threshold of how comfortable you are that is mostly independent of the kind of file, rather than relying on the fact that the middle chunks haven't been tampered with.

from fclones.

bmfrosty avatar bmfrosty commented on August 15, 2024

I can't say what files they are (I can't seem to find a verbose option), but I see this when scanning a very large directory:

[2023-10-27 13:54:46.198] fclones:  info: Found 2303 (276.5 GB) candidates after grouping by prefix
[2023-10-27 13:54:53.994] fclones:  info: Found 2298 (274.9 GB) candidates after grouping by suffix
[2023-10-27 14:50:36.020] fclones:  info: Found 2186 (271.0 GB) redundant files

These should all be zip and rar files.

I should open up a ticket for a verbose option for what's unmatched between steps 4 and 5, but I definitely have files with the same size, prefix, and suffix that don't have the same content - or maybe just update to the latest version.

Another option would be to do random chunk matching of a configurable percentage of the file size using a seed based on the file size to pick which chunks.

from fclones.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.