Git Product home page Git Product logo

Comments (3)

kornelski avatar kornelski commented on July 28, 2024 2

It doesn't have to keep the files open. It's fine to close and reopen them when needed.

The key insight is that if you split a file into multiple hashes (array of hashes), and put these multi-hashes in a tree (btree or binary tree), then you don't need to know all of the hashes at once. You only need to compare them as bigger/smaller. This means you can stop reading files as soon as you find a difference. And when all of the files are in the same tree, you only compare the file against minimum number of other files as you go down the tree, and you only need to compare minimum number of hashes.

from czkawka.

qarmin avatar qarmin commented on July 28, 2024

I tried to read and understand what it going on with lazy hashing, but I failed because for now seems that I only can read my own code.

But if I correctly understand it opens n files which are in group with same size and read part of file, hashes it and compare it with other partial hashes. Next throw out unique hashes and repeat everything until data ends.

Looks that this should be very fast solution but isn't suitable for current Czkawka version:

  • At first, Windows looks that have limit of 512 files opened at one time, current implementation opens maximal one file per available virtual processor but probably lazy hashing could exceed this limit with checking e.g. 1000 identical files .
  • But the most important, that will really complicate caching data. Recently added feature base on saving/loading full file hash.

from czkawka.

x2es avatar x2es commented on July 28, 2024

I'd like to link this thread with this idea #640

from czkawka.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.