Comments (3)
It doesn't have to keep the files open. It's fine to close and reopen them when needed.
The key insight is that if you split a file into multiple hashes (array of hashes), and put these multi-hashes in a tree (btree or binary tree), then you don't need to know all of the hashes at once. You only need to compare them as bigger/smaller. This means you can stop reading files as soon as you find a difference. And when all of the files are in the same tree, you only compare the file against minimum number of other files as you go down the tree, and you only need to compare minimum number of hashes.
from czkawka.
I tried to read and understand what it going on with lazy hashing, but I failed because for now seems that I only can read my own code.
But if I correctly understand it opens n
files which are in group with same size and read part of file, hashes it and compare it with other partial hashes. Next throw out unique hashes and repeat everything until data ends.
Looks that this should be very fast solution but isn't suitable for current Czkawka version:
- At first, Windows looks that have limit of 512 files opened at one time, current implementation opens maximal one file per available virtual processor but probably lazy hashing could exceed this limit with checking e.g. 1000 identical files .
- But the most important, that will really complicate caching data. Recently added feature base on saving/loading full file hash.
from czkawka.
I'd like to link this thread with this idea #640
from czkawka.
Related Issues (20)
- Detect if Searching for empty folders.
- what is happening to the main window?
- paste text crashed the application on Windows HOT 2
- Double-clicking the Path column value opens the directory/folder HOT 1
- Reference Folder - I am still able to select files from the Reference Folder if it is a symlink HOT 3
- Add info about WebP compression type
- Fuzzy Size match in `Duplicate Files` section
- Folder level choices
- Improve HDD performance by reading files in physical sector order
- Filter for filename and resolution in Similar Images/Videos
- Czkawka 6.1.0 does not work under Linux ARM HOT 2
- Search for similar images breaks down HOT 4
- Unable to ompile in Debian 12 HOT 1
- Nested mapped drive not scanned properly
- Find "Duplicate folders" with exact same file content and "Similar folders" with most (some %) of files being same HOT 5
- reproduceable crash when click on sort by folder HOT 1
- Gui visibility
- DATA LOSS: Czkawka lists a file in the same folder twice, when it "deduplicates" that the file vanishes. HOT 1
- Symbolic link dedup simply deletes the duplicates in windows
- Compare *raw data*
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from czkawka.