Git Product home page Git Product logo

bitrot's Introduction

bitrot

Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay.

Usage

Go to the desired directory and simply invoke:

$ bitrot

This will start digging through your directory structure recursively indexing all files found. The index is stored in a .bitrot.db file which is a SQLite 3 database.

Next time you run bitrot it will add new files and update the index for files with a changed modification date. Most importantly however, it will report all errors, e.g. files that changed on the hard drive but still have the same modification date.

All paths stored in .bitrot.db are relative so it's safe to rescan a folder after moving it to another drive. Just remember to move it in a way that doesn't touch modification dates. Otherwise the checksum database is useless.

Performance

Obviously depends on how fast the underlying drive is. Historically the script was single-threaded because back in 2013 checksum calculations on a single core still outran typical drives, including the mobile SSDs of the day. In 2020 this is no longer the case so the script now uses a process pool to calculate SHA1 hashes and perform stat() calls.

No rigorous performance tests have been done. Scanning a ~1000 file directory totalling ~5 GB takes 2.2s on a 2018 MacBook Pro 15" with a AP0512M SSD. Back in 2013, that same feat on a 2015 MacBook Air with a SM0256G SSD took over 20 seconds.

On that same 2018 MacBook Pro 15", scanning a 60+ GB music library takes 24 seconds. Back in 2013, with a typical 5400 RPM laptop hard drive it took around 15 minutes. How times have changed!

Tests

There's a simple but comprehensive test scenario using pytest and pytest-order.

Install:

$ python3 -m venv .venv
$ . .venv/bin/activate
(.venv)$ pip install -e .[test]

Run:

(.venv)$ pytest -x
==================== test session starts ====================
platform darwin -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/ambv/Documents/Python/bitrot
plugins: order-1.1.0
collected 12 items

tests/test_bitrot.py ............                      [100%]

==================== 12 passed in 15.05s ====================

Change Log

1.0.1

  • officially remove Python 2 support that was broken since 1.0.0 anyway; now the package works with Python 3.8+ because of a few features

1.0.0

  • significantly sped up execution on solid state drives by using a process pool executor to calculate SHA1 hashes and perform stat() calls; use -w1 if your runs on slow magnetic drives were negatively affected by this change
  • sped up execution by pre-loading all SQLite-stored hashes to memory and doing comparisons using Python sets
  • all UTF-8 filenames are now normalized to NFKD in the database to enable cross-operating system checks
  • the SQLite database is now vacuumed to minimize its size
  • bugfix: additional Python 3 fixes when Unicode names were encountered

0.9.2

  • bugfix: one place in the code incorrectly hardcoded UTF-8 as the filesystem encoding

0.9.1

  • bugfix: print the path that failed to decode with FSENCODING
  • bugfix: when using -q, don't hide warnings about files that can't be statted or read
  • bugfix: -s is no longer broken on Python 3

0.9.0

  • bugfix: bitrot.db checksum checking messages now obey --quiet
  • Python 3 compatibility

0.8.0

  • bitrot now keeps track of its own database's bitrot by storing a checksum of .bitrot.db in .bitrot.sha512
  • bugfix: now properly uses the filesystem encoding to decode file names for use with the .bitrotdb database. Report and original patch by pallinger.

0.7.1

  • bugfix: SHA1 computation now works correctly on Windows; previously opened files in text-mode. This fix will change hashes of files containing some specific bytes like 0x1A.

0.7.0

  • when a file changes or is renamed, the timestamp of the last check is updated, too
  • bugfix: files that disappeared during the run are now properly ignored
  • bugfix: files that are locked or with otherwise denied access are skipped. If they were read before, they will be considered "missing" in the report.
  • bugfix: if there are multiple files with the same content in the scanned directory tree, renames are now handled properly for them
  • refactored some horrible code to be a little less horrible

0.6.0

  • more control over performance with --commit-interval and --chunk-size command-line arguments
  • bugfix: symbolic links are now properly skipped (or can be followed if --follow-links is passed)
  • bugfix: files that cannot be opened are now gracefully skipped
  • bugfix: fixed a rare division by zero when run in an empty directory

0.5.1

  • bugfix: warn about test mode only in test mode

0.5.0

  • --test command-line argument for testing the state without updating the database on disk (works for testing databases you don't have write access to)
  • size of the data read is reported upon finish
  • minor performance updates

0.4.0

  • renames are now reported as such
  • all non-regular files (e.g. symbolic links, pipes, sockets) are now skipped
  • progress presented in percentage

0.3.0

  • --sum command-line argument for easy comparison of multiple databases

0.2.1

  • fixed regression from 0.2.0 where new files caused a KeyError exception

0.2.0

  • --verbose and --quiet command-line arguments
  • if a file is no longer there, its entry is removed from the database

0.1.0

  • First published version.

Authors

Glued together by Łukasz Langa. Multiple improvements by Ben Shepherd, Jean-Louis Fuchs, Marcus Linderoth, p1r473, Peter Hofmann, Phil Lundrigan, Reid Williams, Stan Senotrusov, Yang Zhang, and Zhuoyun Wei.

bitrot's People

Contributors

ambv avatar benshep avatar liloman avatar msloth avatar p1r473 avatar philipbl avatar senotrusov avatar vain avatar wzyboy avatar yang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bitrot's Issues

Tests fail

Tests fail in Ubuntu 22.04

$ python --version
Python 3.8.16
$ ./test-bitrot.bats 
 ✓ bitrot command exists 
 ✓ bitrot detects new files in a tree dir 
 ✓ bitrot detects modified files in a tree dir 
 ✓ bitrot detects renamed files in a tree dir 
 ✓ bitrot detects delete files in a tree dir 
 ✓ bitrot detects new files and modified in a tree dir  
 ✗ bitrot detects new files, modified, deleted and moved in a tree dir 
   (in test file test-bitrot.bats, line 115)
     `[[ ${lines[13]}  = " from ./more-files-a.txt to ./more-files-a.txt2" ]]' failed
 ✓ bitrot detects new files, modified, deleted and moved in a tree dir 2 
 ✓ bitrot can operate with 3278 files easily in a dir (1) 
 ✓ bitrot can operate with 3278 files easily in a dir (2) 
 ✗ bitrot can detect rotten bits in a dir (1)
   (in test file test-bitrot.bats, line 191)
     `[[ ${lines[2]}   = "3301 entries in the database, 2 entries new:" ]]' failed
 ✓ bitrot can detect rotten bits in a dir (2) 
 ✓ Clean everything 

13 tests, 2 failures

Python 3.12 deprecation warning for date function

This is a great program, thanks for writing and maintaining it! I was about to write something myself, but please to find an existing tested tool that does pretty much exactly what I want :)

When I run on Windows with python 3.12.0 I get the following deprecation warning. It's probably a fairly easy fix, and it's not at all urgent.

bitrot.py:73: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
  return datetime.datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S%z')

Last good hash date

Thanks for the software. I had an issue with an external drive when connecting to a windows computer. I ran bitrot on the drive, and it did report some hash errors. The date of the 'last good hash checked' on those files was in 2022. But I've run bitrot on that drive since 2022. I would have expected the 'last good hash checked' would have been the last time I ran bitrot bitrot before the error occurred?

High Memory Usage

Hi there,

Love this library, just found it and it seems to work exactly as I want, except for one issue. It just ran my box out of memory and caused it to crash.

I'm trying to check a fairly large batch of files (about 3.7TB worth or 1,206,600 files) and bitrot really chews through the RAM (causing my box to crash). All up it seems to need 4.3GB of RAM to run, which does seem like a lot.

I'd prefer not to split my checks into multiple smaller sets if at all possible, but obviously I can't have my system crashing.

Any ideas on what I can do to fix this issue?

My system is:
AMD64
Debian Stretch
Python3.8

bitrot for xmp files

Several free software photography tools like darktable store all the metadata, and the changes they make in the image in an xmp file with the same name. Original file is never touched.
Would anybody be interested in giving bitrot a feature where the bitrot information is stored in these same xmp files instead of the SQLite database file?
The main benefits of this, is that other apps like darktable or digikam that will also be able to access and use the bitrot information from the xmp file. Bitrot could even be integrated and launched from these apps directly.
Most photographs don't store checksums of their photos, but just like everybody else , they suffer the consequences when it happens. This way, as soon as bitrot occurs it can be detected and the photographer can delete the corrupt file and restore a backup of just that file.

Hanging with parallel multi-processor futures

Hi
Ever since upgrading to 1.0 I have been getting the program to hang when trying to hash around 4tb on magnetic disks.
I am using -w 1 for only one worker

I believe the current implementation of futures or the pool executor may be causing a deadlock or some sort of sleep condition.

However, multi-cpu processing is beyond my area of expertise
Anyone else hanging when trying to hash many terabytes of data?

If I kill it while its hung, I get this:

Traceback (most recent call last):
  File "c:\python3\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "c:\python3\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "c:\python3\lib\concurrent\futures\process.py", line 233, in _process_worker
    call_item = call_queue.get(block=True)
  File "c:\python3\lib\multiprocessing\queues.py", line 97, in get
    res = self._recv_bytes()
  File "c:\python3\lib\multiprocessing\connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "c:\python3\lib\multiprocessing\connection.py", line 305, in _recv_bytes
    waitres = _winapi.WaitForMultipleObjects(
KeyboardInterrupt
Traceback (most recent call last):
  File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 1731, in <module>
  File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 1725, in run_from_command_line

  File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 810, in run
    for future in as_completed(futures):
  File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 810, in <listcomp>
    for future in as_completed(futures):
  File "c:\python3\lib\concurrent\futures\process.py", line 643, in submit
    self._queue_management_thread_wakeup.wakeup()
  File "c:\python3\lib\concurrent\futures\process.py", line 90, in wakeup
    self._writer.send_bytes(b"")
  File "c:\python3\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python3\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
KeyboardInterrupt

No such file or directory

Might be some kind of race condition when run over folders where files are actively and rapidly changing?

Traceback (most recent call last):
  File "/Users/tailee/.virtualenvs/bitrot/bin/bitrot", line 10, in <module>
    execfile(__file__)
  File "/Users/tailee/Projects/bitrot/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 323, in run_from_command_line
    chunk_size=args.chunk_size,
  File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 138, in run
    st = os.stat(p)
OSError: [Errno 2] No such file or directory: './.dropbox/instance1/filecache.dbx-journal'

“python_requires” should be set with “>=3”, as bitrot 1.0.0 is not compatible with all Python versions.

Currently, the keyword argument python_requires of setup() is not set, and thus it is assumed that this distribution is compatible with all Python versions.
However, I found it is not compatible with Python2. My local Python version is 2.7, and I encounter the following error when executing “pip install bitrot”

Collecting bitrot
  Downloading bitrot-1.0.0.tar.gz (11 kB)
    ERROR: Command errored out with exit status 1:
     command: /usr/local/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-v9p1bP/bitrot/setup.py'"'"'; __file__='"'"'/tmp/pip-install-v9p1bP/bitrot/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-v9p1bP/bitrot/pip-egg-info
         cwd: /tmp/pip-install-v9p1bP/bitrot/
    Complete output (9 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-v9p1bP/bitrot/setup.py", line 39, in <module>
        from bitrot import VERSION
      File "/tmp/pip-install-v9p1bP/bitrot/src/bitrot.py", line 190, in <module>
        class Bitrot(object):
      File "/tmp/pip-install-v9p1bP/bitrot/src/bitrot.py", line 193, in Bitrot
        chunk_size=DEFAULT_CHUNK_SIZE, workers=os.cpu_count(),
    AttributeError: 'module' object has no attribute 'cpu_count'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I noticed that bitrot.py used the function os.cpu_count. os.cpu_count only exists in Python 3, resulting in installation failure of bitrot in Python2.

Way to fix:
modify setup() in setup.py, add python_requires keyword argument:

setup(…
     python_requires='>=3',
     …)

Thanks for your attention.
Best regrads,
PyVCEchecker

Rename handling logic broken

To be honest, I don't know why this rename handling logic is doing what it's doing, so you're probably the best person to remedy this since you probably at least know what it's supposed to be doing, but anyway I am regularly running into column path is not unique errors, which I've repro'd here just based on reading the code:

$ mkdir /tmp/bitrot/

$ cd /tmp/bitrot/

$ echo a > a

$ echo a > b

$ bitrot
Finished. 0.00 MiB of data read. 0 errors found.
2 entries in the database, 2 new, 0 updated, 0 renamed, 0 missing.

$ mv a c

$ mv b d

$ bitrot
 50.0%Traceback (most recent call last):
  File "/home/yang/.virtualenvs/bitrot/bin/bitrot", line 8, in <module>
    execfile(__file__)
  File "/home/yang/bitrot/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/home/yang/bitrot/src/bitrot.py", line 265, in run_from_command_line
    chunk_size=args.chunk_size)
  File "/home/yang/bitrot/src/bitrot.py", line 151, in run
    (new_mtime, p_uni, update_ts, new_sha1))
sqlite3.IntegrityError: column path is not unique

Bitrot is not reading full file contents

After trying to figure out why this script was so much faster than expected, I stepped through the code and determined that all files are being opened and read in text mode. This means that the data which is being hashed for any given binary file is randomly truncated after a 0x1A byte, making the entire exercise moot.

This may be fixed by changing the open command to use 'rb' mode.

Unfortunately fixing this means that lots and lots of checksums will become invalid.

Example with added debug output:

dir
39 blah - Copy (2).dat
3 blah - Copy (3) - Copy.dat
3 blah - Copy (3).dat
112 blah - Copy.dat
11 blah.dat

python.exe -m bitrot --verbose
.\blah - Copy (2).dat; Length: 0; Checksum: da39a3ee5e6b4b0d3255bfef95601890afd80709
.\blah - Copy (3) - Copy.dat; Length: 1; Checksum: 5ba93c9db0cff93f52b521d7420e43f6eda2784f
.\blah - Copy (3).dat; Length: 1; Checksum: 5ba93c9db0cff93f52b521d7420e43f6eda2784f
.\blah - Copy.dat; Length: 0; Checksum: da39a3ee5e6b4b0d3255bfef95601890afd80709
.\blah.dat; Length: 11; Checksum: 3a4d8abb5811f6b58b9755ca65ffc01d38f9153f

Note multiple duplicate checksums.

With binary read mode:

python.exe -m bitrot --verbose
File: .\blah - Copy (2).dat; Length: 39; Checksum: ddfb5399fc8f39f26e43f7e3807ae919ee88fe59
File: .\blah - Copy (3) - Copy.dat; Length: 3; Checksum: 4684f40f78d7474c93464241cf4a1ccaa012d7d3
File: .\blah - Copy (3).dat; Length: 3; Checksum: edf5298c70ff205a98c17fd199ddd610e9e2c7c6
File: .\blah - Copy.dat; Length: 112; Checksum: 733fdb8b5cc69814ff448b87af8b02681a749907
File: .\blah.dat; Length: 11; Checksum: 3a4d8abb5811f6b58b9755ca65ffc01d38f9153f

Improve performance using threads

Hi,

I've noted that it takes time related to number of files. So I'm trying to use it for big number of files and it takes so long.

cd /tmp; mkdir more ; cd more/
#create a 320KB file                                                              
dd if=/dev/zero of=masterfile bs=1 count=327680                                 
#split it in 32768 files (instantly) + masterfile = 32769                       
split -b 10 -a 10 masterfile    
#waiiiiiiiiiiiiiiiiiiiiiiiit
bitrot -v 

I suppose that it could be made to work with threads and improve it a lot, cause single threaded to calculate the sha1 and insert into sqlite for x files is too old school for nowdays. ;)

I'm using an Intel i7 so I have plenty of spare threads to burn I reckon something like a central buffer/MQ/DB/x where insert the files to be hashed and n threads to calculate and insert/update them (or just another thread for just sqlite) could work (they collect files from the central buffer, n at a time), sounds like a cool project. ;)

I'm using it for this tool. ;)

https://github.com/liloman/heal-bitrots

Cheers!

Random stalls when running on large directories

bitrot normally shows progress as a running percentage shortly after checking bitrot.db integrity.
This running percentage always appears quickly for relatively small directories.

For large directories like my home directory on macOS 12.6.1, bitrot may or may not show this running percentage. When it does show it, all is well and bitrot executes as expected. When it does not show it (most of the time), bitrot stalls right after integrity checking and may never complete its execution.

I am a newbie in Python so cannot readily investigate though I could help pinpoint the issue with instructions.

IOError when lacking permissions

bitrot fails totally if it encounters a file it can't read.

File "/usr/local/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/usr/local/lib/python2.7/site-packages/bitrot.py", line 247, in run_from_command_line
    run(verbosity=verbosity, test=args.test)
  File "/usr/local/lib/python2.7/site-packages/bitrot.py", line 120, in run
    new_sha1 = sha1(p)
  File "/usr/local/lib/python2.7/site-packages/bitrot.py", line 48, in sha1
    with open(path) as f:
IOError: [Errno 13] Permission denied: './file'

I think it should be more graceful. Either skip with/without logging or abort. Any opinion on this?

Permission denied.

Not sure why it wouldn't have permission, but bitrot should probably handle permissions errors instead of crashing with a traceback :)

Traceback (most recent call last):
  File "/Users/tailee/.virtualenvs/bitrot/bin/bitrot", line 10, in <module>
    execfile(__file__)
  File "/Users/tailee/Projects/bitrot/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 323, in run_from_command_line
    chunk_size=args.chunk_size,
  File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 125, in run
    st = os.lstat(p)
OSError: [Errno 13] Permission denied: './BitTorrent Sync/Projects/prezto/.git/logs/HEAD'

Ignoring files or directories, or specifying files to be scanned

I know this is quite a big feature, but could be nice to have: I have a lot of small backup files in a directory, and bitrot is progressing really slowly (0.1% in ~20 hours).
One simple implementation would be for bitrot to accept a predefined file-list, which could be generated by find or similar, that has already exclude options. I may start to work on it, if I have the time. :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.