lakshmipathi / dduper Goto Github PK

View Code? Open in Web Editor NEW

163.0 11.0 18.0 5.59 MB

Fast block-level out-of-band BTRFS deduplication tool.

License: GNU General Public License v2.0

Python 58.27% Dockerfile 4.60% Shell 37.13%

btrfs btrfs-progs btrfs-tools deduplication dedupe

dduper's People

Contributors

Stargazers

Watchers

Forkers

gelma tushar-nitave jrcichra variousforks endolith bmaggard ols3er fredy72 tezeta arthurhsliu artizirk sherlock-holo gaelicgrime sturdycat diez-canseco-ramirez peet2k17 piranah am97

dduper's Issues

Add information about `dump-csum` to README

Just tried it out and got lots of messages like:

btrfs inspect-internal: unknown token 'dump-csum'.

There should be a text somewhere on how to apply a patch or get a version of btrfs-progs which supports the dump-csum command in the README

dduper --analyze saves magic disk space

I have disk data of 7GB and --analyze reports it can claim 22GB data dedupe? Please explain how this magic happens ? :D

                :                      0m                       :               
      8192      : /mnt/fn_abcd_50m_200m:/mnt/fn_cdcdcd_50m_300m :    278528     
      8192      : /mnt/fn_abcd_50m_200m:/mnt/fn_pqsrt_50m_250m  :       0       
      8192      : /mnt/fn_abcd_50m_200m:/mnt/fn_pqsrt_100m_500m :       0       
      8192      : /mnt/fn_abac_50m_200m:/mnt/fn_cdcdcd_50m_300m :    139264     
      8192      : /mnt/fn_abac_50m_200m:/mnt/fn_pqsrt_50m_250m  :       0       
      8192      : /mnt/fn_abac_50m_200m:/mnt/fn_pqsrt_100m_500m :       0       
      8192      : /mnt/fn_cdcdcd_50m_300m:/mnt/fn_pqsrt_50m_250 :       0       
                :                       m                       :               
      8192      : /mnt/fn_cdcdcd_50m_300m:/mnt/fn_pqsrt_100m_50 :       0       
                :                      0m                       :               
      8192      : /mnt/fn_pqsrt_50m_250m:/mnt/fn_pqsrt_100m_500 :    491520     
                :                       m                       :               
================================================================================
dduper:23117824KB of duplicate data found with chunk size:8192KB

Failed to build pysqlite3

Hi!
I have tested dduper on a Raspberry Pi 3B+ with Raspberry OS.
Unfortunately, the installation does not succeed, because the following error occurs:

Building wheels for collected packages: pysqlite3
  Building wheel for pysqlite3 (setup.py) ... done
  WARNING: Legacy build of wheel for 'pysqlite3' created no files.
  Command arguments: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5pjwwdhd/pysqlite3_1d66649f2505475fa84dcb126b8dd6d9/setup.py'"'"'; __file__='"'"'/tmp/pip-install-5pjwwdhd/pysqlite3_1d66649f2505475fa84dcb126b8dd6d9/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-g4vittxi
  Command output: [use --verbose to show]
  Running setup.py clean for pysqlite3
Failed to build pysqlite3

The rest could be installed without problems according to the log.
I have tested the same installation method on a normal x64 system with Debian 10, there everything worked great and the tool runs wonderfully.

port to python3

This code seems to use python2.

Distros are removing it,
I had to create a python2 virtualenv to be able to install beautifultable, but the one installed via pip does not work:

(venv) root@box:~/dduper# ./dduper --device /dev/mapper/something --dir /mnt/something
Traceback (most recent call last):
  File "./dduper", line 28, in <module>
    from beautifultable import BeautifulTable
  File "/root/dduper/venv/lib/python2.7/site-packages/beautifultable/__init__.py", line 5, in <module>
    from .beautifultable import (  # noqa F401
  File "/root/dduper/venv/lib/python2.7/site-packages/beautifultable/beautifultable.py", line 36, in <module>
    from .utils import (
  File "/root/dduper/venv/lib/python2.7/site-packages/beautifultable/utils.py", line 39
    def ensure_type(value, *types, varname="value"):
                                         ^
SyntaxError: invalid syntax

Sadly, this makes this tool unusable on my end.

Publish dduper to pypi and make available via pip

Provide one-liner installation using pypi. pip3 install dduper

Reduce dduper docker image size

The docker image laks/dduper created via Dockerfile has 287MB. Reduce this image preferably around 100MB. Hint: docker history laks/dduper

Output extremely verbose

I tried running sudo dduper -Dm on my backups and it outputs an endless list of skipped files. Maybe it could be a bit less verbose? i.g: just output a counter of skipped files? Put them in a log file?

Also: This kind of messages should probably be printed to stderr rather than stdout.

what about read only subvolumes?

Is it possible to dedupe read only subvolumes? Couldn't find an option to dedupe read only subvolumes.
I have a bunch of read only snapshots but I can't use dduper on them.

Btrfs-progs patch issue - Hunk #1 FAILED at 158.

I just discovered dduper today and was trying to set it up, but ran into the following issue while cut/pasting the INSTALL.md steps to apply the btrfs-progs patch.

08:23:58 evil@H510 ~/src/dduper/btrfs-progs» patch -p1 < ../patch/btrfs-progs-v5.6.1/0001-Print-csum-for-a-given-file-on-stdout.patch
patching file Makefile
Hunk #1 FAILED at 158.
1 out of 1 hunk FAILED -- saving rejects to file Makefile.rej
patching file cmds/commands.h
patching file cmds/inspect-dump-csum.c
patching file cmds/inspect.c
Hunk #1 succeeded at 667 (offset -3 lines).

Makefile.rej:

08:28:14 evil@H510 ~/src/dduper/btrfs-progs» cat Makefile.rej 
--- Makefile
+++ Makefile
@@ -158,7 +158,8 @@ cmds_objects = cmds/subvolume.o cmds/filesystem.o cmds/device.o cmds/scrub.o \
               cmds/rescue-super-recover.o \
               cmds/property.o cmds/filesystem-usage.o cmds/inspect-dump-tree.o \
               cmds/inspect-dump-super.o cmds/inspect-tree-stats.o cmds/filesystem-du.o \
-              mkfs/common.o check/mode-common.o check/mode-lowmem.o
+              mkfs/common.o check/mode-common.o check/mode-lowmem.o \
+              cmds/inspect-dump-csum.o
 libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o \
                   kernel-lib/radix-tree.o extent-cache.o extent_io.o \
                   crypto/crc32c.o common/messages.o \

Looking at the Makefile, it looks like more cmds_objects have been added since the original patch. I manually added cmds/inspect-dump-csum.o to my makefile to workaround this, but thought you'd want to know to so that you can update the patch.

Add support for --csum xxhash

Right now dduper works only with crc32. Add support for other checksum types like xxhash,blake,sha256

ci failure for specific dataset

Examine and fix https://gitlab.com/giis/dduper/-/jobs/720569752 failure.

Can't use chunk sizes less than 128k

I'm trying to make a folder of highly redundant data. 128k chunk size barely makes a difference but using smaller chunks on duperemove made a significant difference.

Can't find or match chunks on subvolume which uses blake2 csum

Running dduper on a subvolume doesn't seem to work. Both directories have the same two files. Both files are canceled dd copies of my boot drive.

Output from subvolume:

[bluemond@BlueQ dduper]$ sudo python2 ./dduper --device /dev/sda1 --dir /btrfs/subvol/ddtest/ --dry-run
Prefect match :  /btrfs/subvol/ddtest/sbd.img /btrfs/subvol/ddtest/sbd.img2
Summary
blk_size : 4KB  chunksize : 8192KB
/btrfs/subvol/ddtest/sbd.img has 0 chunks
/btrfs/subvol/ddtest/sbd.img2 has 0 chunks
Matched chunks: 0
Unmatched chunks: 0
Total size(KB) available for dedupe: 0
dduper took 32.3749928474 seconds
[bluemond@BlueQ dduper]$ sudo python2 ./dduper --device /dev/sda1 --dir /btrfs/subvol/ddtest/
Prefect match :  /btrfs/subvol/ddtest/sbd.img /btrfs/subvol/ddtest/sbd.img2
************************
Dedupe completed for /btrfs/subvol/ddtest/sbd.img:/btrfs/subvol/ddtest/sbd.img2
Summary
blk_size : 4KB  chunksize : 8192KB
/btrfs/subvol/ddtest/sbd.img has 0 chunks
/btrfs/subvol/ddtest/sbd.img2 has 0 chunks
Matched chunks: 0
Unmatched chunks: 0
Total size(KB) deduped: 0
dduper took 32.7617127895 seconds

Output from rootvolume:

[bluemond@BlueQ dduper]$ sudo python2 ./dduper --device /dev/sda1 --dir /btrfs/ddtest/ --dry-run
Summary
blk_size : 4KB  chunksize : 32KB
/btrfs/ddtest/sbd.img has 184064 chunks
/btrfs/ddtest/sbd.img2 has 84480 chunks
Matched chunks: 32066
Unmatched chunks: 52414
Total size(KB) available for dedupe: 1026112
dduper took 36.9195628166 seconds
[bluemond@BlueQ dduper]$ sudo python2 ./dduper --device /dev/sda1 --dir /btrfs/ddtest/
************************
Dedupe completed for /btrfs/ddtest/sbd.img:/btrfs/ddtest/sbd.img2
Summary
blk_size : 4KB  chunksize : 32KB
/btrfs/ddtest/sbd.img has 184064 chunks
/btrfs/ddtest/sbd.img2 has 84480 chunks
Matched chunks: 32066
Unmatched chunks: 52414
Total size(KB) deduped: 0
dduper took 204.889986038 seconds

Also I'm not sure why the total size deduped is 0 on the actual dedupe...

I am using blake2 as csum on a 6-drive raid5 data raid1 meta array.

Validate behaviour with sparse files.

Need to verify how dduper behaves for sparse files.

Autoupdate `laks/dduper` docker image from ci job

Instead of building docker image manually and pushing it to docker hub, include it as part of CI.

Segmentation fault in btrfs.static causes AssertionError

Running sudo dduper --device /dev/nvme0n1p5 --dir / --recurse fails due to

Dedupe completed for /core:/app/docker/kafka1/data/__confluent.support.metrics-0/00000000000000000000.log
Summary
blk_size : 4KB  chunksize : 128KB
/core has 28 chunks
/app/docker/kafka1/data/__confluent.support.metrics-0/00000000000000000000.log has 1 chunks
Matched chunks: 0
Unmatched chunks: 1
Total size(KB) deduped: 0
Traceback (most recent call last):
  File "/usr/sbin/dduper", line 594, in <module>
    main(results)
  File "/usr/sbin/dduper", line 465, in main
    dedupe_dir(results.dir_path, results.dry_run, results.recurse)
  File "/usr/sbin/dduper", line 456, in dedupe_dir
    dedupe_files(file_list, dry_run)
  File "/usr/sbin/dduper", line 410, in dedupe_files
    ret = do_dedupe(src_file, dst_file, dry_run)
  File "/usr/sbin/dduper", line 225, in do_dedupe
    assert len(out2) != 0
AssertionError

Adding --verbose provides no additional information. dmesg contains

[ 1361.934038] btrfs.static[9668]: segfault at ffffffffb7bd9228 ip 00000000005228d4 sp 00007fff7e417848 error 5 in btrfs.static[401000+189000]
[ 1361.934043] Code: 0e 88 0f c3 c5 fa 6f 06 c5 fa 6f 4c 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17 f0 c3 48 8b 4c 16 f8 48 8b 36 48 89 4c 17 f8 48 89 37 c3 <8b> 4c 16 fc 8b 36 89 4c 17 fc 89 37 c3 0f b7 4c 16 fe 0f b7 36 66

experienced with v0.04-9-g78155b6 on Ubuntu 20.04 with Linux 5.8.0-43-generic

Fix pycodestyle reported issues

Resolve the following:

$ pycodestyle dduper 
dduper:67:5: E265 block comment should start with '# '
dduper:67:80: E501 line too long (88 > 79 characters)
dduper:147:80: E501 line too long (89 > 79 characters)
dduper:154:1: E302 expected 2 blank lines, found 1
dduper:155:5: E266 too many leading '#' for block comment
dduper:159:22: E262 inline comment should start with '# '
dduper:162:23: E262 inline comment should start with '# '
dduper:240:80: E501 line too long (132 > 79 characters)
dduper:258:80: E501 line too long (96 > 79 characters)
dduper:292:5: E303 too many blank lines (2)
dduper:293:1: E101 indentation contains mixed spaces and tabs
dduper:293:1: W191 indentation contains tabs
dduper:294:1: W191 indentation contains tabs
dduper:294:2: E101 indentation contains mixed spaces and tabs
dduper:295:1: W191 indentation contains tabs
dduper:295:2: E101 indentation contains mixed spaces and tabs
dduper:296:1: W191 indentation contains tabs
dduper:297:1: W191 indentation contains tabs
dduper:297:2: E101 indentation contains mixed spaces and tabs
dduper:298:1: W191 indentation contains tabs
dduper:299:1: W191 indentation contains tabs
dduper:299:2: E101 indentation contains mixed spaces and tabs
dduper:300:1: W191 indentation contains tabs
dduper:301:1: W191 indentation contains tabs
dduper:301:2: E101 indentation contains mixed spaces and tabs
dduper:302:1: W191 indentation contains tabs
dduper:303:1: W191 indentation contains tabs
dduper:303:2: E101 indentation contains mixed spaces and tabs
dduper:304:1: W191 indentation contains tabs
dduper:304:2: E101 indentation contains mixed spaces and tabs
dduper:306:1: E101 indentation contains mixed spaces and tabs
dduper:314:80: E501 line too long (88 > 79 characters)
dduper:319:5: E303 too many blank lines (2)
dduper:321:12: E111 indentation is not a multiple of four
dduper:322:14: W291 trailing whitespace
dduper:323:12: E111 indentation is not a multiple of four
dduper:324:40: E231 missing whitespace after ','
dduper:326:66: E228 missing whitespace around modulo operator
dduper:326:76: E231 missing whitespace after ','
dduper:326:80: E501 line too long (86 > 79 characters)
dduper:369:1: E101 indentation contains mixed spaces and tabs
dduper:369:1: W191 indentation contains tabs
dduper:370:1: E101 indentation contains mixed spaces and tabs
dduper:373:80: E501 line too long (112 > 79 characters)
dduper:376:13: E265 block comment should start with '# '
dduper:377:1: E101 indentation contains mixed spaces and tabs
dduper:377:1: W191 indentation contains tabs
dduper:377:3: E265 block comment should start with '# '
dduper:377:80: E501 line too long (82 > 79 characters)
dduper:378:1: E101 indentation contains mixed spaces and tabs
dduper:394:30: W291 trailing whitespace
dduper:506:80: E501 line too long (88 > 79 characters)
dduper:509:5: E303 too many blank lines (2)
dduper:520:20: W291 trailing whitespace
dduper:527:22: E231 missing whitespace after ','
dduper:527:26: E231 missing whitespace after ','
dduper:527:31: E231 missing whitespace after ','
dduper:527:36: E231 missing whitespace after ','
dduper:527:41: E231 missing whitespace after ','
dduper:531:35: E261 at least two spaces before inline comment
dduper:531:36: E262 inline comment should start with '# '
dduper:532:14: E231 missing whitespace after ','
dduper:539:1: E101 indentation contains mixed spaces and tabs
dduper:539:1: W191 indentation contains tabs
dduper:540:1: E101 indentation contains mixed spaces and tabs
dduper:542:16: E111 indentation is not a multiple of four
dduper:543:16: E111 indentation is not a multiple of four
dduper:544:5: E101 indentation contains mixed spaces and tabs
dduper:544:5: W191 indentation contains tabs
dduper:544:13: E111 indentation is not a multiple of four
dduper:544:14: E231 missing whitespace after ','
dduper:545:16: E111 indentation is not a multiple of four
dduper:546:16: E111 indentation is not a multiple of four
dduper:547:16: E111 indentation is not a multiple of four
dduper:548:16: E111 indentation is not a multiple of four
dduper:549:2: E101 indentation contains mixed spaces and tabs
dduper:549:2: W191 indentation contains tabs
dduper:550:80: E501 line too long (98 > 79 characters)
dduper:553:27: E225 missing whitespace around operator
dduper:553:49: E225 missing whitespace around operator
dduper:553:60: E703 statement ends with a semicolon

btrfs-progs 5.16.2 will not build with patch

cmds/inspect-dump-csum.c: In function ‘btrfs_lookup_extent’:
cmds/inspect-dump-csum.c:166:53: error: ‘struct btrfs_fs_info’ has no member named ‘csum_root’; did you mean ‘fs_root’?
166 | u16 csum_size = btrfs_super_csum_size(info->csum_root->fs_info->super_copy);
...

Question on --device parameter

My BTRFS setup is RAID10, so I don't have a single device tied to my BTRFS array. I wasn't sure what I am supposed to pass in this case. Do I just pass in any device that is part of the array for the file/folder I am trying to dedupe?

For example, if I want to dedupe everything in '/mnt/ddimages', and that is part of a BTRFS pool that is comprised of 4 disks:

/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd

Would I use the command:

dduper --device /dev/sda --dir /mnt/ddimages

Will that work as expected? Or will it somehow only dedupe the files that BTRFS has stored on /dev/sda? Do I then need to run the command 4 times, once for each device?

I guess my question is why do we need to specify --device at all? Isn't that something that can be determined based on the mount that holds the file/folders specified?

finding duplicate by providing a single file (not by directory)

I run a site.

When the user uploads a file, the site needs to tell, instantly, if it is a duplicate of an existing file, since the user won't stay on the upload page to wait for a full HDD scan.
When the site admin deletes a file (say it turned out that the file has illegal content) the administrator wishes to delete every copy of that file. However, he knows if he only deletes the path, there might be other duplicated path pointing to the same blocks.

Basically what I ask for is something like

$ dduper --file new_upload.mp4 --device /dev/sda3

which returns immediately whether or not a duplicate of new_upload.mp4 exists in /dev/sda3. Bonues if the new_upload.mp4 doesn't have to be inside /dev/sda3.

Thanks!

Validate dduper results with different checksum types

Linux ≥ 5.5 and btrfs-progs ≥ 5.4 finally bring support for checksum algorithms that are stronger than CRC32C. xxHash, SHA256, and BLAKE2 are supported with kernel+btrfs-progs newer than these.

Add test scripts for RAID config.

Right dduper has minimal test script to check basic functionality See ci/gitlab/*.sh . Enhance it add RAID tests.

dedupe fedora silverblue results in AssertionError

[lizelive@fedora ~]$ sudo podman run --rm -it --device /dev/sda2 --privileged -v /var/home/lizelive/.local/share/containers/storage/:/mnt docker.io/laks/dduper dduper --device /dev/sda2 --dir /mnt --analyze --recurse
Traceback (most recent call last):
  File "/usr/sbin/dduper", line 575, in <module>
    main(results)
  File "/usr/sbin/dduper", line 465, in main
    dedupe_dir(results.dir_path, results.dry_run, results.recurse)
  File "/usr/sbin/dduper", line 456, in dedupe_dir
    dedupe_files(file_list, dry_run)
  File "/usr/sbin/dduper", line 410, in dedupe_files
    ret = do_dedupe(src_file, dst_file, dry_run)
  File "/usr/sbin/dduper", line 224, in do_dedupe
    assert len(out1) != 0
AssertionError

what is the correct way to dedupe?

related to #48

Provide visual representation of deduplication analysis.

This is just fun task :) Provide analysis results in visual format.

Two files with duplicate data detected by smaller chunk size 256KB.

Same two files with no duplicate data since it uses 1024KB as chunk size.

Whole file with duplicate data.

Changes are available under visualize branch.

Permission denied error?

I installed dduper on my Ubuntu 21.04 system with a BTRFS file system called mounted as /data in /dev/sda1. I'm trying to play around with it on a single directory, but I keep getting permission denied errors.

dduper -p /dev/sda1 --dir /data/G --recurse --dry-run

Tells me ..

ERROR: cannot open '/dev/sda1': Permission denied
unable to open /dev/sda1

Adding sudo in front of dduper doesn't work either.

Any ideas?

What a coincidence

Hello,

thanks for dduper!
I have run over a directory recursively:

dduper --device /dev/sda1 --dir /srv/dev-disk-by-label-DataPool1/Video/  -r --dry-run
Perfect match :  /srv/dev-disk-by-label-DataPool1/Video/plugin.video.vdr.recordings_0.2.4.zip /srv/dev-disk-by-label-DataPool1/Video/VDR/unsortiert/Topspione_der_Geschichte/2016-11-04.20.13.23-0.rec/00055.ts
Summary
blk_size : 4KB  chunksize : 128KB
/srv/dev-disk-by-label-DataPool1/Video/plugin.video.vdr.recordings_0.2.4.zip has 0 chunks
/srv/dev-disk-by-label-DataPool1/Video/VDR/unsortiert/Topspione_der_Geschichte/2016-11-04.20.13.23-0.rec/00055.ts has 0 chunks
Matched chunks: 0
Unmatched chunks: 0
Total size(KB) available for dedupe: 0
Perfect match :  /srv/dev-disk-by-label-DataPool1/Video/plugin.video.vdr.recordings_0.2.4.zip /srv/dev-disk-by-label-DataPool1/Video/VDR/unsortiert/Topspione_der_Geschichte/2016-11-04.20.13.23-0.rec/00039.ts
Summary
blk_size : 4KB  chunksize : 128KB
/srv/dev-disk-by-label-DataPool1/Video/plugin.video.vdr.recordings_0.2.4.zip has 0 chunks
/srv/dev-disk-by-label-DataPool1/Video/VDR/unsortiert/Topspione_der_Geschichte/2016-11-04.20.13.23-0.rec/00039.ts has 0 chunks

What I find odd is, that the plugin.video.vdr.recordings_0.2.4.zip seems to match every single ts file (https://fileinfo.com/extension/ts).
I can imagine that every ts file must contain a certain bit-pattern in it... But that to be in a zip file as well?

Greetings,
Hendrik

insane fast is not very fast at all

when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:

compute a summary for each file by the csum of each block included in the file;
Do a sort / uniq of these files

Since csum is already computed, this shouldn't take more than a minute in a modern computer. Instead, the process has been running 30 minutes now and the result already showing 2020 non-matching results.

Should this be using indexes?

Looks like the code spends a lot of time on sqlite lookups for various things, perhaps sqlite indexes could speed things up a bit even in normal mode?

Assertion Error -- AUR dduper-git

Using dduper-git or dduper-bin on Arch, I'm running into the following error. I'm not sure what other informaction to give, but if you need more, let me know. I'm sure I just missed a setup step, or something similar.

Traceback (most recent call last):
  File "/usr/bin/dduper", line 576, in <module>
    main(results)
  File "/usr/bin/dduper", line 466, in main
    dedupe_dir(results.dir_path, results.dry_run, results.recurse)
  File "/usr/bin/dduper", line 457, in dedupe_dir
    dedupe_files(file_list, dry_run)
  File "/usr/bin/dduper", line 411, in dedupe_files
    ret = do_dedupe(src_file, dst_file, dry_run)
  File "/usr/bin/dduper", line 225, in do_dedupe
    assert len(out1) != 0
AssertionError

Question: What do you mean by "offline"?

It seems like the file system has to be mounted in order for dduper to work. Is it safe to use the file system while dduper is running? What do you mean by "offline"?

Docker - unable to open /dev/sdX

I'm trying to run dduper from Docker with

 sudo docker run -it --device /dev/sdf -v /media/data:/media/data laks/dduper dduper --device /dev/sdf --dir /media/data/media/Ixus --recurse --analyze

/media/data is the mountpoint for /dev/sdf with btrfs filesystem.

The output for all files is more or less like:

[Analyzing] /media/data/media/Ixus/103___12/IMG_0431.JPG:/media/data/media/Ixus/144___05/IMG_3098.JPG bad tree block 21676032, bytenr mismatch, want=21676032, have=0
ERROR: cannot read chunk root
unable to open /dev/sdf
bad tree block 21676032, bytenr mismatch, want=21676032, have=0
ERROR: cannot read chunk root
unable to open /dev/sdf
Perfect match :  /media/data/media/Ixus/103___12/IMG_0431.JPG /media/data/media/Ixus/144___05/IMG_3099.JPG

The volume is fine and healthy. All files can be read.
I assume there's something going wrong with accessing /dev/sdf from within the container.
Any ideas?

Add travisci.yml for basic sanity test

Add simple .travisci.yml to run pycodestyle and other basic sanity test.

Operation not permitted on synology

Hi trying to run your amazing tool on a synology with BTRFS using your docker as described in install.md

but i'm seeing errors
/dev/mapper/cachedev_0 is the dev that synology mounts the BTRFS(checked with the mount command)
maybe is because synology have NVME caching

parent transid verify failed on 7939913383936 wanted 7154693 found 7154700
parent transid verify failed on 7939913383936 wanted 7154693 found 7154700
parent transid verify failed on 7939913383936 wanted 7154693 found 7154700
Ignoring transid failure
leaf parent key incorrect 7939913383936
ERROR: failed to read block groups: Operation not permitted
unable to open /dev/mapper/cachedev_0

enum34 not in requirements.txt

(venv) root@box:~/dduper# ./dduper --device /dev/mapper/something --dir /mnt/something
Traceback (most recent call last):
  File "./dduper", line 28, in <module>
    from beautifultable import BeautifulTable
  File "/root/dev/dduper/venv/lib/python2.7/site-packages/beautifultable/__init__.py", line 5, in <module>
    from .beautifultable import (  # noqa F401
  File "/root/dev/dduper/venv/lib/python2.7/site-packages/beautifultable/beautifultable.py", line 34, in <module>
    from . import enums
  File "/root/dev/dduper/venv/lib/python2.7/site-packages/beautifultable/enums.py", line 2, in <module>
    import enum
ImportError: No module named enum

pip install enum34 fixed the issue on my end

Create an upstream PR for the patch to bring the patch to trunk.

It's a nasty hurdle to have to build btrfs-progs yourself.

Support multiple directories

Multiple files are supported with --files, but only one directory using --dir. This is quite limiting when deduplicating multiple directories.

Documentation of the --perfect_match_only

Thank you for creating and maintaining dduper.

I noticed that the --perfect_match_only option was merged in #54. I think it will be beneficial if this option is explained in for example the https://github.com/Lakshmipathi/dduper/blob/master/README.md file.

Couldn't read tree root

When starting dduper on docker from openmediavault I get the following error when accessing any file:

user@host:~$ sudo docker run -it --device /dev/sdf -v /media/data:/media/data -u root laks/dduper bash
root@0dad41b09a40:/dduper# dduper --device /dev/sdf --dir /media/data/media/Ixus/ -r -a
bad tree block 912588800, bytenr mismatch, want=912588800, have=0
Couldn't read tree root
unable to open /dev/sdf
...
root@0dad41b09a40:/dduper# ls -la /media/data/media/Ixus/
total 0
d---r-x--- 1 root root 1112 Dec  1 08:10 .
drwxr-xr-x 1 root root  428 Dec  4 14:53 ..
d---r-x--- 1 root root 3330 Dec  1 07:48 103___12
...

I assume it's related to user rights or some wrong parameters. Any idea?

Use builtin btrfs-progs 5.13 dump commands.

In case you missed it, btrfs-progs 5.13 added commands to dump csums.
kdave/btrfs-progs@9f6c055

Make a fallback method to calculate the checksum

The do_btrfs_dump_csum fails if BTRFS' inspect-internal dump-csum command is not implemented (and this is still not in the main BTRFS implementation)...
It would be good, if it fails to have a fall-back method to calculate this.

Throws 'UnicodeEncodeError' on strange filename

Backed up an old Windows disk onto a BTRFS backed network share. Now dduper throws an exception on one of the filenames.

ls gives the filename as:
'Finland.J'$'\344''rvenp'$'\344\344''-Elisa.xml'

Traceback (most recent call last):
  File "/usr/sbin/dduper", line 535, in <module>
    main(results)
  File "/usr/sbin/dduper", line 426, in main
    dedupe_dir(results.dir_path, results.dry_run, results.recurse)
  File "/usr/sbin/dduper", line 409, in dedupe_dir
    if validate_file(fn) is True:
  File "/usr/sbin/dduper", line 399, in validate_file
    file size < 4kb ")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in position 146: surrogates not allowed

Using the docker image.
sudo docker run -it --device /dev/sdc -v /media/backup/:/mnt laks/dduper dduper --device /dev/sda1 --dir /mnt --analyze --recurse

UnicodeEncodeError

The UnicodeEncodeError reported by @plattrap in issue #15 has been marked "resolved" long ago, but I've just started using dduper yesterday (pre-built binary 0.04) and already get that:

# dduper --device /dev/sdd2 --recurse --dir /btrfstestdir
  File "/usr/sbin/dduper", line 734, in <module>
    main(results)
  File "/usr/sbin/dduper", line 595, in main
    dedupe_dir(results.dir_path, results.dry_run, results.recurse)
  File "/usr/sbin/dduper", line 571, in dedupe_dir
    populate_records(file_list)
  File "/usr/sbin/dduper", line 549, in populate_records
    btrfs_dump_csum(fn)
  File "/usr/sbin/dduper", line 269, in btrfs_dump_csum
    out, ret = check_btrfs_file_exists(filename)
  File "/usr/sbin/dduper", line 252, in check_btrfs_file_exists
    cursor.execute("SELECT * FROM filehash WHERE filename = ?",(filename,))
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 58-59: surrogates not allowed

(I'm not sure if it's the same core issue as in #15 , so I file this one separately. Please feel free to move it there and/or mark as duplicate.)

Add License

What License are you publishing this under?

dump-csum patch doesn't work with btrfs-progs 5.12

Or, rather, it applies, but compilation fails with an error about too many arguments to open_ctree_fs_info.

Document simple test results for RAID setups

Update TESTS.md with results from different RAID setups like raid0,raid1,raid5,raid10.

TypeError

$ dduper --device /dev/mapper/vg-root --dir /nix/store/*llvm*-lib --recurse
[...]
Dedupe completed for /nix/store/1xcwdxx002a70ml4h1k0byciidbsnx2n-llvm-8.0.1-lib/lib/libLLVM-8.so:/nix/store/hpa2wxp7cjxgb5bn44wnhb4aig65s1kg-llvm-8.0.1-lib/lib/libLLVM.so
Summary
blk_size : 4KB  chunksize : 128KB
/nix/store/1xcwdxx002a70ml4h1k0byciidbsnx2n-llvm-8.0.1-lib/lib/libLLVM-8.so has 646 chunks
/nix/store/hpa2wxp7cjxgb5bn44wnhb4aig65s1kg-llvm-8.0.1-lib/lib/libLLVM.so has 643 chunks
Matched chunks: 1
Unmatched chunks: 642
Total size(KB) deduped: 128
************************
error([Errno 22] Invalid argument)
Traceback (most recent call last):
  File "/nix/store/ldmj09d7pfyircf1j34m8rhpy0qxlj2l-dduper-v0.04/bin//dduper", line 594, in <module>
    main(results)
  File "/nix/store/ldmj09d7pfyircf1j34m8rhpy0qxlj2l-dduper-v0.04/bin//dduper", line 465, in main
    dedupe_dir(results.dir_path, results.dry_run, results.recurse)
  File "/nix/store/ldmj09d7pfyircf1j34m8rhpy0qxlj2l-dduper-v0.04/bin//dduper", line 456, in dedupe_dir
    dedupe_files(file_list, dry_run)
  File "/nix/store/ldmj09d7pfyircf1j34m8rhpy0qxlj2l-dduper-v0.04/bin//dduper", line 410, in dedupe_files
    ret = do_dedupe(src_file, dst_file, dry_run)
  File "/nix/store/ldmj09d7pfyircf1j34m8rhpy0qxlj2l-dduper-v0.04/bin//dduper", line 281, in do_dedupe
    bytes_deduped,status = ioctl_fideduperange(src_fd, s)
TypeError: cannot unpack non-iterable NoneType object

AssertionError on very small dir

strace -o /mnt/logs.txt dduper --device /dev/mapper/cachedev_0 --dir /mnt/Exchange --analyze --recurse
Traceback (most recent call last):
File "/usr/sbin/dduper", line 575, in
main(results)
File "/usr/sbin/dduper", line 465, in main
dedupe_dir(results.dir_path, results.dry_run, results.recurse)
File "/usr/sbin/dduper", line 456, in dedupe_dir
dedupe_files(file_list, dry_run)
File "/usr/sbin/dduper", line 410, in dedupe_files
ret = do_dedupe(src_file, dst_file, dry_run)
File "/usr/sbin/dduper", line 224, in do_dedupe
assert len(out1) != 0
AssertionError

Strace log attached.

logs.txt

Assertion error

Running code from today's repo Gen-15-2021 (dduper 0.04)

btrfs_lookup_csums search failed.icrosoft.MicrosoftOfficeHub_8wekyb3d8bbwe/AC/Microsoft/CLR_v4.0/ngen.log
 Error: btrfs_lookup_csumextent buffer leak: start 87928078336 len 16384
extent buffer leak: start 55155425280 len 16384
extent buffer leak: start 305545216 len 16384
Traceback (most recent call last):
  File "/usr/local/bin/dduper", line 575, in <module>
    main(results)
  File "/usr/local/bin/dduper", line 465, in main
    dedupe_dir(results.dir_path, results.dry_run, results.recurse)
  File "/usr/local/bin/dduper", line 456, in dedupe_dir
    dedupe_files(file_list, dry_run)
  File "/usr/local/bin/dduper", line 410, in dedupe_files
    ret = do_dedupe(src_file, dst_file, dry_run)
  File "/usr/local/bin/dduper", line 225, in do_dedupe
    assert len(out2) != 0
AssertionError

It ran half a day analyzing a path then crashed. Command line is
dduper --device /dev/sdq --dir /mypath/backup/ --analyze --recurse

dump-csum output is empty, so dduper prints "has 0 chunks" for every file

Not sure which repo to report/ask this in, sorry.

I've tried the prebuilt btrfs.static and kdave/btrfs-progs.git#v5.6.1 with 0001-Print-csum-for-a-given-file-on-stdout.patch built from source. I'm pretty sure I have CRC32 csums (mount says Btrfs loaded, crc32c=crc32c-intel), but btrfs inspect-internal dump-csum just pauses and exits (code 0) without printing anything. No kernel/syslog messages occur while dump-csum is running. I've tried several files and all three devices in the set.

Any ideas as to how I diagnose this?

deduped always 0 when deduped??

When I run it using:

dduper --device /dev/sdb1 --dir /Databases/_/BNE/

files appear to be deduped (used space goes down) ... but ...
the output is always "deduped: 0" (see below)