kilobyte / compsize Goto Github PK
View Code? Open in Web Editor NEWbtrfs: find compression type/ratio on a file or set of files
License: Other
btrfs: find compression type/ratio on a file or set of files
License: Other
Static analysis through scan-build
and Clang/LLVM 12.0.0-rc1
:
scan-build: Using '/usr/lib/llvm/12/bin/clang-12' for static analysis
/usr/lib/llvm/12/bin/../libexec/ccc-analyzer -Wall -std=gnu90 -c -o compsize.o compsize.c
/usr/lib/llvm/12/bin/../libexec/ccc-analyzer -Wall -std=gnu90 -c -o radix-tree.o radix-tree.c
compsize.c:124:10: warning: Casting a non-structure type to a structure type and accessing a field can lead to memory access errors or data corruption [alpha.core.CastToStruct]
ei = (struct btrfs_file_extent_item *) bp;
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compsize.c:206:16: warning: Casting a non-structure type to a structure type and accessing a field can lead to memory access errors or data corruption [alpha.core.CastToStruct]
head = (struct btrfs_ioctl_search_header*)bp;
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compsize.c:281:26: warning: Null pointer passed to 1st parameter expecting 'nonnull' [core.NonNullParamChecker]
de = readdir(dir);
^~~~~~~~~~~~
compsize.c:455:9: warning: Potential leak of memory pointed to by 'ws' [unix.Malloc]
print_help();
^~~~~~~~~~
4 warnings generated.
/usr/lib/llvm/12/bin/../libexec/ccc-analyzer -Wall -std=gnu90 -o compsize compsize.o radix-tree.o
scan-build: Analysis run complete.
scan-build: 4 bugs found.
See the attached file/html pages for an interactive view of function paths taken (or just run scan-build -V make
with alpha checkers yourself). Not sure if any of these are actually issues but I thought you may be interested.
I have a freshly created BTRFS file system with only one file. It is ca. 68GB in size when compressed:
[root@nas ~]# df -h /dev/sdc1
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 932G 68G 863G 8% /mnt/m3
The uncompressed file is 257GB:
[root@nas ~]# du -h /mnt/m3/rav.img
257G /mnt/m3/rav.img
[root@nas ~]# ls -lh /mnt/m3
total 257G
-rw-r--r-- 1 root root 257G 12-25 17:02 rav.img
But when I use compsize, I get the following result:
[root@nas compsize]$ ./compsize /mnt/m3/rav.img
Type Perc Disk Usage Uncompressed Referenced
TOTAL 40% 10G 26G 26G
none 100% 3.0G 3.0G 3.0G
zstd 32% 7.6G 23G 23G
Environment:
Currently, the Makefile ignores DESTDIR, making it impossible to use make to install this package in a distro setting without extra configuration. Instead, we have to do install -Dm755 compsize $DESTDIR$PREFIX/bin/compsize
or run make install PREFIX=$DESTDIR$PREFIX
which is unwieldly either way. Some distro build systems have the standard make commands baked in, and having to override the defaults is cumbersome
I was running caed4fd for a long time. I updated to latest version but values have massivly changed:
caed4fd:
Processed 62 files.
Type Perc Disk Usage Uncompressed Referenced
Data 69% 72G 104G 141G
none 100% 37G 37G 46G
zstd 52% 34G 66G 95G
Current HEAD:
Processed 62 files, 1347823 regular extents (1622359 refs), 33 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 76% 179G 235G 273G
none 100% 106G 106G 116G
zstd 56% 73G 128G 157G
What has happened?
Latest version doesn't compile here (debian buster)...
cc -Wall -c -o compsize.o compsize.c
compsize.c:30:18: error: ‘SZ_16M’ undeclared here (not in a function)
uint8_t buf[SZ_16M]; // hardcoded kernel's limit
I'm building compsize on my distro's build system, and I got this error:
/bin/sh: line 1: /buildstream-install/usr/share/man/man8/compsize.8.gz: No such file or directory
The makefile doesn't ensure that the directory exists before trying to make a file in it. Ideally it should do that
Workaround:
I just ran mkdir
before make install
An option to get raw values or even a json structure our of it would be great. This would allow machine readability.
I'd like an option to use compsize to find files with extents matching some criteria and list them.
compsize --find
could have the following matches:
It should be possible to combine several matches.
The output can be a table with the path/filename + the chosen matches.
A possibly to sort the output would be great, or have an output format that can be piped to sort
.
My initial use-case is to find highly fragmented files so that I can manually defrag them. But also to analyse my files and how they are to determine if I should do some action on them.
I found that compsize was slower than btrfs fi du, which shouldn't happen because btrfs fi du uses FIEMAP while compsize uses TREE_SEARCH_V2.
# time btrfs fi du -s .
Total Exclusive Set shared Filename
6.92GiB 351.20MiB 4.67GiB .
real 0m2.414s
user 0m0.177s
sys 0m2.189s
# time compsize .
Processed 96083 files, 170199 regular extents (202723 refs), 39927 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 52% 3.6G 6.9G 7.0G
none 100% 2.2G 2.2G 2.0G
zlib 32% 1.3G 4.1G 4.5G
zstd 14% 75M 505M 478M
real 0m7.719s
user 0m0.105s
sys 0m7.532s
Some stracing found this:
ioctl(6, BTRFS_IOC_TREE_SEARCH_V2, {key={tree_id=0, min_objectid=17466447, max_objectid=17466447, ..., buf_size=16777216} => {key={nr_items=170},
This is zeroing out a 16 MB memory buffer to read one inode, and it slows compsize down to the point where even glacial FIEMAP calls can catch up.
Changing the buffer size to 64K makes it run much faster:
# time compsize-64k .
Processed 96083 files, 170199 regular extents (202723 refs), 39927 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 52% 3.6G 6.9G 7.0G
none 100% 2.2G 2.2G 2.0G
zlib 32% 1.3G 4.1G 4.5G
zstd 14% 75M 505M 478M
real 0m1.089s
user 0m0.102s
sys 0m0.950s
but the loop retry condition in do_file
has to be fixed. I would suggest setting max_object = min_objectid + 1, then bail out of the loop as soon as you see an objectid > the inode you're looking at, or zero nr_items on ioctl return. Then it runs even faster:
# time compsize-64k-plus-one .
Processed 96083 files, 170199 regular extents (202723 refs), 39927 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 52% 3.6G 6.9G 7.0G
none 100% 2.2G 2.2G 2.0G
zlib 32% 1.3G 4.1G 4.5G
zstd 14% 75M 505M 478M
real 0m0.933s
user 0m0.086s
sys 0m0.800s
64K is the largest possible metadata page size. TREE_SEARCH_V2 will never need space for an item that large, so you don't have to worry about an item being too large for the buffer.
After upgrading from 5.8.5 to 5.8.8, compsize
does almost always abort with:
$ compsize -x /usr/
/usr/portage/distfiles/ndata-0.6.1.zip: Regular extent's header not 53 bytes (0) long?!?
on various directories and also on different subvolumes. Is there an upstream kernel change this tool needs to be adapted to?
Or something similar.
This will enable the user to find the command when using btrfs<TAB completion>
and it will be among the other useful btrfs tools.
I've moved a folder from another partition to the current one, mounted with compress-force=zstd
, to my surprise, with compsize I get:
rbenedetti at rbenedettid > $sudo compsize /mnt/HDDs/
Processed 8890 files, 47691 regular extents (47691 refs), 3540 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 75% 5.6G 7.4G 7.4G
none 100% 2.8G 2.8G 2.8G
zlib 60% 2.8G 4.6G 4.6G
rbenedetti at rbenedettid > $sudo cat /proc/mounts | grep HDDs
systemd-1 /mnt/HDDs autofs rw,relatime,fd=50,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=20037 0 0
/dev/mapper/cryptHDDC /mnt/HDDs btrfs rw,seclabel,noatime,compress-force=zstd,flushoncommit,space_cache,autodefrag,subvolid=5,subvol=/ 0 0
The partition was empty before, I expected it to be forcible compressed to zstd, as it came from another file system.
It would be nice if compsize would show two additional columns to display the dedupe rate and total rate:
comp% = usage vs. uncompressed
dedupe% = uncompressed vs. referenced
total% = usage vs. referenced
See-also: Zygo/bees#102
I started to use btrfs compression to store disk images.. The values seemed to me incorrect so i made a try..
Created a file (approx 100G) with full zeroes..
The values seemed ok.
wrote few data to this file, in this case i created a loobpack device and made an ext4 formatted partition inside this file.
i unmounted. loopback detached..
This is what i got:
ant@voyager:/storage/4tb-s76/work$ ls -la zeroes.img
-rw-r--r-- 1 root root 103131643904 okt 12 19:49 zeroes.img
ant@voyager:/storage/4tb-s76/work$ sudo compsize zeroes.img
Type Perc Disk Usage Uncompressed Referenced
TOTAL 99% 516M 518M 518M
none 100% 516M 516M 516M
zstd 3% 76K 2.1M 2.0M
It would be great if you could support signals to show status of the investigated files so far..
Like the SIGUSR1 on dd
https://askubuntu.com/a/215521.
Currently running on a medium sized directory (~700GB) for a couple of hours..
Hello, thanks for your great program!
I have a little issue - I can't run compsize on / directory:
yard@arch ~ sudo compsize /
Not btrfs (or SEARCH_V2 unsupported).
But btrfs command can detect that it's btrfs:
yard@arch ~ sudo btrfs fi us /
Overall:
Device size: 340.00GiB
Device allocated: 28.06GiB
Device unallocated: 311.94GiB
Device missing: 0.00B
Used: 18.18GiB
Free (estimated): 320.55GiB (min: 164.58GiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 43.91MiB (used: 0.00B)
Data,single: Size:26.00GiB, Used:17.38GiB
/dev/sda2 26.00GiB
Metadata,DUP: Size:1.00GiB, Used:406.47MiB
/dev/sda2 2.00GiB
System,DUP: Size:32.00MiB, Used:16.00KiB
/dev/sda2 64.00MiB
Unallocated:
/dev/sda2 311.94GiB
Is it a limitation?
It's been proposed that for Fedora 34, transparent compression be enabled by default on btrfs volumes, and compsize
be offered as the standard tool for examining file compression on zstd-compressed volumes.
Given that compsize
would then be an enduser tool, including potentially for lay users, I felt that it might benefit from a few user-friendliness enhancements. Specifically:
-h
/ --help
arguments
$ compsize --help
compsize: unrecognized option '--help'
$ compsize
Usage: compsize file-or-dir1 [file-or-dir2 ...]
Could compsize be made safe to be setuid root, so that unprivileged users can check the compression of files they can read?
Trying to run 'make':
# make
cc -Wall -c -o compsize.o compsize.c
compsize.c: In function ‘main’:
compsize.c:301:5: error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode
for (int t=0; t<MAX_ENTRIES; t++)
^
compsize.c:301:5: note: use option -std=c99, -std=gnu99, -std=c11 or -std=gnu11 to compile your code
compsize.c:328:14: error: redefinition of ‘t’
for (int t=0; t<MAX_ENTRIES; t++)
^
compsize.c:301:14: note: previous definition of ‘t’ was here
for (int t=0; t<MAX_ENTRIES; t++)
^
compsize.c:328:5: error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode
for (int t=0; t<MAX_ENTRIES; t++)
^
Makefile:11: recipe for target 'compsize.o' failed
make: *** [compsize.o] Error 1
# cc --version
cc (Debian 4.9.2-10) 4.9.2
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Could you please describe the prerequisites or install instructions in the README file.
Thanks
being curious - how can compsize show "none compressed" data of 1,6G when btrfs is mounted with compress-force=lzo ?
"There is a simple decision logic: if the first portion of data being compressed is not smaller than the original, the compression of the file is disabled -- unless the filesystem is mounted with -o compress-force. In that case it'll be compressed always regardless of the compressibility of the file. This is not optimal and subject to optimizations and further development. "
# mount|grep btrfs
/dev/sdb1 on /btrfspool type btrfs (rw,relatime,compress-force=lzo,nossd,space_cache,subvolid=5,subvol=/,_netdev)
# compsize /btrfspool/backup/adminstation
Processed 164408 files, 167122 regular extents (167122 refs), 90204 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 66% 4.1G 6.2G 6.2G
none 100% 1.6G 1.6G 1.6G
lzo 54% 2.5G 4.5G 4.5G
Hi,
your tool is very useful. I have just a request, can you avoid to call die() when errno==EACCES?
I suppose that we can just ignore the failing open() and continue.
(I will provide a PR if you need)
ciao
luigi
I'm not sure if this is the right place to post (possible) bugs. Anyway when I try to run compsize on root or my home folder, I get this error message:
akiss@buster:~$ su
Password:
root@buster:/home/akiss# ./compsize/compsize .
SEARCH_V2: Inappropriate ioctl for device
root@buster:/home/akiss#
When I run it on a subfolder, it works as expected:
root@buster:/home/akiss# ./compsize/compsize .steam
Processed 34050 files, 206016 regular extents (206274 refs), 10770 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 65% 23G 35G 35G
none 100% 12G 12G 12G
zstd 45% 10G 22G 22G
root@buster:/home/akiss#
I have an ext4 boot partition mounted at /boot, and a samba share mounted in my home folder.
root@buster:/home/akiss# uname -a
Linux buster 4.14.0-2-amd64 #1 SMP Debian 4.14.7-1 (2017-12-22) x86_64 GNU/Linux
The title says it all. Also, it would be nice if it was compressed with gzip, like other manuals.
compsize version affected: 33c07cb
Please, it would be really useful for me if there was an options to print each file and its infos (Perc, Disk Usage, Uncompressed, Referenced, Comp Algorithm) in addition to only the "summary".
Thanks.
Sometimes very useful to see all size values in same measurement, for example in megabytes.
Also for parsing output in scripts will be good to have a raw bytes output for all sizes.
For example, in btrfs filesystem df
we can choose format via cli arguments:
# btrfs filesystem df --help
usage: btrfs filesystem df [options] <path>
Show space usage information for a mount point
-b|--raw raw numbers in bytes
-h|--human-readable
human friendly numbers, base 1024 (default)
-H human friendly numbers, base 1000
--iec use 1024 as a base (KiB, MiB, GiB, TiB)
--si use 1000 as a base (kB, MB, GB, TB)
-k|--kbytes show sizes in KiB, or kB with --si
-m|--mbytes show sizes in MiB, or MB with --si
-g|--gbytes show sizes in GiB, or GB with --si
-t|--tbytes show sizes in TiB, or TB with --si
Can you please add this feature? Thanks!
There is an entry in the FAQ list of Fedora's Btrfs compression initiative which says that when using filefrag
on a compressed file, some of the reported extents are potentially in reality contiguous on the disk, which means that filefrag
is not a reliable way to tell file's fragmentation.
I've noticed that compsize
also reports much less extents for uncompressed files than for compressed files, which means that compsize
is likely also affected by the issue above, but I just wanted to confirm anyway.
It would be good if sparse holes are shown as it's own row instead of baked in with "none".
# touch file
# fallocate -l 100M file
# compsize file
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 100M 100M 100M
none 100% 100M 100M 100M
$ ls -la 000021.log ; du -h 000021.log ; sudo compsize 000021.log
-rw-r--r-- 1 user user 1036 26.09.2019 21:52 000021.log
141M 000021.log
[sudo] password for user:
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 140M 140M 140M
none 100% 140M 140M 140M
Well, even du
reports it as 141M, so it can't be compsize's fault... hmm
$ stat 000021.log
File: 000021.log
Size: 1036 Blocks: 288360 IO Block: 4096 regular file
Device: 18h/24d Inode: 23998778 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ user) Gid: ( 1000/ user)
Access: 2019-09-26 15:22:00.635307037 +0200
Modify: 2019-09-26 21:52:11.643990800 +0200
Change: 2019-09-26 21:52:11.643990800 +0200
Birth: 2019-09-26 15:22:00.635307037 +0200
Hmm... 288360 blocks(ie. times 512) = 147,640,320
bytes.
But why, how!?! :)
I would like an option to show the count of extents even on single files. Currently we only have that when choosing multiple input files.
Thanks for a great tool! 😊
i'm getting
root@pve-test1:/btrfs# compsize /btrfs/zstd/
All empty or still-delalloced files.
what does "still-delalloced" mean ?
looks like a typo to me !?
Errors:
G-REPO/geolcation for tcploggerv3/IP2LOCATION-LITE-DB5.CSV: Regular extent's header not 53 bytes (0) long?!?
.././testthings: Regular extent's header not 53 bytes (0) long?!?
testthings is a file created via dd /dev/zeros
Versions:
5.4.0-52-generic #57~18.04.1-Ubuntu SMP Thu Oct 15 14:04:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
btrfs-progs v4.15.1
compsize compiled from repo ---cksum:
2314157740 30840 compsize
When looking at the size of subvolumes and snapshots, I usually only want to look at the actual disk usage for the chosen set.
Example:
Instead of
# compsize foo
Processed 2778793 files, 79568 regular extents (1461600 refs), 1612739 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 45% 3.5G 7.6G 96G
none 100% 1.9G 1.9G 29G
zstd 27% 1.5G 5.7G 67G
# compsize --disk-usage foo
3.5G
Ultimately it would be nice to print a table for the chosen set, similar to what btrfs fi du -s
does.
# compsize --disk-usage foo bar
Size Path
---------------
3.5G foo
4.1G bar
# compsize --summary --disk-usage foo bar
Size Path
---------------
3.5G foo
4.1G bar
---------------
7.6G TOTAL
By default the program defaults to the most sane units for output, e.g. 12GB or 37MB but this doesn't allow for much precision. e.g. when comparing sizes, 11GB allows for almost 1GB of variance - 11GB could be anywhere between 10.5GB and 11.4GB.
Can you please add an option to specify the output units so that the values can be in more detail? e.g.
Data 83% 1.4G 1.7G 1.7G
... with such an option would allow you to specify Megabytes as the unit, and in turn show...
Data 83% 1432M 1722M 1722M
A single program is better than nothing, but it can't replace a whole ecosystem of tools.
It would be nice to have a library that other programs can call, preferably just providing a compression-aware stat/lstat.
Even better would be a library that can be LD_PRELOADED and replaces lstat/stat (or for most Linux systems rather __lxstat seems to be what is called) overriding only the block count. This would then show compression naturally in programs like ncdu without needing any code changes in them.
While not entirely risk-free, most of the concerns of replacing st_blocks with the actual allocation shouldn't apply for this usage.
# compsize /mnt/sda2/Backups.backupdb/d2/2017-12-17-211627
Processed 24766 files, 20583 regular extents (22801 refs), 12848 inline.
Type Perc Disk Usage Uncompressed Referenced
Data 31% 471M 1.4G 1.7G
none 100% 126M 126M 126M
zstd 25% 344M 1.3G 1.5G
Now I am wondering: what is "Data", "none" and "zstd"?
Or what are the differences between them.
After reading https://github.com/kilobyte/compsize/blob/master/README.md the clue is still missing.
I had no luck compiling compsize after installing btrfs-progs (or btrfsprogs)
I had to also install libbtrfs-devel described as include files and libraries for developing with Btrfs.
Would be nice to mention this in the Readme.md please.
Otherwise great tool, thank you very much!
Would it be possible to add a bit of code to show exclusive usage for subvolume? The only way to do that that I've found on the Internet seems to be to enable quota and then use btrfs qgroup show
. However, when I enable quota btrfs becomes unusable for a long time (at least 15 minutes) after snapshot creation, so I can't afford to do that.
Right now I have something like this:
# btrfs subvolume list --sort=rootid -t /data/pg_data
ID gen top level path
-- --- --------- ----
258 14791 5 mirrors/prod-db-c01_5432
328 14791 5 snapshots/bckash_6432
# btrfs-compsize /data/pg_data/mirrors/prod-db-c01_5432
Processed 14759 files, 18988196 regular extents (22223973 refs), 27 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 12% 203G 1.5T 1.5T
none 100% 11G 11G 11G
zstd 12% 192G 1.5T 1.5T
# btrfs-compsize /data/pg_data/snapshots/bckash_6432
Processed 6022 files, 10202853 regular extents (12257581 refs), 21 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 10% 95G 905G 879G
none 100% 55M 55M 55M
zstd 10% 95G 905G 879G
snapshots/bckash_6432
was created as a snapshot of mirrors/prod-db-c01_5432
. Then a lot of files in the snapshot were deleted and a few were changed. So I'd like to know how much its exclusive usage is. If I call btrfs-compsize
on both directories I get this:
# btrfs-compsize /data/pg_data/mirrors/prod-db-c01_5432 /data/pg_data/snapshots/bckash_6432
Processed 20781 files, 19001055 regular extents (34492529 refs), 48 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 12% 203G 1.5T 2.3T
none 100% 11G 11G 11G
zstd 12% 192G 1.5T 2.3T
I don't see if it's possible to calculate the exclusive usage for snapshots/bckash_6432
from these numbers.
So would it be possible to add a command line flag which would calculate usage only for files which are not reflinked? Then I could call btrfs-compsize
without that flag on the mirror
subvolume and after that call it with the flag on the snapshot
subvolume.
Any other method would be fine as well, this just looks as the simplest. But I don't quite understand how the code works, so I could be wrong.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.