sergey-dryabzhinsky / dedupsqlfs Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 7.0 6.76 MB

Deduplicating filesystem via Python3, FUSE and SQLite

License: MIT License

Python 15.81% C 82.10% C++ 0.49% Shell 0.02% Cython 1.58%

backup compression deduplication fuse python python3 storage

dedupsqlfs's People

Contributors

Stargazers

Watchers

Forkers

cloudxtreme bundabrg bhyvex tabulon-ext himadrisd dim2dim disconnect3d

dedupsqlfs's Issues

Add support for PgSQL as storage DB

PgSQL has really optimized storage format.
Checked via migration 13kk rows of Zabbix DB from MySQL InnoDB: 55Gb into 2Gb.

Snapshot cleanup by plan removes first (oldest) yearly snapshot

The goal of plan is to keep first yearly snapshot, first interval snapshot.
So if plan is 2y,6m,8w,14d then it must contain:

at least 2 yearly snapshots, and 1 oldest in 2 years interval
at least 6 monthly snapshots, and 1 oldest in 6 month interval
at least 8 weekly snapshots, and 1 oldest in 8 week interval
at least 14 daily snapshots, and 1 oldest in 14 days interval

All of them cat intersect with each other.

Need to recalculate time-line to keep right amount of them.

Fix truncated down files data

If file was truncated down - blocks/index not deleted, block data not zeroed.
Indexes flushing only by defragmentation.
But by block data not truncated and block sizes not adjusted - subvol stats are incorrect (sparse sizes).

Need to check inode size change. If it possibly truncated - than recalc block count, last block size and truncate data to new size.

Fix compilation flags for compression modules

Replace -O3 with -O2
Remove -march=native
Add basic fortification flags

Block size per inode

Add posibility to config block size per inode, except directory, special files.

Add command interface via socket

To call snapshooting, vacuum, and other commands then FS is mounted.
The trick is to not mess with threads and fuse operations and Sqlite.

Try to:

start RPC-socket-server in separate thread
use pipe/queue in main thread to get commands from RPC
trigger queue run by trigger fuse operation like file creation or stat change
use socket in do command

Defragment and vacuum works only for 'root' subvolume

Need to add subvolume selection, or defragment & vacuum all subvolumes.
And store last defragment, vacuum date in subvol table.

Update Zstd to 1.0.0

Use latest stable version.

And this one can be build w/o legacy formats support.

Probably others too: 0.6.1, 0.4.7.
v0.3.6 must remain with legacy support.

Gzip snapshots

Gzip tree, inode, block-index tables with gzip after snapshot creation.
This can make snapshooted data size be lesser by 75%.

This need to be supported by defragment, table open/close code.

Extract snapshot

Extract all snapshot data into other copy of dedupsqlfs like a new fs.

Don't expire block on flush / fsync operations

Just force data to flush on disk, keep it in memory cache.

Expire blocks from cache only on forget , unlink, rmdir operations.
Or by inactivity time.
Or by cache size limits.

Rehash action

Change hashing algo for FS and recalculate all hashes with it.

Make DIFF between snapshots/subvolumes

List of directories/files that:
- created
- changed
- deleted
One file data chages in diff/rdiff format
All data changes in diff/rdiff format

Support sparce files

There is two ways for handle sparce areas in files:

If block full of zeroes - then don't hash, compress and write it.
Still, write block size to db.
If block has many zeroes at block end, at least more than 1024b or 10% - then hash, compress and write only valuable bytes.
Write "real", with zeroes, block size to db.

Lock down database to prevent damage from other operations

If FS is mounted or running some do action - it must be locked.

Support llfuse-0.42+

Many incompatible changes in 0.42 version: https://github.com/python-llfuse/python-llfuse/blob/master/Changes.rst

Goes into 1.3 version of DDSQLFS.

Add options to configure snapshot db files compression

For now it uses gzip program and .gz extension.
Need to support different programs and compression levels:

gzip / pigz - .gz
bzip2 / pbzip2 - .bz2
xz / pxz - .xz
zstd / pzstd - .zst
lzop, lz4, etc...

Recompress on the fly

add "isDeprecated" to compressors
add option "--recompress-on-fly" to mount action
check compression type for readed and writed block
- if it's deprecated - mark block as "to write / update", compress again and save

Update zstd to 1.3.1

As soon as I update my python-zstd project

Wrong table names in files per subvolume

It was designed to just change files name for table storage.
Not table name inside that file.
So it can be copied back.

But sometimes things get worse: indexes may get doubled after making snapshot.

Remove 'custom' compression, add 'all_fast', rename 'auto_best' into 'all_best' ...

Just use --compress as --compress-custom.
If there will be all_ method - use last of them.
all_fast is all methods with fast level activated.
all_best is all methods with best level activated.
compression level can be defined as --compress=zlib:9 or --compress=zlib:fast.
Delimiter symbol changed from = to :

Move all C/C++ sources to src directories

It's needed for packaging.

Support connection to external MySQL database

Add options for host, port, db, user and pass.

Try to use multi-threaded compression for fast methods

For example lz4, zstd, lzo can be really fast.
And multi-process compression waste more time in inter-process communication.
Need to test if multi-threaded version will be faster.

Add:

multi-threaded compression tool class
new option to switch between multi-th/p versions

Add support for cython

Some parts/modules of DdSF can be compiled by cython:

dedupsqlfs/lib/cache
dedupsqlfs/fuse

This can be used by packaging.

Recompression action

Create "recompression" action in "do" app to "remove" not needed compression algo from fs. Don't forget to NULL all subvol stats after that.

Don't force to install cython

Add possibility to install without it.

Limit count of compression process

Do not waste resources if CPU is fast enough

Make three modules for quicklz

Quicklz compression level is choosen at compilation time.
What need to do:

Keep "old" module version for compatibility
Make two versions with different compilation options: quicklzf(ast), quicklzm(edium) & quicklzb(est)

Add file-lock when some do action is active

To prevent DB damage on mount or other action.

Gzip snapshots tables only if file size more than 1Mb

Sometimes it's just not needed.
Only if files big ehough - like tree,inode or index tables if:

many files on fs
big files on fs

Cleanup snapshots by some plan

What is means:

remove old snapshots
keep only selected by some periods

Like --cleanup-old-snapshots-by-plan=14d:8w=1:18m=1:3y, so it will keep:

14 latest daily snapshots (distance <=1 day between)
8 weekly - Monday or any in week day if no other (distance <=1 week between)
18 monthly - 1 day or any day if no other (distance <=1 month between)
3 yearly - one in year (distance <=1 year between)

Caught exception in read(): 'NoneType' object is not subscriptable

2017-05-21 05:05:17,751 - DedupFS - ERROR - Traceback (most recent call last):
  File "/opt/rusoft/dedupsqlfs/dedupsqlfs/fuse/operations.py", line 765, in read
    data = self.__get_block_data_by_offset(fh, offset, size)
  File "/opt/rusoft/dedupsqlfs/dedupsqlfs/fuse/operations.py", line 1480, in __get_block_data_by_offset
    block = self.__get_block_from_cache(inode, n + first_block_number)
  File "/opt/rusoft/dedupsqlfs/dedupsqlfs/fuse/operations.py", line 1399, in __get_block_from_cache
    self.getLogger().debug("-- db size: %s" % len(item["data"]))
TypeError: 'NoneType' object is not subscriptable

Add gc-fast options

Do not collect garbage in hash+blocks - should be faster gc.

Run sqlite R/W workout in separate process

What need to be achieved:

continue to work with in-memory data while cache flush/cleanup
more smooth I/O

There may be overhead by inter-process data movement.

Support configuration file

File with base parameters to use without bunch of commands-line options.

Start and stop mysqld server separately from main commands

Make mysqld server startup and stop separate commands or options for do/mkfs commands.
Goal:

start server and make fs
mount fs and connect to started server
operate with files, sync backups, etc.
umount fs
make snapshot
print statistics
stop server

It can save about 10-30 seconds on each operation due disabled start-stop of server.

Split operations class into two

clean and only fuse operations
helper for DB, Cache, other functions

Compression methods per inode

Add posibility to config compression methods per inode, except directory, special files.

Add timer-thread to push fs events and cache flush-expire

On some looong rsync processes filesystem consume memory and ... stops.

To gradualy drop caches we need to touch root of filesystem to provide any event (setattr).
It will trigger cache cleanup procedures.
Just create Thread that do os.utime(mountPoint, None) every second.
Until fuse destroy.

ZSTD 0.2+ incompatible with old ZSTD 0.1

Need to:

make two modules zstd01(old) and zstd03
make migration to rename old zstd in DB into zstd01

Setuptools module deprecated

Should use distutils module distributed with python.

LZ4 & ZSTD modules affected.

Problem is that module not exists in old distros like Debian Wheezy, Ubuntu Lucid.

Update ZSTD to upstream version 0.3+

To use new compression levels.

Store snapshot statistics in subvol table

And update statistics only if subvol FS was modified.

Store statistics only for snapshots (readonly=True)
Recalculate on umount only if modified
Store in table only if modified, it is snapshot, no stats saved. On request stats, on snapshot creation.

Current root subvol can gather stats on umount. Need option. If modified.

This is for snapshot statistics output speed up, subvolumes list stats speed up.

Stats:

apparent size
unique size
sparse size
dedup size
compressed size
comp. uniq. size
comp. type. stats
In json, one blob field.

if method1 fails - try method2
if method2 fails - try method1
else - raise error

sizes in blocks index table
subvolume ID in tree, inode, index tables