The file_archive from remram44

file_archive's Issues

Command line interface

API is taking shape, but we should be able to add, remove and query files from the command line.

Change storage schema

The current storage scheme has the following problems:

(C) Collisions: different entries (= sets of metadata) with the same content won't be accepted (issue #10)
(F) File name selection: Matthias wants filenames to be selectable (or changeable)

Other options that come to mind are:

(M) Changing metadata for an entry
(D) Deduplication (store files once if two entries have the same content)

The schemas I have in mind are:

1

hash(content) <-> filename
               -> metadata

can rename file
can add/change metadata
two files with the same hash will share metadata

2

hash(content||metadata) <-> filename
                         -> metadata

can rename file
can't change metadata
no collisions
no deduplication

3

uuid <-> filename
      -> metadata

can rename file
can change metadata
no collisions
no deduplication

4

uuid -> hash(content) <-> filename
     -> metadata

can rename file (affects multiple entries)
can change metadata
no collisions
deduplication

5

hash(content||metadata) -> hash(content) <-> filename
                        -> metadata

can rename file (affects multiple entries)
can't change metadata
no collisions
deduplication

6

hash(content||metadata) <-> filename
                         -> metadata

can rename file
can't change metadata
no collisions
no deduplication

Right now:

Legacy

hash(content) = filename
              -> metadata

can't rename file
can change metadata (not implemented)
collisions possible

Waiting on VisTrails

VisTrails/VisTrails#755

Querying with a type tracebacks

file_archive path/to/store query age=str:two

tdparser not installed automatically by pip

setup.py

Multiple query/print formats

Options should be available to control what print/query output: hashes, metadata, contents...

File/directory collision

It is possible to craft a file that will have the same hash as a directory, by writing the text representation of the directory that is used for hashing internally.

Example:

TEMPROOT=/tmp/store.$$.$RANDOM
mkdir $TEMPROOT
cleanup(){
    rm -Rf $TEMPROOT
}
trap cleanup 0
STORE=$TEMPROOT/store
BIN="python -m file_archive"

# Adds directory
mkdir $TEMPROOT/testdir
echo foo > $TEMPROOT/testdir/foo
echo bar > $TEMPROOT/testdir/bar
$BIN $STORE create
$BIN $STORE add $TEMPROOT/testdir

# Adds carefully-crafter file: this will fail with:
# KeyError: 'This file already exists in the store'
cat > $TEMPROOT/testfile<<END
file foo f1d2d2f924e986ac86fdf7b36c94bcdf32beec15
file bar e242ed3bffccdf271b7fbaf34ed72d089537b42f
END
$BIN $STORE add $TEMPROOT/testfile

Compression?

Don't read the input file twice

If a file-like object is given to add_file, it will be read once, rewound using seek(0, SEEK_SET), and then read a second time. This is because it is hashed first and then copied to its destination.

I don't know if this is optimal. Maybe it should be read only once while both hashing it and writing it to a temporary file, and then moved to its destination.

Test coverage

Coverage as of 4cfe3e6:

Module	Missing	Partial	Coverage
init	0	1	99%
database	11	6	92%
main	106	9	54%
parser	0	0	100%

Missing:

Coverage for directories
Tests for database module
Tests for command line interface

Syncing stores

Sending some of the files to another store needs:

A way to select what to send (i.e. a query)
Transfer the files (feeding thefilenames to rsync, scp, ftp...)
Merge the databases

file_archive <store> remove deletes everything

This works similarly to query, so is the correct behavior, however it is dangerous. A --force should probably be needed when no condition is passed (similar to rm's --preserve-root).

Add typed mkey existence condition

Add a way to select entries for which a certain metadata key, with a certain type, exists, regardless of its value.

API support (31f2aec)
test coverage (6672d8f)
command-line support
viewer support (28fa27f)

PyQt5 support

viewer.py can probably be made to support both PyQt4 and PyQt5.

Use qtpy for Qt abstraction layer

https://github.com/spyder-ide/qtpy

Directory handling

Right now the system only stores files.

There are two options for storing directories:

Do the same thing we do for files, hash it recursively and copy it to filename objects/filehash
- Pros:
  - Easy
  - Keeps file structure (the user needs the system to obtain the path, but then he can access it in place)
- Cons:
  - File duplication is possible
Do something similar to git, making a tree object that references other tree or blob objects
- Pros:
  - No more duplication!
- Cons:
  - Need for a GC (this can be made efficient by putting the SQL database to use)
  - Need to process the files to turn it back into exploitable format
  - Harder

Do non-equality queries from the command line

Same file, different metadata

Currently it is assumed that we will never bump into the same file again, or that if we do, it will be the same file with the exact same set of metadata.

Something should probably be provided to handle the case when we have different metadata.

The possible operations are:

Joining different sets of metadata (would occur went syncing?). I don't think this is what we want to do.
Considering the file as different if it has different metadata (i.e. hash metadata+contents)

remram44 / file_archive Goto Github PK

file_archive's People

Contributors

Stargazers

Watchers

Forkers

file_archive's Issues

1

2

3

4

5

6

Legacy

Recommend Projects

Recommend Topics

Recommend Org