remram44 / file_archive Goto Github PK
View Code? Open in Web Editor NEWA file store with searchable metadata
License: Other
A file store with searchable metadata
License: Other
API is taking shape, but we should be able to add, remove and query files from the command line.
The current storage scheme has the following problems:
Other options that come to mind are:
The schemas I have in mind are:
hash(content) <-> filename
-> metadata
hash(content||metadata) <-> filename
-> metadata
uuid <-> filename
-> metadata
uuid -> hash(content) <-> filename
-> metadata
hash(content||metadata) -> hash(content) <-> filename
-> metadata
hash(content||metadata) <-> filename
-> metadata
Right now:
hash(content) = filename
-> metadata
file_archive path/to/store query age=str:two
Options should be available to control what print/query output: hashes, metadata, contents...
It is possible to craft a file that will have the same hash as a directory, by writing the text representation of the directory that is used for hashing internally.
Example:
TEMPROOT=/tmp/store.$$.$RANDOM
mkdir $TEMPROOT
cleanup(){
rm -Rf $TEMPROOT
}
trap cleanup 0
STORE=$TEMPROOT/store
BIN="python -m file_archive"
# Adds directory
mkdir $TEMPROOT/testdir
echo foo > $TEMPROOT/testdir/foo
echo bar > $TEMPROOT/testdir/bar
$BIN $STORE create
$BIN $STORE add $TEMPROOT/testdir
# Adds carefully-crafter file: this will fail with:
# KeyError: 'This file already exists in the store'
cat > $TEMPROOT/testfile<<END
file foo f1d2d2f924e986ac86fdf7b36c94bcdf32beec15
file bar e242ed3bffccdf271b7fbaf34ed72d089537b42f
END
$BIN $STORE add $TEMPROOT/testfile
If a file-like object is given to add_file, it will be read once, rewound using seek(0, SEEK_SET), and then read a second time. This is because it is hashed first and then copied to its destination.
I don't know if this is optimal. Maybe it should be read only once while both hashing it and writing it to a temporary file, and then moved to its destination.
Coverage as of 4cfe3e6:
Module | Missing | Partial | Coverage |
---|---|---|---|
init | 0 | 1 | 99% |
database | 11 | 6 | 92% |
main | 106 | 9 | 54% |
parser | 0 | 0 | 100% |
Missing:
Sending some of the files to another store needs:
This works similarly to query, so is the correct behavior, however it is dangerous. A --force should probably be needed when no condition is passed (similar to rm's --preserve-root).
viewer.py can probably be made to support both PyQt4 and PyQt5.
Right now the system only stores files.
There are two options for storing directories:
Currently it is assumed that we will never bump into the same file again, or that if we do, it will be the same file with the exact same set of metadata.
Something should probably be provided to handle the case when we have different metadata.
The possible operations are:
Maybe some special logic should be implemented for symlinks, right now the target would be copied. Make it an option? Show a warning?
Implementing the backend with MongoDB shouldn't be hard. This should probably be done and compared to the SQLite3 backend (and other SQL servers?)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.