Git Product home page Git Product logo

file_archive's Introduction

file_archive: A file store with searchable metadata

Build Status Coverage Status Say Thanks!

This is a file archiving system. You submit it files with a set of metadata, as key-value pairs, and it allows you to later retrieve the files that match conditions on these metadata.

It uses a flat file-store where files are stored under their 40 characters SHA1 hash, and a SQLite3 database for the metadata.

Its purpose is to be used as a persistent file store for the VisTrails workflow and provenance management system: http://www.vistrails.org/

file_archive's People

Contributors

remram44 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

pombredanne

file_archive's Issues

Compare performance

Implementing the backend with MongoDB shouldn't be hard. This should probably be done and compared to the SQLite3 backend (and other SQL servers?)

File/directory collision

It is possible to craft a file that will have the same hash as a directory, by writing the text representation of the directory that is used for hashing internally.

Example:

TEMPROOT=/tmp/store.$$.$RANDOM
mkdir $TEMPROOT
cleanup(){
    rm -Rf $TEMPROOT
}
trap cleanup 0
STORE=$TEMPROOT/store
BIN="python -m file_archive"

# Adds directory
mkdir $TEMPROOT/testdir
echo foo > $TEMPROOT/testdir/foo
echo bar > $TEMPROOT/testdir/bar
$BIN $STORE create
$BIN $STORE add $TEMPROOT/testdir

# Adds carefully-crafter file: this will fail with:
# KeyError: 'This file already exists in the store'
cat > $TEMPROOT/testfile<<END
file foo f1d2d2f924e986ac86fdf7b36c94bcdf32beec15
file bar e242ed3bffccdf271b7fbaf34ed72d089537b42f
END
$BIN $STORE add $TEMPROOT/testfile

Same file, different metadata

Currently it is assumed that we will never bump into the same file again, or that if we do, it will be the same file with the exact same set of metadata.

Something should probably be provided to handle the case when we have different metadata.

The possible operations are:

  • Joining different sets of metadata (would occur went syncing?). I don't think this is what we want to do.
  • Considering the file as different if it has different metadata (i.e. hash metadata+contents)

Directory handling

Right now the system only stores files.

There are two options for storing directories:

  • Do the same thing we do for files, hash it recursively and copy it to filename objects/filehash
    • Pros:
      • Easy
      • Keeps file structure (the user needs the system to obtain the path, but then he can access it in place)
    • Cons:
      • File duplication is possible
  • Do something similar to git, making a tree object that references other tree or blob objects
    • Pros:
      • No more duplication!
    • Cons:
      • Need for a GC (this can be made efficient by putting the SQL database to use)
      • Need to process the files to turn it back into exploitable format
      • Harder

Don't read the input file twice

If a file-like object is given to add_file, it will be read once, rewound using seek(0, SEEK_SET), and then read a second time. This is because it is hashed first and then copied to its destination.

I don't know if this is optimal. Maybe it should be read only once while both hashing it and writing it to a temporary file, and then moved to its destination.

Command line interface

API is taking shape, but we should be able to add, remove and query files from the command line.

Link handling

Maybe some special logic should be implemented for symlinks, right now the target would be copied. Make it an option? Show a warning?

Test coverage

Coverage as of 4cfe3e6:

Module Missing Partial Coverage
init 0 1 99%
database 11 6 92%
main 106 9 54%
parser 0 0 100%

Missing:

  • Coverage for directories
  • Tests for database module
  • Tests for command line interface

Change storage schema

The current storage scheme has the following problems:

  • (C) Collisions: different entries (= sets of metadata) with the same content won't be accepted (issue #10)
  • (F) File name selection: Matthias wants filenames to be selectable (or changeable)

Other options that come to mind are:

  • (M) Changing metadata for an entry
  • (D) Deduplication (store files once if two entries have the same content)

The schemas I have in mind are:

1

hash(content) <-> filename
               -> metadata
  • can rename file
  • can add/change metadata
  • two files with the same hash will share metadata

2

hash(content||metadata) <-> filename
                         -> metadata
  • can rename file
  • can't change metadata
  • no collisions
  • no deduplication

3

uuid <-> filename
      -> metadata
  • can rename file
  • can change metadata
  • no collisions
  • no deduplication

4

uuid -> hash(content) <-> filename
     -> metadata
  • can rename file (affects multiple entries)
  • can change metadata
  • no collisions
  • deduplication

5

hash(content||metadata) -> hash(content) <-> filename
                        -> metadata
  • can rename file (affects multiple entries)
  • can't change metadata
  • no collisions
  • deduplication

6

hash(content||metadata) <-> filename
                         -> metadata
  • can rename file
  • can't change metadata
  • no collisions
  • no deduplication

Right now:

Legacy

hash(content) = filename
              -> metadata
  • can't rename file
  • can change metadata (not implemented)
  • collisions possible

Syncing stores

Sending some of the files to another store needs:

  • A way to select what to send (i.e. a query)
  • Transfer the files (feeding thefilenames to rsync, scp, ftp...)
  • Merge the databases

file_archive <store> remove deletes everything

This works similarly to query, so is the correct behavior, however it is dangerous. A --force should probably be needed when no condition is passed (similar to rm's --preserve-root).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.