Git Product home page Git Product logo

datafs's Introduction

DataFS Data Management System

image

image

image

Documentation Status

Updates

Codacy Badge

Join the chat at https://gitter.im/DataFS/Lobby

DataFS is a package manager for data. It manages file versions, dependencies, and metadata for individual use or large organizations.

Configure and connect to a metadata Manager and multiple data Services using a specification file and you'll be sharing, tracking, and using your data in seconds.

Features

  • Explicit version and metadata management for teams
  • Unified read/write interface across file systems
  • Easily create out-of-the-box configuration files for users
  • Track data dependencies and usage logs
  • Use datafs from python or from the command line
  • Permissions handled by managers & services, giving you control over user access

Usage

First, configure an API. Don't worry. It's not too bad. Check out the quickstart to follow along.

We'll assume we already have an API object created and attached to a service called "local". Once you have this, you can start using DataFS to create and use archives.

$ datafs create my_new_data_archive --description "a test archive"
created versioned archive <DataArchive local://my_new_data_archive>

$ echo "initial file contents" > my_file.txt

$ datafs update my_new_data_archive my_file.txt

$ datafs cat my_new_data_archive
initial file contents

Versions are tracked explicitly. Bump versions on write, and read old versions if desired.

$ echo "updated contents" > my_file.txt

$ datafs update my_new_data_archive my_file.txt --bumpversion minor
uploaded data to <DataArchive local://my_new_data_archive>. version bumped 0.0.1 --> 0.1.

$ datafs cat my_new_data_archive
updated contents

$ datafs cat my_new_data_archive --version 0.0.1
initial file contents

Pin versions using a requirements file to set the default version

$ echo "my_new_data_archive==0.0.1" > requirements_data.txt

$ datafs cat my_new_data_archive
initial file contents

All of these features are available from (and faster in) python:

>>> import datafs
>>> api = datafs.get_api()
>>> archive = api.get_archive('my_new_data_archive')
>>> with archive.open('r', version='latest') as f:
...     print(f.read())
...
updated contents

If you have permission to delete archives, it's easy to do. See administrative tools for tips on setting permissions.

$ datafs delete my_new_data_archive
deleted archive <DataArchive local://my_new_data_archive>

See examples for more extensive use cases.

Installation

pip install datafs

Additionally, you'll need a manager and services:

Managers:

  • MongoDB: pip install pymongo
  • DynamoDB: pip install boto3

Services:

  • Ready out-of-the-box:
    • local
    • shared
    • mounted
    • zip
    • ftp
    • http/https
    • in-memory
  • Requiring additional packages:
    • AWS/S3: pip install boto
    • SFTP: pip install paramiko
    • XMLRPC: pip install xmlrpclib

Requirements

For now, DataFS requires python 2.7. We're working on 3x support.

Todo

See issues to see and add to our todos.

Credits

This package was created by Justin Simcock and Michael Delgado of the Climate Impact Lab. Check us out on github.

Major kudos to the folks at PyFilesystem. Thanks also to audreyr for the wonderful cookiecutter package, and to Pyup, a constant source of inspiration and our silent third contributor.

datafs's People

Contributors

codacy-badger avatar delgadom avatar dependabot[bot] avatar gitter-badger avatar jgerardsimcock avatar pyup-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

datafs's Issues

Data requirements file

Support a path to a requirements_data.txt file, which specifies default archive versions

Purpose

Allow easy, transparent archive dependency pinning in a data project

UI suggestion

A requirements_data.txt file might look like this

var1==1.0.4
var2==9.1.14a2

This would mean the following:

# normal behavior
datafs create_archive my_arch --versioned True --requirements "requirement_data.txt"

# downloads var1 --version 1.0.4 (NOT LATEST!)
datafs download var1 --requirements "requirement_data.txt"

# downloads var2 version 8.0 (named version overrides requirements.txt and latest)
datafs download var2 --version 8.0 --requirements "requirement_data.txt"

# downloads var3's latest release (2.0)
datafs download var3 --requirements "requirement_data.txt"

# creates a new version of my_arch with dependencies=[(var1, 1.0.4), (var2, 8.0), (var3, 2.0)]
datafs update my_arch my_arch.txt --requirements "requirement_data.txt" --dependency var1 --dependency "var2==8.0" --dependency var3

These behaviors should be replicated (and therefore implemented) in the python API as well -- not just the CLI.

Implementation

Relies on versioned dependency management in #63 and CLI implementation in #69. Additionally, specifying the path to a requirements_data.txt in a local config file as suggested in #67 would be great.

Setting default download versions on DataArchive

This could be done by adding a default_version attribute to DataArchive. This would be specific to an archive instance - not an entry in a manager - so that the default download would be specific to each user and instance. Right now the default version is None, which downloads the latest version.

remove DataArchive properties that rely on manager calls

The fact that these are properties is potentially confusing and could lead to unintended calls to the manager.

Propose changing the following properties into functions:

  • latest_version --> get_latest_version
  • versions --> get_versions
  • latest_hash --> get_latest_hash
  • history --> get_history
  • metadata --> get_metadata

These changes would break tests and docs/examples which explicitly rely on these properties.

Set Metadata Spec

  1. Enable admin to update metadata spec for given project via the manager
  2. Save the schema somewhere in the manager so it can easily be accessed any time an item is updated
  3. Create a hook to reference this setting when an item is updated via the manager

Version overwrite on simultaneous write of same version

If two users are trying to write to the same archive at the same time, two versions with the same version number will be uploaded to the manager. These two versions will point to the same path in the fs and the last closed file will be the object associated with that version.

{'_id': 'big climate data', 'versions': [ '1.0.0', 'dependencies': [('arch1','1.0.1')], '1.1.0', 'dependencies': [('arch1', '1.1.0'), ('arch2', '1.0.0'), '1.1.0', 'dependencies': [('arch1','1.0.1')]

In this case, the manager will reflect both version 1.1.0 entries but datafs will pull the last entry, not reflecting the earlier dependency update.

Add download & retainment preferences differentiated by service

Presumably, if you have a remote and a local service, you don't want to keep every version of your data on the local system.

If we implement #10 it would be great if we could specify a config file something like this:

services:
    local:
        service: fs.osfs.OSFS
        service_path: '~/my_data_files/'
        retain_versions: 1
        size_warning: 500MB
        size_limit: 5GB

    osdc:
        service: fs.s3fs.S3FS
        service_config: '~/.aws/config'
        bucket: my-data-file
        retain_versions: -1
        size_warning: -1
        size_limit: -1

Extend CLI functionality

I think core commands should be:

configure [**kwargs]
configure add authority [**kwargs]
configure add manager [**kwargs]
list [--prefix prefix]
create_archive <archive_name> [--authority auth_name] [**kwargs]
metadata <archive_name>
versions <archive_name>
update <archive_name> [filepath] [**kwargs]
download <archive_name> [filepath]
get-version <archive_name> filepath   # hashes the file and gives the version ID from the archive
delete <archive_name>

In addition to the addition of get-version, this would mean an extension of configure to enable creation of managers and authorities, making the CLI a fully-featured version of datafs. This could be done by allowing the user to specify arbitrary keyword arguments to create an api object, then writing the object back to the config file.

standardize core API functions

Propose renaming key functions

DataAPI

create_archive --> create

CLI

create_archive --> create
upload --> update

Propose adding the following functions:

DataAPI

Add the folowing methods:

  • list()
  • update(archive_name, filepath, **kwargs)
  • metadata(archive_name)
  • versions(archive_name)
  • download(archive_name, filepath)

provide a means of adding files directly from an authority

Currently there is no way to add to an archive from an authority. This prevents us from indexing external files, such as NASA data, and also prevents us from adding data that is currently on OSDC.

In order to do this, we would need to develop a way of handling "version checking" in a world where the authority's hash value is unknown. Perhaps in that situation we should download it on first read and store the hash of the downloaded file? But what do we do if we download it again from the same authority and the hash is different?

Defualt version on archives should be last stable release, not last release

archive.latest_version should not return an alpha or beta release.

This could be done by changing DataArchive.latest_version:

return max(versions)

could be changed to

# filter out versions with a non-None prerelease attribute
releases = filter(lambda v: v.prerelease is None, versions)

if len(releases) > 0:
    return max(releases)
else:
    return max(versions)

Frequent BackUp of metadata table to S3 or other data store

There is currently no backup of metadata table to backup data store. If this thing gets deleted or goes down then all your work is lost. Possible solutions would give admin users the rights to backup tables at defined intervals as well as ad-hoc data backup

*recurring
api.manager.backup_metadata_table(table_name='my_metadata_table', backup_location='s3', interval='hourly')

*ad-hoc
api.manager.backup_metadata_table(table_name='my_metadata_table', backup_location='s3')

Add DataFile methods to write file to a local path

For both developer use (specifically, using DataFS+netCDF) and non-developer use (calls by other languages) we need to be able to write the file to a local path.

For developer use, we could avoid copying the file if it is on the local system and a method could simply return the path on the local filesystem if it exists or could create a temporary file and return the path if the data is on a remote service.

For CLI calls, the user would likely have to specify the path. Therefore, we would probably have to write the file to the specified path.

Consider metadata & archive list caching

Right now we don't cache anything, so this would require very careful consideration. However, it would be great to speed up the CLI and archive search functions.

Generalize API initialization requirements

This would allow subclassers to easily change the list of required metadata using a class attribute.

I'm thinking something like:

class DataAPI(object):

    DatabaseName = 'MyDatabase'
    DataTableName = 'DataFiles'

    TimestampFormat = '%Y%m%d-%H%M%S'

    RequiredUserData=['username','contact']

    def __init__(self, download_priority=None, upload_services=None, **userdata):
        for d in self.RequiredUserData:
            if d not in userdata:
                raise KeyError('{} argument "{}" required'.format(self.__class__.__name__, d))

        self.userdata = dict(userdata)

        self.manager = None
        self.services = {}

        self._download_priority = download_priority
        self._upload_services = upload_services

This would also require changing all references to api.username and api.contact throughout the package.

Write remote tutorial

Preferred specification for remote tutorial is a DynamoDB manager (See #6) and an AWS+SFTP service. Using multiple services requires passing #7

The tutorial walk-through should go in docs/examples.remote.rst and the code in examples/remote.py. A test that runs remote.py should be added in tests/test_remote.py

Allow setting dependency versions on update from CLI

CLI implementation of versioned dependencies in #63.

Purpose

Imagine an API with archives:

{'_id': 'my_arch', ..., 'versions': []},
{'_id': 'var1', ..., 'versions': [{'version': '1.0', ...}, {'version': '1.1', ...}]},
{'_id': 'var2', ..., 'versions': [{'version': '0.1.9', ...}, {'version': '0.2.0', ...}]}

We should be able to create a new version of my_arch with dependencies [(var1, 1.1), (var2, 0.1.9)] from the command line.

Suggested UI

A possible UI implementation of this would be:

datafs update my_arch my_arch.txt --dependency var1 --dependency "var2==0.1.9"

Note that var1's dependency was set to the latest version (default) and var2 was set to the explicitly named 0.1.9.

Suggested Implementation

Dependencies could be parsed with click's multiple options handler. I'm sure there's a way to handle version numbers like "var2==0.1.9" in some package somewhere, but if not, that's not the end of the world (split on "==").

This can get passed to the DataArchive.update() dependencies argument that will have to be implemented in #63.

Allow more customization on upload service list

In addition to being able to set filters on local file system stores (see #16) it would be great if we could specify a service by name on upload, e.g.:

archive.update('local_path.txt', upload_to=['s3', 'ftp'])

This would pull service configurations from those in the api.services dictionary and would override the base api._upload_services preferences.

test examples in documentation

Since we can include small snippets from a large file in documents, I think we should use doctests to test all code snippets in the docs.

We could have all code from within an article draw from the same docstring, allowing us to run the entire article as one environment:

.. include:: snippets/pythonapi.metadata.py
    :start-after: ## EXAMPLE-BLOCK-1-START
    :end-before: ## EXAMPLE-BLOCK-1-END

This would allow us to do setup and teardown.

in snippets/pythonapi.metadata.py:

'''
## SETUP
>>> api = DataAPI(**test_configuration)
>>> manager = DynamoDBManager(**kwargs)
>>> api.attach_manager(manager)
>>> m = moto.mock_s3()
etc...
...

## EXAMPLE-BLOCK-1-START

.. code-block:: python

    >>> sample_archive = api.create('sample_archive', 
    ...     metadata=dict(
    ...         oneline_description='tas by admin region', 
    ...         long_description='daily average temperature (kelvin) '
    ...             'by admin region2 as defined by the united nations', 
    ...         source='NASA BCSD', 
    ...         notes='important note'))
    >>>

## EXAMPLE-BLOCK-1-END

## TEARDOWN
>>> archive.delete()
>>> m.stop()
etc...

'''

Support local .datafs.yml config file

A local datafs config that overrides the global config in a given working directory

Purpose

A local datafs config would be useful for a couple reasons:

  1. It could make examples & testing much easier, as you would only need a local config file (override a non-existant global config file)
  2. It could handle additional use-specific parameters, such as --profile
  3. It would enable adding additional use-specific features down the line, such as requirement pinning

UI suggestion

I'm imagining something like this:

datafs --profile my_api create_archive my_archive 
    --description "my new archive" --project "myproj" --versioned False

could be replaced by

datafs create_archive my_archive --description "my new archive"

in the presence of this .datafs.yml file

default_profile: my_api
archive_kwargs:
    metadata:
        project: "myproj"
    versioned: False

Implementation

Local version of global config options

Overriding existing configuration options is easy... just read a second config file and update (rather than replace) the elements in the config dictionary. For some nested items, such as profiles, it may be worth doing a nested update. Something like this would do:

import collections

def update(d, u):
    for k, v in u.items():
        if isinstance(v, collections.Mapping):
            r = update(d.get(k, {}), v)
            d[k] = r
        else:
            d[k] = u[k]
    return d

This would be done in ./datafs/config/config_file.py:ConfigFile.parse_configfile_contents, which merges config dictionaries in with the defaults. If this were modified to handle updates more generally (as above) we could pass it multiple config files in series.

New archive config options

Adding the new "archive kwargs" and other options is a larger and longer-term project that is not essential for the Beta milestone.

Simulate bad users

We fail on a wide variety of uses. We probably shouldn't.

Here's one. Bad archive paths only fail on write, not on create. Should we deal with this?

$ datafs create 'my new ar\\//chi\//\\ve' --description "weird error"
created versioned archive <DataArchive osdc://my new ar\\//chi\//\\ve>

$ datafs update 'my new ar\\//chi\//\\ve' test.txt
Traceback (most recent call last):
...
fs.errors.InvalidCharsInPathError: Path contains invalid characters: my new ar\\/chi\/\\ve/0.0.1

Allow extended archive search features

Currently we have

    def list(self, substr=None):

        archives = self.manager.get_archive_names()
        if substr:
            return [a for a in archives if substr in a]
        else:
            return archives

I propose we move to something along the lines of

import fnmatch, re
...

    def list(self, pattern=None, engine='fn'):
    
        archives =self.manager.get_archive_names()

        if not pattern:
            return archives

        if engine == 'str':
            return [x for x in archives if pattern in x]

        elif engine == 'fn':
            return fnmatch.filter(archives, pattern)

        elif engine == 'regex':
            return [arch for arch in archives if re.search(pattern, arch)]

        else:
            raise ValueError(
                'search engine "{}" not recognized. choose "str", "fn", or "regex"'.format(
                    engine))

And then implement the call with --pattern and --engine on the cli side

Authority config handling through manager

This feature would move all authority configuration to the manager.

Following methods would be added to manager

_create_authoriy_table: this would create a table in the manager that handled the service configuration settings. Called on create_archive_table

_configure_authorities: specifies the config settings for authorities

The following methods would be modified in constructor

_generate_manager

_generate_service

CLI "help" strings

I think they can be written using a conventional docstring. Anyhow, there is barely anything there and we need to give the user some help from the command line.

Also, the CLI is not currently documented in the API. click doesn't have a sphinx extension as far as I can tell, so I think we'll have to create this documentation manually.

archive version dependencies should be a first-class citizen

We currently don't support version-specific metadata for archives... just archive-specific metadata.

We should probably think about supporting version-specific metadata. But at the very least we need version-specific dependencies. One way to do this is to allow dependencies directly as version metadata.

cache config

determine whether config file load is slowing us down.

If so, we could hash the file and load from a cached pickled copy if unchanged.

Pass API object as argument?

Should we pass API object as argument rather than requiring that API be assigned to Manager and Services

This would allow all Manager methods to be class methods rather than instance methods. Database implementation (MongoDBManager/DynamoDBManager) objects would be api-agnostic. Their __init__ methods would simply initialize the connection to the database. The api object would get passed in as the first argument to all other methods.

A couple examples:

DataAPI.create_archive

in datafs/core/data_api.py - DataAPI.create_archive:

def create_archive(self, archive_name, raise_if_exists=True, **metadata):
    self.manager.create_archive(self, archive_name, raise_if_exists=raise_if_exists, **metadata)

in datafs/managers/manager.py - Manager.create_archive:

@classmethod
def create_archive(self, api, archive_name, raise_if_exists=True, **metadata):
    '''
    Create a new data archive

    Parameters
    -------------
    api : object
    archive_name : str
    raise_if_exists : bool

    **kwargs added to archive metadata

    Returns
    -------
    archive : object
        new :py:class:`~datafs.core.data_archive.DataArchive` object

    '''

    metadata['creator'] = metadata.get('creator', api.username)
    metadata['contact'] = metadata.get('contact', api.contact)
    metadata['creation_date'] = metadata.get('creation_date', api.create_timestamp())

    if raise_if_exists:
        cls._create_archive(archive_name, **metadata)
    else:
        cls._create_if_not_exists(archive_name, **metadata)

Prevent multiple uploads of files with the same hash

Currently archive.update() will upload without checking file contents.

The hashlib.sha256 hexdigest is in the file metadata currently, but this isn't used.

The mongodb query coll.find_one({"_id": archive_name}, {"versions.checksum": 1}) will return a list of checksums in the database for the given archive_name.

If implemented, it would make sense to make sure the file actually exists among the available services. See #3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.