climateimpactlab / datafs Goto Github PK

View Code? Open in Web Editor NEW

6.0 7.0 2.0 942 KB

An abstraction layer for data storage systems

License: MIT License

Makefile 0.81% Python 99.19%

python storage file-management data-version-control command-line-tool

datafs's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger shalevy1

datafs's Issues

archive version dependencies should be a first-class citizen

We currently don't support version-specific metadata for archives... just archive-specific metadata.

We should probably think about supporting version-specific metadata. But at the very least we need version-specific dependencies. One way to do this is to allow dependencies directly as version metadata.

another service approach to consider...

Maybe we could modify a backup package to our needs? The idea of saving file diffs rather than versions is very attractive for minimizing storage space. We'd need to make sure we don't kill performance though.

Potential options:
http://bakthat.readthedocs.io/en/latest/
https://attic-backup.org/
https://docs.python.org/2/library/difflib.html

Figure out how to run live tests on S3/OSDC from travis using an encrypted config

I think this is pretty self-explanatory

Frequent BackUp of metadata table to S3 or other data store

There is currently no backup of metadata table to backup data store. If this thing gets deleted or goes down then all your work is lost. Possible solutions would give admin users the rights to backup tables at defined intervals as well as ad-hoc data backup

*recurring
api.manager.backup_metadata_table(table_name='my_metadata_table', backup_location='s3', interval='hourly')

*ad-hoc
api.manager.backup_metadata_table(table_name='my_metadata_table', backup_location='s3')

CLI "help" strings

I think they can be written using a conventional docstring. Anyhow, there is barely anything there and we need to give the user some help from the command line.

Also, the CLI is not currently documented in the API. click doesn't have a sphinx extension as far as I can tell, so I think we'll have to create this documentation manually.

Write remote tutorial

Preferred specification for remote tutorial is a DynamoDB manager (See #6) and an AWS+SFTP service. Using multiple services requires passing #7

The tutorial walk-through should go in docs/examples.remote.rst and the code in examples/remote.py. A test that runs remote.py should be added in tests/test_remote.py

Stick file extension on version numbers

xarray uses the file path to determine the read engine. It needs paths to end in '.nc'

Pass API object as argument?

Should we pass API object as argument rather than requiring that API be assigned to Manager and Services

This would allow all Manager methods to be class methods rather than instance methods. Database implementation (MongoDBManager/DynamoDBManager) objects would be api-agnostic. Their __init__ methods would simply initialize the connection to the database. The api object would get passed in as the first argument to all other methods.

A couple examples:

DataAPI.create_archive

in datafs/core/data_api.py - DataAPI.create_archive:

def create_archive(self, archive_name, raise_if_exists=True, **metadata):
    self.manager.create_archive(self, archive_name, raise_if_exists=raise_if_exists, **metadata)

in datafs/managers/manager.py - Manager.create_archive:

@classmethod
def create_archive(self, api, archive_name, raise_if_exists=True, **metadata):
    '''
    Create a new data archive

    Parameters
    -------------
    api : object
    archive_name : str
    raise_if_exists : bool

    **kwargs added to archive metadata

    Returns
    -------
    archive : object
        new :py:class:`~datafs.core.data_archive.DataArchive` object

    '''

    metadata['creator'] = metadata.get('creator', api.username)
    metadata['contact'] = metadata.get('contact', api.contact)
    metadata['creation_date'] = metadata.get('creation_date', api.create_timestamp())

    if raise_if_exists:
        cls._create_archive(archive_name, **metadata)
    else:
        cls._create_if_not_exists(archive_name, **metadata)

cache config

determine whether config file load is slowing us down.

If so, we could hash the file and load from a cached pickled copy if unchanged.

Support local .datafs.yml config file

A local datafs config that overrides the global config in a given working directory

Purpose

A local datafs config would be useful for a couple reasons:

It could make examples & testing much easier, as you would only need a local config file (override a non-existant global config file)
It could handle additional use-specific parameters, such as --profile
It would enable adding additional use-specific features down the line, such as requirement pinning

UI suggestion

I'm imagining something like this:

datafs --profile my_api create_archive my_archive 
    --description "my new archive" --project "myproj" --versioned False

could be replaced by

datafs create_archive my_archive --description "my new archive"

in the presence of this .datafs.yml file

default_profile: my_api
archive_kwargs:
    metadata:
        project: "myproj"
    versioned: False

Implementation

Local version of global config options

Overriding existing configuration options is easy... just read a second config file and update (rather than replace) the elements in the config dictionary. For some nested items, such as profiles, it may be worth doing a nested update. Something like this would do:

import collections

def update(d, u):
    for k, v in u.items():
        if isinstance(v, collections.Mapping):
            r = update(d.get(k, {}), v)
            d[k] = r
        else:
            d[k] = u[k]
    return d

This would be done in ./datafs/config/config_file.py:ConfigFile.parse_configfile_contents, which merges config dictionaries in with the defaults. If this were modified to handle updates more generally (as above) we could pass it multiple config files in series.

New archive config options

Adding the new "archive kwargs" and other options is a larger and longer-term project that is not essential for the Beta milestone.

Manager does not check for file presence in the services when returning archive results

Should it?

Allow setting dependency versions on update from CLI

CLI implementation of versioned dependencies in #63.

Purpose

Imagine an API with archives:

{'_id': 'my_arch', ..., 'versions': []},
{'_id': 'var1', ..., 'versions': [{'version': '1.0', ...}, {'version': '1.1', ...}]},
{'_id': 'var2', ..., 'versions': [{'version': '0.1.9', ...}, {'version': '0.2.0', ...}]}

We should be able to create a new version of my_arch with dependencies [(var1, 1.1), (var2, 0.1.9)] from the command line.

Suggested UI

A possible UI implementation of this would be:

datafs update my_arch my_arch.txt --dependency var1 --dependency "var2==0.1.9"

Note that var1's dependency was set to the latest version (default) and var2 was set to the explicitly named 0.1.9.

Suggested Implementation

Dependencies could be parsed with click's multiple options handler. I'm sure there's a way to handle version numbers like "var2==0.1.9" in some package somewhere, but if not, that's not the end of the world (split on "==").

This can get passed to the DataArchive.update() dependencies argument that will have to be implemented in #63.

Allow extended archive search features

Currently we have

    def list(self, substr=None):

        archives = self.manager.get_archive_names()
        if substr:
            return [a for a in archives if substr in a]
        else:
            return archives

I propose we move to something along the lines of

import fnmatch, re
...

    def list(self, pattern=None, engine='fn'):
    
        archives =self.manager.get_archive_names()

        if not pattern:
            return archives

        if engine == 'str':
            return [x for x in archives if pattern in x]

        elif engine == 'fn':
            return fnmatch.filter(archives, pattern)

        elif engine == 'regex':
            return [arch for arch in archives if re.search(pattern, arch)]

        else:
            raise ValueError(
                'search engine "{}" not recognized. choose "str", "fn", or "regex"'.format(
                    engine))

And then implement the call with --pattern and --engine on the cli side

Set up performance benchmarks & integrate into tests

We currently have no idea how we're doing relative to boto/fs or regular read/write/stream operations. Let's figure that out and start optimizing.

Set Metadata Spec

Enable admin to update metadata spec for given project via the manager
Save the schema somewhere in the manager so it can easily be accessed any time an item is updated
Create a hook to reference this setting when an item is updated via the manager

fs.tempfs temporary filesystem services don't work

why? let's fix it.

Rip out filesystem management & replace with Luigi or other all-in-one pipeline manager

Luigi has everything we need and is set up for pipelines

s3 data management
local data management
built for pipelining python functions
dependency management
extensive support for distributed computing

Other options to evalutate:

(Apache airflow)[https://airflow.incubator.apache.org/tutorial.html]
(Yelp MrJob)[https://pythonhosted.org/mrjob/]
(Pinterest Pinball)[https://github.com/pinterest/pinball]

Defualt version on archives should be last stable release, not last release

archive.latest_version should not return an alpha or beta release.

This could be done by changing DataArchive.latest_version:

return max(versions)

could be changed to

# filter out versions with a non-None prerelease attribute
releases = filter(lambda v: v.prerelease is None, versions)

if len(releases) > 0:
    return max(releases)
else:
    return max(versions)

Prohibit writing to archived files

Currently any file opened with fs can be written to. Ideally, we'd have a way of preventing this.

allow commit message on version bump

something like

datafs update my_archive new.txt -m "this version includes more stuff"

seems useful.

test examples in documentation

Since we can include small snippets from a large file in documents, I think we should use doctests to test all code snippets in the docs.

We could have all code from within an article draw from the same docstring, allowing us to run the entire article as one environment:

.. include:: snippets/pythonapi.metadata.py
    :start-after: ## EXAMPLE-BLOCK-1-START
    :end-before: ## EXAMPLE-BLOCK-1-END

This would allow us to do setup and teardown.

in snippets/pythonapi.metadata.py:

'''
## SETUP
>>> api = DataAPI(**test_configuration)
>>> manager = DynamoDBManager(**kwargs)
>>> api.attach_manager(manager)
>>> m = moto.mock_s3()
etc...
...

## EXAMPLE-BLOCK-1-START

.. code-block:: python

    >>> sample_archive = api.create('sample_archive', 
    ...     metadata=dict(
    ...         oneline_description='tas by admin region', 
    ...         long_description='daily average temperature (kelvin) '
    ...             'by admin region2 as defined by the united nations', 
    ...         source='NASA BCSD', 
    ...         notes='important note'))
    >>>

## EXAMPLE-BLOCK-1-END

## TEARDOWN
>>> archive.delete()
>>> m.stop()
etc...

'''

Add DataFile methods to write file to a local path

For both developer use (specifically, using DataFS+netCDF) and non-developer use (calls by other languages) we need to be able to write the file to a local path.

For developer use, we could avoid copying the file if it is on the local system and a method could simply return the path on the local filesystem if it exists or could create a temporary file and return the path if the data is on a remote service.

For CLI calls, the user would likely have to specify the path. Therefore, we would probably have to write the file to the specified path.

Allow file, archive deletion & API teardown

This is a necessary feature, and should be part of examples/tests

Version overwrite on simultaneous write of same version

If two users are trying to write to the same archive at the same time, two versions with the same version number will be uploaded to the manager. These two versions will point to the same path in the fs and the last closed file will be the object associated with that version.

{'_id': 'big climate data', 'versions': [ '1.0.0', 'dependencies': [('arch1','1.0.1')], '1.1.0', 'dependencies': [('arch1', '1.1.0'), ('arch2', '1.0.0'), '1.1.0', 'dependencies': [('arch1','1.0.1')]

In this case, the manager will reflect both version 1.1.0 entries but datafs will pull the last entry, not reflecting the earlier dependency update.

Consider metadata & archive list caching

Right now we don't cache anything, so this would require very careful consideration. However, it would be great to speed up the CLI and archive search functions.

Allow more customization on upload service list

In addition to being able to set filters on local file system stores (see #16) it would be great if we could specify a service by name on upload, e.g.:

archive.update('local_path.txt', upload_to=['s3', 'ftp'])

This would pull service configurations from those in the api.services dictionary and would override the base api._upload_services preferences.

In manager, change name for versions to version_history

This must be done before the MVP release to ensure archive backward compatibility

This should be changed on BaseDataManager and also on DataArchive.history

Type checking on upload_priority and download_priority setters

Maybe these should accept api.services keys (str) and service objects (api.services values)?

Provide service management tools

At the very least, we need to be able to delete files if we are running out of space.

The readme sucks

that's all.

Add tqdm progress bar to upload and download

Help users understand how long there uploads and downloads will take with tqdm module

https://pypi.python.org/pypi/tqdm

replace @contextmanager with class definitions

convert archive.open and archive.get_local_path into openers with enter and exit methods

This one is going to suck, but it is required in order to allow data exploration.

Generalize API initialization requirements

This would allow subclassers to easily change the list of required metadata using a class attribute.

I'm thinking something like:

class DataAPI(object):

    DatabaseName = 'MyDatabase'
    DataTableName = 'DataFiles'

    TimestampFormat = '%Y%m%d-%H%M%S'

    RequiredUserData=['username','contact']

    def __init__(self, download_priority=None, upload_services=None, **userdata):
        for d in self.RequiredUserData:
            if d not in userdata:
                raise KeyError('{} argument "{}" required'.format(self.__class__.__name__, d))

        self.userdata = dict(userdata)

        self.manager = None
        self.services = {}

        self._download_priority = download_priority
        self._upload_services = upload_services

This would also require changing all references to api.username and api.contact throughout the package.

Publish to pip & conda

Gotta get the tests working... then we'll be setup to have travis publish for us. See #19

Add DynamoDB support

Currently only mongoDB managers are supported

optimize searching for archives

Not sure what the best way to do this is. local archive name sql index?

Enable specification with a config file

There isn't too much to configure. Tough part will be automatically configuring services.

Provide ranges of supported package dependencies, and test against these ranges.

Right now we have

click==6.0
PyYAML==3.0
fs1==0.6

This breaks when bundled with any packages pinning other dependencies. Seems problematic.

create template for AWS user profile

make sure that users can read from the .spec table. ours currently can't.

Prevent multiple uploads of files with the same hash

Currently archive.update() will upload without checking file contents.

The hashlib.sha256 hexdigest is in the file metadata currently, but this isn't used.

The mongodb query coll.find_one({"_id": archive_name}, {"versions.checksum": 1}) will return a list of checksums in the database for the given archive_name.

If implemented, it would make sense to make sure the file actually exists among the available services. See #3

Data requirements file

Support a path to a requirements_data.txt file, which specifies default archive versions

Purpose

Allow easy, transparent archive dependency pinning in a data project

UI suggestion

A requirements_data.txt file might look like this

var1==1.0.4
var2==9.1.14a2

This would mean the following:

# normal behavior
datafs create_archive my_arch --versioned True --requirements "requirement_data.txt"

# downloads var1 --version 1.0.4 (NOT LATEST!)
datafs download var1 --requirements "requirement_data.txt"

# downloads var2 version 8.0 (named version overrides requirements.txt and latest)
datafs download var2 --version 8.0 --requirements "requirement_data.txt"

# downloads var3's latest release (2.0)
datafs download var3 --requirements "requirement_data.txt"

# creates a new version of my_arch with dependencies=[(var1, 1.0.4), (var2, 8.0), (var3, 2.0)]
datafs update my_arch my_arch.txt --requirements "requirement_data.txt" --dependency var1 --dependency "var2==8.0" --dependency var3

These behaviors should be replicated (and therefore implemented) in the python API as well -- not just the CLI.

Implementation

Relies on versioned dependency management in #63 and CLI implementation in #69. Additionally, specifying the path to a requirements_data.txt in a local config file as suggested in #67 would be great.

Setting default download versions on DataArchive

This could be done by adding a default_version attribute to DataArchive. This would be specific to an archive instance - not an entry in a manager - so that the default download would be specific to each user and instance. Right now the default version is None, which downloads the latest version.

Travis is failing because PyPy is running 2.6

I haven't figured out how to remove python 2.6 in pypy support. We could remove pypy support but there's gotta be a better way.

standardize core API functions

Propose renaming key functions

DataAPI

create_archive --> create

CLI

create_archive --> create
upload --> update

Propose adding the following functions:

DataAPI

Add the folowing methods:

list()
update(archive_name, filepath, **kwargs)
metadata(archive_name)
versions(archive_name)
download(archive_name, filepath)

Simulate bad users

We fail on a wide variety of uses. We probably shouldn't.

Here's one. Bad archive paths only fail on write, not on create. Should we deal with this?

$ datafs create 'my new ar\\//chi\//\\ve' --description "weird error"
created versioned archive <DataArchive osdc://my new ar\\//chi\//\\ve>

$ datafs update 'my new ar\\//chi\//\\ve' test.txt
Traceback (most recent call last):
...
fs.errors.InvalidCharsInPathError: Path contains invalid characters: my new ar\\/chi\/\\ve/0.0.1

Add tests with multiple services

Uploading & downloading from multiple services is currently untested. This can be tested using local services only.

Extend CLI functionality

I think core commands should be:

configure [**kwargs]
configure add authority [**kwargs]
configure add manager [**kwargs]
list [--prefix prefix]
create_archive <archive_name> [--authority auth_name] [**kwargs]
metadata <archive_name>
versions <archive_name>
update <archive_name> [filepath] [**kwargs]
download <archive_name> [filepath]
get-version <archive_name> filepath   # hashes the file and gives the version ID from the archive
delete <archive_name>

In addition to the addition of get-version, this would mean an extension of configure to enable creation of managers and authorities, making the CLI a fully-featured version of datafs. This could be done by allowing the user to specify arbitrary keyword arguments to create an api object, then writing the object back to the config file.

Add download & retainment preferences differentiated by service

Presumably, if you have a remote and a local service, you don't want to keep every version of your data on the local system.

If we implement #10 it would be great if we could specify a config file something like this:

services:
    local:
        service: fs.osfs.OSFS
        service_path: '~/my_data_files/'
        retain_versions: 1
        size_warning: 500MB
        size_limit: 5GB

    osdc:
        service: fs.s3fs.S3FS
        service_config: '~/.aws/config'
        bucket: my-data-file
        retain_versions: -1
        size_warning: -1
        size_limit: -1

remove DataArchive properties that rely on manager calls

The fact that these are properties is potentially confusing and could lead to unintended calls to the manager.

Propose changing the following properties into functions:

latest_version --> get_latest_version
versions --> get_versions
latest_hash --> get_latest_hash
history --> get_history
metadata --> get_metadata

These changes would break tests and docs/examples which explicitly rely on these properties.

provide a means of adding files directly from an authority

Currently there is no way to add to an archive from an authority. This prevents us from indexing external files, such as NASA data, and also prevents us from adding data that is currently on OSDC.

In order to do this, we would need to develop a way of handling "version checking" in a world where the authority's hash value is unknown. Perhaps in that situation we should download it on first read and store the hash of the downloaded file? But what do we do if we download it again from the same authority and the hash is different?

Authority config handling through manager

This feature would move all authority configuration to the manager.

Following methods would be added to manager

_create_authoriy_table: this would create a table in the manager that handled the service configuration settings. Called on create_archive_table

_configure_authorities: specifies the config settings for authorities

The following methods would be modified in constructor

_generate_manager

_generate_service

climateimpactlab / datafs Goto Github PK

datafs's People

Contributors

Stargazers

Watchers

Forkers

datafs's Issues

Purpose

UI suggestion

Implementation

Local version of global config options

New archive config options

Purpose

Suggested UI

Suggested Implementation

Purpose

UI suggestion

Implementation

Setting default download versions on DataArchive

Propose renaming key functions

DataAPI

CLI

Propose adding the following functions:

DataAPI

Recommend Projects

Recommend Topics

Recommend Org