climateimpactlab / datafs Goto Github PK
View Code? Open in Web Editor NEWAn abstraction layer for data storage systems
License: MIT License
An abstraction layer for data storage systems
License: MIT License
We currently don't support version-specific metadata for archives... just archive-specific metadata.
We should probably think about supporting version-specific metadata. But at the very least we need version-specific dependencies. One way to do this is to allow dependencies directly as version metadata.
Maybe we could modify a backup package to our needs? The idea of saving file diffs rather than versions is very attractive for minimizing storage space. We'd need to make sure we don't kill performance though.
Potential options:
http://bakthat.readthedocs.io/en/latest/
https://attic-backup.org/
https://docs.python.org/2/library/difflib.html
I think this is pretty self-explanatory
There is currently no backup of metadata table to backup data store. If this thing gets deleted or goes down then all your work is lost. Possible solutions would give admin users the rights to backup tables at defined intervals as well as ad-hoc data backup
*recurring
api.manager.backup_metadata_table(table_name='my_metadata_table', backup_location='s3', interval='hourly')
*ad-hoc
api.manager.backup_metadata_table(table_name='my_metadata_table', backup_location='s3')
I think they can be written using a conventional docstring. Anyhow, there is barely anything there and we need to give the user some help from the command line.
Also, the CLI is not currently documented in the API. click doesn't have a sphinx extension as far as I can tell, so I think we'll have to create this documentation manually.
Preferred specification for remote tutorial is a DynamoDB manager (See #6) and an AWS+SFTP service. Using multiple services requires passing #7
The tutorial walk-through should go in docs/examples.remote.rst
and the code in examples/remote.py
. A test that runs remote.py
should be added in tests/test_remote.py
xarray uses the file path to determine the read engine. It needs paths to end in '.nc'
Should we pass API object as argument rather than requiring that API be assigned to Manager and Services
This would allow all Manager
methods to be class methods rather than instance methods. Database implementation (MongoDBManager
/DynamoDBManager
) objects would be api-agnostic. Their __init__
methods would simply initialize the connection to the database. The api object would get passed in as the first argument to all other methods.
A couple examples:
DataAPI.create_archive
in datafs/core/data_api.py - DataAPI.create_archive
:
def create_archive(self, archive_name, raise_if_exists=True, **metadata):
self.manager.create_archive(self, archive_name, raise_if_exists=raise_if_exists, **metadata)
in datafs/managers/manager.py - Manager.create_archive
:
@classmethod
def create_archive(self, api, archive_name, raise_if_exists=True, **metadata):
'''
Create a new data archive
Parameters
-------------
api : object
archive_name : str
raise_if_exists : bool
**kwargs added to archive metadata
Returns
-------
archive : object
new :py:class:`~datafs.core.data_archive.DataArchive` object
'''
metadata['creator'] = metadata.get('creator', api.username)
metadata['contact'] = metadata.get('contact', api.contact)
metadata['creation_date'] = metadata.get('creation_date', api.create_timestamp())
if raise_if_exists:
cls._create_archive(archive_name, **metadata)
else:
cls._create_if_not_exists(archive_name, **metadata)
determine whether config file load is slowing us down.
If so, we could hash the file and load from a cached pickled copy if unchanged.
A local datafs config that overrides the global config in a given working directory
A local datafs config would be useful for a couple reasons:
--profile
I'm imagining something like this:
datafs --profile my_api create_archive my_archive
--description "my new archive" --project "myproj" --versioned False
could be replaced by
datafs create_archive my_archive --description "my new archive"
in the presence of this .datafs.yml
file
default_profile: my_api
archive_kwargs:
metadata:
project: "myproj"
versioned: False
Overriding existing configuration options is easy... just read a second config file and update (rather than replace) the elements in the config dictionary. For some nested items, such as profiles, it may be worth doing a nested update. Something like this would do:
import collections
def update(d, u):
for k, v in u.items():
if isinstance(v, collections.Mapping):
r = update(d.get(k, {}), v)
d[k] = r
else:
d[k] = u[k]
return d
This would be done in ./datafs/config/config_file.py:ConfigFile.parse_configfile_contents
, which merges config dictionaries in with the defaults. If this were modified to handle updates more generally (as above) we could pass it multiple config files in series.
Adding the new "archive kwargs" and other options is a larger and longer-term project that is not essential for the Beta milestone.
Should it?
CLI implementation of versioned dependencies in #63.
Imagine an API with archives:
{'_id': 'my_arch', ..., 'versions': []},
{'_id': 'var1', ..., 'versions': [{'version': '1.0', ...}, {'version': '1.1', ...}]},
{'_id': 'var2', ..., 'versions': [{'version': '0.1.9', ...}, {'version': '0.2.0', ...}]}
We should be able to create a new version of my_arch
with dependencies [(var1, 1.1), (var2, 0.1.9)]
from the command line.
A possible UI implementation of this would be:
datafs update my_arch my_arch.txt --dependency var1 --dependency "var2==0.1.9"
Note that var1
's dependency was set to the latest version (default) and var2
was set to the explicitly named 0.1.9
.
Dependencies could be parsed with click's multiple options handler. I'm sure there's a way to handle version numbers like "var2==0.1.9"
in some package somewhere, but if not, that's not the end of the world (split on "=="
).
This can get passed to the DataArchive.update()
dependencies argument that will have to be implemented in #63.
Currently we have
def list(self, substr=None):
archives = self.manager.get_archive_names()
if substr:
return [a for a in archives if substr in a]
else:
return archives
I propose we move to something along the lines of
import fnmatch, re
...
def list(self, pattern=None, engine='fn'):
archives =self.manager.get_archive_names()
if not pattern:
return archives
if engine == 'str':
return [x for x in archives if pattern in x]
elif engine == 'fn':
return fnmatch.filter(archives, pattern)
elif engine == 'regex':
return [arch for arch in archives if re.search(pattern, arch)]
else:
raise ValueError(
'search engine "{}" not recognized. choose "str", "fn", or "regex"'.format(
engine))
And then implement the call with --pattern
and --engine
on the cli side
We currently have no idea how we're doing relative to boto/fs or regular read/write/stream operations. Let's figure that out and start optimizing.
why? let's fix it.
Luigi has everything we need and is set up for pipelines
Other options to evalutate:
archive.latest_version should not return an alpha or beta release.
This could be done by changing DataArchive.latest_version
:
return max(versions)
could be changed to
# filter out versions with a non-None prerelease attribute
releases = filter(lambda v: v.prerelease is None, versions)
if len(releases) > 0:
return max(releases)
else:
return max(versions)
Currently any file opened with fs
can be written to. Ideally, we'd have a way of preventing this.
something like
datafs update my_archive new.txt -m "this version includes more stuff"
seems useful.
Since we can include small snippets from a large file in documents, I think we should use doctests to test all code snippets in the docs.
We could have all code from within an article draw from the same docstring, allowing us to run the entire article as one environment:
.. include:: snippets/pythonapi.metadata.py
:start-after: ## EXAMPLE-BLOCK-1-START
:end-before: ## EXAMPLE-BLOCK-1-END
This would allow us to do setup and teardown.
in snippets/pythonapi.metadata.py
:
'''
## SETUP
>>> api = DataAPI(**test_configuration)
>>> manager = DynamoDBManager(**kwargs)
>>> api.attach_manager(manager)
>>> m = moto.mock_s3()
etc...
...
## EXAMPLE-BLOCK-1-START
.. code-block:: python
>>> sample_archive = api.create('sample_archive',
... metadata=dict(
... oneline_description='tas by admin region',
... long_description='daily average temperature (kelvin) '
... 'by admin region2 as defined by the united nations',
... source='NASA BCSD',
... notes='important note'))
>>>
## EXAMPLE-BLOCK-1-END
## TEARDOWN
>>> archive.delete()
>>> m.stop()
etc...
'''
For both developer use (specifically, using DataFS+netCDF) and non-developer use (calls by other languages) we need to be able to write the file to a local path.
For developer use, we could avoid copying the file if it is on the local system and a method could simply return the path on the local filesystem if it exists or could create a temporary file and return the path if the data is on a remote service.
For CLI calls, the user would likely have to specify the path. Therefore, we would probably have to write the file to the specified path.
This is a necessary feature, and should be part of examples/tests
If two users are trying to write to the same archive at the same time, two versions with the same version number will be uploaded to the manager. These two versions will point to the same path in the fs and the last closed file will be the object associated with that version.
{'_id': 'big climate data', 'versions': [ '1.0.0', 'dependencies': [('arch1','1.0.1')], '1.1.0', 'dependencies': [('arch1', '1.1.0'), ('arch2', '1.0.0'), '1.1.0', 'dependencies': [('arch1','1.0.1')]
In this case, the manager will reflect both version 1.1.0
entries but datafs will pull the last entry, not reflecting the earlier dependency update.
Right now we don't cache anything, so this would require very careful consideration. However, it would be great to speed up the CLI and archive search functions.
In addition to being able to set filters on local file system stores (see #16) it would be great if we could specify a service by name on upload, e.g.:
archive.update('local_path.txt', upload_to=['s3', 'ftp'])
This would pull service configurations from those in the api.services
dictionary and would override the base api._upload_services
preferences.
This must be done before the MVP release to ensure archive backward compatibility
This should be changed on BaseDataManager
and also on DataArchive.history
Maybe these should accept api.services
keys (str) and service objects (api.services
values)?
At the very least, we need to be able to delete files if we are running out of space.
that's all.
Help users understand how long there uploads and downloads will take with tqdm module
convert archive.open and archive.get_local_path into openers with enter and exit methods
This one is going to suck, but it is required in order to allow data exploration.
This would allow subclassers to easily change the list of required metadata using a class attribute.
I'm thinking something like:
class DataAPI(object):
DatabaseName = 'MyDatabase'
DataTableName = 'DataFiles'
TimestampFormat = '%Y%m%d-%H%M%S'
RequiredUserData=['username','contact']
def __init__(self, download_priority=None, upload_services=None, **userdata):
for d in self.RequiredUserData:
if d not in userdata:
raise KeyError('{} argument "{}" required'.format(self.__class__.__name__, d))
self.userdata = dict(userdata)
self.manager = None
self.services = {}
self._download_priority = download_priority
self._upload_services = upload_services
This would also require changing all references to api.username
and api.contact
throughout the package.
Gotta get the tests working... then we'll be setup to have travis publish for us. See #19
Currently only mongoDB managers are supported
Not sure what the best way to do this is. local archive name sql index?
There isn't too much to configure. Tough part will be automatically configuring services.
Right now we have
click==6.0
PyYAML==3.0
fs1==0.6
This breaks when bundled with any packages pinning other dependencies. Seems problematic.
make sure that users can read from the .spec table. ours currently can't.
Currently archive.update()
will upload without checking file contents.
The hashlib.sha256
hexdigest
is in the file metadata currently, but this isn't used.
The mongodb query coll.find_one({"_id": archive_name}, {"versions.checksum": 1})
will return a list of checksums in the database for the given archive_name
.
If implemented, it would make sense to make sure the file actually exists among the available services. See #3
Support a path to a requirements_data.txt
file, which specifies default archive versions
Allow easy, transparent archive dependency pinning in a data project
A requirements_data.txt file might look like this
var1==1.0.4
var2==9.1.14a2
This would mean the following:
# normal behavior
datafs create_archive my_arch --versioned True --requirements "requirement_data.txt"
# downloads var1 --version 1.0.4 (NOT LATEST!)
datafs download var1 --requirements "requirement_data.txt"
# downloads var2 version 8.0 (named version overrides requirements.txt and latest)
datafs download var2 --version 8.0 --requirements "requirement_data.txt"
# downloads var3's latest release (2.0)
datafs download var3 --requirements "requirement_data.txt"
# creates a new version of my_arch with dependencies=[(var1, 1.0.4), (var2, 8.0), (var3, 2.0)]
datafs update my_arch my_arch.txt --requirements "requirement_data.txt" --dependency var1 --dependency "var2==8.0" --dependency var3
These behaviors should be replicated (and therefore implemented) in the python API as well -- not just the CLI.
Relies on versioned dependency management in #63 and CLI implementation in #69. Additionally, specifying the path to a requirements_data.txt in a local config file as suggested in #67 would be great.
This could be done by adding a default_version attribute to DataArchive. This would be specific to an archive instance - not an entry in a manager - so that the default download would be specific to each user and instance. Right now the default version is None
, which downloads the latest version.
I haven't figured out how to remove python 2.6 in pypy support. We could remove pypy support but there's gotta be a better way.
create_archive --> create
create_archive --> create
upload --> update
Add the folowing methods:
list()
update(archive_name, filepath, **kwargs)
metadata(archive_name)
versions(archive_name)
download(archive_name, filepath)
We fail on a wide variety of uses. We probably shouldn't.
Here's one. Bad archive paths only fail on write, not on create. Should we deal with this?
$ datafs create 'my new ar\\//chi\//\\ve' --description "weird error"
created versioned archive <DataArchive osdc://my new ar\\//chi\//\\ve>
$ datafs update 'my new ar\\//chi\//\\ve' test.txt
Traceback (most recent call last):
...
fs.errors.InvalidCharsInPathError: Path contains invalid characters: my new ar\\/chi\/\\ve/0.0.1
Uploading & downloading from multiple services is currently untested. This can be tested using local services only.
I think core commands should be:
configure [**kwargs]
configure add authority [**kwargs]
configure add manager [**kwargs]
list [--prefix prefix]
create_archive <archive_name> [--authority auth_name] [**kwargs]
metadata <archive_name>
versions <archive_name>
update <archive_name> [filepath] [**kwargs]
download <archive_name> [filepath]
get-version <archive_name> filepath # hashes the file and gives the version ID from the archive
delete <archive_name>
In addition to the addition of get-version, this would mean an extension of configure to enable creation of managers and authorities, making the CLI a fully-featured version of datafs. This could be done by allowing the user to specify arbitrary keyword arguments to create an api object, then writing the object back to the config file.
Presumably, if you have a remote and a local service, you don't want to keep every version of your data on the local system.
If we implement #10 it would be great if we could specify a config file something like this:
services:
local:
service: fs.osfs.OSFS
service_path: '~/my_data_files/'
retain_versions: 1
size_warning: 500MB
size_limit: 5GB
osdc:
service: fs.s3fs.S3FS
service_config: '~/.aws/config'
bucket: my-data-file
retain_versions: -1
size_warning: -1
size_limit: -1
The fact that these are properties is potentially confusing and could lead to unintended calls to the manager.
Propose changing the following properties into functions:
latest_version
--> get_latest_version
versions
--> get_versions
latest_hash
--> get_latest_hash
history
--> get_history
metadata
--> get_metadata
These changes would break tests and docs/examples which explicitly rely on these properties.
Currently there is no way to add to an archive from an authority. This prevents us from indexing external files, such as NASA data, and also prevents us from adding data that is currently on OSDC.
In order to do this, we would need to develop a way of handling "version checking" in a world where the authority's hash value is unknown. Perhaps in that situation we should download it on first read and store the hash of the downloaded file? But what do we do if we download it again from the same authority and the hash is different?
This feature would move all authority configuration to the manager.
Following methods would be added to manager
_create_authoriy_table
: this would create a table in the manager that handled the service configuration settings. Called on create_archive_table
_configure_authorities
: specifies the config settings for authorities
The following methods would be modified in constructor
_generate_manager
_generate_service
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.