ipazc / dhub Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 3.0 77 KB

CLI for mldatahub, for Python 3

License: GNU General Public License v3.0

Python 100.00%

dhub's People

Contributors

Stargazers

Watchers

Forkers

bigrlab gaybro8777 kndwaninang1

dhub's Issues

Filtered iteration is not calculating correctly the number of elements it has to iterate over

When it is specified an option to the filtered iteration, it may iterate over less elements because less elements match the desired options. When this is the case, the iterator keeps iterating until the page count is reached, even when there are no elements matching the options set.
Thus, iterator should stop whenever backend is not returning elements at all.

The dataset's save_to_folder method is so slow

Save to folder of a dataset is not taking advantage of the iter smart cache.

The url_prefix for the retrieval of a dataset should work when token prefix is not specified

Given as an example the dataset with url prefix "foo/bar", the dataset should also be retrievable by
providing only the dataset prefix "bar":

>> dataset = datasets["bar"]

by default it should complete the prefix.

There are exceptions when using requests that are bubbled to the application

The exception requests.exceptions.ChunkedEncodingError or similar while requesting should be caught, in order to transparently repeat the process.

Multiple updates of element's data

The smart updater is not mixing up multiple updates of the same element. Under certain circunstances, this might lead to a race condition.

Elements are not synchornized with the dataset object

Modifications on an element are not reflected on the dataset inmediately.
Example:

>> element = dataset[0]
>> print(element.get_title())
"title1"
>> element.set_title("titlte2")
>> print(element.get_title())
"titlte2"
>> print(dataset[0].get_title())
"title1"

Specifying the metadata in the Forking of a dataset should not be mandatory

The metadata should be optional, and it should take it from the forked dataset in its absence.

Unexpected delay in iterations

When iterating a dataset with filter_iter(cache_content=True), there is an unexpected delay between page_size requests. Seems that when the whole page_size is processed and the limit is reached, the new page is requested, stopping the run of the application until the next page is retrieved. It should be requesting the next page before reaching the limit to speed up the computations.

There should be a config file for setting up the backend host

Even though that by default it is going to point to our backend, it should be a configurable parameter of a config file.

dataset sync method is not syncing.

The sync() method is not syncing correctly. The following code on a huge append of dataset elements:

>>> print("Syncing...")
>>> dataset.sync()
>>> print("Synced.")

gives this result almost instantly:

Syncing...
Synced.

retrieval of elements in dataset: some slices options are not controlled

For example, the following code will break its execution:

>>> dataset[200:]

client should try to recover automatically from Connection Timed Out

requests.exceptions.ConnectionError: HTTPConnectionPool(host='', port=): Max retries exceeded with url: /datasets/ipazc-adience/fold1/elements?page=60&_tok= 
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2bd3875c18>: Failed to establish a new connection: [Errno 110] Connection timed out',))

After this error, the client should automatically retry.

Page size exceded when retrieving big slices of elements from dataset

this retrieval will cause a page size exceeded error:

dataset[300:1000]

Iteration should have an option to avoid caching

When it is wanted to iterate over the elements' headers of a dataset rather than the data, data is downloaded too because of the caching system in background. It should have an option to disable this behaviour.