radiantearth / radiant-mlhub Goto Github PK

View Code? Open in Web Editor NEW

51.0 5.0 9.0 23.13 MB

A Python client for the Radiant MLHub API (https://mlhub.earth).

Home Page: https://radiant-mlhub.readthedocs.io/

License: Other

Python 100.00%

machine-learning satellite-imagery python python-client python-3 stac

radiant-mlhub's Introduction

Radiant MLHub Python Client

A Python client for the Radiant MLHub API.

Documentation

API reference and Getting Started guides are available on Read the Docs.

Python Version Support

The radiant_mlhub Python client requires Python >= 3.8.

This library aligns with PySTAC in following the recommendations of NEP-29 in deprecating support for Python versions. This means that users can expect support for Python 3.8 to be removed from the main branch after April 14, 2023 and therefore from the next release after that date.

Design Decisions

Major architectural and design decisions are documented using Architectural Design Records stored in the docs/adr directory.

Contributing

See the Contributing docs for details on contributing to the project.

radiant-mlhub's People

Contributors

Stargazers

Watchers

Forkers

muskanmahajan37 sunleaverjwj rowenarono95 megnigbeto tfgbestneal powerchell phuntshophuntsho adamjstewart eslamfouadd

radiant-mlhub's Issues

Add keywords and long descriptions to Datasets/Collections

Problem Description

The description field in the STAC Collections returned by the API are pretty minimal and we do not include any keywords. This makes it a bit harder for users to discover and filter collections using the Python client.

Feature Description

Add the long-form descriptions and tags from the MLHub Registry to the /datasets API responses.
Add tags from the MLHub Registry to the /collections API responses, where appropriate
Add documentation to make it clear how to access these fields

Alternative Solutions

Considered adding longer descriptions to the Collection objects as well, but the descriptions on the Registry are really more appropriate at the dataset level.

Related Issues

N/A

No module named 'typing_extensions`

From the support email:

I re-installed the library and I am facing the following error whenever I try to import:

Traceback (most recent call last):
  File "/home/userredacted/Documents/test_radiant_earth_api/mypython/lib/python3.6/site-packages/radiant_mlhub/models.py", line 8, in <module>
    from typing import Literal  # type: ignore [attr-defined]
ImportError: cannot import name 'Literal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/userredacted/Documents/test_radiant_earth_api/mypython/lib/python3.6/site-packages/radiant_mlhub/__init__.py", line 3, in <module>
    from .models import Collection, Dataset
  File "/home/userredacted/Documents/test_radiant_earth_api/mypython/lib/python3.6/site-packages/radiant_mlhub/models.py", line 10, in <module>
    from typing_extensions import Literal  # type: ignore [misc]
ModuleNotFoundError: No module named 'typing_extensions'

Bug caused by PySTAC upgrade. bad version pinning in our setup.py

In #149 we noticed the unit tests were failing. We tracked it down to pystac version 1.5.0 and higher. The reason this occurred is we are using incorrect ~ version pinning in our setup.py.

This type of versioning 'pystac~=1.4', actually means compatible with pystac 1.*. https://stackoverflow.com/a/67374648/6199836

The fix should be two things in separate PRs:

High priority: Change the setup.py to use full semver strings like 'pystac~=1.4.0' which means compatible with pystac 1.4.*. Do this for all the packages mentioned in setup.py (Then hotfix release this to pypi and conda corge).
Lower priority: Upgrade to latest pystac, and fix whatever error were introduced:

FAILED test/models/test_ml_model.py::TestMLModel::test_list_ml_models - TypeError: __init__() got an unexpected keyword argument 'assets'
FAILED test/models/test_ml_model.py::TestMLModel::test_fetch_ml_model_by_id - TypeError: __init__() got an unexpected keyword argument 'assets'
FAILED test/models/test_ml_model.py::TestMLModel::test_ml_model_resolves_links - TypeError: __init__() got an unexpected keyword argument 'assets'
FAILED test/models/test_ml_model.py::TestMLModel::test_dunder_str_method - TypeError: __init__() got an unexpected keyword argument 'assets'

Using `api_key` in `Dataset.download(...)` raises an exception

Environment

What operating system are you using?

Ubuntu 18.04

What Python version are you using?

3.8.6

What version of the radiant_mlhub library are you using?

0.2.0

Problem Description

I get an exception APIKeyNotFound: Could not resolve an API key from arguments, the environment, or a config file. when using api_key as a kwarg in the Dataset.download method. The same exception doesn't happen when authenticating with an environment variable.

Steps To Reproduce

dataset = Dataset.fetch('ref_african_crops_kenya_02', api_key=API_KEY)
dataset.download(output_dir="data/", api_key=API_KEY)

Expected behavior

No errors :)

Continuous 104 exception trying to download the data for "dlr_fusion_competition_germany"

Environment

What operating system are you using?
Ubuntu
If possible, include the OS name and version (e.g. macOS Catalina 10.15.7)
Ubuntu 21.04
What Python version are you using?
3.8.5
Please use the full version string (e.g. 3.8.6). You can use $ python --version at the command line or >>> import sys; print(sys.version) in a Python console.

What version of the radiant_mlhub library are you using?
0.3.1
You can use $ mlhub --version at the command line or >>> import radiant_mlhub; print(radiant_mlhub.__version__) in a Python console.

Problem Description

Please provide a clear and concise description of the bug/problem, including error messages.
ConnectionResetError Traceback (most recent call last)
~/miniconda3/envs/ai4food/lib/python3.8/site-packages/urllib3/response.py in _error_catcher(self)
437 try:
--> 438 yield
439

~/miniconda3/envs/ai4food/lib/python3.8/site-packages/urllib3/response.py in read(self, amt, decode_content, cache_content)
518 cache_content = False
--> 519 data = self._fp.read(amt) if not fp_closed else b""
520 if (

~/miniconda3/envs/ai4food/lib/python3.8/http/client.py in read(self, amt)
457 b = bytearray(amt)
--> 458 n = self.readinto(b)
459 return memoryview(b)[:n].tobytes()

~/miniconda3/envs/ai4food/lib/python3.8/http/client.py in readinto(self, b)
501 # (for example, reading in 1k chunks)
--> 502 n = self.fp.readinto(b)
503 if not n and b:

~/miniconda3/envs/ai4food/lib/python3.8/socket.py in readinto(self, b)
668 try:
--> 669 return self._sock.recv_into(b)
670 except timeout:

~/miniconda3/envs/ai4food/lib/python3.8/ssl.py in recv_into(self, buffer, nbytes, flags)
1240 self.class)
-> 1241 return self.read(nbytes, buffer)
1242 else:

~/miniconda3/envs/ai4food/lib/python3.8/ssl.py in read(self, len, buffer)
1098 if buffer is not None:
-> 1099 return self._sslobj.read(len, buffer)
1100 else:

ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

ProtocolError Traceback (most recent call last)
~/miniconda3/envs/ai4food/lib/python3.8/site-packages/requests/models.py in generate()
757 try:
--> 758 for chunk in self.raw.stream(chunk_size, decode_content=True):
759 yield chunk

~/miniconda3/envs/ai4food/lib/python3.8/site-packages/urllib3/response.py in stream(self, amt, decode_content)
575 while not is_fp_closed(self._fp):
--> 576 data = self.read(amt=amt, decode_content=decode_content)
577

~/miniconda3/envs/ai4food/lib/python3.8/site-packages/urllib3/response.py in read(self, amt, decode_content, cache_content)
540 # Content-Length are caught.
--> 541 raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
542

~/miniconda3/envs/ai4food/lib/python3.8/contextlib.py in exit(self, type, value, traceback)
130 try:
--> 131 self.gen.throw(type, value, traceback)
132 except StopIteration as exc:

~/miniconda3/envs/ai4food/lib/python3.8/site-packages/urllib3/response.py in _error_catcher(self)
454 # This includes IncompleteRead.
--> 455 raise ProtocolError("Connection broken: %r" % e, e)
456

ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

ChunkedEncodingError Traceback (most recent call last)
in
----> 1 arch_path = download_archive(dslist[1])

~/data/ai4food/radiant_mlhub/client.py in download_archive(archive_id, output_dir, if_exists, api_key, profile)
502
503 try:
--> 504 return _download(
505 f'archive/{archive_id}',
506 output_dir=output_dir,

~/data/ai4food/radiant_mlhub/client.py in _download(url, output_dir, if_exists, chunk_size, api_key, profile)
129 with tqdm(total=round(content_length / 1000000., 1), unit='M') as pbar:
130 pbar.update(round(start / 1000000., 1))
--> 131 for chunk in executor.map(
132 partial(_fetch_range, download_url),
133 _get_ranges(content_length, chunk_size, start=start)

~/miniconda3/envs/ai4food/lib/python3.8/concurrent/futures/_base.py in result_iterator()
609 # Careful not to keep a reference to the popped future
610 if timeout is None:
--> 611 yield fs.pop().result()
612 else:
613 yield fs.pop().result(end_time - time.monotonic())

~/miniconda3/envs/ai4food/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433
434 self._condition.wait(timeout)

~/miniconda3/envs/ai4food/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
386 def __get_result(self):
387 if self._exception:
--> 388 raise self._exception
389 else:
390 return self._result

~/miniconda3/envs/ai4food/lib/python3.8/concurrent/futures/thread.py in run(self)
55
56 try:
---> 57 result = self.fn(*self.args, **self.kwargs)
58 except BaseException as exc:
59 self.future.set_exception(exc)

~/data/ai4food/radiant_mlhub/client.py in fetch_range(url, range_)
84 def fetch_range(url: str, range_: str) -> bytes:
85 """Internal function for fetching a byte range from the url."""
---> 86 return session.get(url_, headers={'Range': f'bytes={range_}'}).content
87
88 # Resolve user directory shortcuts and relative paths

~/miniconda3/envs/ai4food/lib/python3.8/site-packages/requests/sessions.py in get(self, url, **kwargs)
553
554 kwargs.setdefault('allow_redirects', True)
--> 555 return self.request('GET', url, **kwargs)
556
557 def options(self, url, **kwargs):

~/data/ai4food/radiant_mlhub/session.py in request(self, method, url, **kwargs)
104 ).geturl()
105
--> 106 response = super().request(method, url, **kwargs)
107
108 # Handle authentication errors

~/miniconda3/envs/ai4food/lib/python3.8/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
540 }
541 send_kwargs.update(settings)
--> 542 resp = self.send(prep, **send_kwargs)
543
544 return resp

~/miniconda3/envs/ai4food/lib/python3.8/site-packages/requests/sessions.py in send(self, request, **kwargs)
695
696 if not stream:
--> 697 r.content
698
699 return r

~/miniconda3/envs/ai4food/lib/python3.8/site-packages/requests/models.py in content(self)
834 self._content = None
835 else:
--> 836 self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
837
838 self._content_consumed = True

~/miniconda3/envs/ai4food/lib/python3.8/site-packages/requests/models.py in generate()
759 yield chunk
760 except ProtocolError as e:
--> 761 raise ChunkedEncodingError(e)
762 except DecodeError as e:
763 raise ContentDecodingError(e)

ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Steps To Reproduce

from radiant_mlhub import Dataset
import time
ds = Dataset.fetch('dlr_fusion_competition_germany')
ds.download('data/')

A full example that can be used to reproduce the errror. DO NOT INCLUDE ANY SENSITIVE INFORMATION LIKE API KEYS.

Expected behavior

A clear and concise description of what you expected to happen.

Additional Context

If applicable, please add screenshots or any other information that might help us troubleshoot the problem. Also describe any attempted solutions or workarounds.

Downloads fail if connection is broken

Some users have been seeing an error like the one below when trying to download collection archives. These errors appear to be caused by the connection breaking during the download and seem more likely when downloading larger datasets over an inconsistent connection. The only option is then to restart the download from scratch, which often just continues to lead to the same outcome.

Traceback (most recent call last):
  File "/path/to/environment/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/path/to/environment/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/path/to/environment/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1373, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 311, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 280, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/environment/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/path/to/environment/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/path/to/environment/lib/python3.6/site-packages/urllib3/util/retry.py", line 531, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/path/to/environment/lib/python3.6/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/path/to/environment/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/path/to/environment/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/path/to/environment/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1373, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 311, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 280, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/environment/lib/python3.6/site-packages/radiant_mlhub/models.py", line 155, in download
    client.download_archive(self.id, output_dir=output_dir, overwrite=overwrite, **session_kwargs)
  File "/path/to/environment/lib/python3.6/site-packages/radiant_mlhub/client.py", line 302, in download_archive
    _download(f'archive/{archive_id}', output_dir=output_dir, overwrite=overwrite, **session_kwargs)
  File "/path/to/environment/lib/python3.6/site-packages/radiant_mlhub/client.py", line 91, in _download
    for chunk in executor.map(partial(_fetch_range, download_url), _get_ranges(content_length, chunk_size)):
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/path/to/environment/lib/python3.6/site-packages/radiant_mlhub/client.py", line 58, in _fetch_range
    return session.get(url_, headers={'Range': f'bytes={range_}'}).content
  File "/path/to/environment/lib/python3.6/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/path/to/environment/lib/python3.6/site-packages/radiant_mlhub/session.py", line 88, in request
    response = super().request(method, url, **kwargs)
  File "/path/to/environment/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/path/to/environment/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/path/to/environment/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

Dataset organisation of SpaceNet 2: Vegas collection

Environment

What operating system are you using?

Ubuntu 18.04

What Python version are you using?

Python 3.9.7

What version of the radiant_mlhub library are you using?

mlhub, version 0.2.2

Problem Description

Please provide a clear and concise description of the bug/problem, including error messages.

I recently downloaded the sn2_AOI_2_Vegas collection and noticed something odd with the dataset organisation.

.
├── collection.json
├── _common
│   └── labels.geojson
├── sn2_SN2_buildings_train_AOI_2_Vegas_PS-RGB_img1
│   ├── MS.tif
│   ├── PAN.tif
│   ├── PS-MS.tif
│   ├── PS-RGB.tif
│   └── stac.json
├── sn2_SN2_buildings_train_AOI_2_Vegas_PS-RGB_img1-labels
│   └── stac.json
│
├── sn2_SN2_buildings_train_AOI_2_Vegas_PS-RGB_img3
│   ├── MS.tif
│   ├── PAN.tif
│   ├── PS-MS.tif
│   ├── PS-RGB.tif
│   └── stac.json
├── sn2_SN2_buildings_train_AOI_2_Vegas_PS-RGB_img3-labels
│   ├── label.geojson
│   └── stac.json
......................

For img1, the labels are located in the _common folder and named labels.geojson whereas the labels for the rest of the images are in their respective label folders and are named label.geojson. Only img1 of the Vegas collection seems to be like this. Is there a reason for this?

The stac.json file in sn2_SN2_buildings_train_AOI_2_Vegas_PS-RGB_img1-labels reflects this as well.

Steps To Reproduce

from radiant_mlhub import Collection

vegas = Collection.fetch("sn2_AOI_2_Vegas")
vegas.download(output_dir=".")

Expected behavior

Consistent data organisation

Spacenet 3: Cannot download AOI2, AOI3, AOI4

Environment

What operating system are you using?

CentOS Linux v7

What Python version are you using?

3.7.10

What version of the radiant_mlhub library are you using?

0.1.2

Problem Description

I cannot seem to download any Spacenet 3 AOIs except AOI_5_Khartoum

Steps To Reproduce

>>> from radiant_mlhub import Dataset, Collection
>>> sn3 = Dataset.fetch('spacenet3')
>>> sn3_collections = [i for i in Collection.list() if i.id[:3] == "sn3"]
>>> sn3_collections
[<Collection id=sn3_AOI_2_Vegas>, <Collection id=sn3_AOI_5_Khartoum>, <Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_4_Shanghai>]
>>> sn3.collections
[<Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_2_Vegas>, <Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_2_Vegas>, <Collection id=sn3_AOI_4_Shanghai>, <Collection id=sn3_AOI_5_Khartoum>]


>>> sn3.collections[0].download('./Spacenet3')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/an1/miniconda3/envs/rdet2/lib/python3.7/site-packages/radiant_mlhub/models.py", line 168, in download
    return client.download_archive(self.id, output_dir=output_dir, if_exists=if_exists, **session_kwargs)
  File "/home/an1/miniconda3/envs/rdet2/lib/python3.7/site-packages/radiant_mlhub/client.py", line 355, in download_archive
    f'Archive "{archive_id}" does not exist and may still be generating. Please try again later.') from None
radiant_mlhub.exceptions.EntityDoesNotExist: Archive "sn3_AOI_3_Paris" does not exist and may still be generating. Please try again later.

>>> sn3.collections[5].download('./Spacenet3')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/an1/miniconda3/envs/rdet2/lib/python3.7/site-packages/radiant_mlhub/models.py", line 168, in download
    return client.download_archive(self.id, output_dir=output_dir, if_exists=if_exists, **session_kwargs)
  File "/home/an1/miniconda3/envs/rdet2/lib/python3.7/site-packages/radiant_mlhub/client.py", line 355, in download_archive
    f'Archive "{archive_id}" does not exist and may still be generating. Please try again later.') from None
radiant_mlhub.exceptions.EntityDoesNotExist: Archive "sn3_AOI_2_Vegas" does not exist and may still be generating. Please try again later.

>>> sn3_collections[0].download('./Spacenet3')
5265.0M [03:57, 22.15M/s]                                                                                                                         
PosixPath('/home/an1/Datasets/Spacenet3/sn3_AOI_5_Khartoum.tar.gz')

Expected behavior

I was expecting to successfully download a specfic Spacenet 3 collection

Enable downloading assets within a pytest environment

Problem Description

calls to Dataset.download will not download assets in a pytest because of this short circuit: https://github.com/radiantearth/radiant-mlhub/blob/main/radiant_mlhub/client/catalog_downloader.py#L637-L640

This makes it harder to test code that uses mlhub download functionality.

Feature Description

Enable downloading assets in a pytest environment by removing the short-circuit and addressing compatibility with vcr.py somehow. @guidorice indicated in the ml-model slack channel that the short circuit is there to get vcr.py tests working.

Alternative Solutions

A workaround would be to download data and set up fixtures up front and assume downloading assets works. But I think it'd be preferable to be able to test it directly.

Additional context

We are working on a python package that can convert Radiant Earth MLHub STAC records to TFRecords, and potentially other ML-ready formats.

I'll be traveling to Lisbon, Portugal for a work team week next week, so my availability to test/work on this may be a bit limited. However I'll try forking this library to see what happens when I remove the short circuit and update this ticket.

404 not found for registry url from recent mlhub version

https://mlhub.earth/sen12floods

from

ds = Dataset.fetch('sen12floods')
print(ds.registry_url)

leads to a 404 page even when signed in

Provide new release to allow more recent `shapely`

Problem Description

Since the following change was resolved 3 months ago, there has not been a new release (0.5.6 ?).

radiant-mlhub/CHANGELOG.md

Line 12 in 7d315e3

- Add support for shapely 2

Any project that tries to install the latest MLHub package still faces the shapely>=1.8.0,<1.9.0 restriction.
It is not currently possible to run poetry add radiant_mlhub with a project that uses shapely>=2.

Feature Description

Release the new version.

Alternative Solutions

Overrides using pip install.

Spacenet3: dataset.collections different from collections list

Environment

What operating system are you using?

CentOS Linux v7

What Python version are you using?

3.7.10

What version of the radiant_mlhub library are you using?

0.1.2

Problem Description

I don't know if this is a bug or a misunderstanding on my part but for Spacenet 3, dataset.collections and collections from Collection.list() are different.

Steps To Reproduce

>>> from radiant_mlhub import Dataset, Collection
>>> sn3 = Dataset.fetch('spacenet3')
>>> sn3_collections = [i for i in Collection.list() if i.id[:3] == "sn3"]
>>> sn3_collections
[<Collection id=sn3_AOI_2_Vegas>, <Collection id=sn3_AOI_5_Khartoum>, <Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_4_Shanghai>]
>>> sn3.collections
[<Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_2_Vegas>, <Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_3_Paris>, <Collection id=sn3_AOI_2_Vegas>, <Collection id=sn3_AOI_4_Shanghai>, <Collection id=sn3_AOI_5_Khartoum>]

Expected behavior

I was expecting sn3.collections and sn3_collections to have the same output.

Questions:

Why are sn3.collections and sn3_collections different?
Why are there 3 collections for AOI_3_Paris and 2 for AOI_2_Vegas in sn3.collections?

Method for getting archive size

Problem Description

Users don't currently have a way of determining the size of a collection archive prior to actually downloading it. This can be a problem in environments with limited storage, and it would be useful to get a ballpark estimate of how long the download will take before starting.

Feature Description

Add archive_size property to Collection instances that returns the size (in MB) of the archive for that Collection. If the archive does not exist, this will return None. The property will be cached to avoid unnecessary network traffic.
Add archive_sizes property to Dataset instances that returns a dictionary mapping each Collection ID to its associated archive size (or None if the archive does not exist for that Collection).

Alternative Solutions

We could also introduce a dry_run argument to the existing download methods, but this seems less transparent.

Related Issues

N/A

Additional context

This issue comes out of user feedback during our first round of user testing.

SpaceNet missing collections

I wanted to follow up on some missing SpaceNet collections

SpaceNet 6 is currently not available. Is it being re-archived ?
Spacenet 5 has AOI 9 San Juan which is missing in radiant mlhub. Any plans to add it?
All SpaceNet datasets contain train (images + labels) and test images. However on mlhub, all Spacenet datasets (except SpaceNet 7) contain only the train data. Are there any plans to include the test data for SpaceNet 1 to 6 as a separate collection?

cc: @kbgg

When using api_key parameter, some class properties cannot be accessed

Environment

Macos 11.4
Python 3.9.4
mlhub, version 0.2.0

Problem Description

When using api_key function parameter, some class properties cannot be accessed, for example the Dataset.collections.

Steps To Reproduce

Remove any profile or environment variables having the api_key
Iterate over datasets and collections by the following example

from radiant_mlhub import Dataset
k = '*****'
datasets = Dataset.list(api_key=k)
for d in datasets:
    print(d.id)
    for c in d.collections:  # APIKeyNotFound is raised here
        print(c.id)

Expected behavior

There should be a way to resolve the collections using Dataset class, even when using the api_key='xxx' method of authentication.

Additional Context

Additional dataset attributes break Dataset.collections

Environment

What operating system are you using?

macOS Catalina 10.15.7 (also reported on Windows)

What Python version are you using?

Python 3.8.6

What version of the radiant_mlhub library are you using?

mlhub, version 0.1.2

Problem Description

Calling the Dataset.collection property raises the following exception:

TypeError: get_session() got an unexpected keyword argument 'citation'

Steps To Reproduce

>>> from radiant_mlhub import Dataset
>>> Dataset.list()
<generator object Dataset.list at 0x10bbacc10>
>>> next(Dataset.list())
<radiant_mlhub.models.Dataset object at 0x10aa5c490>
>>> next(Dataset.list()).collections
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jduckworth/Code/ml-hub/radiant-mlhub/radiant_mlhub/models.py", line 290, in collections
    collections = [_fetch_collection(only_description)]
  File "/Users/jduckworth/Code/ml-hub/radiant-mlhub/radiant_mlhub/models.py", line 282, in _fetch_collection
    Collection.fetch(_collection_description['id'], **self.session_kwargs),
  File "/Users/jduckworth/Code/ml-hub/radiant-mlhub/radiant_mlhub/models.py", line 106, in fetch
    response = client.get_collection(collection_id, **session_kwargs)
  File "/Users/jduckworth/Code/ml-hub/radiant-mlhub/radiant_mlhub/client.py", line 216, in get_collection
    session = get_session(**session_kwargs)
TypeError: get_session() got an unexpected keyword argument 'citation'

Expected behavior

Dataset.collections should return a list of collections.

Additional Context

Was recently working, but was reported as broken starting on 2021-05-07

"ref_landcovernet_v1_source" file does not exist ?

Hello. I was modifying the Datasets & collections.ipynb to access and download the landcovernet collection.
While I was able to modify the bigEarth cell example to download de landcovernet_v1_label.tar.gz file, I was not able to download the corresponding source (imagery) file.
first

landcovernet_dataset = Dataset.fetch('landcovernet_v1')
print(f'Total Collections: {landcovernet_dataset.collections}')

Total Collections: [< Collection id=ref_landcovernet_v1_source\ >, <Collection id=ref_landcovernet_v1_labels>]

then

landcovernet_source= Collection.fetch('ref_landcovernet_v1_source')
landcovernet_source.download('./landcovernet_source.tar.gz')

EntityDoesNotExist Traceback (most recent call last)
in ()
----> 1 landcovernet_source.download('./landcovernet_source.tar.gz')

1 frames
/usr/local/lib/python3.6/dist-packages/radiant_mlhub/client.py in download_archive(archive_id, output_dir, overwrite, **session_kwargs)
304 if e.response.status_code != 404:
305 raise
--> 306 raise EntityDoesNotExist(f'Archive "{archive_id}" does not exist and may still be generating. Please try again later.') from None

EntityDoesNotExist: Archive "ref_landcovernet_v1_source" does not exist and may still be generating. Please try again later.

Drop support for Python 3.6

PySTAC may be dropping support for Python 3.6 as part of the first major release (see stac-utils/pystac#500) based on the recommendations in NEP-29. While we do not need to upgrade to PySTAC 1.0.0 immediately, we may want to follow suit to avoid being stuck supporting older versions.

If anyone has a strong need to keep support for Python 3.6, feel free to comment here

improve projection metadata for sen12floods and other stacs so they can be used more easily with stackstac

Problem Description

I filtered a set of stac items for sen12floods_s2_source and tried to pass the item collection to stackstac, a library for loading pystac item collections as xarrays. but I got this error

stackstac.stack(source_items, epsg = 4326)

ValueError: Cannot automatically compute the resolution, since asset 'B01' on item 0 'sen12floods_s2_source_62_2019_03_09' doesn't provide enough metadata to determine its native resolution.
We'd need at least one of (in order of preference):
- The `proj:transform` and `proj:epsg` fields set on the asset, or on the item
- The `proj:shape` and one of `proj:bbox` or `bbox` fields set on the asset, or on the item

Please specify the `resolution=` argument to set the output resolution manually. (Remember that resolution must be in the units of your CRS (http://epsg.io/4326)---not necessarily meters.

Feature Description

Provide the following metadata

The `proj:transform` and `proj:epsg` fields set on the asset, or on the item

or, since bbox is already provided, proj:shape needs to be provided as well

Alternative Solutions

specifying the resolution argument of the output when calling stackstac.stack . this adds some overhead to the user and means we need to use projected coordinate systems when calling stackstac.stack, since decimal resolution is not consistent.

context

Guidelines on making STAC catalogs with complete projection metadata gjoseph92/stackstac#110
gjoseph92/stackstac#108 (reply in thread)

Data download Error with collection_filter option

On Colab

Linux

python 3.7.13
mlhub --version 0.4.1

CODE:

os.environ['MLHUB_API_KEY'] =  getpass.getpass(prompt="MLHub API Key: ")

dataset = Dataset.fetch(main)

my_filter = dict(
    ref_agrifieldnet_competition_v1_labels_train=assets,

    ref_agrifieldnet_competition_v1_labels_test=[assets[0]],

    ref_agrifieldnet_competition_v1_source=selected_bands 
)

dataset.download(collection_filter=my_filter)

ERROR: TypeError: download() got an unexpected keyword argument 'collection_filter'

Import and run the /agrifieldnet_india_competition/ starter notebook on Colab to reproduce the error.
https://github.com/radiantearth/agrifieldnet_india_competition/blob/main/Starter%20notebook.ipynb

Without the filter the data download process is working fine.

Move to One Flow branching strategy instead of Git Flow

I adopted a Git Flow branching strategy for this project mostly out of familiarity and because it has worked in other projects. However, it seems like it adds a lot of complexity to the process of adding features and hotfixes without adding much value. It can also cause headaches if we start a branch thinking it's a hotfix or feature and then decide partway through that it should be the other.

The goal of this issue is to discuss moving to a One Flow branching strategy instead to simplify the development process and am looking for feedback on this approach. Some of the significant changes would be:

All feature, hotfix, and release branches would come from and merge back into main
Release tags would happen on a branch
Hotfixes would include a patch version upgrade in their PR
Publishing a new patch version would also include any feature work since the last release, so we would need to make sure features are bombproof by the time they are merged.

@guidorice @kbgg I'm curious to get your thoughts on this.

Able to update pystac pin?

I see from #150 that you identified a bug in PySTAC 1.5.x or so that caused a problem and updated the pin on pystac to be ~=1.4.0. However, PySTAC is several versions beyond that now. Have any of these newer versions been tested to see if they have the same bug? If not, could the pin on pystac be updated?

Nothing is happening when I start a download

Environment

Description: Ubuntu 18.04.5 LTS
Python 3.7.8
radiant-mlhub==0.3.0

Problem Description

I got this code from a challenge and when I try to download the dataset nothing happens. Stay there for hours, tried several times:

#CREATE THE FOLDER FOR THE DATA TO BE DOWNLOADED AND SET DOWNLOAD CREDENTIALS

from radiant_mlhub import Dataset

os.environ['MLHUB_API_KEY'] = 'My api key'

if not os.path.exists('data/'):
    os.makedirs('data/')



# Download the Brandenburg-Germany Dataset
dataset = Dataset.fetch('dlr_fusion_competition_germany')
print(f'{dataset.id}: {dataset.title}')
dataset.download('data/')

This outputs:
dlr_fusion_competition_germany: A Fusion Dataset for Crop Type Classification in Germany

My api key was obtained from https://mlhub.earth/ and it is active.

Human-readable Collection info

Problem Description

We currently have a lot of example and tutorial code that does manual string formatting to print Collection information in a human-readable way.

for collection in collections:
    print(f'{collection.id}: {collection.description}')

This ends up in a lot of manual string formatting work by the user to get information in a way that's easy to read. It might be better to have a built-in solution to this.

Feature Description

Add an info property to the Collection class that returns a human-readable summary of the Collection as a str. This could include a standard set of properties by default, but allow the user to past in a custom list of fields to include if they want (invalid fields would be ignored).

Alternative Solutions

We could also accomplish this through more atomic properties (e.g. the archive_size_readable property discussed here), but having them all in a single property seems like a better user experience.

Related Issues

N/A

Additional context

See Alternative Solutions above for reference to similar conversation about human-readable archive sizes.

Include LICENSE file in package distribution

This is technically required by the license and means we do not need to include it in our conda-forge recipe.

Permission error using configure CLI

Problem Summary

During user testing, one user got the following error when trying to run mlhub configure on Windows:

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\Dell\\.mlhub\\profiles

The client can currently only read and write profile configurations (e.g. API keys) from a .mlhub/profiles file in the user's "home" directory. The path to the "home directory is resolved using pathlib.Path.home() / '.mlhub' here, which resolves to something like /Users/myusername/.mlhub on Macs and something like C:\Users\MyUsername\.mlhub on Windows.

Proposed Solution

It seems like it is not safe to assume that a user has access to that directory, so we should have a way for the user to define an alternate location for their profile. Proposed solution is to allow the user to set a MLHUB_HOME environment variable that points to the directory that contains all radiant_mlhub config files (currently only the profiles file). If this environment variable is not present, the client would default to using pathlib.Path.home() / '.mlhub' to find the appropriate directory.

Workaround

Until this solution has been implemented, the workaround is to not use mlhub configure to create a profile and instead set an API key using the MLHUB_API_KEY environment variable as described in the Authentication - Using API Keys docs.

Failure to download catalogs or data for datasets dlr_fusion_competition_germany and ref_fusion_competition_south_africa

Environment: Windows 10 Enterprise, Build 19042.1706 using VSCode

Python version: 3.8.13

Radiant_mlhub version: 0.5.1

Problem Description

When using the download function for a Radiant MLHub dataset, two specific dataset IDs fail: dlr_fusion_competition_germany and ref_fusion_competition_south_africa. A directory with the name of the dataset is created containing err_report.csv which has no content and a directory dlr_fusion_competition_germany_test_source_planet_5day with subsequent subdirectory dlr_fusion_competition_germany_test_source_planet_5day_33N_18E_242N_2019_08_06 which is empty

The full failure error message for the first file with prior status display information is:

`dlr_fusion_competition_germany: fetch stac catalog: 96KB [00:00, 418.75KB/s]
unarchive dlr_fusion_competition_germany.tar.gz: 0%| | 3/1790 [00:00<00:05, 299.49it/s]

FileNotFoundError Traceback (most recent call last)
c:\Users\613287\Desktop\wids_cohort5\agriculture\practice_mlhub.ipynb Cell 2' in <cell line: 5>()
3 mltrainingdata = Dataset.fetch_by_id('dlr_fusion_competition_germany')
4 print (mltrainingdata)
----> 5 mltrainingdata.download(catalog_only=True)

File c:\Users\613287\Anaconda3\envs\agriculture\lib\site-packages\radiant_mlhub\models\dataset.py:361, in Dataset.download(self, output_dir, catalog_only, if_exists, api_key, profile, bbox, intersects, datetime, collection_filter)
347 config = CatalogDownloaderConfig(
348 catalog_only=catalog_only,
349 api_key=api_key,
(...)
358 temporal_query=datetime,
359 )
360 dl = CatalogDownloader(config=config)
--> 361 dl()

File c:\Users\613287\Anaconda3\envs\agriculture\lib\site-packages\radiant_mlhub\client\catalog_downloader.py:740, in CatalogDownloader.call(self)
738 # call each step
739 for step in steps:
--> 740 step()
742 # inspect the error report
743 self.err_report.flush()

File c:\Users\613287\Anaconda3\envs\agriculture\lib\site-packages\radiant_mlhub\client\catalog_downloader.py:160, in CatalogDownloader._unarchive_catalog_step(self)
158 continue
159 else:
--> 160 archive.extract(tar_info, path=c.output_dir)
161 assert (self.work_dir / 'catalog.json').exists()

File c:\Users\613287\Anaconda3\envs\agriculture\lib\tarfile.py:2069, in TarFile.extract(self, member, path, set_attrs, numeric_owner)
2066 tarinfo._link_target = os.path.join(path, tarinfo.linkname)
2068 try:
-> 2069 self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
2070 set_attrs=set_attrs,
2071 numeric_owner=numeric_owner)
2072 except OSError as e:
2073 if self.errorlevel > 0:

File c:\Users\613287\Anaconda3\envs\agriculture\lib\tarfile.py:2141, in TarFile._extract_member(self, tarinfo, targetpath, set_attrs, numeric_owner)
2138 self._dbg(1, tarinfo.name)
2140 if tarinfo.isreg():
-> 2141 self.makefile(tarinfo, targetpath)
2142 elif tarinfo.isdir():
2143 self.makedir(tarinfo, targetpath)

File c:\Users\613287\Anaconda3\envs\agriculture\lib\tarfile.py:2182, in TarFile.makefile(self, tarinfo, targetpath)
2180 source.seek(tarinfo.offset_data)
2181 bufsize = self.copybufsize
-> 2182 with bltn_open(targetpath, "wb") as target:
2183 if tarinfo.sparse is not None:
2184 for offset, size in tarinfo.sparse:

FileNotFoundError: [Errno 2] No such file or directory: 'c:\Users\613287\Desktop\wids_cohort5\agriculture\dlr_fusion_competition_germany\dlr_fusion_competition_germany_test_source_planet_5day\dlr_fusion_competition_germany_test_source_planet_5day_33N_18E_242N_2019_08_06\dlr_fusion_competition_germany_test_source_planet_5day_33N_18E_242N_2019_08_06.json'`

Steps To Reproduce

from radiant_mlhub import Dataset

mltrainingdata = Dataset.fetch_by_id('dlr_fusion_competition_germany')
mltrainingdata.download(catalog_only=True)

Expected behavior

At the completion of the dataset download, a number of subdirectories should exist, each of which containing a json file

LandCoverNet download includes unnecessary metadata when using collection_filter

Environment

AWS SageMaker (ml.geospatial.interactive)
Debian GNU/Linux 11 (bullseye)
Python 3.10.4
radiant_mlhub==0.5.2

Problem Description

I am trying to download LandCoverNet Africa, specifically Sentinel-2 (B02, B03, B04, and B08) and labels. I am using a collection_filter to only request these assets. However, the downloaded archive includes Landsat 8 and Sentinel-1 metadata. As a result, unarchiving takes a long time. I am not sure if this is a bug, feature, or something that will be fixed in the future.

Steps To Reproduce

from radiant_mlhub import Collection, Dataset 
dataset = Dataset.fetch('ref_landcovernet_af_v1')
collection_filter = dict(
    ref_landcovernet_af_v1_source_sentinel_2=['B02', 'B03', 'B04', 'B08'],
    ref_landcovernet_af_v1_labels=['labels']
)
dataset.download(collection_filter=collection_filter, output_dir='./data/re/landcover')

Expected behavior

I expect the archive to contain only information about the requested collections (Sentinel-2 and labels).

Publish conda package

Problem Description

conda users currently need to install the package using pip, which works but isn't ideal.

Feature Description

Create a conda recipe for the radiant-mlhub package and publish this as a package on conda-forge.

Alternative Solutions

We could alternately publish this on our own channel, but it would be less discoverable and seems like it would create unnecessary complexity.

Related Issues

N/A

Additional context

See the conda-forge Contributing Packages docs for details on submitting a new package.

radiantearth / radiant-mlhub Goto Github PK

radiant-mlhub's Introduction

Radiant MLHub Python Client

Documentation

Python Version Support

Design Decisions

Contributing

radiant-mlhub's People

Contributors

Stargazers

Watchers

Forkers

radiant-mlhub's Issues

Problem Description

Feature Description

Alternative Solutions

Related Issues

Environment

Problem Description

Steps To Reproduce

Expected behavior

Environment

Problem Description

Steps To Reproduce

Expected behavior

Additional Context

Environment

Problem Description

Steps To Reproduce

Expected behavior

Environment

Problem Description

Steps To Reproduce

Expected behavior

Problem Description

Feature Description

Alternative Solutions

Additional context

Problem Description

Feature Description

Alternative Solutions

Environment

Problem Description

Steps To Reproduce

Expected behavior

Questions:

Problem Description

Feature Description

Alternative Solutions

Related Issues

Additional context

Environment

Problem Description

Steps To Reproduce

Expected behavior

Additional Context

Environment

Problem Description

Steps To Reproduce

Expected behavior

Additional Context

Problem Description

Feature Description

Alternative Solutions

context

On Colab

Environment

Problem Description

Problem Description

Feature Description

Alternative Solutions

Related Issues

Additional context

Problem Summary

Proposed Solution

Workaround

`dlr_fusion_competition_germany: fetch stac catalog: 96KB [00:00, 418.75KB/s] unarchive dlr_fusion_competition_germany.tar.gz: 0%| | 3/1790 [00:00<00:05, 299.49it/s]

Steps To Reproduce

Expected behavior

Environment

`dlr_fusion_competition_germany: fetch stac catalog: 96KB [00:00, 418.75KB/s]
unarchive dlr_fusion_competition_germany.tar.gz: 0%| | 3/1790 [00:00<00:05, 299.49it/s]