Git Product home page Git Product logo

darwin-py's Introduction

V7 Darwin Python SDK

Downloads Downloads GitHub Repo stars Twitter Follow

⚡️ Official library to annotate, manage datasets, and models on V7's Darwin Training Data Platform. ⚡️

Need to label data? Start using V7 free today

Darwin-py can both be used from the command line and as a python library.


Main functions are (but not limited to):

  • Client authentication
  • Listing local and remote datasets
  • Create/remove datasets
  • Upload/download data to/from remote datasets
  • Direct integration with PyTorch dataloaders

Support tested for python 3.8 - 3.10

🏁 Installation

pip install darwin-py

You can now type darwin in your terminal and access the command line interface.

If you wish to use the PyTorch bindings, then you can use the ml flag to install all the additional requirements

pip install darwin-py[ml]

If you wish to use video frame extraction, then you can use the ocv flag to install all the additional requirements

pip install darwin-py[ocv]

To run test, first install the test extra package

pip install darwin-py[test]

Development

See our development and QA environment installation recommendations here


Usage as a Command Line Interface (CLI)

Once installed, darwin is accessible as a command line tool. A useful way to navigate the CLI usage is through the help command -h/--help which will provide additional information for each command available.

Client Authentication

To perform remote operations on Darwin you first need to authenticate. This requires a team-specific API-key. If you do not already have a Darwin account, you can contact us and we can set one up for you.

To start the authentication process:

$ darwin authenticate
API key:
Make example-team the default team? [y/N] y
Datasets directory [~/.darwin/datasets]:
Authentication succeeded.

You will be then prompted to enter your API-key, whether you want to set the corresponding team as default and finally the desired location on the local file system for the datasets of that team. This process will create a configuration file at ~/.darwin/config.yaml. This file will be updated with future authentications for different teams.

Listing local and remote datasets

Lists a summary of local existing datasets

$ darwin dataset local
NAME            IMAGES     SYNC_DATE         SIZE
mydataset       112025     yesterday     159.2 GB

Lists a summary of remote datasets accessible by the current user.

$ darwin dataset remote
NAME                       IMAGES     PROGRESS
example-team/mydataset     112025        73.0%

Create/remove a dataset

To create an empty dataset remotely:

$ darwin dataset create test
Dataset 'test' (example-team/test) has been created.
Access at https://darwin.v7labs.com/datasets/579

The dataset will be created in the team you're authenticated for.

To delete the project on the server:

$ darwin dataset remove test
About to delete example-team/test on darwin.
Do you want to continue? [y/N] y

Upload/download data to/from a remote dataset

Uploads data to an existing remote project. It takes the dataset name and a single image (or directory) with images/videos to upload as parameters.

The -e/--exclude argument allows to indicate file extension/s to be ignored from the data_dir. e.g.: -e .jpg

For videos, the frame rate extraction rate can be specified by adding --fps <frame_rate>

Supported extensions:

  • Video files: [.mp4, .bpm, .mov formats].
  • Image files [.jpg, .jpeg, .png formats].
$ darwin dataset push test /path/to/folder/with/images
100%|████████████████████████| 2/2 [00:01<00:00,  1.27it/s]

Before a dataset can be downloaded, a release needs to be generated:

$ darwin dataset export test 0.1
Dataset test successfully exported to example-team/test:0.1

This version is immutable, if new images / annotations have been added you will have to create a new release to included them.

To list all available releases

$ darwin dataset releases test
NAME                           IMAGES     CLASSES                   EXPORT_DATE
example-team/test:0.1               4           0     2019-12-07 11:37:35+00:00

And to finally download a release.

$ darwin dataset pull test:0.1
Dataset example-team/test:0.1 downloaded at /directory/choosen/at/authentication/time .

Usage as a Python library

The framework is designed to be usable as a standalone python library. Usage can be inferred from looking at the operations performed in darwin/cli_functions.py. A minimal example to download a dataset is provided below and a more extensive one can be found in

./darwin_demo.py.

from darwin.client import Client

client = Client.local() # use the configuration in ~/.darwin/config.yaml
dataset = client.get_remote_dataset("example-team/test")
dataset.pull() # downloads annotations and images for the latest exported version

Follow this guide for how to integrate darwin datasets directly in PyTorch.

darwin-py's People

Contributors

ahmedosama9777 avatar albertorizzoli avatar almazan avatar andreaazzini avatar andriiklymchuk avatar balysv avatar biendltb avatar brain-geek avatar brentgracey avatar campbell-remedy avatar christofferedlund avatar dependabot[bot] avatar dsienkiewicz avatar elpikel avatar fl4m3ph03n1x avatar florianfischerx avatar francescosaveriozuppichini avatar hwaxxer avatar jbwilkie avatar nathanjp91 avatar owencjones avatar renthal avatar rradz avatar rslota avatar saurbhc avatar shernshiou avatar simedw avatar tkupek avatar tomvars avatar whilefalse avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

darwin-py's Issues

Parsing the time

My personal and empirical evidence shows that in order for line 298 in remote_dataset.py to work python 3.7.4 is required. My test was with python 3.6 and was not working correctly throwing a ValueError: time data '2020-02-10T11:02:30Z' does not match format '%Y-%m-%dT%H:%M:%S%z'.
This got fixed by using a 3.7.4 python version. I would therefore suggest to change the required version in setup.py

Logging errors

When download a dataset the CLI fails silently. We should show to the user that the operation failed and why.

KeyError: 'status_code' when importing annotations

Hi, when running the following command

darwin dataset import <dataset-name> coco <coco-file>

I got the following error:

if res["status_code"] != 200:
KeyError: 'status_code'

this means that the key "status_code" doesn't exist in 'res'.

I have currently implemented a workaround to this on darwin-py/darwin/importer/importer.py line 143:

if "status_code" in res and res["status_code"] != 200:

to ensure that the key exists. Please ensure that the key exists in the importer.py file.

`detectron2_register_dataset` function anormal behaviour when the release_name parameter is defined

I detect an error when we defined the release_name as parameter (which is optional and take "latest" value by default)

The DatasetCatalog will register a wrong catalog if there is a difference between the "latest" version and the "release_name"

To solve this issue :
You need to replace the 103rd row in darwin.torch.utils.py detectron2_register_dataset()

classes = get_classes(dataset_path, annotation_type="polygon")

by

-----> classes = get_classes(dataset_path=dataset_path, release_name=release_name, annotation_type="polygon")

upload_file_to_s3()

function upload_file_to_s3 at line 147 in dataset.py

  • return value not consistent with declared return type in method signature
  • unresolved reference Client
  • unresolved reference process_response()

Errors in dataset.split()

In the dataset.split() functions are some errors which I want to highlight.

The code I am using is:

client = Client.local()
dataset = client.get_remote_dataset("DATASET_NAME")
dataset.pull()
dataset.split()

First of all the default variable name of split() for the val_percentage parameter is 0.1. This results in an error, because the validation on the numbers is based on percentages, so this should be 10.

Additionally when the test_percentage parameter is 0, this results in an error, because somewhere in the code the indices of these sets are concatenated, but 0 results in a None instead of a list.

Error resulted from test_percentage=0:

~/miniconda3/envs/birds/lib/python3.9/site-packages/darwin/dataset/split_manager.py in stratified_split(annotation_path, splits, annotation_files, val_percentage, test_percentage, stratified_types, split_seed)
    184         )
    185 
--> 186         stratified_indices = train_indices + val_indices + test_indices
    187         for idx in range(dataset_size):
    188             if idx in stratified_indices:

TypeError: can only concatenate list (not "NoneType") to list

Uploading a COCO format json with "attributes" defined in it for each annotation doesn't appear in the platform.

Hi everyone, I'm submitting this issue because I downloaded a coco.json from the platform which had a series of attributes for each annotation defined. So under the "extra" key of each annotation, I have

"extra": {"attributes": ["some-attribute", "some-other-attribute"]}

but then after downloading the COCO formatted json I wanted to re-upload the same file to another dataset I created with the same images and the attributes defined previously don't appear in the platform anymore for this new dataset.
All the classes and segmentation masks seem to be uploaded correctly.

when importing annotations, parser was not made correctly

When I try to import my annotation file (coco format) according to the recipe in the documentation, I get an error.
It happened when creating the parser variable.
Perhaps there is a problem with the formats.supported_formats object.

スクリーンショット 2022-01-05 17 50 31
スクリーンショット 2022-01-05 17 52 34

Object-detection dataset

Hello,

I'm trying to import an object detection dataset per #291 but running into an error:

from darwin.torch import get_dataset
dataset_id = "myorg/dataset"
dataset = get_dataset(dataset_id, dataset_type="object-detection")
print(dataset)

Result:

Error: dataset_type needs to be one of 'classification, instance-segmentation, 
semantic-segmentation'

Based on https://github.com/v7labs/darwin-py/blob/master/darwin/torch/dataset.py#L36-L37 it looks like object-detection needs to be added to the docstring, unless there is some other way it's being validated.

The documentation at https://docs.v7labs.com/docs/loading-a-dataset-in-python could be updated as well to note the object-detection support.

Thanks,
Chris

Also, it would be good to add a test for get_dataset() on each possible type, so that this minor issue is detected automatically.

Validation issues with dataset splitting

Hello,

I'm trying to follow the dataset splitting shown at https://docs.v7labs.com/docs/loading-a-dataset-in-python but running into errors.

First, it looks like the function is expecting a floating point number rather than integer:

darwin dataset split {dataset_id} --val-percentage 0 --test-percentage 20
> Error: Invalid test percentage (20.0). Must be >= 0 and < 100

So it would be good to update the validation error message to say that it must be between 0 and 1, and to update the documentation page to be consistent with that.

This error message is also misleading for the validation set:

darwin dataset split {dataset_id} --val-percentage 20 --test-percentage 20
> Error: Invalid validation percentage (20.0). Must be >= 0 and < 100

However, if I change 20 to 0.2 I get a different, unclear error message:

darwin dataset split {dataset_id} --val-percentage 0 --test-percentage .20
> Error: test_size=1.0 should be either positive and smaller than the number of 
samples 41 or a float in the (0, 1) range

If I change val-percentage to be 0.2 it does succeed though:

darwin dataset split {dataset_id} --val-percentage 0.2 --test-percentage .2

So it would be helpful to support a validation percentage of 0 or improve the checking to note that it can't be 0.

Thanks,
Chris

Handling of erroneous json annotation

Thank you for putting together such a good tool. I have recently been uploading my own annotations, via a custom YOLO .txt to Darwin .json converter. I got this error periodicially while trying to upload my dataset.

requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://darwin.v7labs.com/api/dataset_items/267804819/import

I discovered that it was because some of my original annotations had a 0 height or width, which was being (correctly) converted to .json but fell down at the upload stage. Is this something that is worth checking for? If I wanted to try to make a pull request to fix this, where would such an error be handled?

darwin-py 0.7.11 causing `MissingConfig` issues

When I try using 0.7.11, I am getting an error when I run dataset.pull

It appears to be something with the release.py download_zip code passing config.yaml. I checked my .darwin folder and I don't see a config.yaml so I'm sure that's why it's causing an issue, but not sure where that is supposed to come from.

Here's a reproducible version you should be able to use

import darwin; print(darwin.__version__)
from darwin.client import Client

API_KEY = *your_key_here*
client = Client.from_api_key(API_KEY)

datasets={d.name:d for d in client.list_remote_datasets()}
dataset=datasets["kevin-tmp"]

release = dataset.get_release()
dataset.pull(release=release)

stack trace in 0.7.11:

---------------------------------------------------------------------------
MissingConfig                             Traceback (most recent call last)
/tmp/ipykernel_277/279924910.py in <module>
     10 
     11 release = dataset.get_release()
---> 12 dataset.pull(release=release)

/opt/conda/lib/python3.7/site-packages/darwin/dataset/remote_dataset.py in pull(self, release, blocking, multi_threaded, only_annotations, force_replace, remove_extra, subset_filter_annotations_function, subset_folder_name, use_folders, video_frames)
    309             tmp_dir = Path(tmp_dir_str)
    310             # Download the release from Darwin
--> 311             zip_file_path = release.download_zip(tmp_dir / "dataset.zip")
    312             with zipfile.ZipFile(zip_file_path) as z:
    313                 # Extract annotations

/opt/conda/lib/python3.7/site-packages/darwin/dataset/release.py in download_zip(self, path)
    195 
    196         config_path: Path = Path.home() / ".darwin" / "config.yaml"
--> 197         client: Client = Client.from_config(config_path=config_path, team_slug=self.team_slug)
    198 
    199         data: Response = client.fetch_binary(self.url)

/opt/conda/lib/python3.7/site-packages/darwin/client.py in from_config(cls, config_path, team_slug)
    836         """
    837         if not config_path.exists():
--> 838             raise MissingConfig()
    839         config = Config(config_path)
    840 

MissingConfig:

Dataset splitting fails due to "least populated class"

Hello,

I've been analyzing a dataset and when pulling an updated version I now can no longer split the dataset using the darwin CLI. I receive this error:

Error: The least populated class in y has only 1 member, which is too few. The 
minimum number of groups for any class cannot be less than 2.

However none of the classes has a single member - the lowest is 29. This dataset combines bounding boxes with classification tags though - is it looking at all combinations of labels? If so, is it possible for the error message to display what combination is causing the error? Or simply coarsen the stratification?

darwin dataset delete-files returns success when deleteing non-existent files

Current behavior

When running the darwin dataset delete-files on a dataset for a file that does NOT exist in the dataset, the current behavior is as follows:

darwin dataset delete-files cars mycar.jpg -y
Files successfully deleted!

Expected behavior

I defend the expected behavior should be:

  • A warning telling the user he mentioned a file that does no exist
  • Skip trying to delete the non-existing file
  • If no files were deleted, print a message saying that (probably a warning) instead of a success message.

@andreaazzini and @simedw , do you agree with my expectations?

Uploading Polygons and Tags using Python Library

Context

Hello, I'm working on programmatically uploading a data set to Darwin that already has both "Classification" and "Polygon" labels using the Darwin Python Library. I'm using the coco format to upload polygon labels:

parser=dict(formats.supported_formats)['coco']
importer.import_annotations(dataset, parser, annotation_paths)

Issue

The problem here is that my dataset also has tags. Some images are just tagged some are tagged and have polygons, some just have polygons. I'm able to upload tags using the csv_tags parser format:

tags_parser=dict(formats.supported_formats)['csv_tags']
importer.import_annotations(dataset, tags_parser, [data_dir/'test_tags.csv'])

The problem is that when I upload tags, my polygon annotations get wiped out. It's like I have a POST, but really need a PATCH method here. Is there a work around? I'm wondering if I can upload tags in the coco format in "one go" maybe?

Peerdependency Dataclasses does compromise Python 3.7 and Python 3.8 support (AttributeError: module 'typing' has no attribute '_ClassVar')

Overview
There is a known issue, that occurs with many of the packages using Dataclasses where installing Dataclasses actually disrupts the native dataclasses shipped with >Python 3.7.

Example:
When using Pandas v. 1.2.3 together with darwin-py v. 0.5.15 you will encounter the error AttributeError: module 'typing' has no attribute '_ClassVar' due to the usage of @dataclasses.dataclass in pandas/io/common.

Suggested solution

  • Deploy a breaking change and lose official support for Python 3.6.
  • Add instructions of installing dataclassess 0.6 manually if running on Python 3.6 or below

Workaround
Install package without deps (pip3 install --no-deps) and then manually install all missing dependencies for darwin-py

Downside: A lot of work

How to replicate
I've done this in Docker, as this allows me to run in a clean environment.

  1. Spin up a Buster docker image with Python 3.8
  2. Install darwin-py==0.5.15
  3. Install pandas==1.2.3
  4. Import pandas (as pd)
  5. Run the docker container

OSError: symbolic link privilege not held

When downloading datasets on windows machines, the user seems to need admin access to the environment in order to create symlinks. I don't think should really be necessary?

If I comment out the lines 252-255 in darwin > dataset > remote_dataset.py the issue is resolved.

Could you include an option to not use symlinks for windows users that don't have admin access to the os?

Response is not JSON encoded

Getting this error when importing annotations to a video dataset in "New" status.

warning, failed to upload annotation to 209824341 {'error': 'Response is not JSON encoded', 'status_code': 500, 'text': 'Something went wrong'}

What could be the issue?

uploading image tags overwrites annotations

I am trying to upload image tags and COCO annotations, so that e.g. multiple images can be grouped by a single tag.
In my use case I have thousands of images which form groups of 6 (XYZ.IJK.jpg, where there are hundreds of XYZ, and 6 IJK).
I want to tag all images which belong to XYZ as "XYZ", and I also need to upload labels to each image.

Currently, if you upload either COCO segmentation annotations (darwin dataset import dataset_name coco coco.json) or csv tags (darwin dataset import dataset_name csv_tags tags.csv) the oldest upload is deleted.
I have modified darwin-py/import/formats/coco.py to simultaneously upload both (see here) but it still overwrites the second lot of annotations.

Is this a bug?
If it is unsupported, I would like to make a feature request to be able to upload both.

Raise exception on failure of upload

It would be nice if we get an exception when status_code != 200 in _import_annotations. This way we can catch the exception and do a retry with backoff on the upload. I'm happy to make a PR?

Line 282 in darwin/importer/importer.py

if res.get("status_code") != 200:
    print(f"warning, failed to upload annotation to {id}", res)

download_all_images_from_annotations()

download_all_images_from_annotations() at line 182 in dataset.py

  • unresolved reference attributed on annotations_path.glob() since annotations_path is declared to be of type str it can't have an attribute glob

Similar to

download_image_from_json_annotation() at line 215in dataset.py

  • unresolved reference attributed on annotations_path.stem since annotations_path is declared to be of type str it can't have an attribute stem

darwin dataset comment -h gives incorrect help information

Based on this message:

usage: darwin dataset comment [-h] [--text TEXT] [--x X] [--y Y] [--w W] [--h H] dataset file

positional arguments:
  dataset            [Remote] Dataset name: to list all the existing dataset, run 'darwin dataset remote'.
  file               File to comment

optional arguments:
  -h, --help         show this help message and exit
  --text TEXT        Comment: list of words
  --x X              X coordinate for comment box
  --y Y              Y coordinate for comment box
  --w W, --width W   Comment box width in pixels
  --h H, --height H  Comment box height in pixels

One would assume that the minimal argument would be darwin dataset comment dataset-demo. However this is not true and it will fail:

There was an error posting your comment!

Traceback (most recent call last):
  File "/home/pedro/Workplace/v7/darwin-py/darwin/cli_functions.py", line 1004, in post_comment
    client.post_workflow_comment(workflow_id, text, x, y, w, h)
  File "/home/pedro/Workplace/v7/darwin-py/darwin/client.py", line 745, in post_workflow_comment
    self._post(
  File "/home/pedro/Workplace/v7/darwin-py/darwin/client.py", line 1043, in _post
    response: Response = self._post_raw(endpoint, payload, team_slug, retry)
  File "/home/pedro/Workplace/v7/darwin-py/darwin/client.py", line 1026, in _post_raw
    self._raise_if_known_error(response, urljoin(self.url, endpoint))
  File "/home/pedro/Workplace/v7/darwin-py/darwin/client.py", line 1094, in _raise_if_known_error
    raise ValidationError(body)
darwin.exceptions.ValidationError: {'errors': {'workflow_comments': [{'body': ["can't be blank"]}]}}

As --text is mandatory. The --text flag is currently marked as optional.
I defend this needs to be fixed.

No module named numpy

On fresh install of darwin-py one gets an error "no module named numpy" and indeed, numpy is missing in setup.py

Missed team -> team_slug change in torch get_dataset()

It looks like #287 changed the API for client.list_local_datasets() to expect team_slug rather than team.

darwin/torch/dataset.py defines get_dataset() and it has a call to list_local_dataset() that still needs to be updated:
https://github.com/v7labs/darwin-py/blob/master/darwin/torch/dataset.py#L61

If possible it would be good to add a test for get_dataset() to detect this minor bug automatically.

Also it would be good to update the docstring for dataset_type in that same file (adding "object-detection"): https://github.com/v7labs/darwin-py/blob/master/darwin/torch/dataset.py#L37

Upload video with non integer native frame rate

I would like to upload a video with a non integer (i.e. 29.97) frame rate. In the web interface, there is an option to use the native frame rate. Using this option allows you to annotate every frame in the video. Using the python api, you can specify the frame rate using the fps argument. However, fps is supposed to be an integer according to the function definition.

def push( self, files_to_upload: Optional[List[Union[PathLike, LocalFile]]], *, blocking: bool = True, multi_threaded: bool = True, fps: int = 0, as_frames: bool = False, files_to_exclude: Optional[List[PathLike]] = None, path: Optional[str] = None, preserve_folders: bool = False, progress_callback: Optional[ProgressCallback] = None, file_upload_callback: Optional[FileUploadCallback] = None, )

I would like there to be an option to use the native frame rate from the api.

TFRecord or H5 export formats

Hi, Is there a way to export the annotated datasets in TFRecord or H5 formats?

Please comment on importing such formats as well.

Thanks

Unused dependencies

Hi

There are some unused dependencies on your setup.py for example scikit-learn.

This makes darwin-py harder to install since some of these libraries have binary dependencies which effects MacOS installs and local debugging.

If there is no plan to use these libraries can you please remove unused libraries?

Checking for dependencies

Actually I don't think we should add torch/torchvision here.
My reasoning is that you could use our cli without ever needing torch installed, so why force it.
I also want use to add support for tensorflow (darwin.tf) in the future, but we should force the user to install both tf and pytorch.

It's better if we add a check in darwin/torch/init.py that complains if torch is not available.
Which is also a reason not to include torch in the darwin/init.py

Originally posted by @simedw in https://github.com/_render_node/MDIzOlB1bGxSZXF1ZXN0UmV2aWV3VGhyZWFkMjA0OTc0ODU0OnYy/pull_request_review_threads/discussion

__mul__ overloading

if I understand this correctly the __mul__ method will allow to write something like dataset * 0.2 to create a validation split. While this is attractive from an elegance point of view it exposes us to a pitfall: reinsertion.
In fact, the way indexes are selected here does not prevent the next split to select the same samples from previous call on this methods thus enabling what is known as cross contamination between train/val/test.

Here in the method split_darwin_dataset() I implemented the splitting in a different way which avoids this problem. Maybe we can bring it over?

Here, always in split_darwin_dataset() (its being overridden) I implemented the same thing but with stratified splitting i.e. taking into account the current distribution of classes and ensuring that the final distribution in all splits will be the same. Clearly this is more situational and maybe should be implemented in classification only? Since in raw we don't have self.classes (its None).

Originally posted by @Renthal in #10

fetch_remote_files not working for spaces (or maybe commas) in filenames

I'm pretty sure there's an issue with the fetch_remote_files() method on the RemoteDataset class (or more more likely in the actual API call) with either spaces or commas inside filenames.

To reproduce: upload two files to a dataset with filenames Video Sep 09, 10 02 59 AM.mov and 5897.mp4 respectively

dataset.fetch_remote_files({"filenames": ["5897.mp4"]})) returns results whereas dataset.fetch_remote_files({"filenames": ["Video Sep 09, 10 02 59 AM.mov"]})) does not.

ValueError: Multiple tags defined for this image

I'm trying to load a dataset on torch using the following code, darwin.version = '0.5.20'. I've tried installing the most up to date version on the conda environment I'm using but was not able to do so.

from darwin.torch import get_dataset
dataset_id = dataset_name
dataset = get_dataset(dataset_id, dataset_type="classification")

I'm getting the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/suporte/anaconda3/envs/dino-paws/lib/python3.6/site-packages/darwin/torch/dataset.py", line 99, in __getitem__
    target = self.get_target(index)
  File "/home/suporte/anaconda3/envs/dino-paws/lib/python3.6/site-packages/darwin/torch/dataset.py", line 112, in get_target
    raise ValueError(f"Multiple tags defined for this image ({tags}). This is not supported at the moment.")

But apparently there is support for multi-label classification (

def check_if_multi_label(self) -> None:
) on the current version. Any suggestions on the easiest way to solve this? I'm trying to install darwin==0.7.0 using pip in the anaconda env but I am also getting some erros

  File "<stdin>", line 1, in <module>
  File "/home/suporte/anaconda3/envs/dino-paws/lib/python3.6/site-packages/darwin/__init__.py", line 4, in <module>
    from .client import Client  # noqa
  File "/home/suporte/anaconda3/envs/dino-paws/lib/python3.6/site-packages/darwin/client.py", line 23, in <module>
    class Client:
  File "/home/suporte/anaconda3/envs/dino-paws/lib/python3.6/site-packages/darwin/client.py", line 34, in Client
    ) -> Union[Any, requests.Response]:
AttributeError: module 'requests' has no attribute 'Response'

No function named get_dataset

Looking in your torch integration API, I don't see a function called "get_dataset()", which is suggested for us to use on the PyTorch integration Readme. It was supposed to be in datasets.py, but it isn't in that file. Is that function missing?

"does not have a corresponding image" error

Hi! When running the following code:

get_dataset(dataset_id, dataset_type="classification", partition="test", split='split_v0_t20', transform=val_transform)

I keep getting this error message:

ValueError: Annotation (/home/suporte/.darwin/datasets/--/--/releases/--/annotations/.json) does not have a corresponding image

All the images from the dataset are located in the local "images" folder, and I can successfully run the previous code for partitions = "train" and "val". I only get the error message for "test". What could be causing this issue?

Thanks!

I could not import coco format json. it seems that coco.py has a bug.

I found a bug in coco.py that is used importing json files with coco format.
around 90s rows, when the situation that make_polygon function is called, there is no bouunding_box instance.
so, I put some codes to make dict which has bounding box information('x', 'y', 'w', 'h') .
it seems good.
I m glad for you to refer to this.
please check it.

image

darwin.importer.formats.supported_formats can not be converted to dict anymore

hi,

i updated the darwin-py from 0.7.1 to 0.7.4

from darwin.importer import formats
format_name = 'darwin'
parser = dict(formats.supported_formats)[format_name]

In 0.7.1 this code worked fine but in 0.7.4 it gives the following error

Exception has occurred: ValueError
dictionary update sequence element #0 has length 4; 2 is required

formats.supported_formats changed from list with tuples to a list. for me this was a breaking change.

would you suggest that i change my code or are you planning to revert this change?

remote_dataset.export returns exception when sending annotation_class_ids

Current behavior

When running dataset.export(name=release_name, annotation_class_ids=annotation_classes, include_url_token=True), where annotation_classes is a list of strings with the name of the class ["ClassName"]

I get:
*** requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://darwin.v7labs.com/api/datasets/{dataset_id}/exports

I can see on darwin.v7labs.com that the release was created and is "generating" but it never finishes (I've been running one since yesterday)

I also just tried generating a release manually through the website selecting a class and in that case the same behavior happens, so presumably, this is an issue with darwin's API and not with darwin-py (I didn't see a repository in which to report it)

Expected behavior

It should create a release with only the selected classes.


I also tried passing the numerical string I get from dataset.fetch_remote_classes()[0]['id'] both in integer format and string format.

In both cases, it threw no exception but the release was created with all the classes associated to the dataset (rather than the one provided)

Importing annotations on DICOMs requires VideoAnnotation format

I pushed some DICOM images into Darwin and was trying to import externally produced polygon annotations using the recipe: Import Annotations to V7

Since these are single frame images (mammograms), I first tried to import polygon annotations using ImageAnnotation in Darwin JSON Format. The result I got back was:

warning, failed to upload annotation to 234038087 {'error': 'Response is not JSON encoded', 'status_code': 500, 'text': 'Something went wrong'}

With a little hackery, I was able to collect the following:

Client get request response (Something went wrong) with unexpected status (500).
Client: (Client(default_team=volpara-science))
Request: (endpoint=/dataset_items/234038081/import, payload={'annotations': [{'annotation_class_id': 105926, 'data': {'polygon': {'path': [{'x': 621.4909206, 'y': 0.0}, {'x': 517.0527928500001, 'y': 114.2340069336}, {'x': 510.2191908, 'y': 140.87745338}, {'x': 517.0527928500001, 'y': 144.67379398}, {'x': 531.617465808, 'y': 141.2915814004}, {'x': 615.4855541999999, 'y': 92.56033072}, {'x': 656.2115906939999, 'y': 56.8056493992}, {'x': 661.7337276000001, 'y': 37.4097750222}, {'x': 654.0717942, 'y': 6.55585699614}]}}}]})
[ERROR 500] Something went wrong

I was able to import the polygon annotations by switching to VideoAnnotation format, which I suppose is fair enough since DICOM supports multiframe images.

Issues:

  • document that DICOMs must be annotated with VideoAnnotation.
  • fix server-side error handling to return better error messages rather than 500 "Something went wrong"
  • fix client-side error handling to show response.text if not response.json()

I hope this feedback helps and thanks for a very handy product! 😀

darwin dataset set-status will not complain if given file does not exist.

The command:

darwin dataset set-file-status dataset-demo new 134.tifffdfsdfsfsfd

Will give the same output as:

darwin dataset set-file-status dataset-demo clear 134.tif

The first one is clearly wrong and tries to act upon a file that does not exist. The second command works.
We need to fix this bug so a message is print when a file that does not exist is detected.

get_dataset -- Add client configuration directory arg to parameter list

Problem:

The 'get_dataset' function is hardwired via the _load_client() function
to load its config.yml file from a central location (e.g., $HOME/.darwin/config.yml),
which enforces the requirement that consumers/users/projects for a specific user account
are bound to the same configuration. However, there are other use cases where a dedicated
Darwin configuration file (1 per project) would be better suited, thereby allowing each
to control their own configuration needs (i.e. default team, API Key, location of dataset directory).

Example:

projects
├── project-a
│ └── darwin-config.yml
├── project-b
│ └── darwin-config.yml
└── project-c
└── darwin-config.yml

Solution:

Add an optional 'config_dir' to the 'get_dataset' function and plumb this parameter
into the '_load_client' function (also as optional), thereby allowing the '_load_client'
method to either utilize the passed in 'config_dir' parameter to locate its darwin
config.yml configuration file, or to seek out this file in its default location
of $HOME/.darwin/config.yml.

sample-implementation.zip

Dataset pull raises OSError: [Errno 39] Directory not empty in some cases

We are scheduling lots of trainings on different Linux machines. Each training starts with a dataset pull of darwin-py.
In most cases, this works fine.

However, there are some rare cases where darwin fails to pull the latest version:

File "/root/.clearml/venvs-builds/3.8/task_repository/swarm-dnn.git/dataloader/preloader/darwin_loader.py", line 39, in preload_dataset
ds.pull(release=release)
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/darwin/dataset/remote_dataset.py", line 235, in pull
shutil.rmtree(annotations_dir)
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/shutil.py", line 719, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/shutil.py", line 717, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/home/swarm/.darwin/swarm-analytics/licenseplate_ocr/releases/2021-04-08/annotations'

From my point of view, darwin should sync the dataset version from the server in this step. This is why cleaning this folder manually is not helpful (it would need to download everything again).

Please let me know if you need further information.

ValueError: invalid literal for int() with base 10: '350.354212'

Hi Guys,

Thanks for a great tool!

When trying to upload bbox from pascal VOC file format im getting this error. It happens cause you are expecting a floating point or integer for min/max values.

Based on this answer my suggesting is for you to change line 23 - 26 in darwin/importer/formats/pascalvoc.py

xmin = int(bndbox.find("xmin").text)
xmax = int(bndbox.find("xmax").text)
ymin = int(bndbox.find("ymin").text)
ymax = int(bndbox.find("ymax").text)

to

xmin = int(float(bndbox.find("xmin").text))
xmax = int(float(bndbox.find("xmax").text))
ymin = int(float(bndbox.find("ymin").text))
ymax = int(float(bndbox.find("ymax").text))

So you can handle strings as well.

Have a great day!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.