tensorflow / datasets Goto Github PK

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

Home Page: https://www.tensorflow.org/datasets

License: Apache License 2.0

Shell 0.06% Python 98.34% Smalltalk 0.34% Perl 0.01% Gherkin 0.01% Roff 0.36% NewLisp 0.23% Ruby 0.42% JavaScript 0.22% TeX 0.01%

tensorflow machine-learning data datasets numpy jax dataset

datasets's Introduction

`Documentation`

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.

TensorFlow was originally developed by researchers and engineers working within the Machine Intelligence team at Google Brain to conduct research in machine learning and neural networks. However, the framework is versatile enough to be used in other areas as well.

TensorFlow provides stable Python and C++ APIs, as well as a non-guaranteed backward compatible API for other languages.

Keep up-to-date with release announcements and security updates by subscribing to [email protected]. See all the mailing lists.

Install

See the TensorFlow install guide for the pip package, to enable GPU support, use a Docker container, and build from source.

To install the current release, which includes support for CUDA-enabled GPU cards (Ubuntu and Windows):

$ pip install tensorflow

Other devices (DirectX and MacOS-metal) are supported using Device plugins.

A smaller CPU-only package is also available:

$ pip install tensorflow-cpu

To update TensorFlow to the latest version, add --upgrade flag to the above commands.

Nightly binaries are available for testing using the tf-nightly and tf-nightly-cpu packages on PyPi.

Try your first TensorFlow program

$ python

>>> import tensorflow as tf
>>> tf.add(1, 2).numpy()
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
b'Hello, TensorFlow!'

For more examples, see the TensorFlow tutorials.

Contribution guidelines

If you want to contribute to TensorFlow, be sure to review the contribution guidelines. This project adheres to TensorFlow's code of conduct. By participating, you are expected to uphold this code.

We use GitHub issues for tracking requests and bugs, please see TensorFlow Forum for general questions and discussion, and please direct specific questions to Stack Overflow.

The TensorFlow project strives to abide by generally accepted best practices in open-source software development.

Patching guidelines

Follow these steps to patch a specific version of TensorFlow, for example, to apply fixes to bugs or security vulnerabilities:

Clone the TensorFlow repo and switch to the corresponding branch for your desired TensorFlow version, for example, branch r2.8 for version 2.8.
Apply (that is, cherry-pick) the desired changes and resolve any code conflicts.
Run TensorFlow tests and ensure they pass.
Build the TensorFlow pip package from source.

Continuous build status

You can find more community-supported platforms and configurations in the TensorFlow SIG Build community builds table.

Official Builds

Build Type	Status	Artifacts
Linux CPU		PyPI
Linux GPU		PyPI
Linux XLA		TBA
macOS		PyPI
Windows CPU		PyPI
Windows GPU		PyPI
Android		Download
Raspberry Pi 0 and 1		Py3
Raspberry Pi 2 and 3		Py3
Libtensorflow MacOS CPU	Status Temporarily Unavailable	Nightly Binary Official GCS
Libtensorflow Linux CPU	Status Temporarily Unavailable	Nightly Binary Official GCS
Libtensorflow Linux GPU	Status Temporarily Unavailable	Nightly Binary Official GCS
Libtensorflow Windows CPU	Status Temporarily Unavailable	Nightly Binary Official GCS
Libtensorflow Windows GPU	Status Temporarily Unavailable	Nightly Binary Official GCS

Resources

Learn more about the TensorFlow community and how to contribute.

Courses

License

Apache License 2.0

datasets's People

Contributors

Stargazers

Watchers

Forkers

jdc08161063 batermj microw yuyongep lilianweng modu-ftnc yueyedeai abiraja2004 jackzhangsir maplewzx brettkoonce winnerineast lz9168 wookayin undeadinu yuhonghong7035 moruifang0508 john2912 esmaeilinia hieuqtran gyanachand1 abenavs zgsxwsdxg shankar0206 htnani aslm123 sadashivpal rameezrehman83 eddiecityu marclachapelle arvind-india andandandand earlbabson nunofernandes-plight gdcollect doddle7456 tony32769 gdcollectdata sidratulislam seanyu-git jackd lc0 rsepassi temiafeye 0101011 tomarraj008 skommareddi songhappy bhanditz rodrigob lgeiger akki2825 aperturetechnology mbbessa vikyale andrewk1 tabshaikh aweers nakarumanchi aayushkumarjvs tianyikenan andhau davidce phillyschoolofai liu39236255 zofuthan lipanpanpanpan awesomemachinelearning yang9527 biyituan albertofernandezvillan anupam-tripathi-zz duke24k chris-512 drewszurko paulgureghian1 ayush1999 suvarna-kadam vijaysinghkadam kyscg shashankjain12 michhaha stanxii parths007 ctiijima aayu04 us asrivast13 pulkitmishra stjordanis mtngld ubershmekel kaustumbh7 securetorobert hefv57 cuent jshuadvd shibutamang chanchalkumarmaji shivramd

datasets's Issues

[data request] WikiText-103

Name of dataset: WikiText-103
URL of dataset: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
License of dataset: CC BY-SA 3.0 Unported
Short description of dataset and use case(s): The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] Omniglot

Name of dataset: Omniglot
URL of dataset: https://github.com/brendenlake/omniglot
License of dataset: MIT
Short description of dataset and use case(s): Image dataset for few-shot image classification and meta-learning

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] Horses and Zebras

Name of dataset: Horses and Zebras
URL of dataset: https://github.com/junyanz/CycleGAN/blob/master/datasets/download_dataset.sh
License of dataset:
Short description of dataset and use case(s): Image to image examples using in CycleGAN.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

XL dataset generation with Apache Beam

This is a tracking bug for extra-large dataset generation using Apache Beam (i.e. for datasets that cannot feasibly be generated within a day on a single machine). Follow it to be notified of updates on this support.

two files of same name but different capitalization

You have a file called Text.md and one called text.md in the same directory datasets/docs/api_docs/python/tfds/features/.

This is causing an issue in filesystems that ignore capitalization in filenames. (My filesystem overwrote one of the files with the other, and now git shows unstaged changes/modification.)

Could you resolve the conflict between the two files?

Import error on Windows

ModuleNotFoundError: No module named 'cPickle'

Document `lazy_imports` in add a dataset doc

Make adding a dataset doc more prominent on README

[data request] Downsampled ImageNet

Name of dataset: Downsampled ImageNet
URL of dataset: http://image-net.org/small/download.php
License of dataset: Same as ImageNet
Short description of dataset and use case(s): great for prototyping and quick experimentation beyond MNIST.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

GCS access doesn’t work from Colab

Using colab.research.google.com GCS access doesn’t work (tfds.load(“mnist”) hangs). It seems due to tf.io.gfile and not TFDS. TFDS is trying to access the dataset_info.json file from GCS.

TensorFlow or Colab should fix this.

For now, a possible alternative is to use requests to access the GCS files through the http API: http://storage.googleapis.com/tfds-data/dataset_info/mnist/1.0.0/dataset_info.json.

Simple MNIST model training example

Is there a simple example showing how to import MNIST and train a simple neural network to make inferences on the data. It would help to see how to use this library in an end to end manner. Right now, I know how to import a dataset but not how to actually train a model with it.

I'm a lot more used to working with numpy datasets that I can feed directly into a TensorFlow feed dict.

[data request] UCF101

Name of dataset: UCF101
URL of dataset: http://crcv.ucf.edu/data/UCF101.php
License of dataset: unknown
Short description of dataset and use case(s):
UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 data set which has 50 action categories.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[Question] - Would it be okay to generate tfrecords from downloaded/pre-downloaded dataset ?

Hi,

Thanks for this package; its look really good.

I have a question regarding the optimal workflow and usage of this package for below mentioned scenario :

Let's say that there is a very large database (20+ GBs) of images (and access is password protected). In the tensorflow ecosystem it is recommended to use tfrecords to consolidate the labels and data and speed up training etc so this data needs to be converted to tfrecords.

Since it is 20+ GB and password protected I would tend to think that automatic download would not be recommended and hence this library provides a way to specify the folder.

However, since original dataset needs to be converted to tfrecords first -:

Would you suggest that the conversion (to tfrecords) is done as part of this library (sub classes) or should it be a separate step (equivalent to manual downloading of images)

Regards
Kapil

Split dataset between workers

For large datasets being processed on many workers, it is useful to be able to read separate shards of the dataset on each worker. Is this possible with the tfds API?

I experimented a bit with the Split.subsplit functionality, but it looks like it works by reading every dataset element and masking out selected elements (see here). This means that every worker ends up reading the whole dataset, which can be costly. In particular, this makes it impossible to use tf.data.Dataset.cache.

Re-enable notebook tests

Jupyter does not seem to be respecting the virtualenv. Update the test script to make the notebook respect the virtualenv.

Results here seem relevant: https://www.google.com/search?q=jupyter+running+in+virtualenv

error on pip install

I'm receiving the following error after run pip install tensorflow-datasets

Could not find a version that satisfies the requirement tensorflow-datasets (from versions: )
No matching distribution found for tensorflow-datasets

[data request] ADE20k

Name of dataset: ADE20k
URL of dataset: http://groups.csail.mit.edu/vision/datasets/ADE20K/
License of dataset: ?? (not specified on website)
Short description of dataset and use case(s): dense stuff, things, and parts annotations over 20k images.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] ModaNet

Name of dataset: ModaNet
URL of dataset: https://github.com/eBay/modanet
License of dataset: Creative Commons Attribution-NonCommercial license 4.0.
Short description of dataset and use case(s):
ModaNet is a street fashion images dataset consisting of annotations related to RGB images. ModaNet provides multiple polygon annotations for each image. This dataset is described in a technical paper with the title ModaNet: A Large-Scale Street Fashion Dataset with Polygon Annotations. Each polygon is associated with a label from 13 meta fashion categories. The annotations are based on images in the PaperDoll image set, which has only a few hundred images annotated by the superpixel-based tool. The contribution of ModaNet is to provide new and extra polygon annotations for the images.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] MultiNLI

Name of dataset: MultiNLI
URL of dataset: https://www.nyu.edu/projects/bowman/multinli/
License of dataset:
License
See details in the data description paper: https://www.nyu.edu/projects/bowman/multinli/paper.pdf
Short description of dataset and use case(s):

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] COCO 2018

Name of dataset: COCO 2018
URL of dataset: http://cocodataset.org/#panoptic-2018
License of dataset: CC 4.0 see http://cocodataset.org/#termsofuse
Short description of dataset and use case(s): Thing and stuff annotations over Coco images

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Example in README not working

Example given in README not implemented as of yet.
AttributeError: module 'tensorflow_datasets' has no attribute 'load'

[data request] GLUE Datasets

Name of dataset: GLUE Datasets
URL of dataset: https://gluebenchmark.com/tasks
License of dataset:
Short description of dataset and use case(s): The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] YouTube 8M

Name of dataset: YouTube 8M
URL of dataset: https://research.google.com/youtube8m/
License of dataset: Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Short description of dataset and use case(s): YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. Can be used for large-scale video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video, and multi-task learning.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] Amazon product data

Name of dataset: Amazon product data
URL of dataset: http://jmcauley.ucsd.edu/data/amazon/
Short description of dataset and use case(s): Dataset contains product reviews and metadata from Amazon.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] Flickr-Faces-HQ

Name of dataset: Flickr-Faces-HQ
URL of dataset: https://github.com/NVlabs/ffhq-dataset
License of dataset: Dataset itself Creative Commons BY-NC-SA 4.0; the individual images were published in Flickr by their respective authors under either Creative Commons BY 2.0, Creative Commons BY-NC 2.0, Public Domain Mark 1.0, Public Domain CC0 1.0, or U.S. Government Works license.
Short description of dataset and use case(s):
The dataset consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. It also has good coverage of accessories such as eyeglasses, sunglasses, hats, etc.
The dataset is suitable for GAN research, face attributes, etc.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Support TF 2.0

Any plans to support TensorFlow 2? Since Datasets are not released and hopefully TF2 is going to be released in the next months, would totally make sense.

What do you think?

Currently I've got the issues with contrib part

# Flatten
--> 127   flat_ds = tf.contrib.framework.nest.flatten(nested_ds)
    128   flat_np = []

Spurious logging from GCS access

Short description
Spurious logging from GCS access when using tfds.load. From metadata file access internally.

Environment information

Operating System: Ubuntu 16
Python version: Python 3
tensorflow-datasets/tfds-nightly version: tfds-nightly
tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tf-nightly

Reproduction instructions

import tensorflow as tf
tf.io.gfile.exists("gs://tfds-data")

Link to logs

2019-02-03 02:03:50.095060: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.92628 seconds (attempt 10 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-02-03 02:03:52.022467: W tensorflow/core/platform/cloud/google_auth_provider.cc:157] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".

Additional context

Clearly a problem with TensorFlow. But would be nice to not have these logs dumped (10 retries). They go away on subsequent access to GCS. Seems to be just on first access. And nothing crashes or breaks, just wait for the 10 retries to be done and move on. Annoying.

One alternative for now may be to use requests to use the HTTP API for GCS access to the TFDS bucket (similar to #36).

[data request] HMDB51

Name of dataset: HMDB51
URL of dataset: http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
License of dataset: Creative Commons Attribution 4.0 International License.
Short description of dataset and use case(s): Action Recognition Dataset containing Videos. It is a widely used Benchmark with UCF101 #21

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Fix `sequence_feature_test` and `open_images_test` in TF 2.0

Short description

With TF 2.0, there's a new Defun implementation that breaks sequence_feature_test.py and open_images_test.py. Currently these are disabled.

Environment information

Operating System: Ubuntu 14.04
Python version: Python 2.7 and Python 3.6
tensorflow-datasets/tfds-nightly version: HEAD
tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tf-nightly-2.0-preview

Reproduction instructions

See Travis failure: https://travis-ci.org/tensorflow/datasets/jobs/491913842

Link to logs

E   tensorflow.python.framework.errors_impl.InvalidArgumentError: Tried to stack elements of an empty list with non-fully-defined element_shape: <unknown>
E   	 [[{{node sequence_decode/TensorArrayV2Stack/TensorListStack}}]] [Op:IteratorGetNextSync]

[data request] SNLI

Name of dataset: SNLI
URL of dataset: https://nlp.stanford.edu/projects/snli/
License of dataset: Creative Commons Attribution-ShareAlike 4.0 International License
Short description of dataset and use case(s):
The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE)

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] bAbI Question Answering

Name of dataset: bAbI Question Answering
URL of dataset: https://research.fb.com/downloads/babi/
License of dataset: CC BY 3.0
Short description of dataset and use case(s): question answering dataset for automatic text understanding and reasoning

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Would the team accept a JS version, published in NPM (instead of PyPi)?

Is your feature request related to a problem? Please describe.
Ideally, the datasets API would be available cross language, like Keras or TensorFlow. Many TF learners are coming to TensorFlow from JavaScript, and would benefit from the access to known datasets.

Describe the solution you'd like

in package.json

npm add @tensorflow/tensorflow-datasets

in index.js

import * as tfds from '@tensorflow/tensorflow-datasets' 
const ds = tfds.load(name='mnist');

Additional context
js.tensorflow.org

[data request] CNN DailyMail

Name of dataset: CNN DailyMail summarization
URL of dataset: https://cs.nyu.edu/~kcho/DMQA/
License of dataset: unknown
Short description of dataset and use case(s): news articles for question answering and summarization

Related URL that uses the data: https://github.com/abisee/cnn-dailymail

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Follow this issue to be notified of release

Hi all, thanks for your interest in tensorflow/datasets. We're actively working on the project and hope to release soon with a starter set of datasets. Please follow this issue to be notified of when we have an initial version on PyPI.

We have an alpha nightly release that you can try out: pip install tfds-nightly. Please leave feedback through Issues if you try it out.

If you're interested in contributing a dataset implementation, please feel free to start looking through the new dataset documentation and familiarize yourself with the codebase. MNIST might be a good starting point.

[data request] Pascal VOC

Name of dataset: Pascal VOC12 with SDB extension
URL of dataset: http://host.robots.ox.ac.uk/pascal/VOC/ and http://home.bharathh.info/pubs/codes/SBD/download.html
License of dataset: ?? (to be investigated)
Short description of dataset and use case(s): the classic detection and semantic segmentation dataset. Used in many works related to weakly supervised learning (since a bit easier than Coco).

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Change default download directoy to hidden folder

Currently, the default download directory for dataset caching appears to be ~/tensorflow_datasets. However, since it's not a folder that is meant to be accessed through a file manager, I'd suggest to make it hidden by default, e.g. ~/.tensorflow_datasets.

[data request] OpenImages v4

Name of dataset: OpenImages v4
URL of dataset: https://storage.googleapis.com/openimages/web/index.html
License of dataset: licensed by Google Inc. under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license.
Short description of dataset and use case(s): bigger ImageNet with image level labels and bounding boxes.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Consider hosting preconverted Datasets to minimize time-to-first-example

Is your feature request related to a problem? Please describe.
I just tried the minist dataset in a colab. It took a few minutes to download & convert the data to tfrecords, before I can try anything at all.

The keras built-in MNIST dataset loaded in a few seconds.

This is concerning because MNIST is pretty small. If I tried to use a bigger dataset, seems like I might be waiting for an hour.

Describe the solution you'd like
When not otherwise prohibited by dataset licensing, it would be great if the TFDS team could convert the datasets AOT to their TFRecord format and host the converted data as a public dataset in the cloud. So when users try to use the dataset, there is not an extensive pause.

Describe alternatives you've considered
Using the dataset apis of other frameworks.

import tensorflow_datasets always enables eager execution

It seems importing tensorflow_datasets always enables eager execution, which I don't want. Is there a way to disable it? Thank you very much!

If I execute the following piece, the eager mode will be enabled by importing tensorflow_datasets

In [1]: import tensorflow as tf

In [2]: import tensorflow_datasets as tfds

In [3]: tf.executing_eagerly()
Out[3]: True

However, if I execute the following, an error will be raised!

In [1]: import tensorflow as tf

In [2]: tf.executing_eagerly()
Out[2]: False

In [3]: import tensorflow_datasets as tfds
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-46a8a2031c9c> in <module>
----> 1 import tensorflow_datasets as tfds

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\__init__.py in <module>
     49 # Imports for registration
     50 # pylint: disable=g-import-not-at-top
---> 51 from tensorflow_datasets import audio
     52 from tensorflow_datasets import image
     53 from tensorflow_datasets import text

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\audio\__init__.py in <module>
     16 """Audio datasets."""
     17
---> 18 from tensorflow_datasets.audio.librispeech import Librispeech
     19 from tensorflow_datasets.audio.librispeech import LibrispeechConfig
     20 from tensorflow_datasets.audio.nsynth import Nsynth

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\audio\librispeech.py in <module>
     26
     27 from tensorflow_datasets.core import api_utils
---> 28 import tensorflow_datasets.public_api as tfds
     29
     30 _CITATION = """\

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\public_api.py in <module>
     61
     62
---> 63 testing = _import_testing()

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\public_api.py in _import_testing()
     55 def _import_testing():
     56   try:
---> 57     from tensorflow_datasets import testing  # pylint: disable=redefined-outer-name
     58     return testing
     59   except:

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\testing\__init__.py in <module>
     16 """Testing utilities."""
     17
---> 18 from tensorflow_datasets.testing.dataset_builder_testing import DatasetBuilderTestCase
     19 from tensorflow_datasets.testing.test_case import TestCase
     20 from tensorflow_datasets.testing.test_utils import DummyDatasetSharedGenerator

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\testing\dataset_builder_testing.py in <module>
     37 from tensorflow_datasets.testing import test_utils
     38
---> 39 tf.compat.v1.enable_eager_execution()
     40
     41 # `os` module Functions for which tf.io.gfile equivalent should be preferred.

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in enable_eager_execution(config, device_policy, execution_mode)
   5421         device_policy=device_policy,
   5422         execution_mode=execution_mode,
-> 5423         server_def=None)
   5424
   5425

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in enable_eager_execution_internal(config, device_policy, execution_mode, server_def)
   5489   else:
   5490     raise ValueError(
-> 5491         "tf.enable_eager_execution must be called at program startup.")
   5492
   5493   # Monkey patch to get rid of an unnecessary conditional since the context is

ValueError: tf.enable_eager_execution must be called at program startup.

[data request] Moving MNIST

Name of dataset: Moving MNIST
URL of dataset: http://www.cs.toronto.edu/~nitish/unsupervised_video/
License of dataset: unknown (same as MNIST probably)
Short description of dataset and use case(s): A test set for evaluating sequence prediction/reconstruction. 10,000 sequences each of length 20 showing 2 digits moving in a 64 x 64 frame.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Rm dep on tf.contrib.data.LMDBDataset for TF 2.0

tf.contrib.data.LMDBDataset will not make it to TF2. It may be moving to a new repo/package. If/when that happens, we should switch to it.

Part of #31 (TF 2.0 support)

Leverage tf.io.gfile as a downloader

Is your feature request related to a problem? Please describe.
Using tf-datasets with private datasets stored on Google Cloud Storage returns an error code 403 since the current download method using request does not handle the authentification. Also, paths specified in the form of gs://... are not currently supported

Describe the solution you'd like
Using tf.GFile whenever possible (in place of the current requests based approach) would solve both of these issues.

Additional context
A possible implementation could add a check (or something like that) during the call of _sync_download() and then if the given URI matches gs:// (or any supported URL) use tf.GFile to download it.

[data request] EMNIST

Name of dataset: EMNIST
URL of dataset: https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist
License of dataset: Unknown, but from the same NIST sources originally as MNIST.
Short description of dataset and use case(s): EMNIST is an extended MNIST dataset using the same processing and normalization techniques as MNIST itself on a superset of the original NIST images. This includes six splits, including not only a larger 280k MNIST digit equivalent, but also a 814k handwritten digit and letter dataset. For more information and details see: Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: an extension of MNIST to handwritten letters. Retrieved from http://arxiv.org/abs/1702.05373

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] KTH human action video data set

Name of dataset: KTH human action video data set
URL of dataset: http://www.nada.kth.se/cvap/actions/
License of dataset: unknown
Short description of dataset and use case(s):
The current video database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as illustrated below. Currently the database contains 2391 sequences. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Index datasets in Google Dataset Search

It would be great to have all TFDS datasets indexed in Dataset Search.

Example Kaggle dataset.

We need a library function that goes from a DatasetBuilder to the schema.org markup needed. See the Google Dataset type docs and the schema.org docs. The markup should include usage instructions similar to the Kaggle example above (i.e. show off using the TFDS APIs for that dataset).

Then we need a script to generate an HTML page for each dataset and write them to a directory.

If Google indexes GitHub, then we'd be done. If not, we can copy those files over to the TF site, which is definitely indexed.

If anybody has experience with schema.org or is interested in having TFDS have wider exposure, this would be a great issue to pick up.

[data request] LSUN

Name of dataset: LSUN
URL of dataset: http://lsun.cs.princeton.edu/2017/
License of dataset: unknown
Short description of dataset and use case(s): scene and object recognition.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] Natural Questions

Name of dataset: Natural Questions
URL of dataset: https://ai.google.com/research/NaturalQuestions/dataset
License of dataset: Creative Commons Share-Alike 3.0
Short description of dataset and use case(s):
To help spur development in open-domain question answering, we have created the Natural Questions (NQ) corpus, along with a challenge website based on this data. The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. The inclusion of real user questions, and the requirement that solutions should read an entire page to find the answer, cause NQ to be a more realistic and challenging task than prior QA datasets.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Unused pytz dependency?

pytz is in setup.py but doesn’t seem to be used anywhere. Rm?

Add option to return images as float instead of uint8

Is your feature request related to a problem? Please describe.
Dataset features do not immediately compose with tf.hub modules. For example I want to fine tune a model for the cats_vs_dogs dataset, by reusing a tfhub image feature module. cats_vs_dogs provides the image as uint8 and undefined image size, but tfhub image modules expect float32 at specific sizes.

This can be solved with Dataset.map etc, but this is not obvious for beginners and otherwise is just a friction for everyone that dilutes the value of this project. The expectation is that Datasets of this project are ready to go for plugging into downstream computations, not that there is more massaging/transformation that needs to happen.

Describe the solution you'd like
I want to trivially compose tfds datasets with tfhub modules, without having to manually check and align details of tensor shapes and tensor types, or figure out where in the pipeline to insert a conversion function.

One solution could be to provide explicit features that are targeted for compatibility with tf.hub. Another solution could be to have the Builder parameterize/generate a set of FeatureColumns in correspondence to the FeaturesDict.

Describe alternatives you've considered
Dataset.map, ugly and against the spirit of this project which seems to be "make it easy to plug existing datasets into TF"

Additional context
FastAI has very slick examples for fine tuning a model for this dataset, TF solution could be competitive or better but the pieces need to fit together.

[data request] Colorectal Histology (Collection of textures in colorectal cancer histology)

Name of dataset: Colorectal Histology (Collection of textures in colorectal cancer histology)
URL of dataset: https://zenodo.org/record/53169#.XGL2CNFCfOR
License of dataset: Open Access
Short description of dataset and use case(s):
The dataset serves as a much more interesting MNIST or CIFAR10 problem for biologists by focusing on histology tiles from patients with colorectal cancer. In particular, the data has 8 different classes of tissue (but Cancer/Not Cancer can also be an interesting problem).
https://www.kaggle.com/kmader/colorectal-histology-mnist/home
https://zenodo.org/record/53169#.XGL2CNFCfOR

Good example for people to see use case of AI for Good.

Can tried to help to do TFRecords files if needed (I am learning it)

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

tensorflow / datasets Goto Github PK

datasets's Introduction

Install

Try your first TensorFlow program

Contribution guidelines

Patching guidelines

Continuous build status

Official Builds

Resources

Courses

License

datasets's People

Contributors

Stargazers

Watchers

Forkers

datasets's Issues

in package.json

in index.js

Recommend Projects

Recommend Topics

Recommend Org