Git Product home page Git Product logo

tensorflow / datasets Goto Github PK

View Code? Open in Web Editor NEW
4.2K 105.0 1.5K 968.31 MB

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

Home Page: https://www.tensorflow.org/datasets

License: Apache License 2.0

Shell 0.06% Python 98.34% Smalltalk 0.34% Perl 0.01% Gherkin 0.01% Roff 0.36% NewLisp 0.23% Ruby 0.42% JavaScript 0.22% TeX 0.01%
tensorflow machine-learning data datasets numpy jax dataset

datasets's Introduction

Python PyPI DOI CII Best Practices OpenSSF Scorecard Fuzzing Status Fuzzing Status OSSRank Contributor Covenant TF Official Continuous TF Official Nightly

Documentation
Documentation

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.

TensorFlow was originally developed by researchers and engineers working within the Machine Intelligence team at Google Brain to conduct research in machine learning and neural networks. However, the framework is versatile enough to be used in other areas as well.

TensorFlow provides stable Python and C++ APIs, as well as a non-guaranteed backward compatible API for other languages.

Keep up-to-date with release announcements and security updates by subscribing to [email protected]. See all the mailing lists.

Install

See the TensorFlow install guide for the pip package, to enable GPU support, use a Docker container, and build from source.

To install the current release, which includes support for CUDA-enabled GPU cards (Ubuntu and Windows):

$ pip install tensorflow

Other devices (DirectX and MacOS-metal) are supported using Device plugins.

A smaller CPU-only package is also available:

$ pip install tensorflow-cpu

To update TensorFlow to the latest version, add --upgrade flag to the above commands.

Nightly binaries are available for testing using the tf-nightly and tf-nightly-cpu packages on PyPi.

Try your first TensorFlow program

$ python
>>> import tensorflow as tf
>>> tf.add(1, 2).numpy()
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
b'Hello, TensorFlow!'

For more examples, see the TensorFlow tutorials.

Contribution guidelines

If you want to contribute to TensorFlow, be sure to review the contribution guidelines. This project adheres to TensorFlow's code of conduct. By participating, you are expected to uphold this code.

We use GitHub issues for tracking requests and bugs, please see TensorFlow Forum for general questions and discussion, and please direct specific questions to Stack Overflow.

The TensorFlow project strives to abide by generally accepted best practices in open-source software development.

Patching guidelines

Follow these steps to patch a specific version of TensorFlow, for example, to apply fixes to bugs or security vulnerabilities:

  • Clone the TensorFlow repo and switch to the corresponding branch for your desired TensorFlow version, for example, branch r2.8 for version 2.8.
  • Apply (that is, cherry-pick) the desired changes and resolve any code conflicts.
  • Run TensorFlow tests and ensure they pass.
  • Build the TensorFlow pip package from source.

Continuous build status

You can find more community-supported platforms and configurations in the TensorFlow SIG Build community builds table.

Official Builds

Build Type Status Artifacts
Linux CPU Status PyPI
Linux GPU Status PyPI
Linux XLA Status TBA
macOS Status PyPI
Windows CPU Status PyPI
Windows GPU Status PyPI
Android Status Download
Raspberry Pi 0 and 1 Status Py3
Raspberry Pi 2 and 3 Status Py3
Libtensorflow MacOS CPU Status Temporarily Unavailable Nightly Binary Official GCS
Libtensorflow Linux CPU Status Temporarily Unavailable Nightly Binary Official GCS
Libtensorflow Linux GPU Status Temporarily Unavailable Nightly Binary Official GCS
Libtensorflow Windows CPU Status Temporarily Unavailable Nightly Binary Official GCS
Libtensorflow Windows GPU Status Temporarily Unavailable Nightly Binary Official GCS

Resources

Learn more about the TensorFlow community and how to contribute.

Courses

License

Apache License 2.0

datasets's People

Contributors

acharles7 avatar adarob avatar afrozenator avatar annxingyuan avatar captain-pool avatar chanchalkumarmaji avatar conchylicultor avatar cyfra avatar eshan-agarwal avatar fineguy avatar harsh020 avatar iswariyam avatar jpuigcerver avatar marcenacp avatar markdaoust avatar nikhilbartwal avatar pierrot0 avatar rchen152 avatar rickwierenga avatar sabelaraga avatar sharanramjee avatar soumyadeepjana avatar tomvdw avatar tonywz avatar us avatar vijayphoenix avatar vvkio avatar williamhyzhang avatar yaozhaogoogle avatar yukimasano avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasets's Issues

[data request] WikiText-103

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] Omniglot

  • Name of dataset: Omniglot
  • URL of dataset: https://github.com/brendenlake/omniglot
  • License of dataset: MIT
  • Short description of dataset and use case(s): Image dataset for few-shot image classification and meta-learning

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

XL dataset generation with Apache Beam

This is a tracking bug for extra-large dataset generation using Apache Beam (i.e. for datasets that cannot feasibly be generated within a day on a single machine). Follow it to be notified of updates on this support.

two files of same name but different capitalization

You have a file called Text.md and one called text.md in the same directory datasets/docs/api_docs/python/tfds/features/.

This is causing an issue in filesystems that ignore capitalization in filenames. (My filesystem overwrote one of the files with the other, and now git shows unstaged changes/modification.)

Could you resolve the conflict between the two files?

[data request] Downsampled ImageNet

  • Name of dataset: Downsampled ImageNet
  • URL of dataset: http://image-net.org/small/download.php
  • License of dataset: Same as ImageNet
  • Short description of dataset and use case(s): great for prototyping and quick experimentation beyond MNIST.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

GCS access doesn’t work from Colab

Using colab.research.google.com GCS access doesn’t work (tfds.load(“mnist”) hangs). It seems due to tf.io.gfile and not TFDS. TFDS is trying to access the dataset_info.json file from GCS.

TensorFlow or Colab should fix this.

For now, a possible alternative is to use requests to access the GCS files through the http API: http://storage.googleapis.com/tfds-data/dataset_info/mnist/1.0.0/dataset_info.json.

Simple MNIST model training example

Is there a simple example showing how to import MNIST and train a simple neural network to make inferences on the data. It would help to see how to use this library in an end to end manner. Right now, I know how to import a dataset but not how to actually train a model with it.

I'm a lot more used to working with numpy datasets that I can feed directly into a TensorFlow feed dict.

[data request] UCF101

  • Name of dataset: UCF101
  • URL of dataset: http://crcv.ucf.edu/data/UCF101.php
  • License of dataset: unknown
  • Short description of dataset and use case(s):
    UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 data set which has 50 action categories.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[Question] - Would it be okay to generate tfrecords from downloaded/pre-downloaded dataset ?

Hi,

Thanks for this package; its look really good.

I have a question regarding the optimal workflow and usage of this package for below mentioned scenario :

Let's say that there is a very large database (20+ GBs) of images (and access is password protected). In the tensorflow ecosystem it is recommended to use tfrecords to consolidate the labels and data and speed up training etc so this data needs to be converted to tfrecords.

Since it is 20+ GB and password protected I would tend to think that automatic download would not be recommended and hence this library provides a way to specify the folder.

However, since original dataset needs to be converted to tfrecords first -:

Would you suggest that the conversion (to tfrecords) is done as part of this library (sub classes) or should it be a separate step (equivalent to manual downloading of images)

Regards
Kapil

Split dataset between workers

For large datasets being processed on many workers, it is useful to be able to read separate shards of the dataset on each worker. Is this possible with the tfds API?

I experimented a bit with the Split.subsplit functionality, but it looks like it works by reading every dataset element and masking out selected elements (see here). This means that every worker ends up reading the whole dataset, which can be costly. In particular, this makes it impossible to use tf.data.Dataset.cache.

error on pip install

I'm receiving the following error after run pip install tensorflow-datasets

Could not find a version that satisfies the requirement tensorflow-datasets (from versions: )
No matching distribution found for tensorflow-datasets

[data request] ADE20k

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] ModaNet

  • Name of dataset: ModaNet
  • URL of dataset: https://github.com/eBay/modanet
  • License of dataset: Creative Commons Attribution-NonCommercial license 4.0.
  • Short description of dataset and use case(s):
    ModaNet is a street fashion images dataset consisting of annotations related to RGB images. ModaNet provides multiple polygon annotations for each image. This dataset is described in a technical paper with the title ModaNet: A Large-Scale Street Fashion Dataset with Polygon Annotations. Each polygon is associated with a label from 13 meta fashion categories. The annotations are based on images in the PaperDoll image set, which has only a few hundred images annotated by the superpixel-based tool. The contribution of ModaNet is to provide new and extra polygon annotations for the images.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] MultiNLI

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Example in README not working

Example given in README not implemented as of yet.
AttributeError: module 'tensorflow_datasets' has no attribute 'load'

[data request] GLUE Datasets

  • Name of dataset: GLUE Datasets
  • URL of dataset: https://gluebenchmark.com/tasks
  • License of dataset:
  • Short description of dataset and use case(s): The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] YouTube 8M

  • Name of dataset: YouTube 8M
  • URL of dataset: https://research.google.com/youtube8m/
  • License of dataset: Creative Commons Attribution 4.0 International (CC BY 4.0) license.
  • Short description of dataset and use case(s): YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. Can be used for large-scale video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video, and multi-task learning.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] Amazon product data

  • Name of dataset: Amazon product data
  • URL of dataset: http://jmcauley.ucsd.edu/data/amazon/
  • Short description of dataset and use case(s): Dataset contains product reviews and metadata from Amazon.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] Flickr-Faces-HQ

  • Name of dataset: Flickr-Faces-HQ
  • URL of dataset: https://github.com/NVlabs/ffhq-dataset
  • License of dataset: Dataset itself Creative Commons BY-NC-SA 4.0; the individual images were published in Flickr by their respective authors under either Creative Commons BY 2.0, Creative Commons BY-NC 2.0, Public Domain Mark 1.0, Public Domain CC0 1.0, or U.S. Government Works license.
  • Short description of dataset and use case(s):
    The dataset consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. It also has good coverage of accessories such as eyeglasses, sunglasses, hats, etc.
    The dataset is suitable for GAN research, face attributes, etc.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Support TF 2.0

Any plans to support TensorFlow 2? Since Datasets are not released and hopefully TF2 is going to be released in the next months, would totally make sense.

What do you think?

Currently I've got the issues with contrib part

# Flatten
--> 127   flat_ds = tf.contrib.framework.nest.flatten(nested_ds)
    128   flat_np = []

Spurious logging from GCS access

Short description
Spurious logging from GCS access when using tfds.load. From metadata file access internally.

Environment information

  • Operating System: Ubuntu 16
  • Python version: Python 3
  • tensorflow-datasets/tfds-nightly version: tfds-nightly
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tf-nightly

Reproduction instructions

import tensorflow as tf
tf.io.gfile.exists("gs://tfds-data")

Link to logs

2019-02-03 02:03:50.095060: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.92628 seconds (attempt 10 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-02-03 02:03:52.022467: W tensorflow/core/platform/cloud/google_auth_provider.cc:157] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".

Additional context

Clearly a problem with TensorFlow. But would be nice to not have these logs dumped (10 retries). They go away on subsequent access to GCS. Seems to be just on first access. And nothing crashes or breaks, just wait for the 10 retries to be done and move on. Annoying.

One alternative for now may be to use requests to use the HTTP API for GCS access to the TFDS bucket (similar to #36).

[data request] HMDB51

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Fix `sequence_feature_test` and `open_images_test` in TF 2.0

Short description

With TF 2.0, there's a new Defun implementation that breaks sequence_feature_test.py and open_images_test.py. Currently these are disabled.

Environment information

  • Operating System: Ubuntu 14.04
  • Python version: Python 2.7 and Python 3.6
  • tensorflow-datasets/tfds-nightly version: HEAD
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tf-nightly-2.0-preview

Reproduction instructions

See Travis failure: https://travis-ci.org/tensorflow/datasets/jobs/491913842

Link to logs

E   tensorflow.python.framework.errors_impl.InvalidArgumentError: Tried to stack elements of an empty list with non-fully-defined element_shape: <unknown>
E   	 [[{{node sequence_decode/TensorArrayV2Stack/TensorListStack}}]] [Op:IteratorGetNextSync]

[data request] SNLI

  • Name of dataset: SNLI
  • URL of dataset: https://nlp.stanford.edu/projects/snli/
  • License of dataset: Creative Commons Attribution-ShareAlike 4.0 International License
  • Short description of dataset and use case(s):
    The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE)

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] bAbI Question Answering

  • Name of dataset: bAbI Question Answering
  • URL of dataset: https://research.fb.com/downloads/babi/
  • License of dataset: CC BY 3.0
  • Short description of dataset and use case(s): question answering dataset for automatic text understanding and reasoning

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Would the team accept a JS version, published in NPM (instead of PyPi)?

Is your feature request related to a problem? Please describe.
Ideally, the datasets API would be available cross language, like Keras or TensorFlow. Many TF learners are coming to TensorFlow from JavaScript, and would benefit from the access to known datasets.

Describe the solution you'd like

in package.json

npm add @tensorflow/tensorflow-datasets 

in index.js

import * as tfds from '@tensorflow/tensorflow-datasets' 
const ds = tfds.load(name='mnist');

Additional context
js.tensorflow.org

Follow this issue to be notified of release

Hi all, thanks for your interest in tensorflow/datasets. We're actively working on the project and hope to release soon with a starter set of datasets. Please follow this issue to be notified of when we have an initial version on PyPI.

We have an alpha nightly release that you can try out: pip install tfds-nightly. Please leave feedback through Issues if you try it out.

If you're interested in contributing a dataset implementation, please feel free to start looking through the new dataset documentation and familiarize yourself with the codebase. MNIST might be a good starting point.

[data request] Pascal VOC

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Change default download directoy to hidden folder

Currently, the default download directory for dataset caching appears to be ~/tensorflow_datasets. However, since it's not a folder that is meant to be accessed through a file manager, I'd suggest to make it hidden by default, e.g. ~/.tensorflow_datasets.

[data request] OpenImages v4

  • Name of dataset: OpenImages v4
  • URL of dataset: https://storage.googleapis.com/openimages/web/index.html
  • License of dataset: licensed by Google Inc. under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license.
  • Short description of dataset and use case(s): bigger ImageNet with image level labels and bounding boxes.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Consider hosting preconverted Datasets to minimize time-to-first-example

Is your feature request related to a problem? Please describe.
I just tried the minist dataset in a colab. It took a few minutes to download & convert the data to tfrecords, before I can try anything at all.

The keras built-in MNIST dataset loaded in a few seconds.

This is concerning because MNIST is pretty small. If I tried to use a bigger dataset, seems like I might be waiting for an hour.

Describe the solution you'd like
When not otherwise prohibited by dataset licensing, it would be great if the TFDS team could convert the datasets AOT to their TFRecord format and host the converted data as a public dataset in the cloud. So when users try to use the dataset, there is not an extensive pause.

Describe alternatives you've considered
Using the dataset apis of other frameworks.

import tensorflow_datasets always enables eager execution

It seems importing tensorflow_datasets always enables eager execution, which I don't want. Is there a way to disable it? Thank you very much!

If I execute the following piece, the eager mode will be enabled by importing tensorflow_datasets

In [1]: import tensorflow as tf

In [2]: import tensorflow_datasets as tfds

In [3]: tf.executing_eagerly()
Out[3]: True

However, if I execute the following, an error will be raised!

In [1]: import tensorflow as tf

In [2]: tf.executing_eagerly()
Out[2]: False

In [3]: import tensorflow_datasets as tfds
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-46a8a2031c9c> in <module>
----> 1 import tensorflow_datasets as tfds

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\__init__.py in <module>
     49 # Imports for registration
     50 # pylint: disable=g-import-not-at-top
---> 51 from tensorflow_datasets import audio
     52 from tensorflow_datasets import image
     53 from tensorflow_datasets import text

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\audio\__init__.py in <module>
     16 """Audio datasets."""
     17
---> 18 from tensorflow_datasets.audio.librispeech import Librispeech
     19 from tensorflow_datasets.audio.librispeech import LibrispeechConfig
     20 from tensorflow_datasets.audio.nsynth import Nsynth

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\audio\librispeech.py in <module>
     26
     27 from tensorflow_datasets.core import api_utils
---> 28 import tensorflow_datasets.public_api as tfds
     29
     30 _CITATION = """\

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\public_api.py in <module>
     61
     62
---> 63 testing = _import_testing()

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\public_api.py in _import_testing()
     55 def _import_testing():
     56   try:
---> 57     from tensorflow_datasets import testing  # pylint: disable=redefined-outer-name
     58     return testing
     59   except:

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\testing\__init__.py in <module>
     16 """Testing utilities."""
     17
---> 18 from tensorflow_datasets.testing.dataset_builder_testing import DatasetBuilderTestCase
     19 from tensorflow_datasets.testing.test_case import TestCase
     20 from tensorflow_datasets.testing.test_utils import DummyDatasetSharedGenerator

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\testing\dataset_builder_testing.py in <module>
     37 from tensorflow_datasets.testing import test_utils
     38
---> 39 tf.compat.v1.enable_eager_execution()
     40
     41 # `os` module Functions for which tf.io.gfile equivalent should be preferred.

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in enable_eager_execution(config, device_policy, execution_mode)
   5421         device_policy=device_policy,
   5422         execution_mode=execution_mode,
-> 5423         server_def=None)
   5424
   5425

C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in enable_eager_execution_internal(config, device_policy, execution_mode, server_def)
   5489   else:
   5490     raise ValueError(
-> 5491         "tf.enable_eager_execution must be called at program startup.")
   5492
   5493   # Monkey patch to get rid of an unnecessary conditional since the context is

ValueError: tf.enable_eager_execution must be called at program startup.

[data request] Moving MNIST

  • Name of dataset: Moving MNIST
  • URL of dataset: http://www.cs.toronto.edu/~nitish/unsupervised_video/
  • License of dataset: unknown (same as MNIST probably)
  • Short description of dataset and use case(s): A test set for evaluating sequence prediction/reconstruction. 10,000 sequences each of length 20 showing 2 digits moving in a 64 x 64 frame.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Leverage tf.io.gfile as a downloader

Is your feature request related to a problem? Please describe.
Using tf-datasets with private datasets stored on Google Cloud Storage returns an error code 403 since the current download method using request does not handle the authentification. Also, paths specified in the form of gs://... are not currently supported

Describe the solution you'd like
Using tf.GFile whenever possible (in place of the current requests based approach) would solve both of these issues.

Additional context
A possible implementation could add a check (or something like that) during the call of _sync_download() and then if the given URI matches gs:// (or any supported URL) use tf.GFile to download it.

[data request] EMNIST

  • Name of dataset: EMNIST
  • URL of dataset: https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist
  • License of dataset: Unknown, but from the same NIST sources originally as MNIST.
  • Short description of dataset and use case(s): EMNIST is an extended MNIST dataset using the same processing and normalization techniques as MNIST itself on a superset of the original NIST images. This includes six splits, including not only a larger 280k MNIST digit equivalent, but also a 814k handwritten digit and letter dataset. For more information and details see: Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: an extension of MNIST to handwritten letters. Retrieved from http://arxiv.org/abs/1702.05373

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] KTH human action video data set

  • Name of dataset: KTH human action video data set
  • URL of dataset: http://www.nada.kth.se/cvap/actions/
  • License of dataset: unknown
  • Short description of dataset and use case(s):
    The current video database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as illustrated below. Currently the database contains 2391 sequences. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Index datasets in Google Dataset Search

It would be great to have all TFDS datasets indexed in Dataset Search.

Example Kaggle dataset.

We need a library function that goes from a DatasetBuilder to the schema.org markup needed. See the Google Dataset type docs and the schema.org docs. The markup should include usage instructions similar to the Kaggle example above (i.e. show off using the TFDS APIs for that dataset).

Then we need a script to generate an HTML page for each dataset and write them to a directory.

If Google indexes GitHub, then we'd be done. If not, we can copy those files over to the TF site, which is definitely indexed.

If anybody has experience with schema.org or is interested in having TFDS have wider exposure, this would be a great issue to pick up.

[data request] LSUN

  • Name of dataset: LSUN
  • URL of dataset: http://lsun.cs.princeton.edu/2017/
  • License of dataset: unknown
  • Short description of dataset and use case(s): scene and object recognition.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

[data request] Natural Questions

  • Name of dataset: Natural Questions
  • URL of dataset: https://ai.google.com/research/NaturalQuestions/dataset
  • License of dataset: Creative Commons Share-Alike 3.0
  • Short description of dataset and use case(s):
    To help spur development in open-domain question answering, we have created the Natural Questions (NQ) corpus, along with a challenge website based on this data. The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. The inclusion of real user questions, and the requirement that solutions should read an entire page to find the answer, cause NQ to be a more realistic and challenging task than prior QA datasets.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Add option to return images as float instead of uint8

Is your feature request related to a problem? Please describe.
Dataset features do not immediately compose with tf.hub modules. For example I want to fine tune a model for the cats_vs_dogs dataset, by reusing a tfhub image feature module. cats_vs_dogs provides the image as uint8 and undefined image size, but tfhub image modules expect float32 at specific sizes.

This can be solved with Dataset.map etc, but this is not obvious for beginners and otherwise is just a friction for everyone that dilutes the value of this project. The expectation is that Datasets of this project are ready to go for plugging into downstream computations, not that there is more massaging/transformation that needs to happen.

Describe the solution you'd like
I want to trivially compose tfds datasets with tfhub modules, without having to manually check and align details of tensor shapes and tensor types, or figure out where in the pipeline to insert a conversion function.

One solution could be to provide explicit features that are targeted for compatibility with tf.hub. Another solution could be to have the Builder parameterize/generate a set of FeatureColumns in correspondence to the FeaturesDict.

Describe alternatives you've considered
Dataset.map, ugly and against the spirit of this project which seems to be "make it easy to plug existing datasets into TF"

Additional context
FastAI has very slick examples for fine tuning a model for this dataset, TF solution could be competitive or better but the pieces need to fit together.

[data request] Colorectal Histology (Collection of textures in colorectal cancer histology)

Good example for people to see use case of AI for Good.

Can tried to help to do TFRecords files if needed (I am learning it)

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.