tensorflow / io Goto Github PK

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO

License: Apache License 2.0

Python 42.51% Shell 1.02% C++ 50.56% R 2.30% Dockerfile 0.13% CSS 0.52% JavaScript 0.19% Go 0.11% Starlark 2.12% Swift 0.54%

tensorflow-io tensorflow filesystem dataset streaming

io's People

Contributors

Stargazers

Watchers

Forkers

yongtang terrytangyuan yuhonghong66 bryancutler yupbank jdc08161063 kkmsft dmitrievanthony batermj lbstroud yuhonghong7035 kevinykuo bhanditz suphoff nunofernandes-plight lc0 wookayin alipay fraudies dennisjay zhjunqin jjmachan zhaojp-frank muyixiang caszkgui 0101011 uiuran captain-pool pubfork zhoudaqing farcry4998 backyes xiongkezhi jiachengxu pooppap vlasenkoalexey mbrukman deeprtc vestigegroup gradienthealth feihugis captainduke henrytansetiawan yuanjie-ai cuiyifeng suyashkumar seanpmorgan dinhhai94 mrkjn tjadamlee emiratesback hubbucket-team phymucs zl376 nuzhny007 nikol0900 faroit upalchowdhury zxshinxz sshrdp fgrzeszc sbaier1 samykibrahim qinxuye neuroph12 adamgao1996 harbidel zaccharieramzi pshiko jahau yoonlee-lab dylantallchiefgit doc22940 jjedele sylfrena pooyam stjordanis vnghia nikunjbansal99 burgerkingeater zouxu09 ebritsyn hmoralesp zc100 shaunhenju etsangsplk dav009 chen155998 ruhuajiang hixio-mh michaelbanfield ouwen matt-komm lifeiteng pshved wanderingseed marioecd kmjung emkornfield global-localhost

io's Issues

Build tensorflow-io package on MacOS

At the moment, tensorflow-io package is built on Linux (and inside the tensorflow:custom-op image).

As users may use the package on different platforms and with different python versions, package built on different platform will be needed. At least we should provide the same support as tensorflow itself.

http/https filesystem support

We already have support for filesystems with igfs://, gs://, s3:// (either through tensorflow-io or through tensorflow repo). One protocol that is missing is the https:// or http://. There are lots of data are referring an image (or video/audio) through the URL so it really would be great if we could provide support for https:// or http://.

While HTTP protocol supports Accept-Ranges header, not all web servers provides this feature. So this has to be taken into consideration. Also, there is no need to re-download the file again if already save to local disk. It would make sense to have a local cache similar to pip or bazel.

The implementation likely will be in stages. We could have an early version first then expand features as needed.

Python 3.7 support

Python 3.7 is a supported version in TF 1.13.0 (rc2), so we should support 3.7 as well (already saw lots of interests in TF about 3.7).

Release tensorflow-io package on pypi

The tensorflow-io package could already build whl file, though the package has not been released to pypi yet.

It makes sense to publish the pip file so that users who want to use I/O packages could simply run:

pip install tensorflow-io

Action Items:

Create a release and cut a 0.1 version.
Build tensorflow-io package for Python 2.7, 3.4, 3.5, 3.6 (matching TensorFlow build environment)
Upload to pypi.

Setup lint with CI

We haven't had any lint checking yet and it would be good to setup a lint checking with CI.

In Python pylint/flake8 could be good choice.

In C++, clang-format was used but that was far from ideal. The reason was that clang-format differs output for different versions. This is really hard to maintain a consistent format. For now, let's not use clang-format. Instead we could check simple things like tab vs space, end-of-line spaces, etc.

Make Dockerfile for Python Development

It would be nice to have a Dockerfile to build an image with all the configuration and packages for a developer to start building and testing right away. This would include:

Required OS packages
Bazel installation
TensorFlow
Required pip packages

tensorflow-io 0.4.0 release

TensorFlow v1.13.1 has been released:
https://github.com/tensorflow/tensorflow/releases/tag/v1.13.1

We will need to make a release of 0.4.0 for tensorflow-io. There are several additional items we really like to be in before the dev summit: macOS support, MNIST, PubSub.

But we could also make a 0.4.0 release now, and have a 0.5.0 release immediately after (before dev summit), to hopefully get better PR.

Here is the list of items for 0.4.0

Change required package to tensorflow>=1.13.0,<1.14.0
Update README.md to add additional supported ops since 0.3.0
Update RELEASE.md to capture 0.4.0 changes.
Build *.whl files and push to PyPI.org
Release R package to CRAN.

/cc @dmitrievanthony @BryanCutler @terrytangyuan

FYI: Bazel 0.20.0 now really required (Newer versions no longer working)

A new external dependency requires Bazel 0.20.0
Google Cloud #2090

Workaround to avoid having to globally downgrade bazel

R Interface Prototype

Package setup and testing infrastructure
Implement R wrappers + unit tests
Integration with tfdatasets R package
Setup CI for R package and make sure tests all pass
Examples/vignettes

Support BigQuery for tensorflow-io

BigQuery is Google's serverless cloud data analytics platform. During last SIG I/O call the support for BigQuery (BigQueryDataset) was mentioned.

It would be good to add BigQuery (BigQueryDataset) to tensorflow-io so that users could use tensorflow to do more big data analytics on cloud.

Note that Google's BigQuery seems to have no C++ client library. It does have a python support and a RESTful API support. The implementation likely will need to use BigQuery's python client library, or http?

Ignite tests fail with python 3.4 and 3.6

While running tests on Travis CI, it looks like ignite tests //tensorflow_io/ignite:ignite_py_test fail with python 3.4:
https://travis-ci.org/tensorflow/io/jobs/467306695

Python 3.6 tests also fail with //tensorflow_io/ignite:ignite_py_test
https://travis-ci.org/tensorflow/io/jobs/467306697

It may worth to take a look.

Support Apache Arrow for tensorflow-io

Note: this issue is from tensorflow/tensorflow#23001 (@BryanCutler 👍 )

Apache Arrow is a standard format for in-memory columnar data. It provides a cross-language platform for systems to communicate and operate on data efficiently.

Adding Arrow support in TensorFlow Dataset will allow systems to interface with TensorFlow in a well defined way, without the need to develop custom converters, serialize data, or write to specialized files.

It would be straightforward to add a base layer of Arrow support that works on Arrow record batches (a common struct for Arrow IPC) and extend that layer to support different kinds of Arrow Ops:

Python memory / Pandas DataFrames
Arrow Feather files
Parquet files
Socket / Pipes

A slightly more involved Op could use Arrow Flight - Arrow-based messaging over gRPC. Additionally, it would possible to define Ops to connect directly to other systems that can export Arrow data.

Support video format for tensorflow-io

In TensorFlow, video format is supported through tf.contrib.ffmpeg which calls command line ffmpeg to decode video format to tensors and feed into tensorflow.

The tf.contrib.ffmpeg is pretty much unmaintained, and, the command line ffmpeg invocation is really unreliable due to the changes of output text over different versions.

The tf.contrib.ffmpeg will also be deprecated soon so users of tensorflow will have no direct access to video format very soon. This is a big loss for many users.

I think it makes sense to support video formats in tensorflow-io, by dynamically linking ffmpeg's library (not command line invocation) and generate output to tf.data.

We have to be very careful with licenses for external libraries, though as far as I know (correct me if I am wrong), ffmpeg is LGPL 2.1+ so it would be OK to only dynamically linking ffmpeg from tensorflow-io (Apache 2.0 license). Also, we should not distribute ffmpeg library directly. We should merely call the api through .so/.dll if it has been installed in system already.

Support NumPy for tensorflow-io

In TensorFlow's guide of "Importing Data":
https://www.tensorflow.org/guide/datasets

It is possible to reading input data directly from TFRecord (TFRecordDataset), text (TextLineDataset ), csv (CsvDataset) but not with NumPy. Reading input from NumPy still have to use a not so elegant way in the example code of the TensorFlow Guide:

with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

It should be possible to implement NumPy support so that reading input from numpy could be done in a similar fashion as other input format. This potentially could also improve the performance as it may not be needed to read everything into the memory immediately (remotely related: tensorflow/tensorflow#16933).

Add tests to make sure python modules are exposed correctly

While tensorflow-io has tests covered for different modules (Ignite/Kafka/etc) through bazel test, those tests are always run inside the repo directory and no through pip install.

It would be good to have tests to make sure pip install correctly exposes python modules. Some simple tests of pip install tensorflow-io-*.whl && python -c "import tensorflow_io.Kafka" would be good enough to serve the needs here.

tensorflow-io 0.3.0 release

I think we have enough new features to release 0.3.0. We could have another release 0.4.0 to match tensorflow 1.13.0, likely in a couple of weeks.

Update README.md to add additional supported ops since 0.2.0
Update RELEASE.md to capture 0.3.0 changes.
Build *.whl files and push to PyPI.org
Release R packages to CRAN.

/cc @dmitrievanthony @BryanCutler @terrytangyuan

Support images natively for tensorflow-io

With WebP format (#42/#43) it is possible to import a collections of images into tf.data natively without any additional steps. Due to historical reasons importing png/jpg/bmp files in TensorFlow still has to go through the tensor conversions. It would be good to import a collection of image file (potentially with a mixture of different format) to tf.data without any redundant operations.

Arrow Datasets missing AsGraphDefInternal impl

The Arrow Datasets are missing the proper AsGraphDefInternal implementations to define the inputs and output Node for serialization. This seems to have been allowed to work for v1.12, but causes v1.13 to fail with a cryptic error of a thread being killed during MakeDataset.

Setup nightly build through Travis CI

We could set up a nightly build through Travis CI, so that users can install a preview version. This could be useful as TensorFlow 1.13 is not released yet (only 1.13.0rc0) and we need to start testing early, so that we could release roughly the same time as 1.13.0 release.

Ideally we could setup another project in PyPI.org with tensorflow-io-nightly.

AlibabaCloud OSS support

This feature request is from tensorflow/tensorflow#24556 (/cc @oss-developer):

AlibabaCloud Object Storage Service(OSS) is one of the most widely used cloud stroage services in the world. Could I ask for OSS support with tensorflow?

Kafka dataset may exit in macOS

On macOS sometimes Kafka dataset may exit early:
https://travis-ci.org/tensorflow/io/jobs/499999592

This bug is likely caused by usage of librdkafka. The resources (e.g., messages) may have to be closed before close kafka consumer.

setup.py: Dependency on tensorflow

When installing the current version of tensorflow-io (pip install tensorflow-io-nightly), it tries to install its dependency package tensorflow, not tensorflow-gpu if the gpu one is installed. As a result tensorflow could be installed mistakenly instead of tensorflow-gpu.

I have no idea about specifying conditional dependencies. How about we simply do not require tensorflow in REQUIRED_PACKAGE from setup.py?

Support Python 3 for tensorflow-io

At the moment, tensorflow-io package could be built within docker image tensorflow:custom-op. By default tensorflow:custom-op uses python 2.7.6. The package has not been built and tested with python 3. We should support python 3.

Also, from tensorflow repo there is a lot of interests for python 3.7 support, so 3.7 should be considered as well.

Rework on compressed file based Dataset

As the number of file based Dataset is growing, code duplications start to happen. The biggest area of duplication is the compression support. There are two types of compressions:

ZLIB/GZIP where you have a single compressed entry
ZIP where you have multiple entries inside (e.g, npz file is essentially a ZIP).
The compression topic itself could be complicated, like recursive compression. The goal of tensorflow-io though, is to support formats that are commonly used in machine learning community. So one level of compression is enough.

We should rework on Dataset to have a CompressedFileDataset like abstraction.

support for libsvm format

TensorFlow 1.13 support

There are some tf.data API changes between TensorFlow 1.12 and 1.13 which breaks the build of tensorflow-io. This issue is created to track the the effort needed to provide TensorFlow 1.13 support for tensorflow-io.

Support live audio for tensorflow-io

One interesting usage of tensorflow from the community is to process live audio stream. For that I think it would be great to support live audio stream in tensorflow-io so that more applications could benefit from it directly.

Note this issue might be different from #49 (audio file format, not live audio).

Support audio format for tensorflow-io

With issue #11 and PR #30, tensorflow-io will have Video format support through FFmpeg. Since FFmpeg supports Audio as well, it is natural to add Audio format support through FFmpeg in tensorflow-io.

This issue captures the necessary effort to support Audio format in FFmpeg. Note that unlike video, many audio format need additional information feed from outside of the container. That means the API augments might be different from video support.

Add boolean type support to Arrow Datasets

Arrow datasets are missing support for boolean data types

tensorflow_io namespace?

Should tensorflow-io export any symbols in the tensorflow namespace?

In most cases directly using the tensorflow namespace the is not an issue.
Many tensorflow-io kernels are fully contained in an anonymous namespace and REGISTER_OP just defines a static variable.

However any functionality requiring more then a single file currently exports symbols in the tensorflow namespace. If we ever get a symbol collision with tensorflow then symbol resolution may cause interesting issues.

Should tensorflow-io use a tensorflow_io workspace?
( And is there an automated way to test/protect against exporting symbols into the tensorflow namespace?)

Support Cloud PubSub for tensorflow-io

The tensorflow-io support a basic Kinesis integration which is a cloud-vendor specific (AWS) streaming platform.

A similar streaming platform provided by Google Cloud is the Cloud PubSub. It makes sense to provide support for Cloud PubSub in tensorflow-io as well.

One good thing with PubSub is that it has gRPC endpoint, which should make it a lot easier to implement in C++ than BigQuery (which only has RESTful API endpoint).

Add bazel into build image

Do I understand correctly that Docker image proposed to use for build :

docker run -it -v ${PWD}:/working_dir -w /working_dir tensorflow/tensorflow:custom-op

doesn't contain bazel and we have to setup it manually? Can we prepare a docker image that already contains bazel?

Device and GPU/CPU in Dataset

While working on CIFAR format, noticed we may need access to Device (like Eigen thread or GPU) for preprocessing data in Dataset. For example, some formats are channel_first while by default we output data in channel_last. That requires Transpose op. We could do the transformation in python but ideally we can do it within the kernel.

Also a related issue: At the moment Dataset are attached to CPU device. But will it helps to send data directly to GPU if user want to do it?

Guidance on initializable iterators w/ numpy arrays

I am collecting data during the training process and using Dataset.from_tensor_slices with placeholders and an initializable iterator to refresh the dataset. The dataset uses the tensor slices to then do further preprocessing.

As new data is collected, I reinitialize the iterator's placeholders with the new numpy array data.

Since initializable iterators are deprecated now, how do you recommend I seed the dataset with the dynamics numpy arrays? Should I switch to using a generator?

Support parquet format for tensorflow-io

There are quite some interests to support parquet for tensorflow and a PR (tensorflow/tensorflow#19461) has already been approved.

With the expected deprecation of tf.contrib, it makes sense to move PR tensorflow/tensorflow#19461 to tensorflow-io.

At the moment the blocking issue with the related PR is the bazel version. The PR requires bazel 1.7.1+ to incorporate boost library while tensorflow CI still uses bazel 1.5.0. We will need to find out a way to work out the bazel version issue. See tensorflow/tensorflow#22449, tensorflow/tensorflow#22964 for details.

Guidance on custom filesystem

The past guide for this was here https://www.tensorflow.org/guide/extend/filesystem and I'm looking at porting the azure blob storage file system to this repo.

I was wanting some guidance on what might have changed? I found the igfs registration here https://github.com/tensorflow/io/blob/master/tensorflow_io/ignite/ops/igfs_ops.cc and what seems to be the build target here https://github.com/tensorflow/io/blob/master/tensorflow_io/ignite/BUILD#L5-L43 for this file system. Are these two components what is likely from my point of view?

Travis CI enhancement

The Travis CI config needs to be improved, to support additional platforms and improve maintainability of .travis.yaml:

Platform	Build	Python Test	R 3.5 Test
Linux Python 2.7	Ubuntu 14.04	Ubuntu 16.04+18.04	Ubuntu 16.04+18.04
Linux Python 3.4	Ubuntu 14.04
Linux Python 3.5	Ubuntu 14.04	Ubuntu 16.04
Linux Python 3.6	Ubuntu 14.04	Ubuntu 18.04
MacOS Python 2.x	TBD	TBD	TBD
MacOS Python 3.x	TBD	TBD	TBD
Windows Python 3.x	TBD	TBD	TBD

ArrowDataset.from_pandas fails when preserve_index=True

When making an ArrowDataset from a pandas.DataFrame, if the preserve_index flag is set to True the Dataset iterator will fail with the error:

InternalError: Missing 2-th output from node IteratorGetNext (defined at <ipython-input-42-624c3010a782>:1)  = IteratorGetNext[output_shapes=[[], [], []], output_types=[DT_DOUBLE, DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)

This is because the preserve_index flag will add an additional column in the record batch and there is not a corresponding column index sent to the op.

Release R package on CRAN

CRAN is where most R packages are published. We can release the initial version of the package once #6 is closed as well as the following work:

Make sure all tests pass locally and on CI
Make sure certain tests that cannot run on CRAN test machines are properly skipped
Make sure all CRAN checks/requirements pass
Address feedback from CRAN maintainers' package submission review (if any)

tensorflow-io 0.2.0 release

This issue tracks the efforts to release tensorflow-io 0.2.0.

Since TensorFlow 1.13 has not been released, tensorflow-io 0.2.0 will be based on TensorFlow 1.12.

Update README.md to add additional supported ops since 0.1.0 (@BryanCutler)
Update README.md to add link to individual ops' README.md if possible. (@BryanCutler)
Change setup.py to fix tensorflow==1.12.0, as 1.13 is not compatible. (#65/@yongtang)
Add integration test in tests so that the build of *.whl files work correctly (#64/@yongtang)
Add examples in doc/tutorials so that end user could see the use case (defer to tensorflow 1.13).
Update R-packages so that additional ops are supported. (#70/ @terrytangyuan)
Update RELEASE.md to capture 0.2.0 changes (#66 / @yongtang)
Build *.whl files and push to PyPI.org (@BryanCutler)
Release R packages to CRAN.

If you want to help, you can add a comment to indicate the items you are working on.

Tiff file support

Would that be possible to support TIFF file?

Support WebP for tensorflow-io

In TensorFlow, basic image formats of bmp, png, and jpeg are supported. An issue (tensorflow/tensorflow#18250) was opened in TensorFlow repo looking for WebP support.

This issue captures the effort to support WebP in tensorflow-io.

Unable to install directly using pip on macOS

@yongtang I seem unable to download artifact from PyPI using pip:

$ pip install tensorflow-io
Collecting tensorflow-io
  Could not find a version that satisfies the requirement tensorflow-io (from versions: )
No matching distribution found for tensorflow-io

Is there a step I'm missing that we could add to the README?

Referencing: #7

Migrate LMDBDataset from tensorflow to tensorflow-io

LMDBDataset is part of the tf.contrib.data and it naturally fits the tensorflow-io context. It makes sense to migrate LMDBDataset to tensorflow-io.

Setup CI for Tensorflow I/O

This issue tracks the efforts and progresses to setup CI for TensorFlow I/O. We plan to use Google CI so coordination with TensorFlow's CI might be needed.

Support libsvm for tensorflow-io

In TensorFlow, libsvm format is supported through tf.contrib.libsvm which converts libsvm format into sparse tensor format. With the expected deprecation of tf.contrib, the libsvm will be deprecated as well.

It would be great to provide continued support of libsvm through tensorflow-io package. While the tf.contrib.libsvm converts libsvm format into sparse tensor, I think in tensorflow-io, we could convert libsvm to tf.data directly.

Idea: DT_VARIANT type for ImageSets / ImageSource?

When looking through the current image operations and how to add functionality I quickly came to imagine lots and lots of per file format special operations - and did not really like that picture.

Currently all operations on image and video files (on file system or as strings) parse the files, extract one (or more) pictures as 2-3 dimensional data tensors and then throw away all parsing information.

This severely restricts the usability - all possible operations would have to be defined per file format - and each operation starts from scratch from an unopened file.

If ImageSets (Single Pictures, Multiple Pictures (Tiff), Videos (Single or multi channel)) would be wrapped in a DT_VARIANT they could become first class objects in TF.

ImageSet Operations could then extract information from the ImageSet like cardinality, specific images, image sizes, ...

Example usage pattern in pseudo TF code:

# Extract Random Image from TIF file
image_set = TiffImageSet('example.tif')
n = ImageSetCardinality(image_set)
index = tf.random.uniform(1, minval = 0; maxval = n, dtype=tf.dtypes.int64)
image = ImageSetGetImage(image_set, index)

A similar argument could be made to represent single potential images as an ImageSource Object.
This way meta data (focal length, resolution, ...) information could be retrieved from the ImageSource Object using special ImageSource Operations, used in calculations and then used as parameters to retrieve a cropped and scaled subset with another ImageSource Operation.

Example usage:

image_source = Picture('car.jpg');
 # Extract Low RES picture from image_source and find license plate
license_plate_area = find_license_plate(image_source)
# Extract high Resolution patch from image_source and read license  
plate_image = ImageSourceExtractPatch(image_source, license_plate_area)
tag = read_tag(plate_image)

I believe ImageSource operations could also be made batch friendly to avoid unnecessary copy operations on batching.

This may not be a realistic idea - but I wanted to bring it up as this may be the ideal time to specify a new generic interface. (There are enough existing formats in the repository to verify interface, but not too many to make the task unmanageable.)

Thoughts?

TensorFlow 2.0 support

This issue is created to track TensorFlow 2.0 support for tensorflow-io. The most visible changes in TensorFlow 2.0 is the eager execution in default. Will need to test thoroughly and see if there are any impact on tensorflow-io.

Set up static html documentation for R package

We need to scaffold R package documentation website. pkgdown is a popular choice in R community to generate this automatically through simple configuration. We should at least scaffold the following pages:

API reference
Tutorials
Index page

Support live video for tensorflow-io

Not sure if it is feasible though it would be interesting if tensorflow could be used for live video processing directly. For that we will need live video stream support in tensorflow-io. That will benefit many applications I think.

Note video format support (not live video) has already been captured in #11 and #30.

/cc @juwangvsu