tensorflow / io Goto Github PK
View Code? Open in Web Editor NEWDataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
License: Apache License 2.0
Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
License: Apache License 2.0
At the moment, tensorflow-io
package is built on Linux (and inside the tensorflow:custom-op
image).
As users may use the package on different platforms and with different python versions, package built on different platform will be needed. At least we should provide the same support as tensorflow itself.
We already have support for filesystems with igfs://
, gs://
, s3://
(either through tensorflow-io or through tensorflow repo). One protocol that is missing is the https://
or http://
. There are lots of data are referring an image (or video/audio) through the URL so it really would be great if we could provide support for https://
or http://
.
While HTTP protocol supports Accept-Ranges
header, not all web servers provides this feature. So this has to be taken into consideration. Also, there is no need to re-download the file again if already save to local disk. It would make sense to have a local cache similar to pip or bazel.
The implementation likely will be in stages. We could have an early version first then expand features as needed.
Python 3.7 is a supported version in TF 1.13.0 (rc2), so we should support 3.7 as well (already saw lots of interests in TF about 3.7).
The tensorflow-io
package could already build whl file, though the package has not been released to pypi yet.
It makes sense to publish the pip file so that users who want to use I/O packages could simply run:
pip install tensorflow-io
Action Items:
We haven't had any lint checking yet and it would be good to setup a lint checking with CI.
In Python pylint/flake8 could be good choice.
In C++, clang-format was used but that was far from ideal. The reason was that clang-format differs output for different versions. This is really hard to maintain a consistent format. For now, let's not use clang-format. Instead we could check simple things like tab
vs space
, end-of-line spaces
, etc.
It would be nice to have a Dockerfile to build an image with all the configuration and packages for a developer to start building and testing right away. This would include:
TensorFlow v1.13.1 has been released:
https://github.com/tensorflow/tensorflow/releases/tag/v1.13.1
We will need to make a release of 0.4.0 for tensorflow-io. There are several additional items we really like to be in before the dev summit: macOS support, MNIST, PubSub.
But we could also make a 0.4.0 release now, and have a 0.5.0 release immediately after (before dev summit), to hopefully get better PR.
Here is the list of items for 0.4.0
tensorflow>=1.13.0,<1.14.0
A new external dependency requires Bazel 0.20.0
Google Cloud #2090
BigQuery is Google's serverless cloud data analytics platform. During last SIG I/O call the support for BigQuery (BigQueryDataset
) was mentioned.
It would be good to add BigQuery (BigQueryDataset
) to tensorflow-io so that users could use tensorflow to do more big data analytics on cloud.
Note that Google's BigQuery seems to have no C++ client library. It does have a python support and a RESTful API support. The implementation likely will need to use BigQuery's python client library, or http?
While running tests on Travis CI, it looks like ignite tests //tensorflow_io/ignite:ignite_py_test
fail with python 3.4:
https://travis-ci.org/tensorflow/io/jobs/467306695
Python 3.6 tests also fail with //tensorflow_io/ignite:ignite_py_test
https://travis-ci.org/tensorflow/io/jobs/467306697
It may worth to take a look.
Note: this issue is from tensorflow/tensorflow#23001 (@BryanCutler ๐ )
Apache Arrow is a standard format for in-memory columnar data. It provides a cross-language platform for systems to communicate and operate on data efficiently.
Adding Arrow support in TensorFlow Dataset will allow systems to interface with TensorFlow in a well defined way, without the need to develop custom converters, serialize data, or write to specialized files.
It would be straightforward to add a base layer of Arrow support that works on Arrow record batches (a common struct for Arrow IPC) and extend that layer to support different kinds of Arrow Ops:
A slightly more involved Op could use Arrow Flight - Arrow-based messaging over gRPC. Additionally, it would possible to define Ops to connect directly to other systems that can export Arrow data.
In TensorFlow, video format is supported through tf.contrib.ffmpeg
which calls command line ffmpeg
to decode video format to tensors and feed into tensorflow.
The tf.contrib.ffmpeg
is pretty much unmaintained, and, the command line ffmpeg
invocation is really unreliable due to the changes of output text over different versions.
The tf.contrib.ffmpeg
will also be deprecated soon so users of tensorflow will have no direct access to video format very soon. This is a big loss for many users.
I think it makes sense to support video formats in tensorflow-io
, by dynamically linking ffmpeg's library (not command line invocation) and generate output to tf.data.
We have to be very careful with licenses for external libraries, though as far as I know (correct me if I am wrong), ffmpeg is LGPL 2.1+ so it would be OK to only dynamically linking ffmpeg from tensorflow-io (Apache 2.0 license). Also, we should not distribute ffmpeg library directly. We should merely call the api through .so/.dll
if it has been installed in system already.
In TensorFlow's guide of "Importing Data":
https://www.tensorflow.org/guide/datasets
It is possible to reading input data directly from TFRecord (TFRecordDataset
), text (TextLineDataset
), csv (CsvDataset
) but not with NumPy. Reading input from NumPy still have to use a not so elegant way in the example code of the TensorFlow Guide:
with np.load("/var/data/training_data.npy") as data:
features = data["features"]
labels = data["labels"]
# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
It should be possible to implement NumPy support so that reading input from numpy could be done in a similar fashion as other input format. This potentially could also improve the performance as it may not be needed to read everything into the memory immediately (remotely related: tensorflow/tensorflow#16933).
While tensorflow-io has tests covered for different modules (Ignite/Kafka/etc) through bazel test
, those tests are always run inside the repo directory and no through pip install
.
It would be good to have tests to make sure pip install
correctly exposes python modules. Some simple tests of pip install tensorflow-io-*.whl && python -c "import tensorflow_io.Kafka"
would be good enough to serve the needs here.
I think we have enough new features to release 0.3.0. We could have another release 0.4.0 to match tensorflow 1.13.0, likely in a couple of weeks.
With WebP format (#42/#43) it is possible to import a collections of images into tf.data
natively without any additional steps. Due to historical reasons importing png/jpg/bmp
files in TensorFlow still has to go through the tensor conversions. It would be good to import a collection of image file (potentially with a mixture of different format) to tf.data
without any redundant operations.
The Arrow Datasets are missing the proper AsGraphDefInternal
implementations to define the inputs and output Node for serialization. This seems to have been allowed to work for v1.12, but causes v1.13 to fail with a cryptic error of a thread being killed during MakeDataset
.
We could set up a nightly build through Travis CI, so that users can install a preview version. This could be useful as TensorFlow 1.13 is not released yet (only 1.13.0rc0) and we need to start testing early, so that we could release roughly the same time as 1.13.0
release.
Ideally we could setup another project in PyPI.org with tensorflow-io-nightly.
This feature request is from tensorflow/tensorflow#24556 (/cc @oss-developer):
AlibabaCloud Object Storage Service(OSS) is one of the most widely used cloud stroage services in the world. Could I ask for OSS support with tensorflow?
On macOS sometimes Kafka dataset may exit early:
https://travis-ci.org/tensorflow/io/jobs/499999592
This bug is likely caused by usage of librdkafka. The resources (e.g., messages) may have to be closed before close kafka consumer.
When installing the current version of tensorflow-io (pip install tensorflow-io-nightly
), it tries to install its dependency package tensorflow
, not tensorflow-gpu
if the gpu one is installed. As a result tensorflow
could be installed mistakenly instead of tensorflow-gpu
.
I have no idea about specifying conditional dependencies. How about we simply do not require tensorflow
in REQUIRED_PACKAGE from setup.py?
At the moment, tensorflow-io package could be built within docker image tensorflow:custom-op
. By default tensorflow:custom-op
uses python 2.7.6. The package has not been built and tested with python 3. We should support python 3.
Also, from tensorflow repo there is a lot of interests for python 3.7 support, so 3.7 should be considered as well.
As the number of file based Dataset is growing, code duplications start to happen. The biggest area of duplication is the compression support. There are two types of compressions:
We should rework on Dataset to have a CompressedFileDataset like abstraction.
There are some tf.data
API changes between TensorFlow 1.12 and 1.13 which breaks the build of tensorflow-io. This issue is created to track the the effort needed to provide TensorFlow 1.13 support for tensorflow-io.
One interesting usage of tensorflow from the community is to process live audio stream. For that I think it would be great to support live audio stream in tensorflow-io so that more applications could benefit from it directly.
Note this issue might be different from #49 (audio file format, not live audio).
With issue #11 and PR #30, tensorflow-io will have Video format support through FFmpeg. Since FFmpeg supports Audio as well, it is natural to add Audio format support through FFmpeg in tensorflow-io.
This issue captures the necessary effort to support Audio format in FFmpeg. Note that unlike video, many audio format need additional information feed from outside of the container. That means the API augments might be different from video support.
Arrow datasets are missing support for boolean data types
Should tensorflow-io export any symbols in the tensorflow namespace?
In most cases directly using the tensorflow namespace the is not an issue.
Many tensorflow-io kernels are fully contained in an anonymous namespace and REGISTER_OP just defines a static variable.
However any functionality requiring more then a single file currently exports symbols in the tensorflow namespace. If we ever get a symbol collision with tensorflow then symbol resolution may cause interesting issues.
Should tensorflow-io use a tensorflow_io workspace?
( And is there an automated way to test/protect against exporting symbols into the tensorflow namespace?)
The tensorflow-io support a basic Kinesis integration which is a cloud-vendor specific (AWS) streaming platform.
A similar streaming platform provided by Google Cloud is the Cloud PubSub. It makes sense to provide support for Cloud PubSub in tensorflow-io as well.
One good thing with PubSub is that it has gRPC endpoint, which should make it a lot easier to implement in C++ than BigQuery (which only has RESTful API endpoint).
Do I understand correctly that Docker image proposed to use for build :
docker run -it -v ${PWD}:/working_dir -w /working_dir tensorflow/tensorflow:custom-op
doesn't contain bazel and we have to setup it manually? Can we prepare a docker image that already contains bazel?
While working on CIFAR format, noticed we may need access to Device (like Eigen thread or GPU) for preprocessing data in Dataset. For example, some formats are channel_first while by default we output data in channel_last. That requires Transpose
op. We could do the transformation in python but ideally we can do it within the kernel.
Also a related issue: At the moment Dataset are attached to CPU device. But will it helps to send data directly to GPU if user want to do it?
I am collecting data during the training process and using Dataset.from_tensor_slices with placeholders and an initializable iterator to refresh the dataset. The dataset uses the tensor slices to then do further preprocessing.
As new data is collected, I reinitialize the iterator's placeholders with the new numpy array data.
Since initializable iterators are deprecated now, how do you recommend I seed the dataset with the dynamics numpy arrays? Should I switch to using a generator?
There are quite some interests to support parquet for tensorflow and a PR (tensorflow/tensorflow#19461) has already been approved.
With the expected deprecation of tf.contrib
, it makes sense to move PR tensorflow/tensorflow#19461 to tensorflow-io.
At the moment the blocking issue with the related PR is the bazel version. The PR requires bazel 1.7.1+ to incorporate boost library while tensorflow CI still uses bazel 1.5.0. We will need to find out a way to work out the bazel version issue. See tensorflow/tensorflow#22449, tensorflow/tensorflow#22964 for details.
The past guide for this was here https://www.tensorflow.org/guide/extend/filesystem and I'm looking at porting the azure blob storage file system to this repo.
I was wanting some guidance on what might have changed? I found the igfs registration here https://github.com/tensorflow/io/blob/master/tensorflow_io/ignite/ops/igfs_ops.cc and what seems to be the build target here https://github.com/tensorflow/io/blob/master/tensorflow_io/ignite/BUILD#L5-L43 for this file system. Are these two components what is likely from my point of view?
The Travis CI config needs to be improved, to support additional platforms and improve maintainability of .travis.yaml:
Platform | Build | Python Test | R 3.5 Test |
---|---|---|---|
Linux Python 2.7 | Ubuntu 14.04 | Ubuntu 16.04+18.04 | Ubuntu 16.04+18.04 |
Linux Python 3.4 | Ubuntu 14.04 | ||
Linux Python 3.5 | Ubuntu 14.04 | Ubuntu 16.04 | |
Linux Python 3.6 | Ubuntu 14.04 | Ubuntu 18.04 | |
MacOS Python 2.x | TBD | TBD | TBD |
MacOS Python 3.x | TBD | TBD | TBD |
Windows Python 3.x | TBD | TBD | TBD |
When making an ArrowDataset from a pandas.DataFrame, if the preserve_index
flag is set to True
the Dataset iterator will fail with the error:
InternalError: Missing 2-th output from node IteratorGetNext (defined at <ipython-input-42-624c3010a782>:1) = IteratorGetNext[output_shapes=[[], [], []], output_types=[DT_DOUBLE, DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)
This is because the preserve_index
flag will add an additional column in the record batch and there is not a corresponding column index sent to the op.
CRAN is where most R packages are published. We can release the initial version of the package once #6 is closed as well as the following work:
This issue tracks the efforts to release tensorflow-io 0.2.0.
Since TensorFlow 1.13 has not been released, tensorflow-io 0.2.0 will be based on TensorFlow 1.12.
README.md
to add additional supported ops since 0.1.0 (@BryanCutler)README.md
to add link to individual ops' README.md if possible. (@BryanCutler)tensorflow==1.12.0
, as 1.13
is not compatible. (#65/@yongtang)tests
so that the build of *.whl
files work correctly (#64/@yongtang)doc/tutorials
so that end user could see the use case (defer to tensorflow 1.13).RELEASE.md
to capture 0.2.0 changes (#66 / @yongtang)*.whl
files and push to PyPI.org (@BryanCutler)If you want to help, you can add a comment to indicate the items you are working on.
Would that be possible to support TIFF file?
In TensorFlow, basic image formats of bmp, png, and jpeg are supported. An issue (tensorflow/tensorflow#18250) was opened in TensorFlow repo looking for WebP support.
This issue captures the effort to support WebP in tensorflow-io.
@yongtang I seem unable to download artifact from PyPI using pip
:
$ pip install tensorflow-io
Collecting tensorflow-io
Could not find a version that satisfies the requirement tensorflow-io (from versions: )
No matching distribution found for tensorflow-io
Is there a step I'm missing that we could add to the README?
Referencing: #7
LMDBDataset
is part of the tf.contrib.data
and it naturally fits the tensorflow-io context. It makes sense to migrate LMDBDataset to tensorflow-io.
This issue tracks the efforts and progresses to setup CI for TensorFlow I/O. We plan to use Google CI so coordination with TensorFlow's CI might be needed.
In TensorFlow, libsvm format is supported through tf.contrib.libsvm
which converts libsvm format into sparse tensor format. With the expected deprecation of tf.contrib
, the libsvm will be deprecated as well.
It would be great to provide continued support of libsvm through tensorflow-io package. While the tf.contrib.libsvm
converts libsvm format into sparse tensor, I think in tensorflow-io, we could convert libsvm to tf.data
directly.
When looking through the current image operations and how to add functionality I quickly came to imagine lots and lots of per file format special operations - and did not really like that picture.
Currently all operations on image and video files (on file system or as strings) parse the files, extract one (or more) pictures as 2-3 dimensional data tensors and then throw away all parsing information.
This severely restricts the usability - all possible operations would have to be defined per file format - and each operation starts from scratch from an unopened file.
If ImageSets (Single Pictures, Multiple Pictures (Tiff), Videos (Single or multi channel)) would be wrapped in a DT_VARIANT they could become first class objects in TF.
ImageSet Operations could then extract information from the ImageSet like cardinality, specific images, image sizes, ...
Example usage pattern in pseudo TF code:
# Extract Random Image from TIF file
image_set = TiffImageSet('example.tif')
n = ImageSetCardinality(image_set)
index = tf.random.uniform(1, minval = 0; maxval = n, dtype=tf.dtypes.int64)
image = ImageSetGetImage(image_set, index)
A similar argument could be made to represent single potential images as an ImageSource Object.
This way meta data (focal length, resolution, ...) information could be retrieved from the ImageSource Object using special ImageSource Operations, used in calculations and then used as parameters to retrieve a cropped and scaled subset with another ImageSource Operation.
Example usage:
image_source = Picture('car.jpg');
# Extract Low RES picture from image_source and find license plate
license_plate_area = find_license_plate(image_source)
# Extract high Resolution patch from image_source and read license
plate_image = ImageSourceExtractPatch(image_source, license_plate_area)
tag = read_tag(plate_image)
I believe ImageSource operations could also be made batch friendly to avoid unnecessary copy operations on batching.
This may not be a realistic idea - but I wanted to bring it up as this may be the ideal time to specify a new generic interface. (There are enough existing formats in the repository to verify interface, but not too many to make the task unmanageable.)
Thoughts?
This issue is created to track TensorFlow 2.0 support for tensorflow-io. The most visible changes in TensorFlow 2.0 is the eager execution in default. Will need to test thoroughly and see if there are any impact on tensorflow-io.
We need to scaffold R package documentation website. pkgdown is a popular choice in R community to generate this automatically through simple configuration. We should at least scaffold the following pages:
Not sure if it is feasible though it would be interesting if tensorflow could be used for live video processing directly. For that we will need live video stream support in tensorflow-io. That will benefit many applications I think.
Note video format support (not live video) has already been captured in #11 and #30.
/cc @juwangvsu
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.