Git Product home page Git Product logo

tensorflow-recorder's Introduction

TFRecorder

TFRecorder makes it easy to create TFRecords from Pandas DataFrames or CSV Files. TFRecord reads data, transforms it using TensorFlow Transform, stores it in the TFRecord format using Apache Beam and optionally Google Cloud Dataflow. Most importantly, TFRecorder does this without requiring the user to write an Apache Beam pipeline or TensorFlow Transform code.

TFRecorder can convert any Pandas DataFrame or CSV file into TFRecords. If your data includes images TFRecorder can also serialize those into TFRecords. By default, TFRecorder expects your DataFrame or CSV file to be in the same 'Image CSV' format that Google Cloud Platform's AutoML Vision product uses, however you can also specify an input data schema using TFRecorder's flexible schema system.

'TFRecorder CI/CD Badge'

Release Notes

Why TFRecorder?

Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). The TFRecorder project started inside Google Cloud AI Services when we realized we were writing TFRecord conversion code over and over again.

When to use TFRecords:

  • Your model is input bound (reading data is impacting training time).
  • Anytime you want to use tf.Dataset
  • When your dataset can't fit into memory

Installation

Install from Github

  1. Clone this repo.
git clone https://github.com/google/tensorflow-recorder.git

For "bleeding edge" changes, check out the dev branch.

  1. From the top directory of the repo, run the following command:
python setup.py install

Install from PyPi

pip install tfrecorder

Usage

Generating TFRecords

You can generate TFRecords from a Pandas DataFrame, CSV file or a directory containing images.

From Pandas DataFrame

TFRecorder has an accessor which enables creation of TFRecord files through the Pandas DataFrame object.

Make sure the DataFrame contains a header identifying each of the columns. In particular, the split column needs to be specified so that TFRecorder would know how to split the data into train, test and validation sets.

Running on a local machine
import pandas as pd
import tfrecorder

csv_file = '/path/to/images.csv'
df = pd.read_csv(csv_file, names=['split', 'image_uri', 'label'])
df.tensorflow.to_tfr(output_dir='/my/output/path')
Running on Cloud Dataflow

Google Cloud Platform Dataflow workers need to be supplied with the tfrecorder package that you would like to run remotely. To do so first download or build the package (a python wheel file) and then specify the path the file when tfrecorder is called.

Step 1: Download or create the wheel file.

To download the wheel from pip: pip download tfrecorder --no-deps

To build from source/git: python setup.py sdist

Step 2: Specify the project, region, and path to the tfrecorder wheel for remote execution.

Cloud Dataflow Requirements

  • The output_dir must be a Google Cloud Storage location.
  • The image files specified in an image_uri column must be located in Google Cloud Storage.
  • If being run from your local machine, the user must be authenticated to use Google Cloud.
import pandas as pd
import tfrecorder

df = pd.read_csv(...)
df.tensorflow.to_tfr(
    output_dir='gs://my/bucket',
    runner='DataflowRunner',
    project='my-project',
    region='us-central1',
    tfrecorder_wheel='/path/to/my/tfrecorder.whl')

From CSV

Using Python interpreter:

import tfrecorder

tfrecorder.convert(
    source='/path/to/data.csv',
    output_dir='gs://my/bucket')

Using the command line:

tfrecorder create-tfrecords \
    --input_data=/path/to/data.csv \
    --output_dir=gs://my/bucket

From an image directory

import tfrecorder

tfrecorder.convert(
    source='/path/to/image_dir',
    output_dir='gs://my/bucket')

The image directory should have the following general structure:

image_dir/
  <dataset split>/
    <label>/
      <image file>

Example:

images/
  TRAIN/
    cat/
      cat001.jpg
    dog/
      dog001.jpg
  VALIDATION/
    cat/
      cat002.jpg
    dog/
      dog002.jpg
  ...

Loading a TF Dataset from TFRecord files

You can load a TensorFlow dataset from TFRecord files generated by TFRecorder on your local machine.

import tfrecorder

dataset_dict = tfrecorder.load('/path/to/tfrecord_dir')
train = dataset_dict['TRAIN']

Verifying data in TFRecords generated by TFRecorder

Using Python interpreter:

import tfrecorder

tfrecorder.inspect(
    tfrecord_dir='/path/to/tfrecords/',
    split='TRAIN',
    num_records=5,
    output_dir='/tmp/output')

This will generate a CSV file containing structured data and image files representing the images encoded into TFRecords.

Using the command line:

tfrecorder inspect \
    --tfrecord-dir=/path/to/tfrecords/ \
    --split='TRAIN' \
    --num_records=5 \
    --output_dir=/tmp/output

Default Schema

If you don't specify an input schema, TFRecorder expects data to be in the same format as AutoML Vision input. This format looks like a Pandas DataFrame or CSV formatted as:

split image_uri label
TRAIN gs://my/bucket/image1.jpg cat

where:

  • split can take on the values TRAIN, VALIDATION, and TEST
  • image_uri specifies a local or Google Cloud Storage location for the image file.
  • label can be either a text-based label that will be integerized or integer

Flexible Schema

TFRecorder's flexible schema system allows you to use any schema you want for your input data.

For example, the default image CSV schema input can be defined like this:

import pandas as pd
import tfrecorder
from tfrecorder import input_schema
from tfrecorder import types

image_csv_schema = input_schema.Schema({
    'split': types.SplitKey,
    'image_uri': types.ImageUri,
    'label': types.StringLabel
})

# You can then pass the schema to `tfrecorder.create_tfrecords`.

df = pd.read_csv(...)
df.tensorflow.to_tfr(
    output_dir='gs://my/bucket',
    schema_map=image_csv_schema,
    runner='DataflowRunner',
    project='my-project',
    region='us-central1')

Flexible Schema Example

Imagine that you have a dataset that you would like to convert to TFRecords that looks like this:

split x y label
TRAIN 0.32 42 1

You can use TFRecorder as shown below:

import pandas as pd
import tfrecorder
from tfrecorder import input_schema
from tfrecorder import types

# First create a schema map
schema = input_schema.Schema({
    'split': types.SplitKey,
    'x': types.FloatInput,
    'y': types.IntegerInput,
    'label': types.IntegerLabel,
})

# Now call TFRecorder with the specified schema_map

df = pd.read_csv(...)
df.tensorflow.to_tfr(
    output_dir='gs://my/bucket',
    schema=schema,
    runner='DataflowRunner',
    project='my-project',
    region='us-central1')

After calling TFRecorder's to_tfr() function, TFRecorder will create an Apache beam pipeline, either locally or in this case using Google Cloud's Dataflow runner. This beam pipeline will use the schema map to identify the types you've associated with each data column and process your data using TensorFlow Transform and TFRecorder's image processing functions to convert the data into into TFRecords.

Supported types

TFRecorder's schema system supports several types. You can use these types by referencing them in the schema map. Each type informs TFRecorder how to treat your DataFrame columns.

types.SplitKey

  • A split key is required for TFRecorder at this time.
  • Only one split key is allowed.
  • Specifies a split key that TFRecorder will use to partition the input dataset on.
  • Allowed values are 'TRAIN', 'VALIDATION, and 'TEST'

Note: If you do not want your data to be partitioned, include a column with types.SplitKey and set all the elements to TRAIN.

types.ImageUri

  • Specifies the path to an image. When specified, TFRecorder will load the specified image and store the image as a base64 encoded tf.string in the key 'image' along with the height, width, and image channels as integers using the keys 'image_height', 'image_width', and 'image_channels'.
  • A schema can contain only one imageUri column

types.IntegerInput

  • Specifies an int input.
  • Will be scaled to mean 0, variance 1.

types.FloatInput

  • Specifies an float input.
  • Will be scaled to mean 0, variance 1.

types.CategoricalInput

  • Specifies a string input.
  • Vocabulary computed and output integerized.

types.IntegerLabel

  • Specifies an integer target.
  • Not transformed.

types.StringLabel

  • Specifies a string target.
  • Vocabulary computed and output integerized.

Contributing

Pull requests are welcome. Please see our code of conduct and contributing guide.

Why TFRecorder?

Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem).

TFRecords help when:

  • Your model is input bound (reading data is impacting training time).
  • Anytime you want to use tf.Dataset
  • When your dataset can't fit into memory

Need help with using AI in the cloud? Visit Google Cloud AI Services.

tensorflow-recorder's People

Contributors

cfezequiel avatar dependabot[bot] avatar jmarrietar avatar klmilam avatar lc0 avatar mbernico avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorflow-recorder's Issues

Automated testing for Jupyter notebooks

Is your feature request related to a problem? Please describe.
There's currently no way to automate testing for Jupyter notebooks.
They would need to be run manually to make sure that all the cells can run successfully.
Code changes can sometimes break notebooks.

Describe the solution you'd like
Add a way to test notebooks through a simple command (e.g. make testnb) and integrate this test into the CI pipeline.

Describe alternatives you've considered
Manual testing: This takes time and can also be prone to error. The notebook will be marked as modified in git whenever it is opened.

No testing: The notebooks may become stale over time and wouldn't work with newer changes to TFRecorder.

Make Jupyter Notebooks code example executables in Colab

Is your feature request related to a problem? Please describe.
As a user, a nice to have features would be the ability to open the Notebooks examples in a self-contained Colab environment and be able to play with the library there.

Describe alternatives you've considered
Maybe adding a Badge like this Open In Colab . That could be open it by clicking it.

Train, Test and Val TFRecords files partitions always created

Describe the bug
Train, Test and Val TFRecords files always created even if I only specify TRAIN on CSV file

To Reproduce
Change Split to only TRAIN

split,image_uri,label
TRAIN,../tfrecorder/test_data/images/cat/cat-640x853-1.jpg,cat
TRAIN,../tfrecorder/test_data/images/cat/cat-800x600-2.jpg,cat
TRAIN,../tfrecorder/test_data/images/cat/cat-800x600-3.jpg,cat
TRAIN,../tfrecorder/test_data/images/goat/goat-640x640-1.jpg,goat
TRAIN,../tfrecorder/test_data/images/goat/goat-320x320-2.jpg,goat
TRAIN,../tfrecorder/test_data/images/goat/goat-640x427-3.jpg,goat

Expected behavior
Only create a partitions for TRAIN TFRecords

Screenshots
Screen Shot 2020-08-13 at 10 31 50 PM

System (please complete the following information):

  • OS: [ iOS]
  • Python Version: [3.7.4]
  • TensorFlow Version: [2.2.0]

Additional context
Not sure if this a Bug or this indeed is expected behavior. But as a user If my CSV partitions file only specified TRAIN is strange to create the other files (test and val) are also created, but without images.

Code examples for using TFRecorder TFRecords in a tf.data.Dataset

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
As a user I would like to see an example of how to read the output of TFRecorder in as a tf.data.Dataset.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Pull request template not showing

Describe the bug
When creating a pull request, the text fields should be pre-filled with text from the pull request template, but this is not happening.

To Reproduce
Steps to reproduce the behavior:

  1. Go to Pull requests tab
  2. Click on New pull request
  3. Choose two branches to compare (e.g. master vs. dev)
  4. Click Create pull request

Expected behavior
See Describe the bug

Screenshots
If applicable, add screenshots to help explain your problem.

System (please complete the following information):
NA, on GitHub repo page

Additional context
Possible fix is to move the template to .github/PULL_REQUEST_TEMPLATE.md
relevant SO question

Decode Error when trying to save structured df to tfrecord

Describe the bug
I have been trying to follow the sample notebook on converting structured data and cannot seem to get around an error.

To Reproduce
I copied the whole session so see the code block at the end. Basically:

  • import tfrecorder, avro-python3==1.8.2 and boto3
  • open ipython
  • make df and call tfrecorder.convert
  • get NotImplementedError: TFXIO should be used to decode CSV. [while running 'DecodeCSV']
  • the error points to tensorflow_transform/coders/csv_coder.py

Expected behavior
I expect this to run as in the notebook.

System:

  • OS: macOS 11.4
  • Python Version: 3.7
  • TensorFlow Version: 2.3.1 (installed by tfrecorder)
  • TensorFlow Transform Version: 0.26.0 (installed by tfrecorder)

Additional context
I get the same error if I do
df.tensorflow.to_tfr(output_dir='./tfrecord', schema=tfrecorder.input_schema.Schema(schema))
I am saving locally here but would like to save to S3.

Steps to reproduce the behavior:

conda create --name venv python=3.7
conda activate venv
(venv) pip install tfrecorder
(venv) pip install avro-python3==1.8.2  # to get around the initial avro error
(venv) pip install boto3 
(venv) ipython
Python 3.7.10 (default, Feb 26 2021, 10:16:00) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.25.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:  import pandas as pd; import numpy as np; import tfrecorder

In [2]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),columns=['a', 'b', 'c'])

In [3]: schema = {'a': tfrecorder.types.IntegerInput, 'b': tfrecorder.types.IntegerInput, 'c': tfrecorder.types.IntegerInput}

In [4]: df['split'] = 'TRAIN';schema['split'] = tfrecorder.types.SplitKey

In [5]: results = tfrecorder.convert(df, output_dir='./tfrecord', schema=tfrecorder.input_schema.Schema(schema))

2021-07-23 12:07:32.736923: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-23 12:07:32.759337: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fbc2a70b6c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-07-23 12:07:32.759373: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.SimpleInvoker.invoke_process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/transforms/core.py in <lambda>(x)
   1569   else:
-> 1570     wrapper = lambda x: [fn(x)]
   1571 

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py in new_func(*args, **kwargs)
    323               instructions)
--> 324       return func(*args, **kwargs)
    325     return tf_decorator.make_decorator(

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/tensorflow_transform/coders/csv_coder.py in decode(self, csv_string)
    317   def decode(self, csv_string):
--> 318     raise NotImplementedError(_DECODE_DEPRECATION_MESSAGE)

NotImplementedError: TFXIO should be used to decode CSV. 

During handling of the above exception, another exception occurred:

NotImplementedError                       Traceback (most recent call last)
<ipython-input-6-b9dfc043634e> in <module>
----> 1 results = tfrecorder.convert(df, output_dir='./tfrecord', schema=tfrecorder.input_schema.Schema(schema))

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/tfrecorder/converter.py in convert(source, output_dir, schema, header, names, runner, project, region, tfrecorder_wheel, dataflow_options, job_label, compression, num_shards)
    322   )
    323 
--> 324   result = p.run()
    325 
    326   if runner == 'DirectRunner':

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/pipeline.py in run(self, test_runner_api)
    562         finally:
    563           shutil.rmtree(tmpdir)
--> 564       return self.runner.run_pipeline(self, self._options)
    565     finally:
    566       shutil.rmtree(self.local_tempdir, ignore_errors=True)

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py in run_pipeline(self, pipeline, options)
    129       runner = BundleBasedDirectRunner()
    130 
--> 131     return runner.run_pipeline(pipeline, options)
    132 
    133 

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py in run_pipeline(self, pipeline, options)
    190 
    191     self._latest_run_result = self.run_via_runner_api(
--> 192         pipeline.to_runner_api(default_environment=self._default_environment))
    193     return self._latest_run_result
    194 

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py in run_via_runner_api(self, pipeline_proto)
    200     # TODO(pabloem, BEAM-7514): Create a watermark manager (that has access to
    201     #   the teststream (if any), and all the stages).
--> 202     return self.run_stages(stage_context, stages)
    203 
    204   @contextlib.contextmanager

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py in run_stages(self, stage_context, stages)
    366           stage_results = self._run_stage(
    367               runner_execution_context,
--> 368               bundle_context_manager,
    369           )
    370           monitoring_infos_by_stage[stage.name] = (

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py in _run_stage(self, runner_execution_context, bundle_context_manager)
    562               input_timers,
    563               expected_timer_output,
--> 564               bundle_manager)
    565 
    566       final_result = merge_results(last_result)

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py in _run_bundle(self, runner_execution_context, bundle_context_manager, data_input, data_output, input_timers, expected_timer_output, bundle_manager)
    602 
    603     result, splits = bundle_manager.process_bundle(
--> 604         data_input, data_output, input_timers, expected_timer_output)
    605     # Now we collect all the deferred inputs remaining from bundle execution.
    606     # Deferred inputs can be:

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py in process_bundle(self, inputs, expected_outputs, fired_timers, expected_output_timers, dry_run)
    903             process_bundle_descriptor.id,
    904             cache_tokens=[next(self._cache_token_generator)]))
--> 905     result_future = self._worker_handler.control_conn.push(process_bundle_req)
    906 
    907     split_results = []  # type: List[beam_fn_api_pb2.ProcessBundleSplitResponse]

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py in push(self, request)
    376       self._uid_counter += 1
    377       request.instruction_id = 'control_%s' % self._uid_counter
--> 378     response = self.worker.do_instruction(request)
    379     return ControlFuture(request.instruction_id, response)
    380 

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py in do_instruction(self, request)
    600       # E.g. if register is set, this will call self.register(request.register))
    601       return getattr(self, request_type)(
--> 602           getattr(request, request_type), request.instruction_id)
    603     else:
    604       raise NotImplementedError

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py in process_bundle(self, request, instruction_id)
    637         with self.maybe_profile(instruction_id):
    638           delayed_applications, requests_finalization = (
--> 639               bundle_processor.process_bundle(instruction_id))
    640           monitoring_infos = bundle_processor.monitoring_infos()
    641           monitoring_infos.extend(self.state_cache_metrics_fn())

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py in process_bundle(self, instruction_id)
    992           elif isinstance(element, beam_fn_api_pb2.Elements.Data):
    993             input_op_by_transform_id[element.transform_id].process_encoded(
--> 994                 element.data)
    995 
    996       # Finish all operations.

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py in process_encoded(self, encoded_windowed_values)
    220       decoded_value = self.windowed_coder_impl.decode_from_stream(
    221           input_stream, True)
--> 222       self.output(decoded_value)
    223 
    224   def monitoring_infos(self, transform_id, tag_to_pcollection_id):

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.Operation.output()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.Operation.output()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.SingletonConsumerSet.receive()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner._reraise_augmented()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.SimpleInvoker.invoke_process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common._OutputProcessor.process_outputs()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.SingletonConsumerSet.receive()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner._reraise_augmented()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.SimpleInvoker.invoke_process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common._OutputProcessor.process_outputs()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.SingletonConsumerSet.receive()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner._reraise_augmented()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.SimpleInvoker.invoke_process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common._OutputProcessor.process_outputs()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.SingletonConsumerSet.receive()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner._reraise_augmented()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.SimpleInvoker.invoke_process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common._OutputProcessor.process_outputs()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.SingletonConsumerSet.receive()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/worker/operations.cpython-37m-darwin.so in apache_beam.runners.worker.operations.DoOperation.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner._reraise_augmented()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.DoFnRunner.process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/runners/common.cpython-37m-darwin.so in apache_beam.runners.common.SimpleInvoker.invoke_process()

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/apache_beam/transforms/core.py in <lambda>(x)
   1568     wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
   1569   else:
-> 1570     wrapper = lambda x: [fn(x)]
   1571 
   1572   label = 'Map(%s)' % ptransform.label_from_callable(fn)

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py in new_func(*args, **kwargs)
    322               'in a future version' if date is None else ('after %s' % date),
    323               instructions)
--> 324       return func(*args, **kwargs)
    325     return tf_decorator.make_decorator(
    326         func, new_func, 'deprecated',

~/opt/anaconda3/envs/venv/lib/python3.7/site-packages/tensorflow_transform/coders/csv_coder.py in decode(self, csv_string)
    316   @deprecation.deprecated(None, _DECODE_DEPRECATION_MESSAGE)
    317   def decode(self, csv_string):
--> 318     raise NotImplementedError(_DECODE_DEPRECATION_MESSAGE)

NotImplementedError: TFXIO should be used to decode CSV.  [while running 'DecodeCSV']

In [6]: 


ImportError: cannot import name 'Parse' from 'avro.schema'

I installed tensorflow-recorder using the command: pip install tfrecorder. The package installs without any error, however, I am getting the following error when trying to import the package.

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-63673124a3e9> in <module>
      4 import pandas as pd
      5 
----> 6 import tfrecorder
      7 import tensorflow as tf

~\Anaconda3\envs\tfrecords\lib\site-packages\tfrecorder\__init__.py in <module>
     17 """Imports."""
     18 
---> 19 from tfrecorder import accessor
     20 from tfrecorder.converter import convert
     21 from tfrecorder.dataset_loader import load

~\Anaconda3\envs\tfrecords\lib\site-packages\tfrecorder\accessor.py in <module>
     26 from IPython.core import display
     27 
---> 28 from tfrecorder import converter
     29 from tfrecorder import constants
     30 from tfrecorder import input_schema

~\Anaconda3\envs\tfrecords\lib\site-packages\tfrecorder\converter.py in <module>
     25 from typing import Any, Dict, Optional, Sequence, Tuple, Union
     26 
---> 27 import apache_beam as beam
     28 import pandas as pd
     29 import tensorflow as tf

~\Anaconda3\envs\tfrecords\lib\site-packages\apache_beam\__init__.py in <module>
     94 
     95 from apache_beam import coders
---> 96 from apache_beam import io
     97 from apache_beam import typehints
     98 from apache_beam import version

~\Anaconda3\envs\tfrecords\lib\site-packages\apache_beam\io\__init__.py in <module>
     21 from __future__ import absolute_import
     22 
---> 23 from apache_beam.io.avroio import *
     24 from apache_beam.io.filebasedsink import *
     25 from apache_beam.io.iobase import Read

~\Anaconda3\envs\tfrecords\lib\site-packages\apache_beam\io\avroio.py in <module>
     56 from avro import io as avroio
     57 from avro import datafile
---> 58 from avro.schema import Parse
     59 from fastavro.read import block_reader
     60 from fastavro.write import Writer

ImportError: cannot import name 'Parse' from 'avro.schema' (C:\Users\xxxxx\Anaconda3\envs\tfrecords\lib\site-packages\avro\schema.py)

Solutions Tried:

  1. Tried installing tensorflow-recorder through pip3 after learning it might be python2/python3 issue from here.
  2. Tried removing avro to enforce this package to use avro-python3 but looks like it needs avro.

Can somebody please help me with this?

Visual / IPython based TFRecord Checking

Is your feature request related to a problem? Please describe.
As a researcher, I want to be able to check/inspect my created TFRecords so that I can be certain my data was properly manipulated.
Describe the solution you'd like
I would like to check my tfrecord files in ipython and get output in jupyter notebooks
I would like to verify that my images were properly encoded.
I would like to know how my labels were transformed (TFT)

Describe alternatives you've considered
None

Additional context
Scientific users need to test / validate the TFRecorder transformation process as part of their requirements for reproducability.

Check TFRecord Example / Instructions

Is your feature request related to a problem? Please describe.
No example/instructions exists for checking an existing TFRecord

Describe the solution you'd like
Add instructions for checking a TFRecord's format.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Misleading example in readme

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. pip install tfrecorder on a clean environment
  2. Follow example in README.nd
  3. See error

Expected behavior
I do not know whether you had some expected behavior than is different from the code sniped provided, which is just a copy from the README.md in the repo.

import pandas as pd
import tfrecorder

df = pd.read_csv(...)
df.tensorflow.to_tfr(output_dir='gs://my/bucket')

Screenshots

image

System (please complete the following information):

  • OS: macOS 10.15.6
  • Python Version: 3.6
  • TensorFlow Version: 2.2 (the one that gets installed with tfrecorder)
  • TensorFlow Transform Version: 0.23 (the one that gets installed with tfrecorder)

Additional context
I actually checked the code because it did not make sense to me that you would modify the dataframe or pandas to make the library work. I would spect a single function that accepts dataframe, perhaps with some predefined column names. On the otherhand, probably the example is just incomplete.

How does it use on Object Detection dataset?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Improve Installation Process [setup.py file location]

Is your feature request related to a problem? Please describe.
The most likely installation process for most users may result in the Dataflow runner not working as the setup.py file for the Dataflow workers will not be in the right place.

Describe the solution you'd like
Assume users will only pip install and not clone the repo. Place the setup.py file into the main library folder and update setup.py location in beam_pipeline.py

Describe alternatives you've considered
Got it working after moving the setup.py file into the main python ..python3.7/sitepackages/ folder.

Additional context
Add any other context or screenshots about the feature request here.

'CsvCoder' object has no attribute 'decode'

Describe the bug
When using df.tensorflow.to_tfr() an AttributeError is raised: 'CsvCoder' object has no attribute 'decode'

To Reproduce
All I am doing is defining a schema and using df.tensorflow.to_tfr(). All columns in the schema are either types.SplitKey, types.IntegerLabel, or types.IntegerInput.

Expected behavior
Dataframe is successfully output to a TFRecord.

Screenshots
If applicable, add screenshots to help explain your problem.

System (please complete the following information):

  • OS: Ubuntu-20.04 via WSL2
  • Python Version: 3.8.1
  • TensorFlow Version: 2.3.1
  • TensorFlow Transform Version: 1.2.0

Additional context

AttributeError: 'CsvCoder' object has no attribute 'decode'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-95f4c12ae7c5> in <module>
----> 1 sample.tensorflow.to_tfr(output_dir=f'{PROCESSED_DATA_DIR}/df-trips+weather_split_1', schema=schema)

~/.local/lib/python3.8/site-packages/tfrecorder/accessor.py in to_tfr(self, output_dir, schema, runner, project, region, tfrecorder_wheel, dataflow_options, job_label, compression, num_shards)
     87             '<b>Logging output to /tmp/{} </b>'.format(constants.LOGFILE)))
     88 
---> 89     r = converter.convert(
     90         self._df,
     91         output_dir=output_dir,

~/.local/lib/python3.8/site-packages/tfrecorder/converter.py in convert(source, output_dir, schema, header, names, runner, project, region, tfrecorder_wheel, dataflow_options, job_label, compression, num_shards)
    309   job_dir = _get_job_dir(output_dir, job_name)
    310 
--> 311   p = beam_pipeline.build_pipeline(
    312       df,
    313       job_dir=job_dir,

~/.local/lib/python3.8/site-packages/tfrecorder/beam_pipeline.py in build_pipeline(df, job_dir, runner, project, region, compression, num_shards, schema, tfrecorder_wheel, dataflow_options)
    251         | 'ReadFromDataFrame' >> beam.Create(df.values.tolist())
    252         | 'ToCSVRows' >> beam.ParDo(flatten_rows)
--> 253         | 'DecodeCSV' >> beam.Map(converter.decode)
    254     )
    255 

AttributeError: 'CsvCoder' object has no attribute 'decode'

Change TFRUtil to TFRecorder in Notebooks

Describe the bug
Notebooks inside samples directory have import tfrutil both from https://github.com/google/tensorflow-recorder/blob/master/samples/Basic-TFRecorder-Usage.ipynb and https://github.com/google/tensorflow-recorder/blob/master/samples/Using-TFRecorder-with-Google-Cloud-Dataflow.ipynb

To Reproduce
Steps to reproduce the behavior:

  1. Go to samples directory and try to run the imports.

Expected behavior
import tfrecorder

Additional context
Maybe because this notebooks are still in an early phase. Cheers,

tensorflow version mismatch between requirments.txt and setup.py

In requirements.txt, the requirements are as follows:

apache-beam[gcp] >= 2.22.0
pandas >= 1.0.4
tensorflow_transform >= 0.22
Pillow >= 7.1.2
coverage >= 5.1
ipython >= 7.15.0
nose >= 1.3.7
numpy < 1.19.0
pylint >= 2.5.3
fire >= 0.3.1
jupyter >= 1.0.0
tensorflow >= 2.3.1
pyarrow <0.18,>=0.17
frozendict >= 1.2
dataclasses >= 0.5;python_version<"3.7"
nbval >= 0.9.6
pytest >= 6.1.1

However in setup.py, the required pachages are:

REQUIRED_PACKAGES = [
    "apache-beam[gcp] >= 2.22.0",
    "avro >= 1.10.0",
    "coverage >= 5.1",
    "ipython >= 7.15.0",
    "fire >= 0.3.1",
    "frozendict >= 1.2",
    "nose >= 1.3.7",
    "numpy < 1.19.0",
    "pandas >= 1.0.4",
    "Pillow >= 7.1.2",
    "pyarrow >= 0.17, < 0.18.0",
    "pylint >= 2.5.3",
    "pytz >= 2020.1",
    "python-dateutil",
    "tensorflow == 2.3.1",
    "tensorflow_transform >= 0.22",
]

It would probably be cleaner to read the install requires from requirements.txt so this doesn't happen in the future

Convert to TFRecords and load a dataset

Is your feature request related to a problem? Please describe.
For convenience, it would be great to have a function that converts data into TFRecords and returns a TF Dataset.

Describe the solution you'd like
tfrecorder.convert_and_load(...)

Describe alternatives you've considered
No change: When working in a Python environment, user would have to invoke convert and load operations separately.

Add note that header needs to be included

Is your feature request related to a problem? Please describe.
When using create_tfrecords, the user would need to specify the split key in order to split the data. This can be done either through specifying a schema, the CSV column header or incorporating the header in the Pandas DataFrame.

Describe the solution you'd like
Add a note in the README regarding aforementioned requirement, and also handle the error case in a better way (e.g. more informative error message).

Describe alternatives you've considered
Do nothing: Might cost users significant time trying to debug the problem.

Additional context
Based on user feedback.

Support 2d image segementation use cases

Is your feature request related to a problem? Please describe.
No, I just want to use TFRecorder on a different problem type.

Describe the solution you'd like
As a user I want to be able to use TFRecorder for image segmentation use cases.

In these my input data consists of an image / mask tuple where the mask is a 2-d numpy array with dimensions image_height x image_width. Each pixel in the mask contains an int label, labeling the type of that pixel.

Add guard for non-image files in image directory input

It would be good to add some check in case there are non-image files in an image directory.

Describe the solution you'd like
A simple filter would suffice, e.g.

If not image file: 
    skip

Describe alternatives you've considered
A: Do nothing - potential for tool to fail while processing data, which could waste user's time
B: Filter at the DataFrame level - best not to propagate errors downstream

Additional context
See client._read_image_directory.

Include TFRecord output directory in `create_tfrecords` return `dict`

Is your feature request related to a problem? Please describe.
It would be nice to return the path were the TFRecord files were generated after calling create_tfrecords.

Describe the solution you'd like
Add the TFRecord directory path to job_result returned by create_tfrecords.

Describe alternatives you've considered

  1. Searching for the directory manually based on given output_dir
  • This is a bit cumbersome as the TFRecords are stored in a subdirectory of output_dir, which may contain other TFRecord directories.

Use lower case names for split values

Is your feature request related to a problem? Please describe.
Change split values from all caps to lower case.
This makes file/directory naming more consistent with the split.

Describe the solution you'd like
TRAIN -> train
VALIDATION -> validation
TEST -> test

Describe alternatives you've considered

  1. No change
  • There's a bit of skew when it comes to mapping split value with file/dir name.

Load and inspect from GCS ?

Not sure if I ve missed something, but it would be nice to be able to load() tfr directly from gcs using a path like gs:// bucket/dir/file, same for inspect() as well ?

No documentation? No support for output multiple TFR files?

Is your feature request related to a problem? Please describe.
The whole purpose of TFR is to make hundreds of GB level training data possible by dividing them into little TFR files.. if this is not possible, then why bother just loading a giant training data in memory?

Describe the solution you'd like
I want such feature implemented just like here: https://gist.github.com/swyoon/8185b3dcf08ec728fb22b99016dd533f

Describe alternatives you've considered
I have written my own version

Additional context
no

Add function to load a TF Dataset from TFRecords

Is your feature request related to a problem? Please describe.
The tool is able to generate TFRecord files, but it would also be good for users to have a convenient way to read those TFRecord files from a directory.

Describe the solution you'd like

import tfrecorder

dataset = tfrecorder.load('/path/to/tfrecord/dir')

The function will accept a TFRecord directory generated by tfrecorder using the create_tfrecords function, since the directory would have a specific structure.

Describe alternatives you've considered
Alternative 1: Create sample code on how to read TFRecords and add to samples directory.
Con: This will still have the user write some code in practice.

Alternative 2: Do nothing
Con: The user would have to figure out how to read the TFRecord files effectively, and it is not trivial as TFRecorder uses TF Transform to generate transform metadata along with the TFRecord files.

File sample/data.csv contains an incorrect path

Describe the bug
The supplied sample/data.csv contains an incorrect path including tfrutil

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

System (please complete the following information):

  • OS: [e.g. iOS]
  • Python Version: [e.g. 3.6]
  • TensorFlow Version: [e.g. 2.2]
  • TensorFlow Transform Version: [e.g. 0.22]

Additional context
Add any other context about the problem here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.