Git Product home page Git Product logo

batch-predict's Introduction

โš ๏ธ kubeflow/batch-predict is not maintained

This repository has been deprecated and archived on Nov 30th, 2021.

Batch-predict

Repository for batch predict

Overview

Batch predict is useful when users have a large number of instances to get predictions for and/or they don't need to get the prediction result in a real-time fashion.

This apache-beam-based implementation allows several input file formats: JSON (text), TFRecord, and compressed TFRecord files. It supports JSON and CSV output formats. Batch predict supports models trained using TensorFlow(in SavedModel format), xgboost and scikit-learn.

Today, batch predict can run on a single node in a K8s cluster using beam local runner. Alternatively, it can run on Google's Dataflow service using Dataflow Runner. We expect as other runners on K8s mature, it can run on multiple nodes in a k8s cluster.

Batch predict also supports running on GPU (k80/P100) in k8s if the cluster has configured with GPU and proper nvidia drivers installed.

batch-predict's People

Contributors

activatedgeek avatar ankushagarwal avatar jlewi avatar yixinshi avatar zijianjoy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

batch-predict's Issues

Identify issues needed for an initial release

We should come up with a list of issues that need to be addressed in order to have an initial release of batch predict as part of our 0.2 release.

All such issues should be P1. Anything not needed for our initial release should be lower priority.

Add unitttests

The batch predict package was added in #2 But there are no unittests.

We should add unittests.

A generic interface for PredictionDoFn

The module in question is kubeflow_batch_predict.dataflow.batch_prediction.py and the DoFn is PredictionDoFn.

This issue highlights a shortcoming of the current state and will serve as the primary discussion venue for the problem.

In its current state, this DoFn accepts a serialized JSON containing the following format for the examples:

{'instances': [ {'input': TFRecord}, ... ] } where each item in the list is a dictionary containing the 'input' key and a TFRecord (could be base64 encoding of the TFRecord). This input does not allow for any extraneous top level keys and only yields back a list of inputs and outputs.

The need for extraneous keys exists because there might be extra metadata along with each element which might be needed to identify the input. For instance, consider a prediction task to embed "movies" as high-dimensional vectors. If we desire to write the final results into a CSV file, we would want each row to have extra metadata like "name" etc and we might want this to be passed around in a dict in the DataFlow pipeline (as PCollections).

This is not possible currently because this would violate the input format to PredictionDoFn and instead we would have to morph these values into something acceptable. This step is expected however any downstream DoFns that derive PCollections from PredictionDoFn will not be able to access any pre-transformed data. Instead all we will have is a list of high-dimensional vectors with no way to relate back to the actual human readable information like "name".

We need to come to a design spec surrounding this so as to accommodate the most generic use cases around Batch Prediction

Is this repo deprecated?

Starting to use dataflow for batch prediction jobs and this repo has the best example code that I have found! Really amazing and extends to sklearn models, which I love.

I'm curious if there is still development planned or if it has moved to another repo. It wasn't clear other than I haven't seen any action from y'all this year.

ImportError: No module named kubeflow_batch_predict.dataflow.io.multifiles_source

I am getting "No module named" error when trying to submit a job on google cloud. any idea what I am missing here? It works fine when submitting from my local with DirectRunner.

ImportError: No module named kubeflow_batch_predict.dataflow.io.multifiles_source

I am trying to run this example https://github.com/kubeflow/examples/blob/master/object_detection/submit_batch_predict.md
with these parameters from batch_predict folder

python -m kubeflow_batch_predict.dataflow.batch_prediction_main --input_file_format tfrecord --output_file_format json --input_file_patterns gs://XYZ/batch_predict/object-detection-images.small.tfrecord --output_result_prefix gs://XYZ/batch_predict/batch_predict_out- --output_error_prefix gs://XYZ/batch_predict/batch_predict_error_out- --model_dir gs://XYZ/batch_predict/model/ --project --temp_location gs://XYZ/temp/ --staging_location gs://XYZ/staging/ --runner=DataflowRunner

Exit criterion for 1.0

We should come up with a set of exit criterion for declaring batch predict 1.0

/area 0.4.0
/priority p1
/area inference

Publish pip package

Should we publish a pip package to make it easy to install the Kubeflow batch predict library?

/priority p1
/area 0.4.0
/area inference

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.