rstudio / cloudml Goto Github PK

R interface to Google Cloud Machine Learning Engine

Home Page: https://tensorflow.rstudio.com/tools/cloudml/

R 87.53% Python 11.22% Shell 0.18% CSS 1.07%

tensorflow cloudml gpc gpu keras r rstats

cloudml's Introduction

R interface to Google CloudML

The cloudml package provides an R interface to Google Cloud Machine Learning Engine, a managed service that enables:

Scalable training of models built with the keras, tfestimators, and tensorflow R packages.
On-demand access to training on GPUs, including the new Tesla P100 GPUs from NVIDIA®.
Hyperparameter tuning to optimize key attributes of model architectures in order to maximize predictive accuracy.
Deployment of trained models to the Google global prediction platform that can support thousands of users and TBs of data.

CloudML is a managed service where you pay only for the hardware resources that you use. Prices vary depending on configuration (e.g. CPU vs. GPU vs. multiple GPUs). See https://cloud.google.com/ml-engine/pricing for additional details.

For documentation on using the R interface to CloudML see the package website at https://tensorflow.rstudio.com/tools/cloudml/

cloudml's People

Contributors

Stargazers

Watchers

cloudml's Issues

consider renaming "cloudml" parameter to "config"

Now that we've eliminated the gcloud parameter as well as the previously used "config" parameter I think it would be more clear to name the "cloudml" parameter "config" (as that matches what's used on the gcloud command line).

@javierluraschi What do you think? I'm happy to make this change if you agree.

R version bucket name

Are we forming the name of the R bucket correctly here (seems like the "4" shouldn't be first)

clean out staging bucket?

I noticed that we leave a bunch of our staging packages in the staging bucket:

Should these be removed after the job completes?

Is this package still NSFW?

This looks like a very interesting package, but readme is foreboding.

I'm working with rstudio/keras and loving it. If this can easily let me send my models to google for training that would be great.

Looking over the docs I'm a bit confused because I figured cloudml_train was going to be the function to work with, but the examples are devoid of calling that function.

If this is working I'm willing to invest in learning to use it.

separate 'cloudml.yml', 'flags.yml'

We need some way of automatically detecting + setting the active project + account for deployments.

For command-line deployments, this implies constructing a command line string of the form:

gcloud --project <project> --account <account> ml-engine <...>

I'm guessing this should be encoded as part of config.yml?

error: cloudml package not available

Just executed a training job to test the new cloudml::is_cloudml() function and got this error:

Is the cloudml package available during training? (seems like it should be)

mechanism for caching packages?

During deployments, a very large amount of time is spent installing package dependencies from CRAN and GitHub. We should figure out a way to offset this cost.

Preloaded Images

We could get in touch with the Google folks, and ask them to pre-bake the TensorFlow images that get launched with ml-engine with a set of R packages (tidyverse + tensorflow + tfruns + cloudml).

Packrat Cache

When developers opt-in to using Packrat, we could enable the use of the Packrat cache. The packages would need to get cached on to a persistent filesystem. Some potential ways forward:

Use FUSE to mount the bucket as an actual filesystem path, and tell Packrat to use that: https://cloud.google.com/storage/docs/gcs-fuse
Manually use gsutil to copy the library path from the bucket to the local instance's filesystem, and use that.

provide install_gcloud_sdk() helper

Automate this so users don't have to go through the fuss of doing it themselves (as much as possible).

tfruns files pane should show folder rather all files in folder

remove gcloud config options from R functions

In testing the gcloud config options I realized that the gsutil command line utility works only with the currently active Google Cloud SDK account and project. Since we rely on calling gsutil to probe for and automatically create a project storage bucket, I think this implies that we can also only support the currently active account and project.

So I think we should remove all of the gcloud arguments and instead just rely on the currently active default account, project, and region.

consider `europe-west1` or `us-central1` as default regions

We changed the default region from us-central1 to us-east1, which currently requires whitelisting. Consider using one of default regions to avoid users having to contact support or specify their own region using gcloud = list(region = "us-central1").

Error: ERROR: gcloud invocation failed [exit status 1]

[command]
/Users/javierluraschi/google-cloud-sdk/bin/gcloud --account '[email protected]' --project rstudio-cloudml ml-engine models create hello_world --regions=us-east1

[output]


[errmsg]
ERROR: (gcloud.ml-engine.models.create) FAILED_PRECONDITION: Field: model.regions Error: Your project needs to be whitelisted to use region 'us-east1'. Please contact Cloud ML Engine team for whitelisting. Other available regions are [europe-west1, us-central1].
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: Your project needs to be whitelisted to use region 'us-east1'. Please
      contact Cloud ML Engine team for whitelisting. Other available regions are [europe-west1,
      us-central1].
    field: model.regions

use subdirectory for default bucket

Currently if no storage is provided explicitly we use (and create if necessary) a storage bucket with the same name as the project). It seems like we should namespace our use of this storage, for example create an "r-cloudml" bucket within the root.

avoid duplication of data retrieved from gs:// URLs

We currently use the gs_data function for synchronization of data from Google Storage URLs to local paths. By default, we download these to a local path gs.

Because that gs directory ends up in the run directory, we end up synchronizing data to the bucket. This could be wasteful (we end up duplicating data in the bucket across multiple directories for each run).

Any plans for JSON API implementation?

Very excited to see this project at Next today :)

I was just wondering if you are planning to implement via the JSON API too? https://cloud.google.com/ml-engine/reference/rest/

ensure that `training_run()` outputs get copied to bucket

With the previous version of cloudml, we manually constructed a job directory to point to a Google bucket path, e.g. we'd pass this to gcloud:

--job-dir=gs://rstudio-cloudml-demo-ml/census/jobs/census_cloudml_2017_10_03_225436428

However, with using tfruns::training_run(), run output will instead be written to a filepath on the instance's local filesystem, rather than to an externally accessible bucket. It seems like we need something like the following:

Support for FUSE (https://cloud.google.com/storage/docs/gcs-fuse) so that we can mount a Google Storage bucket directly on the filesystem, so that training_run() can just write to that directory and interact with a bucket as though it were a local filesystem;
An extra step in the deploy script that manually copies the output of the tfruns::training_run() output to a requested bucket somewhere.

(1) would be nicer, but it seems unlikely that it will be supported anytime soon within ML Engine. (2) seems more doable.

This likely implies that the user will need to specify a bucket path where all runs output should be collected. (What should we call this? storage-bucket?)

revamped specification of configuration info

@javierluraschi and I discussed this:

Leave flags as-is
Remove config option (which is currently defaulted to "cloudml") since we think it's unlikely anyone will ever want a different named config (this is for named profiles in flags)
The "config" option will now point to a standard CloudML config file (JSON or YAML): https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs which can be used to set scaleTier, region, etc. etc. This means that our special config.yml file goes away in preference to the standard google format.
Specifying account and project parameters different than the user's global default will be done in an optional gcloud.yml file.

So we'd have:

flags.yml (optional flag overrides for cloudml training, etc.)
cloudml.yml (Google CloudML config)
gcloud.yml (override of account and/or project)

verify that keras cache is working correctly

don't hardcode author (etc) fields in setup.py

cloudml/inst/cloudml/setup.py

Lines 69 to 70 in e981201

	author = "Google and RStudio",
	author_email = "[email protected]",

We should probably populate this based on configuration specified for a particular application (although it doesn't hurt to leave these as-is).

consider using RStudio terminals for some commands

For example, when kicking off a training run, it might be useful to run that as part of an RStudio terminal (so the R session remains unblocked, but logs are still streamed to a window that can be easily viewed as they come in).

We might also want to use processx rather than system2 in launching these processes (especially for processes we want to run asynchronously)

consider new scheme for job directories

We currently have the job_dir option which defaults to jobs/local when running locally and defaults to !expr cloudml::unique_job_dir("gs://example/jobs") when running on Cloud ML. The default behavior uses a single job_dir for the local case (so that local testing just overwrites previous testing) and multiple job directories on Cloud ML.

We might consider a new scheme that is simpler to specify in the config file and brings some consistency with job names. While job_dir would always be available as an explicit override, in the new scheme we'd by default construct the job directory from the combination of a job_output key and the job's name. The config file for this might look as follows:

default:
   project_name: census-demo
   job_output: jobs/local

cloudml:
   job_output: gs://bucket-name/census-demo/jobs

We'd automatically generate a job_name based on the project_name and the current date/time stamp (we do this now but use the directory name rather than an explicit project name). The default job_dir would then be job_output/job_name. A side benefit of this scheme would be that the job_name and the directory containing the job output would be the same (it's different now which is a bit confusing).

By default local jobs wouldn't have a job_name and therefore would only save one job output at a time. We'd also replace the current unique_job_dir function with a unique_job_name function.

@kevinushey What do you think?

@terrytangyuan Let us know if there are perspectives we are missing in reasoning about this.

job data is only partially written in tfruns

As of the last few commits the data written within tfruns is now incomplete, so attempting to view the run fails with this error:

Error in basename(run$script) : a character vector argument expected
Calls: <Anonymous> ... job_download -> <Anonymous> -> run_view_data -> basename
In addition: Warning message:
In max(job_list_trials(job)) :
  no non-missing arguments to max; returning -Inf
Execution halted

You can repro this with any job that you submit fresh.

At first glance it looks like maybe the deploy.rds file doesn't have the entrypoint field?

Are there any changes you can think of that would have caused this? I guess worst case we can git bisect.

integration with tfruns

This is a catch-all thread for the work items necessary to handle integration with tfruns.

Deployment

IIUC, this is the path to deployment using tfruns:

Bundle and deploy the application to Google Cloud (using existing infrastructure),
Rather than directly calling source() on the entry-point for a particular training run, we want to call tfruns::training_run().

I think there isn't much that needs to change here per-se beyond that?

Run Directory

Presumedly, the underlying R packages (e.g. keras, tfestimators) will be using tfruns under the hood. However, we need to think about how this plays with the configuration file (config.yml).

Comparing Runs

Since our runs are now in the cloud, not on the local machine, do we need to pull the run outputs down to the local filesystem for these to work? Or do we just need to synchronize some subset of the data generated by tfruns during training with the local machine?

Configuration

Right now, we advertise an approach where users set their configuration within config.yml (or hyperparameters.yml for hypertuning) and then ask the user to call e.g.

config <- cloudml::project_config()

and use the entries in that configuration file for running the model. This code likely needs to understand / use the tfruns run directory as well? How exactly does this play together with the tfruns::flags() construct?

tfruns should show cloudml info at the top

e.g. job url, logs url, other metadata

Consider defaults for `account` and `project`

Once an active config is available, cloudml could grab the defaul account and project from:

gcloud config list
gcloud config get-value account
gcloud config get-value project

ensure complete capture of hyperparameter trial info

Here are the docs for HyperparameterOutput: https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#HyperparameterOutput

It looks like all of the hyperparameters as well as all of the metrics capture are provided here. Does the data frame we return in job_trials() have all of this data in a tidy format (e.g. have we broken out hyperparameters into e.g. flag_dropout1, flag_learning_rate, and metrics into metric_acc, metric_loss, etc.? Ideally these would have the same column names as what tfruns returns in ls_runs()

where should auxiliary configuration be specified?

There are a few configuration items that need a home:

exclude / include: Controlling what files do, or don't, enter the bundle for deployment;
packrat: Whether packrat should be used to handle deployment.

I'm sure other things will pop up as we consider new deployment scenarios.

@jjallaire, any thoughts on where these should live?

flags.yml, as part of some separate deployment config key?
config.yml, as a totally separate configuration?
Something else?

refine hyperparameter tuning

job_collect / job_describe need to contain/print the data frame with the results.
Add a run_id parameter to job_collect

How to install additional packages

Just gave it a go. Google munched on the job for 20 minutes just installing the packages, and then finally gave my code a run to find it had not installed data.table

Questions:

Will the startup process be faster in the future? Are package installations cached or anything?
How do I instruct the build to also install my packages?

Does my code need calls to install.packages? I will do this, but wondering if there is a better way.

gcloud_file function

This will do an rsync between arbitrary gs buckets and a gs subdirectory of the training directory.

There is a gsutil rsync method that we can use for this.

job fails to submit when containing directory includes a dash

I had a directory with a dash in it (hello-cloudml) and got this error:

ERROR: (gcloud.ml-engine.jobs.submit.training) INVALID_ARGUMENT: Field: job_id Error: A name should start with a letter and contain only letters, numbers and underscores.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: A name should start with a letter and contain only letters, numbers
      and underscores.
    field: job_id

Renaming the directory to hello_cloudml resolved the error. Perhaps we need to substitute invalid characters somewhere in the submission pipeline?

does job_download handle multiple trials gracefully?

I noticed that job_download masks out trials = "all" but what about if multiple trial ids are specified?

error during cloudml_predict

I'm attempting to test the hello world deploy/predict workflow from the vignette and I get this error:

Note that I've made some minor changes (requiring the name parameter) but I don't think this has anything to do with the error.

storage functions don't take gcloud argument

As a result they only work with the currently active account/project. They should also take a gcloud parameter (and read from gcloud.yml if available)

Setting scale tier

Any idea if setting scale tier is supported? https://cloud.google.com/ml-engine/docs/training-overview#scale_tier

I'd assume it would go in the cloudml.yml file if so, but I didn't find any explicit mention of it in the code.

Thanks for the help, sorry about all the questions.

keras package warning in logs

I am seeing this warning in my training logs:

after launching a training job, stream logs to (e.g.) a terminal tab

We can use gcloud ml-engine jobs stream-logs <job> here. We should just create an RStudio terminal, and then execute this command there.

Warning: TensorFlow library wasn't compiled to use SSE4.1

Model is giving me this warning:

 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

Just a detail at this point. No biggy

job_list function returning an error

Here's what I get when I execute job_list:

> job_list()
ERROR: (gcloud.ml-engine.jobs.list) unrecognized arguments:
  
 Error in exec() : 
  Error 2 occurred running command /Users/jjallaire/google-cloud-sdk/bin/gcloud

Possibly an artifact of the command line arguments handling rework?

verify credentials before a training run

While attempting to run the census example, I saw errors of the form:

2017-09-26 12:15:33.222490: I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 32.976 seconds (attempt 7 out of 10), caused by: Unavailable: Error executing an HTTP request (HTTP response code 0, error code 6, error message 'Couldn't resolve host 'metadata'')

The Stack Overflow post here here suggests that this was due to my user account not being authenticated, and that was indeed the case. Running:

gcloud auth application-default login

and following the OAuth steps fixed the issue.

Error when calling job_collect for hyperparameter tuning job

I am seeing this error when calling job_collect():

> job_collect()
Error in order(sapply(status$trainingOutput$trials, function(e) e$finalMetric$objectiveValue),  : 
  unimplemented type 'list' in 'orderVector1'
In addition: Warning message:
In max(job_list_trials(job)) :
  no non-missing arguments to max; returning -Inf

I am using the script here: https://github.com/rstudio/cloudml/tree/master/inst/examples/keras

My training job is submitted with:

cloudml_train("mnist_mlp.R", cloudml = "tuning.yml")

Model Checkpoints - Missing h5py

Any suggestions how to save model checkpoints? I've got the following callback:

best_model_save <- callback_model_checkpoint(best_model_file_local, save_best_only = T)

But it gives me:

The h5py Python package is required to save model checkpoints

Is the h5py not installed with keras?

Another question: Should the checkpoints be written to a local file and then copied to GCS, or is it possible to send them direct to cloud storage? I'm assuming this same process of write to local disk and then copy to GCS is needed to fetch the final model?

comprehensive review/test on windows

add documentation on appropriate hyperparameterMetricTag

Our tuning vignette uses a Keras job to demonstrate the use of hyperparameterMetricTag. We should also document correct usage for tfestimators and core API training jobs (drawing code from your existing core mnist example). Placeholder in tuning vignette is here: https://github.com/rstudio/cloudml/blob/master/vignettes/tuning.Rmd#optimization-metrics

job_trials returns data frame with factors rather than numeric values

The data frame returned by job_trials() is processing the data as strings (which then become factors):

'data.frame':	10 obs. of  5 variables:
 $ finalMetric.objectiveValue: num  0.974 0.973 0.973 0.973 0.973 ...
 $ finalMetric.trainingStep  : Factor w/ 1 level "19": 1 1 1 1 1 1 1 1 1 1
 $ hyperparameters.dropout1  : Factor w/ 10 levels "0.2011326172916916",..: 1 2 3 4 5 6 7 8 9 10
 $ hyperparameters.dropout2  : Factor w/ 10 levels "0.32774705750441724",..: 1 2 3 4 5 6 7 8 9 10
 $ trialId                   : Factor w/ 10 levels "10","3","6","7",..: 1 2 3 4 5 6 7 8 9 10

We need to convert the columns to their appropriate numeric types.

should we allow users to override gcloud / flags configuration within 'train_cloudml()'?

Currently, users specify arguments for their own application in flags.yml, and options associated with gcloud / cloudml configuration in gcloud.yml.

Do we want to make it possible for users to set these dynamically (ie, in R code) as well? If so, how should we support it? Or, should we require that any of these kinds of changes be reflected in flags.yml / gcloud.yml? Alternatively, should we allow users to select different configuration files to be used during deployment?

For context, I'm imagining someone who might want to kick off a training run with something like:

cloudml::train_cloudml(
    application = getwd(),
    config = "cloudml",
    flags = list(num_epochs = 1.0),
    gcloud = list(runtime.version = "1.3")
)

We currently allow something like this with the 'overlay' (we accept things passed through ... and use them as appropriate); it's possible that we might want to take this out if we feel it doesn't play nicely with flags.

storage location for model

By default we store the model in a bucket off of the root of the project bucket. I think that this should change to be e.g.

gs://myproject-1024/r-cloudml/models/

I also think that the user should be able to pass a custom bucket

Old TODOs and enhancements

Moved TODO.Rmd from the repo to here:

JJ

Documentation on using the package
(https://cloud.google.com/ml-engine/docs/)

Kevin

Training/Prediction

Pull out hyperparameters from config and write a new YAML file with:
trainingInput:
hyperparmeters:
Then pass that to gcloud with --config hyperparmeters.yml

Then in our app.R generated file where we set the config pacakge hook we
propagate the command line options/hyperparamters back into the config list
Satisfying R package dependencies of train.R
- install.R or requirements.R
- Use packrat
- Hybrid where packrat auto-generates dependencies.R (but still allows
  the user to provide their own hand edited dependencies.R)

Jobs API

Can we consolidate job_status and job_describe (perhaps just call it job_describe)
Could we rename job_stream to job_log?

API

Consider whether we should use the REST API for Google Cloud rather
than the local SDK (#6).
It is going to be very painful to depend on the Cloud ML Python
libraries as that will force the use of Python 2 (which might not
correspond to where the user has already installed tensorflow).
Sort out all of the gcloud sdk installation / authentication requirements
for people other than us to use the package and document this well.
(see: https://cloud.google.com/ml/docs/how-tos/getting-set-up)
install_gcloud_sdk function (see headless install in
https://cloud.google.com/sdk/downloads)
Ensure that CLOUDSDK_PYTHON points to python 2 not python 3
predict_cloudml
Job status / enumeration functions return R lists with custom print methods
Functions for publishing and versioning models?

Later

Tooling

Package skeleton function w/ config.yml, train.R, etc.
Build pane integration: Custom project type exposing various commands
rstudioapi package making available a version of system that pumps events
(only required with shell API)
TensorBoard integration

gs_copy not working on windows

Call to gs_data on windows ends up calling gs_copy and then the shell_quote operation messes up the string.

Error produced:

InvalidUrlError: Unrecognized scheme ""gs".

Are the shell_quote calls needed for linux? I think it would work on windows if they were removed.

ability to download multiple / all trials

Now that we've got the code to download an individual trial from a hyper-parameter tuning run, it would be nice to provide the option to download all of the trials. I'm thinking we could change job_collect to work like this:

job_collect <- function(job, trials = NULL, ...) {
   # implementation
}

job_collect(job, trials = "best")
job_collect(job, trials = "all")
job_collect(job, trials = 42)
job_collect(job, trials = c(4,5,12,14))

more realistic hello world example

In the vignette we just concatenate two strings together. It might be better to show something more realistic, e.g. perhaps train an MNIST classifier as in https://github.com/rstudio/keras/blob/master/vignettes/examples/mnist_mlp.R then feed it an image (we should be able to easily read images into arrays using the image_ family of functions in the keras package