google / cloud-berg Goto Github PK

Berg - Run GPU-backed experiments on gcloud

License: Apache License 2.0

Python 100.00%

cloud-berg's Introduction

Cloud Berg

Design goal

Berg is a minimal tool for running experiments with GPU instances on Cloud and storing the results in a bucket. It’s 400 lines of python.

Non-goals

not designed to run on anything beside google cloud
not designed to host a production web server
not designed to do anything clever

Installation

Clone and install the berg tool on your local machine After installing gcloud, run the following locally

gcloud init
gcloud auth login # Use your @google account
gcloud source repos clone berg --project=cloud-berg
pip install -e berg
berg init # Set up your configuration correctly in ~/.berg/berg.json

Basic usage

Launch two experiments (that each use all the GPUs on single instance, and shut down after they complete their job)

cd <your_git_repo>
berg run "python train.py --bs=256 --logdir=/root/berg_results/sweep1/bs_256"

The python command will be run from the root of your github repo. It will run rsync files from /root/berg_results/sweep1/ and results will show up in gs://<your_berg_bucket>/berg_results/sweep1

Launch a devbox with 4 p100 gpus, CUDA, Tensorflow and Pytorch

berg devbox

After it starts you can ssh into it with berg ssh <instance_name>

How does berg work?

The best way to understand berg is to just read the code (it's very small). Here’s an even shorter TLDR:

Executing berg run does the following:

Copy the local git repo that you’re working on to GCS
Generate a job-specific <job_name>_metadata.json file with the cmd and the current git_commit and copy it to GCS
Start an instance to process that job (using the config info in ~/.berg/berg.json)

Once the instance starts, it executes berg-worker run which does the following:

Copy down <job_name>_metadata.json from GCS and parse the file
Copy down the git repo from GCS, check out your commit and install any requirements in setup.py
Start regularly uploading logs from /root/berg_results/<results_dir>
Run the cmd for the job
Shut down the instance after the job finishes

Managing berg jobs

Helpful shortcuts (all of which are aliases for gcloud commands)

berg list # list all running instances 
berg tail <inst_name> # tail logs
berg log <inst_name> # get all logs
berg ssh <inst_name> # ssh into an instance (then run `tt` to get tmux panes)
berg kill <inst_name> # kill an experiment
berg sync <inst_name> # sync local code to a running instance (for development)

You can also use the gce web interface which allows filtering and bulk management of all of your instances.

Running a large scale job with MPI

berg devbox -m <num_machines>

This will start num_machines devboxes with 4 GPUs each.

Then ssh into the first machine, and run mpi <your program> to run your program across all machines using MPI. Here mpi is an alias calling mpirun with all the necessary flags, and starting one worker process per GPU in your cluster. The easiest way of running multi machine tensorflow/pytorch experiments is to use Horovod, which now comes pre-installed on the golden image.

Creating a custom golden image

If you’d like to use a different image, you can edit the default_image value in ~/.berg/berg.json

We recommend starting with one of the official GCE deep learning images, and then modifying it slightly.

Berg will log into your image as the root user, and requires the following:

python installed with conda in /root/anaconda3/bin
A directory of code repositories in /root/code
Berg cloned to /root/code/berg and installed with pip install -e ~/code/berg

cloud-berg's People

Contributors

Stargazers

Watchers

Forkers

nottombrown dpkingma neotim isabella232 ghas-results

cloud-berg's Issues

Use standard deep learning images as starting images

The default images are here:
https://cloud.google.com/deep-learning-vm/docs/cli

I propose that we start with those images and then just run pip install berg in our startup script

Consider running as configurable user

We currently run as root on the box because I didn't want to have to think about permissions.
Some downsides of this:

Some programs don't work under root because they see it as unsafe (e.g. linux-brew)
Some programs give warnings when run under root (e.g. jupyter)
Users who don't know about sudo su to have trouble sshing into a box and running things.

Perhaps we could have people run as arbitrary users and default to a berg user?

We could also just continue to run as root. Not sure if this is worth changing

Berg berg devbox example fails due to log in authentication

I have ensured that gcloud auth login is completed before the run.

Message:

{
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Login Required",
"locationType": "header",
"location": "Authorization",
"debugInfo": "com.google.api.server.core.Fault: ImmutableErrorDefinition{base=LOGIN_REQUIRED, category=USER_ERROR, cause=com.google.api.server.core.Fault: LOGIN_REQUIRED Login Required, debugInfo=null, domain=global, extendedHelp=null, httpHeaders={WWW-Authenticate=[Bearer realm="https://accounts.google.com/"]}, httpStatus=unauthorized, internalReason=Reason{arguments={}, cause=null, code=null, createdByBackend=false, debugMessage=null, errorProtoCode=null, errorProtoDomain=null, filteredMessage=null, location=null, message=null, unnamedArguments=[]}, location=headers.Authorization, message=Login Required, reason=required, rpcCode=401} Login Required\n\tat com.google.api.server.auth.AuthenticatorInterceptor.addChallengeHeader(AuthenticatorInterceptor.java:264)\n\tat com.google.api.server.auth.AuthenticatorInterceptor.processErrorResponse(AuthenticatorInterceptor.java:231)\n\tat com.google.api.server.auth.GaiaMintInterceptor.processErrorResponse(GaiaMintInterceptor.java:764)\n\tat com.google.api.server.core.intercept.AroundInterceptorWrapper.processErrorResponse(AroundInterceptorWrapper.java:28)\n\tat com.google.api.server.stats.StatsBootstrap$InterceptorStatsRecorder.processErrorResponse(StatsBootstrap.java:312)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.handleErrorResponse(Interceptions.java:202)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.invoke(Interceptions.java:151)\n\tat com.google.api.server.core.protocol.http.rest.RestServlet.service(RestServlet.java:123)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:717)\n\tat com.google.api.server.core.protocol.http.ApiServlet.service(ApiServlet.java:51)\n\tat com.google.gse.FilteredServlet$ChainEnd.doFilter(FilteredServlet.java:212)\n\tat com.google.api.server.core.EventIdFilter.doFilter(EventIdFilter.java:49)\n\tat com.google.gse.FilteredServlet$Chain.doFilter(FilteredServlet.java:189)\n\tat com.google.loadbalancer.gslb.backend.ubb.UBBFilter.doFilter(UBBFilter.java:72)\n\tat com.google.gse.FilteredServlet$Chain.doFilter(FilteredServlet.java:189)\n\tat com.google.servlet.testing.ResponseInjectionFilter.doFilter(ResponseInjectionFilter.java:133)\n\tat com.google.gse.FilteredServlet$Chain.doFilter(FilteredServlet.java:189)\n\tat com.google.gse.FilteredServlet.service(FilteredServlet.java:158)\n\tat com.google.gse.internal.HttpConnectionImpl.runServletFromWithinSpan(HttpConnectionImpl.java:933)\n\tat com.google.gse.internal.HttpConnectionImpl.access$000(HttpConnectionImpl.java:74)\n\tat com.google.gse.internal.HttpConnectionImpl$1.runServletFromWithinSpan(HttpConnectionImpl.java:825)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable$2.run(GSETraceHelper.java:468)\n\tat com.google.tracing.LocalTraceSpanRunnable.runInContext(LocalTraceSpanRunnable.java:55)\n\tat com.google.tracing.TraceContext$TraceContextRunnable$1.run(TraceContext.java:460)\n\tat com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContextNoUnref(TraceContext.java:321)\n\tat com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContext(TraceContext.java:311)\n\tat com.google.tracing.TraceContext$TraceContextRunnable.run(TraceContext.java:457)\n\tat com.google.tracing.LocalTraceSpanBuilder.internalContinueSpan(LocalTraceSpanBuilder.java:643)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable.continueGfeTrace(GSETraceHelper.java:417)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable.runWithTracingEnabled(GSETraceHelper.java:372)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable.run(GSETraceHelper.java:338)\n\tat com.google.gse.internal.HttpConnectionImpl.runServlet(HttpConnectionImpl.java:827)\n\tat com.google.gse.internal.HttpConnectionImpl.run(HttpConnectionImpl.java:781)\n\tat com.google.gse.internal.DispatchQueueImpl$WorkerThread.run(DispatchQueueImpl.java:403)\nCaused by: com.google.api.server.core.Fault: LOGIN_REQUIRED Login Required\n\tat com.google.api.server.auth.NoAuthAuthenticationProcessor.process(NoAuthAuthenticationProcessor.java:20)\n\tat com.google.api.server.auth.GaiaMintApiAuthenticator.authenticate(GaiaMintApiAuthenticator.java:284)\n\tat com.google.api.server.auth.GaiaMintInterceptor.doAuthenticateSingleRequest(GaiaMintInterceptor.java:876)\n\tat com.google.api.server.auth.GaiaMintInterceptor.doAuthenticate(GaiaMintInterceptor.java:687)\n\tat com.google.api.server.auth.AuthenticatorInterceptor.authenticate(AuthenticatorInterceptor.java:361)\n\tat com.google.api.server.auth.GaiaMintInterceptor.authenticate(GaiaMintInterceptor.java:659)\n\tat com.google.api.server.auth.AuthenticatorInterceptor.processRequest(AuthenticatorInterceptor.java:191)\n\tat com.google.api.server.auth.GaiaMintInterceptor.processRequest(GaiaMintInterceptor.java:517)\n\tat com.google.api.server.core.intercept.AroundInterceptorWrapper.processRequest(AroundInterceptorWrapper.java:20)\n\tat com.google.api.server.stats.StatsBootstrap$InterceptorStatsRecorder.processRequest(StatsBootstrap.java:278)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.processRequest(Interceptions.java:159)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.invoke(Interceptions.java:135)\n\t... 27 more\n"
}
],
"code": 401,
"message": "Login Required"
}
}

Add basic integration tests

Not super clear to me what the best interface would be here.

Perhaps we could mock out check_call and just start with a high level test that ensures that we call gcloud with reasonable arguments

Add programmatic interface

People often want to start a job without piping args through a CLI / scripting in bash.

Seems like we could get this from a very simple API on top of the existing system:

# berg_launcher.py
import berg
import numpy as np

for lr in np.linspace(0.0, 1.0, 10):
  berg.run("train.py", flags={
      'lr': lr
    },
    num_gpus=4
)

These argument could then be serialized into CLI flags and fed into the training script.

This would result in ten instances being spun up that each run a command like the following

train.py --lr 0.1

We also could potentially add a flags helper as in #8, but I think that serializing through CLI flags is likely to be simpler

Add `berg-worker doctor` command to check if an image is ready to be a berg-worker

When people make a new image, we would recommend that they run

pip install berg
berg-worker doctor

This could do the following for them:

Check if they have gsutil and gcloud installed (for writing logs and shutting self down)
Check if they have access to the default log destination
Other checks?

[unlikely to implement] Serialization helper

Follow up to #4

If serialization of the args becomes annoying (for example, if researchers frequently try to serialize strings have characters that bash mis-interprets), we could let the user set flags within their executable function also. I think that we likely don't need to do this though.

# train.py
def main():
  ...

if __name__ == '__main__':
  FLAGS = argparser.parse()
  berg.setup_flags(FLAGS)
  main()

My current thinking is that this is more trouble than it is worth, and that serializing through CLI flags yields code that is easier to understand and more portable than this proposal.

Installation error error: [Errno 2] No such file or directory: '.../berg/bin/berg-self-update'

After pip install -e berg

Add berg to pip

It would be nice to be able to run

pip install berg

This could also be the default way for users to create new cloud images. They just set up the cloud box how they want it and then run

pip install berg

Add TPU support

It would be nice to support starting up a TPU accelerator, connecting to it, and shutting it down after the job finishes

Add support for python 2.7

Currently berg only supports python 3. We could just replace all the template strings literals with f-strings from ww

from ww import f
f("interpolated {value}")

https://github.com/Tygs/ww/blob/master/src/ww/wrappers/strings.py