Git Product home page Git Product logo

cloud-berg's Introduction

Cloud Berg

Design goal

Berg is a minimal tool for running experiments with GPU instances on Cloud and storing the results in a bucket. It’s 400 lines of python.

Non-goals

  • not designed to run on anything beside google cloud
  • not designed to host a production web server
  • not designed to do anything clever

Installation

Clone and install the berg tool on your local machine After installing gcloud, run the following locally

gcloud init
gcloud auth login # Use your @google account
gcloud source repos clone berg --project=cloud-berg
pip install -e berg
berg init # Set up your configuration correctly in ~/.berg/berg.json

Basic usage

Launch two experiments (that each use all the GPUs on single instance, and shut down after they complete their job)

cd <your_git_repo>
berg run "python train.py --bs=256 --logdir=/root/berg_results/sweep1/bs_256" 

The python command will be run from the root of your github repo. It will run rsync files from /root/berg_results/sweep1/ and results will show up in gs://<your_berg_bucket>/berg_results/sweep1

Launch a devbox with 4 p100 gpus, CUDA, Tensorflow and Pytorch

berg devbox

After it starts you can ssh into it with berg ssh <instance_name>

How does berg work?

The best way to understand berg is to just read the code (it's very small). Here’s an even shorter TLDR:

Executing berg run does the following:

  • Copy the local git repo that you’re working on to GCS
  • Generate a job-specific <job_name>_metadata.json file with the cmd and the current git_commit and copy it to GCS
  • Start an instance to process that job (using the config info in ~/.berg/berg.json)

Once the instance starts, it executes berg-worker run which does the following:

  • Copy down <job_name>_metadata.json from GCS and parse the file
  • Copy down the git repo from GCS, check out your commit and install any requirements in setup.py
  • Start regularly uploading logs from /root/berg_results/<results_dir>
  • Run the cmd for the job
  • Shut down the instance after the job finishes

Managing berg jobs

Helpful shortcuts (all of which are aliases for gcloud commands)

berg list # list all running instances 
berg tail <inst_name> # tail logs
berg log <inst_name> # get all logs
berg ssh <inst_name> # ssh into an instance (then run `tt` to get tmux panes)
berg kill <inst_name> # kill an experiment
berg sync <inst_name> # sync local code to a running instance (for development)

You can also use the gce web interface which allows filtering and bulk management of all of your instances.

Running a large scale job with MPI

berg devbox -m <num_machines>

This will start num_machines devboxes with 4 GPUs each.

Then ssh into the first machine, and run mpi <your program> to run your program across all machines using MPI. Here mpi is an alias calling mpirun with all the necessary flags, and starting one worker process per GPU in your cluster. The easiest way of running multi machine tensorflow/pytorch experiments is to use Horovod, which now comes pre-installed on the golden image.

Creating a custom golden image

If you’d like to use a different image, you can edit the default_image value in ~/.berg/berg.json

We recommend starting with one of the official GCE deep learning images, and then modifying it slightly.

Berg will log into your image as the root user, and requires the following:

  • python installed with conda in /root/anaconda3/bin
  • A directory of code repositories in /root/code
  • Berg cloned to /root/code/berg and installed with pip install -e ~/code/berg

cloud-berg's People

Contributors

dpkingma avatar nottombrown avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cloud-berg's Issues

Consider running as configurable user

We currently run as root on the box because I didn't want to have to think about permissions.
Some downsides of this:

  1. Some programs don't work under root because they see it as unsafe (e.g. linux-brew)
  2. Some programs give warnings when run under root (e.g. jupyter)
  3. Users who don't know about sudo su to have trouble sshing into a box and running things.

Perhaps we could have people run as arbitrary users and default to a berg user?

We could also just continue to run as root. Not sure if this is worth changing

Berg berg devbox example fails due to log in authentication

I have ensured that gcloud auth login is completed before the run.

Message:

{
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Login Required",
"locationType": "header",
"location": "Authorization",
"debugInfo": "com.google.api.server.core.Fault: ImmutableErrorDefinition{base=LOGIN_REQUIRED, category=USER_ERROR, cause=com.google.api.server.core.Fault: LOGIN_REQUIRED Login Required, debugInfo=null, domain=global, extendedHelp=null, httpHeaders={WWW-Authenticate=[Bearer realm="https://accounts.google.com/"]}, httpStatus=unauthorized, internalReason=Reason{arguments={}, cause=null, code=null, createdByBackend=false, debugMessage=null, errorProtoCode=null, errorProtoDomain=null, filteredMessage=null, location=null, message=null, unnamedArguments=[]}, location=headers.Authorization, message=Login Required, reason=required, rpcCode=401} Login Required\n\tat com.google.api.server.auth.AuthenticatorInterceptor.addChallengeHeader(AuthenticatorInterceptor.java:264)\n\tat com.google.api.server.auth.AuthenticatorInterceptor.processErrorResponse(AuthenticatorInterceptor.java:231)\n\tat com.google.api.server.auth.GaiaMintInterceptor.processErrorResponse(GaiaMintInterceptor.java:764)\n\tat com.google.api.server.core.intercept.AroundInterceptorWrapper.processErrorResponse(AroundInterceptorWrapper.java:28)\n\tat com.google.api.server.stats.StatsBootstrap$InterceptorStatsRecorder.processErrorResponse(StatsBootstrap.java:312)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.handleErrorResponse(Interceptions.java:202)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.invoke(Interceptions.java:151)\n\tat com.google.api.server.core.protocol.http.rest.RestServlet.service(RestServlet.java:123)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:717)\n\tat com.google.api.server.core.protocol.http.ApiServlet.service(ApiServlet.java:51)\n\tat com.google.gse.FilteredServlet$ChainEnd.doFilter(FilteredServlet.java:212)\n\tat com.google.api.server.core.EventIdFilter.doFilter(EventIdFilter.java:49)\n\tat com.google.gse.FilteredServlet$Chain.doFilter(FilteredServlet.java:189)\n\tat com.google.loadbalancer.gslb.backend.ubb.UBBFilter.doFilter(UBBFilter.java:72)\n\tat com.google.gse.FilteredServlet$Chain.doFilter(FilteredServlet.java:189)\n\tat com.google.servlet.testing.ResponseInjectionFilter.doFilter(ResponseInjectionFilter.java:133)\n\tat com.google.gse.FilteredServlet$Chain.doFilter(FilteredServlet.java:189)\n\tat com.google.gse.FilteredServlet.service(FilteredServlet.java:158)\n\tat com.google.gse.internal.HttpConnectionImpl.runServletFromWithinSpan(HttpConnectionImpl.java:933)\n\tat com.google.gse.internal.HttpConnectionImpl.access$000(HttpConnectionImpl.java:74)\n\tat com.google.gse.internal.HttpConnectionImpl$1.runServletFromWithinSpan(HttpConnectionImpl.java:825)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable$2.run(GSETraceHelper.java:468)\n\tat com.google.tracing.LocalTraceSpanRunnable.runInContext(LocalTraceSpanRunnable.java:55)\n\tat com.google.tracing.TraceContext$TraceContextRunnable$1.run(TraceContext.java:460)\n\tat com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContextNoUnref(TraceContext.java:321)\n\tat com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContext(TraceContext.java:311)\n\tat com.google.tracing.TraceContext$TraceContextRunnable.run(TraceContext.java:457)\n\tat com.google.tracing.LocalTraceSpanBuilder.internalContinueSpan(LocalTraceSpanBuilder.java:643)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable.continueGfeTrace(GSETraceHelper.java:417)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable.runWithTracingEnabled(GSETraceHelper.java:372)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable.run(GSETraceHelper.java:338)\n\tat com.google.gse.internal.HttpConnectionImpl.runServlet(HttpConnectionImpl.java:827)\n\tat com.google.gse.internal.HttpConnectionImpl.run(HttpConnectionImpl.java:781)\n\tat com.google.gse.internal.DispatchQueueImpl$WorkerThread.run(DispatchQueueImpl.java:403)\nCaused by: com.google.api.server.core.Fault: LOGIN_REQUIRED Login Required\n\tat com.google.api.server.auth.NoAuthAuthenticationProcessor.process(NoAuthAuthenticationProcessor.java:20)\n\tat com.google.api.server.auth.GaiaMintApiAuthenticator.authenticate(GaiaMintApiAuthenticator.java:284)\n\tat com.google.api.server.auth.GaiaMintInterceptor.doAuthenticateSingleRequest(GaiaMintInterceptor.java:876)\n\tat com.google.api.server.auth.GaiaMintInterceptor.doAuthenticate(GaiaMintInterceptor.java:687)\n\tat com.google.api.server.auth.AuthenticatorInterceptor.authenticate(AuthenticatorInterceptor.java:361)\n\tat com.google.api.server.auth.GaiaMintInterceptor.authenticate(GaiaMintInterceptor.java:659)\n\tat com.google.api.server.auth.AuthenticatorInterceptor.processRequest(AuthenticatorInterceptor.java:191)\n\tat com.google.api.server.auth.GaiaMintInterceptor.processRequest(GaiaMintInterceptor.java:517)\n\tat com.google.api.server.core.intercept.AroundInterceptorWrapper.processRequest(AroundInterceptorWrapper.java:20)\n\tat com.google.api.server.stats.StatsBootstrap$InterceptorStatsRecorder.processRequest(StatsBootstrap.java:278)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.processRequest(Interceptions.java:159)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.invoke(Interceptions.java:135)\n\t... 27 more\n"
}
],
"code": 401,
"message": "Login Required"
}
}

Add basic integration tests

Not super clear to me what the best interface would be here.

Perhaps we could mock out check_call and just start with a high level test that ensures that we call gcloud with reasonable arguments

Add programmatic interface

People often want to start a job without piping args through a CLI / scripting in bash.

Seems like we could get this from a very simple API on top of the existing system:

# berg_launcher.py
import berg
import numpy as np

for lr in np.linspace(0.0, 1.0, 10):
  berg.run("train.py", flags={
      'lr': lr
    },
    num_gpus=4
)

These argument could then be serialized into CLI flags and fed into the training script.

This would result in ten instances being spun up that each run a command like the following

train.py --lr 0.1

We also could potentially add a flags helper as in #8, but I think that serializing through CLI flags is likely to be simpler

[unlikely to implement] Serialization helper

Follow up to #4

If serialization of the args becomes annoying (for example, if researchers frequently try to serialize strings have characters that bash mis-interprets), we could let the user set flags within their executable function also. I think that we likely don't need to do this though.

# train.py
def main():
  ...

if __name__ == '__main__':
  FLAGS = argparser.parse()
  berg.setup_flags(FLAGS)
  main()

My current thinking is that this is more trouble than it is worth, and that serializing through CLI flags yields code that is easier to understand and more portable than this proposal.

Add berg to pip

It would be nice to be able to run

pip install berg

This could also be the default way for users to create new cloud images. They just set up the cloud box how they want it and then run

pip install berg

Add TPU support

It would be nice to support starting up a TPU accelerator, connecting to it, and shutting it down after the job finishes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.