google / cloud-berg Goto Github PK
View Code? Open in Web Editor NEWBerg - Run GPU-backed experiments on gcloud
License: Apache License 2.0
Berg - Run GPU-backed experiments on gcloud
License: Apache License 2.0
I have ensured that gcloud auth login is completed before the run.
{
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Login Required",
"locationType": "header",
"location": "Authorization",
"debugInfo": "com.google.api.server.core.Fault: ImmutableErrorDefinition{base=LOGIN_REQUIRED, category=USER_ERROR, cause=com.google.api.server.core.Fault: LOGIN_REQUIRED Login Required, debugInfo=null, domain=global, extendedHelp=null, httpHeaders={WWW-Authenticate=[Bearer realm="https://accounts.google.com/"]}, httpStatus=unauthorized, internalReason=Reason{arguments={}, cause=null, code=null, createdByBackend=false, debugMessage=null, errorProtoCode=null, errorProtoDomain=null, filteredMessage=null, location=null, message=null, unnamedArguments=[]}, location=headers.Authorization, message=Login Required, reason=required, rpcCode=401} Login Required\n\tat com.google.api.server.auth.AuthenticatorInterceptor.addChallengeHeader(AuthenticatorInterceptor.java:264)\n\tat com.google.api.server.auth.AuthenticatorInterceptor.processErrorResponse(AuthenticatorInterceptor.java:231)\n\tat com.google.api.server.auth.GaiaMintInterceptor.processErrorResponse(GaiaMintInterceptor.java:764)\n\tat com.google.api.server.core.intercept.AroundInterceptorWrapper.processErrorResponse(AroundInterceptorWrapper.java:28)\n\tat com.google.api.server.stats.StatsBootstrap$InterceptorStatsRecorder.processErrorResponse(StatsBootstrap.java:312)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.handleErrorResponse(Interceptions.java:202)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.invoke(Interceptions.java:151)\n\tat com.google.api.server.core.protocol.http.rest.RestServlet.service(RestServlet.java:123)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:717)\n\tat com.google.api.server.core.protocol.http.ApiServlet.service(ApiServlet.java:51)\n\tat com.google.gse.FilteredServlet$ChainEnd.doFilter(FilteredServlet.java:212)\n\tat com.google.api.server.core.EventIdFilter.doFilter(EventIdFilter.java:49)\n\tat com.google.gse.FilteredServlet$Chain.doFilter(FilteredServlet.java:189)\n\tat com.google.loadbalancer.gslb.backend.ubb.UBBFilter.doFilter(UBBFilter.java:72)\n\tat com.google.gse.FilteredServlet$Chain.doFilter(FilteredServlet.java:189)\n\tat com.google.servlet.testing.ResponseInjectionFilter.doFilter(ResponseInjectionFilter.java:133)\n\tat com.google.gse.FilteredServlet$Chain.doFilter(FilteredServlet.java:189)\n\tat com.google.gse.FilteredServlet.service(FilteredServlet.java:158)\n\tat com.google.gse.internal.HttpConnectionImpl.runServletFromWithinSpan(HttpConnectionImpl.java:933)\n\tat com.google.gse.internal.HttpConnectionImpl.access$000(HttpConnectionImpl.java:74)\n\tat com.google.gse.internal.HttpConnectionImpl$1.runServletFromWithinSpan(HttpConnectionImpl.java:825)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable$2.run(GSETraceHelper.java:468)\n\tat com.google.tracing.LocalTraceSpanRunnable.runInContext(LocalTraceSpanRunnable.java:55)\n\tat com.google.tracing.TraceContext$TraceContextRunnable$1.run(TraceContext.java:460)\n\tat com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContextNoUnref(TraceContext.java:321)\n\tat com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContext(TraceContext.java:311)\n\tat com.google.tracing.TraceContext$TraceContextRunnable.run(TraceContext.java:457)\n\tat com.google.tracing.LocalTraceSpanBuilder.internalContinueSpan(LocalTraceSpanBuilder.java:643)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable.continueGfeTrace(GSETraceHelper.java:417)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable.runWithTracingEnabled(GSETraceHelper.java:372)\n\tat com.google.gse.GSETraceHelper$TraceableServletRunnable.run(GSETraceHelper.java:338)\n\tat com.google.gse.internal.HttpConnectionImpl.runServlet(HttpConnectionImpl.java:827)\n\tat com.google.gse.internal.HttpConnectionImpl.run(HttpConnectionImpl.java:781)\n\tat com.google.gse.internal.DispatchQueueImpl$WorkerThread.run(DispatchQueueImpl.java:403)\nCaused by: com.google.api.server.core.Fault: LOGIN_REQUIRED Login Required\n\tat com.google.api.server.auth.NoAuthAuthenticationProcessor.process(NoAuthAuthenticationProcessor.java:20)\n\tat com.google.api.server.auth.GaiaMintApiAuthenticator.authenticate(GaiaMintApiAuthenticator.java:284)\n\tat com.google.api.server.auth.GaiaMintInterceptor.doAuthenticateSingleRequest(GaiaMintInterceptor.java:876)\n\tat com.google.api.server.auth.GaiaMintInterceptor.doAuthenticate(GaiaMintInterceptor.java:687)\n\tat com.google.api.server.auth.AuthenticatorInterceptor.authenticate(AuthenticatorInterceptor.java:361)\n\tat com.google.api.server.auth.GaiaMintInterceptor.authenticate(GaiaMintInterceptor.java:659)\n\tat com.google.api.server.auth.AuthenticatorInterceptor.processRequest(AuthenticatorInterceptor.java:191)\n\tat com.google.api.server.auth.GaiaMintInterceptor.processRequest(GaiaMintInterceptor.java:517)\n\tat com.google.api.server.core.intercept.AroundInterceptorWrapper.processRequest(AroundInterceptorWrapper.java:20)\n\tat com.google.api.server.stats.StatsBootstrap$InterceptorStatsRecorder.processRequest(StatsBootstrap.java:278)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.processRequest(Interceptions.java:159)\n\tat com.google.api.server.core.intercept.Interceptions$AroundInterception.invoke(Interceptions.java:135)\n\t... 27 more\n"
}
],
"code": 401,
"message": "Login Required"
}
}
After pip install -e berg
Not super clear to me what the best interface would be here.
Perhaps we could mock out check_call
and just start with a high level test that ensures that we call gcloud
with reasonable arguments
The default images are here:
https://cloud.google.com/deep-learning-vm/docs/cli
I propose that we start with those images and then just run pip install berg
in our startup script
Currently berg only supports python 3. We could just replace all the template strings literals with f-strings from ww
from ww import f
f("interpolated {value}")
https://github.com/Tygs/ww/blob/master/src/ww/wrappers/strings.py
Follow up to #4
If serialization of the args becomes annoying (for example, if researchers frequently try to serialize strings have characters that bash mis-interprets), we could let the user set flags within their executable function also. I think that we likely don't need to do this though.
# train.py
def main():
...
if __name__ == '__main__':
FLAGS = argparser.parse()
berg.setup_flags(FLAGS)
main()
My current thinking is that this is more trouble than it is worth, and that serializing through CLI flags yields code that is easier to understand and more portable than this proposal.
It would be nice to be able to run
pip install berg
This could also be the default way for users to create new cloud images. They just set up the cloud box how they want it and then run
pip install berg
People often want to start a job without piping args through a CLI / scripting in bash.
Seems like we could get this from a very simple API on top of the existing system:
# berg_launcher.py
import berg
import numpy as np
for lr in np.linspace(0.0, 1.0, 10):
berg.run("train.py", flags={
'lr': lr
},
num_gpus=4
)
These argument could then be serialized into CLI flags and fed into the training script.
This would result in ten instances being spun up that each run a command like the following
train.py --lr 0.1
We also could potentially add a flags helper as in #8, but I think that serializing through CLI flags is likely to be simpler
When people make a new image, we would recommend that they run
pip install berg
berg-worker doctor
This could do the following for them:
gsutil
and gcloud
installed (for writing logs and shutting self down)It would be nice to support starting up a TPU accelerator, connecting to it, and shutting it down after the job finishes
We currently run as root
on the box because I didn't want to have to think about permissions.
Some downsides of this:
linux-brew
)jupyter
)sudo su
to have trouble sshing into a box and running things.Perhaps we could have people run as arbitrary users and default to a berg
user?
We could also just continue to run as root. Not sure if this is worth changing
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.