minimaxir / gpt-2-cloud-run Goto Github PK

Text-generation API via GPT-2 for Cloud Run

License: MIT License

Dockerfile 4.57% Python 16.61% HTML 78.82%

gpt-2-cloud-run's Introduction

gpt-2-cloud-run

App for building a text-generation API for generating text from OpenAI's GPT-2 via gpt-2-simple, and running it in a scalable manner and effectively free via Google's Cloud Run. This app is intended to be used to easily and cost-effectively allow others to play with a finetuned GPT-2 model on another dataset, and allow programmatic access to the generated text.

The base app.py runs starlette for async/futureproofness, and is easily hackable if you want to modify GPT-2's input/output, force certain generation parameters, or want to add additional features/endpoints such as tweeting the generated result.

Demo

You can play with a web-based demo of a Cloud Run API pointing at the default 117M "small" GPT-2 model here: https://minimaxir.com/apps/gpt2-small/

The demo web UI is based off of the app_ui.html file in this repo (built on Bulma and jQuery) and is designed to be easily hackable to add new features and/or adjust the design (e.g. you can change the URL in the JavaScript function to point to your own Cloud Run API).

How to Build the Container And Start Cloud Run

Since Cloud Run is stateless without access to local storage, you must bundle the model within the container. First, download/clone this repo and copy the model into the folder (the model should be in the form of the folder hierarchy /checkpoint/run1, which is the case by default for most finetuning scripts)

Then build the image:

docker build . -t gpt2

If you want to test the image locally with the same specs as Cloud Run, you can run:

docker run -p 8080:8080 --memory="2g" --cpus="1" gpt2

You can then visit/curl http://0.0.0.0:8080 to get generated text!

Then, tag the image and upload it to the Google Container Registry (note, this will take awhile due to the image size!):

docker tag gpt2 gcr.io/[PROJECT-ID]/gpt2
docker push gcr.io/[PROJECT-ID]/gpt2

Once done, deploy the uploaded image to Cloud Run via the console. Set Memory Allocated to 2 GB and Maximum Requests Per Container to 1!

The Cloud Run logs will tell you how the service runs, and the INFO log level contains Cloud Run diagnostic info, including the time it takes for a request to run.

Interacting with the API in Cloud Run

The API accepts both GET and POST requests, and returns a JSON object with a text attribute that contains the generated text. For example, let's say the Cloud Run URL is http://example.google.com:

A GET request to the API would be http://example.google.com?length=100&temperature=1.0 which can be accessed by almost any type of client. (NB: Don't visit the API in a web browser, as the browser prefetch may count as an additional request)

A POST request (passing the data as a JSON object) is more ideal as it is both more secure and allows non-ASCII inputs. Python example:

import requests

req = requests.post('http://example.google.com',
                    json={'length': 100, 'temperature': 1.0})
text = req.json()['text']
print(text)

The UI from app_ui.html utilizes AJAX POST requests via jQuery to retrieve the generated text and parse the data for display.

Helpful Notes

Due to Cloud Run's current 2 GB memory maximum, this app will only work with the 117M "small" GPT-2 model, and not the 345M "medium" model (even if Cloud Run offers a 4 GB option in the future, it would not be enough to support the 345M model).
Each prediction, at the default 1023 token length, will take about 2 minutes to generate (10 seconds per 100 tokens). You may want to consider reducing the length of the generated text if speed is a concern and/or hardcapping the length at the app-level.
If your API on Cloud Run is actively processing a request less than 7% of the time (at the 100 millisecond level) in a given month, you'll stay within the free tier of Cloud Run, and the price is $0.10 an hour if the service goes over the free tier. Only the time starting up an instance and processing a request counts as billable time (i.e. the durations in the logs); idle time does not count as billable, making it surprisingly easy to stay within the limits.
The concurrency is set to 1 to ensure maximum utilization for each user (if a single user is using it and accidently causes another container to spawn, it doesn't matter cost-wise as only requests processing incurs charges, not the number of active containers).
Memory leaks in the container may cause you to go over the 2GB limit and crash the container after enough text generations. Fortunately, Cloud Run can quickly recover (although the current request will fail), and having multiple containers operating due to low concurrency can distribute the workload.

If You Want More Power

If you expect the API to be actively engaged 24/7, need faster response times, and/or want to use the 345M GPT-2 model, you may want to use Cloud Run on GKE instead (and attach a GPU to the nodes + use a tensorflow-gpu base for the Dockerfile) and increase concurrency to maximize cost efficiency.

Additionally, if you plan on making a lot of GPT-2 APIs, you may want to use Cloud Build to avoid the overhead of downloading/building/reuploading a model. I have written a short tutorial on how to get a model trained with Compute Engine built using Cloud Build using the included cloudbuild.yaml spec.

Future Improvements

Add/test a GPU image

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

Disclaimer

This repo has no affiliation or relationship with OpenAI.

gpt-2-cloud-run's People

Contributors

Stargazers

Watchers

Forkers

jonathanfly abul22 codeaudit minrk lazuraslong tarsbase designium egordev200 seoguypt metronoid sleclair0 ericbfriday kshtzgupta1 stenpiren nico-adamo timbrunette j-k-projects bgottschling galleom bterrific2008 samcamwilliams cschuman montychain calebdinsmore-mc amorous-me deoxykev simonw ryanvolum johnjboren nsaripalli pathos0925 anon4206969 nikhilno1 sunnya97 ayeshah andreichiro steve2507 dalraf morningkaren salimjan-crypto zstern drupalcompany noisyneuron hefedev prizzio shamoons yohannawaliya idiiace duncan-haywood araneta ajaysourcedigital willtejeda photine shubham-root hudefect markijzerman an0nym0u5101 mathigatti nataliasuarez skysunlimited camelohq cyborg5000 mirrorcult shadoku mario-kart-felix imoft marchdown tothepoweroftom javigoncompte hugo0 stungkit miniphantom turbonemesis bmartinsh kamaravichow cualquiercosa327 crystlejock shaneodavey vivekmadarapu macguyversmusic papiche preshy

gpt-2-cloud-run's Issues

Poor quality of text generation in Cloud Run compared to Colab

First up, thanks for all the work you've put into all of the GPT-2-simple stuff. It's amazing!

But I've set up a generation with Cloud Run using the same model and same settings as in Colab, and the text outputs are significantly less cohesive with lines being constantly repeated. Any particular reason why this would be happening? Is it a limitation of the Cloud Run hardware vs the Colab hardware?

The model is intended to be a video game idea generator trained on ~15,000 posts from /r/gameideas. Here's an example of the same prefix in each context:

Colab

A game where you have to fight children or some shit. The children are easy to kill. You can run for cover or you can try to fight back but you're much slower. You can't run as fast as the children. You can hide, crawl, crawl out the door. There's also a lot of zombies.

If you're fast enough, you can jump off the roof and climb inside. The children are easier to kill. You can jump it too. The children can get stuck in the wall. You can jump to them, kill them and then climb up. There's a lot of enemies.

You can use the power of the house as a platform to jump in the first places. You can then jump to the roof where there's a bigger enemy. You can then crawl out the door to the other side to sneak in. There's a lot of zombies.

There's also a lot of fire. You can run into them. You can throw a torch at them. They'll die if you're not careful. Once they die, you can jump to the roof but the fire won't burn you if you're not careful.

I'm not sure if the game is multiplayer or not.

Cloud Run

A game where you fight children, and you can make them do anything you want, and you have a gun and you fight crime.

You can make people sick with drugs, and you can make people homeless, and you can make people commit crimes.

You can make the police and the military and the FBI and the CIA and the NSA and the CIA and the NSA and the NSA and the NSA and the NSA and the NSA and the NSA and the NSA and the NSA and the NSA and the NSA and the NSA and you can make everyone in history a billionaire.

You can make President Trump a billionaire, and all the other billionaire games like they are a game, but you can only make a few people rich, and you can only make one type of person rich, and you can only make a certain amount of people rich, and you can only make a certain amount of people homeless, and you can only make them sick.

It's a consistent trait where the Cloud Run generation seems to almost ignore the context of the prefix and then gets stuck in a loop.

Add rate limiting

Since Cloud Run does unauthenticated HTTP requests, it would be good to add a simple rate limit by IP.

Unfortunately there's no simple implementation, and the simple implementations that exist are for Flask only.

TypeError: not all arguments converted during string formatting

Hi, first of all thank you for this amazing tutorial and repo. Awesome work!

So I've been trying to generate a custom model similar to the code you have in /examples/hacker_news.py. I have my app.py shown below. I've set return_as_list to true. I get the list, remove unnecessary prefixes from each string in the list and return the list as JSON. This code throws an error which I have shows below. However, when I generate text locally without calling the API (ie. without the HTTPS requests involved), the code works perfectly without errors. I can't seem to figure out what I'm doing wrong. Highly appreciate any help.

app.py

from starlette.applications import Starlette
from starlette.responses import UJSONResponse
import gpt_2_simple as gpt2
import tensorflow as tf
import uvicorn
import os
import gc

app = Starlette(debug=False)

sess = gpt2.start_tf_sess(threads=1)
gpt2.load_gpt2(sess)


response_header = {
    'Access-Control-Allow-Origin': '*'
}

generate_count = 0


@app.route('/', methods=['GET', 'POST', 'HEAD'])
async def homepage(request):
    global generate_count
    global sess

    if request.method == 'GET':
        params = request.query_params
    elif request.method == 'POST':
        params = await request.json()
    elif request.method == 'HEAD':
        return UJSONResponse({'text': ''},
                             headers=response_header)

    
    text = gpt2.generate(sess,
                         length=55,
                         temperature=1.0,
                         top_k=int(params.get('top_k', 0)),
                         top_p=float(params.get('top_p', 0)),
                         prefix='<|startoftext|>' + params.get('prefix', ''),
                         truncate='<|endoftext|>',
                         include_prefix=True,
                         nsamples=params.get('nsamples', 1),
                         return_as_list=True
                         )

    for x in text:
        x = x.replace('<|startoftext|>', '')
        x = x.replace('<|endoftext|>', '')
        x = x.replace('  ', ' ')


    generate_count += 1
    if generate_count == 8:
        # Reload model to prevent Graph/Session from going OOM
        tf.reset_default_graph()
        sess.close()
        sess = gpt2.start_tf_sess(threads=1)
        gpt2.load_gpt2(sess)
        generate_count = 0

    gc.collect()
    return UJSONResponse({'text_list': text})

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))

Error:

`Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/uvicorn/protocols/http/httptools_impl.py", line 385, in run_asgi result = await app(self.scope, self.receive, self.send) File "/usr/local/lib/python3.7/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__ return await self.app(scope, receive, send) File "/usr/local/lib/python3.7/site-packages/starlette/applications.py", line 102, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.7/site-packages/starlette/middleware/errors.py", line 181, in __call__ raise exc from None File "/usr/local/lib/python3.7/site-packages/starlette/middleware/errors.py", line 159, in __call__ await self.app(scope, receive, _send) File "/usr/local/lib/python3.7/site-packages/starlette/exceptions.py", line 82, in __call__ raise exc from None File "/usr/local/lib/python3.7/site-packages/starlette/exceptions.py", line 71, in __call__ await self.app(scope, receive, sender) File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 550, in __call__ await route.handle(scope, receive, send) File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 227, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 41, in app response = await func(request) File "app.py", line 44, in homepage nsamples=params.get('nsamples', 1), File "/usr/local/lib/python3.7/site-packages/gpt_2_simple/gpt_2.py", line 428, in generate assert nsamples % batch_size == 0 TypeError: not all arguments converted during string formatting

Memory limit exceeded

I'm trying to run this in Google Cloud Run, however I don't seem to have enough memory.

Memory limit of 2048M exceeded with 2126M used.

Do you have any idea why this is the case? It should work, right?

I'm using the 124MB model.

Google Cloud Free Tier

Not really a issue per see, but does anyone know how to stay in the google cloud free tier ?
What parameter does i have to use when I configure my image ?

More consistent output for Save Image

The current Save Image will result in an output based on the viewport of the device: not necessarily wrong, but it would be good if it was more consistent.

Error message on Cloud Run deployment

This issue has to do with deploying in google cloud run. The app run runs in local docker container, i.e., curl http://0.0.0.0:8080 returns desired output. I then followed the rest of the instructions to deploy in google cloud run -- set the memory to 2GB & set max requests to 1. The deployment however wasn't successful.

Cloud Run error: Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable. Logs for this revision might contain more information.

A cursory search says that GCP expects requests on 0.0.0.0:8080. This is what app.py stores too, so not sure where the deployment error is coming from. Any idea?

Add reCAPTCHA support

Should be able to set the CAPTCHA auth keys as environmental variables.

AttributeError: 'NoneType' object has no attribute 'dumps'

Running fine in a docker, when i tried to run on my desktop error below came out.
Traceback (most recent call last): File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/uvicorn/protocols/http/httptools_impl.py", line 385, in run_asgi result = await app(self.scope, self.receive, self.send) File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__ return await self.app(scope, receive, send) File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/applications.py", line 102, in __call__ await self.middleware_stack(scope, receive, send) File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/middleware/errors.py", line 178, in __call__ raise exc from None File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/middleware/errors.py", line 156, in __call__ await self.app(scope, receive, _send) File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/exceptions.py", line 82, in __call__ raise exc from None File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/exceptions.py", line 71, in __call__ await self.app(scope, receive, sender) File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/routing.py", line 550, in __call__ await route.handle(scope, receive, send) File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/routing.py", line 227, in handle await self.app(scope, receive, send) File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/routing.py", line 41, in app response = await func(request) File "app.py", line 58, in homepage headers=response_header) File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/responses.py", line 42, in __init__ self.body = self.render(content) File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/starlette/responses.py", line 159, in render return ujson.dumps(content, ensure_ascii=False).encode("utf-8") AttributeError: 'NoneType' object has no attribute 'dumps'

Cloud Build workflow could be less janky

Copying an entire GCS bucket / creating a new bucket just for copying it is not ideal. Uploading app.py and the Dockerfile to the same bucket is also somewhat janky.

Possible alternate workflow:

Copy only the specified checkpoint folder to the Cloud Build /workspace working directory
If the specified checkpoint folder is not named correctly, rename it. This allows multiple folders in a bucket.
Upload app.py, Dockerfile, and friends from the local machine with the build CLI command

It may be helpful to include sanity checking steps as well if doing this workflow to check for the presence of filed before the Docker build.

Is it possible to update this for the big model?

Could not load dynamic library 'libcuda.so.1'

When running the docker image, I receive the following cuda error. I have built the image using the normal tensorflow==1.15, but as far as I know cuda is only required for tensorflow on GPU?

2020-01-16 12:58:56.907133: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-01-16 12:58:56.907168: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-01-16 12:58:56.907187: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (a88f782966c0): /proc/driver/nvidia/version does not exist

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run

Create a k8s .yaml file spec which will create a cluster that can support GPT-2 APIs w/ GPU for faster serving

Goal

Each Node has as much GPU utilization as possible.
Able to scale down to zero (for real, GKE is picky about that)

Proposal

A single f1-micro Node so the GPU-pods can scale to 0 (a single f1-micro is free)
Other Node is a 16 vCPU /14GB RAM (n1-highcpu-16).
Each Pod uses 4 vCPU, 1 K80 GPU, and a has a Cloud Run concurrency of 4.

Therefore, a single Node can accommodate up to 4 different GPT-2 APIs or the same API scaled up, which is neat.

In testing a single K80 can generate about 20 texts at a time before going OOM, so setting a maximum of 16 should give enough of a buffer for storing the model. If not, using T4 GPUs should give a RAM boost.

Request for Contribution: Add webpage GUI for GPT-2-based APIs

Part of the reason I am building gpt-2-cloud-run is for easy integration with a web-based front end.

Unfortunately, I suck at front-ends and don't know best practices. (Ideally, I want something similar to OpenAI's UI which has parameter selection and input capabilities for inline autocompletion)

I'll need help for a simple web-based frontend that's flexible. Some feature specifications:

A single HTML file with the app (no external JS/CSS; including from a CDN is OK).
Supports all parameters that the default API supports. (e.g. length, temperature, top_k)
A button for submission which is disabled on click until a response/error is received (to prevent double-submissions since they are slow)

Concerning generating multiple responses

What I am hoping for is that if I can generate multiple response for the prefix/text input i provide in the text box?

Is there any way to do that?

Tensorflow.contrib module error when running docker file

Hi Max,
I am following the instructions in the readme file. Once I built the image, I tried to run the image locally, but got the following error. It seems tensorflow.contrib module is discontinued in version 2.0. I noticed in the docker file you do not specify the version of tensorflow so it might have auto installed 2.0. The TF version in my colab notebook was 1.15 when I trained the model, so I will try to force 1.15 in the dockerfile.

docker run -p 8080:8080 --memory="2g" --cpus="1" gpt2

Traceback (most recent call last):
  File "app.py", line 3, in <module>
    import gpt_2_simple as gpt2
  File "/usr/local/lib/python3.7/site-packages/gpt_2_simple/__init__.py", line 1, in <module>
    from .gpt_2 import *
  File "/usr/local/lib/python3.7/site-packages/gpt_2_simple/gpt_2.py", line 23, in <module>
    from gpt_2_simple.src import model, sample, encoder, memory_saving_gradients
  File "/usr/local/lib/python3.7/site-packages/gpt_2_simple/src/memory_saving_gradients.py", line 5, in <module>
    import tensorflow.contrib.graph_editor as ge
ModuleNotFoundError: No module named 'tensorflow.contrib'

Reduce memory consumption to prevent errors due to container OOM

Containers seem to go OOM after ~10 generations, despite garbage collection. Loading the model takes up ~1.5GB so hitting the ceiling is not surprising, but there should be a way to control the leaks.

Save Image button does not always work on Mobile

There may or may not be a way to fix it. No issues with hiding the button on mobile devices.

Add Favicon indicating when text is generating

For when the web UI is open in another tab.

Can apparently set with jQuery so not too hard.

The hard part is deciding the generation/nongeneration favicon.

Dockerfile checkpoint is missing

Hi,

Hope you are all well !

I could not build the docker image as checkpoint file is missing.
Can you re-upload it or shall I remove it from the Dockerfile ?

Thanks in advance for your insights and inputs on the topic.

Cheers,
X

tensorflow version

Tensorflow > 2 is not compatible with gpt-2-simple.
Docker is downloading the latest version.

Parameter Usage Question

I understand the 4 parameters in your jQuery form (prefix, length, temperature, top_k) but in the .app file, there are also these 4 lines:
top_p=float(params.get('top_p', 0)),
truncate=params.get('truncate', None),
include_prefix=str(params.get('include_prefix', True)).lower() == 'true',
return_as_list=True

I assume if I don't want to allow the user to write the first n characters of the story, I would change it to: 'include_prefix', False . But what do these 3 do? top_p, truncate, and return_as_list