Git Product home page Git Product logo

Comments (6)

minimaxir avatar minimaxir commented on August 23, 2024

Cloud Run may not work well here because it does not allow you to configure number of vCPU per service.

It may be better to use raw Knative for it until Google adds that feature.

from gpt-2-cloud-run.

minimaxir avatar minimaxir commented on August 23, 2024

Interesting issue with trying to put K80s on a n1-highcpu-16:

The number of GPU dies is linked to the number of CPU cores and memory selected for this instance. For the current configuration, you can select no fewer than 2 GPU dies of this type

So T4 it is.

from gpt-2-cloud-run.

minimaxir avatar minimaxir commented on August 23, 2024

Better solution; actually leverage Python's async to minimize dedicated resources needed, so we can actually use K80s.

With gpt-2-simple, the generation is done completely in the GPU, so that might work. We might be able to get away with a 4 vCPU n1-standard-4 system (1 vCPU per Pod), and use a K80 (but still 4 concurrent users per Pod, 16 users per Node). The total cost is less than half of what was proposed.

And since it would be 1 vCPU used, we could set up Cloud Run with it, which might be easier than working with Knative.

from gpt-2-cloud-run.

minimaxir avatar minimaxir commented on August 23, 2024

Unfortunately, this is not as easy expected since a tf.Session cannot be shared between threads and processes, therefore dramatically reducing the async possibilities.

For the initial release I might be OK without, especially if the GPU has high enough throughput.

from gpt-2-cloud-run.

minimaxir avatar minimaxir commented on August 23, 2024

Update: you can share a tf.Session, but it's not easy and might not necessarily result in a performance gain. It however saves GPU vRAM, which is a necessary precondition. (estimate 2.5GB ceiling when generating 4 predictions at a time, so 4 containers will fit in a 12GB vRAM GPU).

Best architecture is still a 4vCPU + 1GPU w/ 4 containers, but it may be better to see if Cloud Run can assign each container 4vCPUs and then share threads (as Flask's native server is threaded by default and route accordingly). And then see if it causes any deadlocks.

from gpt-2-cloud-run.

kshtzgupta1 avatar kshtzgupta1 commented on August 23, 2024

Hi Max! Thank you so much for creating gpt-2-cloud-run. It's been really useful and inspiring for my GPT-2 webapp. For this webapp I'm trying to deploy a finetuned 345M GPT-2 Model (~1.4 GB) through Cloud Run on GKE but I am unsure about the spec of the GKE Cluster as well as what concurrency should I set.

Can you please advice on the number of nodes, machine type and concurrency I should be using for maximum cost effectiveness? Currently, I have a concurrency of 1 along with just 1 node (n1-standard-2; 7.5GB; 2vCPU) and a K80 attached to that node but I'm not sure if this is the most cost-effective GKE spec.

I would really appreciate any insights on this! If it helps I intend to deploy only this model and don't plan on having any more GPT-2 webapps.

from gpt-2-cloud-run.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.