Create a k8s .yaml file spec which will create a clus

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run about gpt-2-cloud-run HOT 6 OPEN

minimaxir commented on August 23, 2024 1

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run

from gpt-2-cloud-run.

Comments (6)

minimaxir commented on August 23, 2024

Cloud Run may not work well here because it does not allow you to configure number of vCPU per service.

It may be better to use raw Knative for it until Google adds that feature.

from gpt-2-cloud-run.

minimaxir commented on August 23, 2024

Interesting issue with trying to put K80s on a n1-highcpu-16:

The number of GPU dies is linked to the number of CPU cores and memory selected for this instance. For the current configuration, you can select no fewer than 2 GPU dies of this type

So T4 it is.

from gpt-2-cloud-run.

minimaxir commented on August 23, 2024

Better solution; actually leverage Python's async to minimize dedicated resources needed, so we can actually use K80s.

With gpt-2-simple, the generation is done completely in the GPU, so that might work. We might be able to get away with a 4 vCPU n1-standard-4 system (1 vCPU per Pod), and use a K80 (but still 4 concurrent users per Pod, 16 users per Node). The total cost is less than half of what was proposed.

And since it would be 1 vCPU used, we could set up Cloud Run with it, which might be easier than working with Knative.

from gpt-2-cloud-run.

minimaxir commented on August 23, 2024

Unfortunately, this is not as easy expected since a tf.Session cannot be shared between threads and processes, therefore dramatically reducing the async possibilities.

For the initial release I might be OK without, especially if the GPU has high enough throughput.

from gpt-2-cloud-run.

minimaxir commented on August 23, 2024

Update: you can share a tf.Session, but it's not easy and might not necessarily result in a performance gain. It however saves GPU vRAM, which is a necessary precondition. (estimate 2.5GB ceiling when generating 4 predictions at a time, so 4 containers will fit in a 12GB vRAM GPU).

Best architecture is still a 4vCPU + 1GPU w/ 4 containers, but it may be better to see if Cloud Run can assign each container 4vCPUs and then share threads (as Flask's native server is threaded by default and route accordingly). And then see if it causes any deadlocks.

from gpt-2-cloud-run.

kshtzgupta1 commented on August 23, 2024

Hi Max! Thank you so much for creating gpt-2-cloud-run. It's been really useful and inspiring for my GPT-2 webapp. For this webapp I'm trying to deploy a finetuned 345M GPT-2 Model (~1.4 GB) through Cloud Run on GKE but I am unsure about the spec of the GKE Cluster as well as what concurrency should I set.

Can you please advice on the number of nodes, machine type and concurrency I should be using for maximum cost effectiveness? Currently, I have a concurrency of 1 along with just 1 node (n1-standard-2; 7.5GB; 2vCPU) and a K80 attached to that node but I'm not sure if this is the most cost-effective GKE spec.

I would really appreciate any insights on this! If it helps I intend to deploy only this model and don't plan on having any more GPT-2 webapps.

from gpt-2-cloud-run.

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run about gpt-2-cloud-run HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent