Comments (16)
Thank you for opening your first issue here! 🛠
from flyte.
The issue you're experiencing with the PyTorchPlugin setting replicas to 0, causing an infinite loop, is likely related to the configuration or deployment of the PyTorch operator in your Flyte setup. According to this discussion, ensuring the PyTorch operator is correctly deployed in your Kubernetes cluster is crucial. Additionally, there might be a need to check for any recent changes or refactoring in the plugin, as mentioned in another discussion.
Would you like more details on how to verify and potentially resolve this issue?
Sources:
- Discussion on PyTorchJob in Flyte
- Discussion on dynamic task issues
- Discussion on multiple replicas issue
This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.
Mention @RunLLM to ask follow-up questions. See runllm.com for more!
from flyte.
I've tried this example, but it works well.
from flytekit import Resources, task
from flytekitplugins.kfpytorch import PyTorch, Worker
# %%
cpu_request = "500m"
mem_request = "500Mi"
gpu_request = "0"
mem_limit = "500Mi"
gpu_limit = "0"
# %%
@task(
task_config=PyTorch(worker=Worker(replicas=2)),
retries=2,
# cache=True,
# cache_version="0.1",
requests=Resources(cpu=cpu_request, mem=mem_request, gpu=gpu_request),
limits=Resources(mem=mem_limit, gpu=gpu_limit),
)
def mnist_pytorch_job() -> str:
return "Hi"
FROM python:3.9-slim-buster
USER root
WORKDIR /root
ENV PYTHONPATH /root
RUN apt-get update && apt-get install build-essential -y
RUN apt-get install git -y
RUN pip install flytekitplugins-kfpytorch
from flyte.
@Future-Outlier , can you confirm which version of Flyte and flytekit you were running?
from flyte.
@Future-Outlier , can you confirm which version of Flyte and flytekit you were running?
Flyte: master branch with single binary
flytekit: 1.12.0
kubeflow training operator: kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
from flyte.
@Future-Outlier , can you try this out with Flyte 1.10.7? You can start single-binary via flytectl and point to a release: flytectl demo start --version 1.10.7
.
from flyte.
@Future-Outlier , can you try this out with Flyte 1.10.7? You can start single-binary via flytectl and point to a release:
flytectl demo start --version 1.10.7
.
I can help, but this version fails
from flyte.
@Future-Outlier , sorry for the late reply. We require a the prefix v
for the version. Can you try v1.10.7
?
from flyte.
@Future-Outlier , sorry for the late reply. We require a the prefix
v
for the version. Can you tryv1.10.7
?
No problem, doing it, will tell you the result in 1 hour
from flyte.
Member
Hi, @eapolinario , it works.
Setup Process
flytectl demo start --version v1.10.7 --disable-agent
- edit config map of
flyte-sandbox-config
001-plugins.yaml: |
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
pytorch: pytorch
enabled-plugins:
- container
- sidecar
- k8s-array
- agent-service
- pytorch
- restart
flyte-sandbox
deployment - build a docker image for Pytorch Job (listed above)
- specify the image with "4" and run it on the remote cluster
from flyte.
@Future-Outlier , which version of the kubeflow training operator are you running? Also, can you paste the pod definition here?
from flyte.
@Future-Outlier , which version of the kubeflow training operator are you running? Also, can you paste the pod definition here?
Hi, @eapolinario
kubeflow training operator version: v1.10.7
pod definition:
(dev) future@outlier ~ % kubectl describe pod training-operator-984cfd546-2jn65 -n kubeflow
Name: training-operator-984cfd546-2jn65
Namespace: kubeflow
Priority: 0
Service Account: training-operator
Node: 481e7e029920/172.17.0.2
Start Time: Wed, 12 Jun 2024 09:39:45 +0800
Labels: control-plane=kubeflow-training-operator
pod-template-hash=984cfd546
Annotations: sidecar.istio.io/inject: false
Status: Running
IP: 10.42.0.10
IPs:
IP: 10.42.0.10
Controlled By: ReplicaSet/training-operator-984cfd546
Containers:
training-operator:
Container ID: containerd://1f46d23264f737a7f07c26d087b91ca890a5862fd9fcf2d0b8a95c479c5db343
Image: kubeflow/training-operator:v1-855e096
Image ID: docker.io/kubeflow/training-operator@sha256:725f0adb8910336625566b391bba35391d712c0ffff6a4be02863cebceaa7cf8
Port: 8080/TCP
Host Port: 0/TCP
Command:
/manager
State: Running
Started: Wed, 12 Jun 2024 09:39:56 +0800
Ready: True
Restart Count: 0
Liveness: http-get http://:8081/healthz delay=15s timeout=3s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=10s timeout=3s period=15s #success=1 #failure=3
Environment:
MY_POD_NAMESPACE: kubeflow (v1:metadata.namespace)
MY_POD_NAME: training-operator-984cfd546-2jn65 (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tjwpd (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-tjwpd:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 43s default-scheduler Successfully assigned kubeflow/training-operator-984cfd546-2jn65 to 481e7e029920
Normal Pulling 43s kubelet Pulling image "kubeflow/training-operator:v1-855e096"
Normal Pulled 32s kubelet Successfully pulled image "kubeflow/training-operator:v1-855e096" in 10.99020538s
Normal Created 32s kubelet Created container training-operator
Normal Started 32s kubelet Started container training-operator
from flyte.
@Future-Outlier , thanks for being so thorough. As a final step, can you paste the pytorchjob CR object created as part of the pytorch job and also the task pod? I just want to make sure the values are reflected there.
from flyte.
pytorchjob CR object created
(dev) future@outlier ~ % kubectl get crd pytorchjobs.kubeflow.org
NAME CREATED AT
pytorchjobs.kubeflow.org 2024-06-13T01:59:15Z
Name: f591aa743583746998b7-fg3djuyi-0
Namespace: flytesnacks-development
Labels: domain=development
execution-id=f591aa743583746998b7
interruptible=false
node-id=pytorchexamplemnistpytorchjob
project=flytesnacks
shard-key=22
task-name=pytorch-example-mnist-pytorch-job
workflow-name=flytegen-pytorch-example-mnist-pytorch-job
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: false
API Version: kubeflow.org/v1
Kind: PyTorchJob
Metadata:
Creation Timestamp: 2024-06-13T02:04:25Z
Generation: 1
Owner References:
API Version: flyte.lyft.com/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: flyteworkflow
Name: f591aa743583746998b7
UID: 4d6c99c5-81c3-4ef3-9617-d3435fe06bc3
Resource Version: 1038
UID: ca0f76c8-b3d1-4f85-8f0f-8b9dff9d99d2
Spec:
Pytorch Replica Specs:
Master:
Replicas: 1
Restart Policy: Never
Template:
Metadata:
Annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: false
Labels:
Domain: development
Execution - Id: f591aa743583746998b7
Interruptible: false
Node - Id: pytorchexamplemnistpytorchjob
Project: flytesnacks
Shard - Key: 22
Task - Name: pytorch-example-mnist-pytorch-job
Workflow - Name: flytegen-pytorch-example-mnist-pytorch-job
Spec:
Affinity:
Containers:
Args:
pyflyte-fast-execute
--additional-distribution
s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
--dest-dir
.
--
pyflyte-execute
--inputs
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
--output-prefix
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
--raw-output-data-prefix
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
--checkpoint-path
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
pytorch_example
task-name
mnist_pytorch_job
Env:
Name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
Value: flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_EXECUTION_ID
Value: f591aa743583746998b7
Name: FLYTE_INTERNAL_EXECUTION_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_EXECUTION_DOMAIN
Value: development
Name: FLYTE_ATTEMPT_NUMBER
Value: 0
Name: FLYTE_INTERNAL_TASK_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_TASK_DOMAIN
Value: development
Name: FLYTE_INTERNAL_TASK_NAME
Value: pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_TASK_VERSION
Value: TPtCnpd9zLfeKcUJ5IeFDw
Name: FLYTE_INTERNAL_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_DOMAIN
Value: development
Name: FLYTE_INTERNAL_NAME
Value: pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_VERSION
Value: TPtCnpd9zLfeKcUJ5IeFDw
Name: FLYTE_AWS_SECRET_ACCESS_KEY
Value: miniostorage
Name: FLYTE_AWS_ENDPOINT
Value: http://flyte-sandbox-minio.flyte:9000
Name: FLYTE_AWS_ACCESS_KEY_ID
Value: minio
Image: localhost:30000/torch-0611:latest
Name: pytorch
Resources:
Limits:
Cpu: 500m
Memory: 500Mi
Requests:
Cpu: 500m
Memory: 500Mi
Termination Message Policy: FallbackToLogsOnError
Restart Policy: Never
Worker:
Replicas: 2
Restart Policy: Never
Template:
Metadata:
Annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: false
Labels:
Domain: development
Execution - Id: f591aa743583746998b7
Interruptible: false
Node - Id: pytorchexamplemnistpytorchjob
Project: flytesnacks
Shard - Key: 22
Task - Name: pytorch-example-mnist-pytorch-job
Workflow - Name: flytegen-pytorch-example-mnist-pytorch-job
Spec:
Affinity:
Containers:
Args:
pyflyte-fast-execute
--additional-distribution
s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
--dest-dir
.
--
pyflyte-execute
--inputs
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
--output-prefix
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
--raw-output-data-prefix
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
--checkpoint-path
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
pytorch_example
task-name
mnist_pytorch_job
Env:
Name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
Value: flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_EXECUTION_ID
Value: f591aa743583746998b7
Name: FLYTE_INTERNAL_EXECUTION_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_EXECUTION_DOMAIN
Value: development
Name: FLYTE_ATTEMPT_NUMBER
Value: 0
Name: FLYTE_INTERNAL_TASK_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_TASK_DOMAIN
Value: development
Name: FLYTE_INTERNAL_TASK_NAME
Value: pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_TASK_VERSION
Value: TPtCnpd9zLfeKcUJ5IeFDw
Name: FLYTE_INTERNAL_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_DOMAIN
Value: development
Name: FLYTE_INTERNAL_NAME
Value: pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_VERSION
Value: TPtCnpd9zLfeKcUJ5IeFDw
Name: FLYTE_AWS_ENDPOINT
Value: http://flyte-sandbox-minio.flyte:9000
Name: FLYTE_AWS_ACCESS_KEY_ID
Value: minio
Name: FLYTE_AWS_SECRET_ACCESS_KEY
Value: miniostorage
Image: localhost:30000/torch-0611:latest
Name: pytorch
Resources:
Limits:
Cpu: 500m
Memory: 500Mi
Requests:
Cpu: 500m
Memory: 500Mi
Termination Message Policy: FallbackToLogsOnError
Restart Policy: Never
Run Policy:
Suspend: false
Status:
Completion Time: 2024-06-13T02:04:41Z
Conditions:
Last Transition Time: 2024-06-13T02:04:25Z
Last Update Time: 2024-06-13T02:04:25Z
Message: PyTorchJob f591aa743583746998b7-fg3djuyi-0 is created.
Reason: PyTorchJobCreated
Status: True
Type: Created
Last Transition Time: 2024-06-13T02:04:27Z
Last Update Time: 2024-06-13T02:04:27Z
Message: PyTorchJob f591aa743583746998b7-fg3djuyi-0 is running.
Reason: PyTorchJobRunning
Status: False
Type: Running
Last Transition Time: 2024-06-13T02:04:41Z
Last Update Time: 2024-06-13T02:04:41Z
Message: PyTorchJob f591aa743583746998b7-fg3djuyi-0 is successfully completed.
Reason: PyTorchJobSucceeded
Status: True
Type: Succeeded
Replica Statuses:
Master:
Selector: training.kubeflow.org/job-name=f591aa743583746998b7-fg3djuyi-0,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=master
Succeeded: 1
Worker:
Succeeded: 2
Start Time: 2024-06-13T02:04:26Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 16m pytorchjob-controller Created pod: f591aa743583746998b7-fg3djuyi-0-master-0
Normal SuccessfulCreateService 16m pytorchjob-controller Created service: f591aa743583746998b7-fg3djuyi-0-master-0
Normal SuccessfulCreatePod 16m pytorchjob-controller Created pod: f591aa743583746998b7-fg3djuyi-0-worker-0
Warning SettedPodTemplateRestartPolicy 16m (x3 over 16m) pytorchjob-controller Restart policy in pod template will be overwritten by restart policy in replica spec
Normal SuccessfulCreatePod 16m pytorchjob-controller Created pod: f591aa743583746998b7-fg3djuyi-0-worker-1
Normal SuccessfulCreateService 16m pytorchjob-controller Created service: f591aa743583746998b7-fg3djuyi-0-worker-0
Normal SuccessfulCreateService 16m pytorchjob-controller Created service: f591aa743583746998b7-fg3djuyi-0-worker-1
Normal ExitedWithCode 16m (x3 over 16m) pytorchjob-controller Pod: flytesnacks-development.f591aa743583746998b7-fg3djuyi-0-master-0 exited with code 0
Normal ExitedWithCode 16m (x2 over 16m) pytorchjob-controller Pod: flytesnacks-development.f591aa743583746998b7-fg3djuyi-0-worker-0 exited with code 0
Normal PyTorchJobSucceeded 16m pytorchjob-controller PyTorchJob f591aa743583746998b7-fg3djuyi-0 is successfully completed.
pytorch job task pod
master pod
(dev) future@outlier ~ % kubectl describe pod f591aa743583746998b7-fg3djuyi-0-master-0 -n flytesnacks-development
Name: f591aa743583746998b7-fg3djuyi-0-master-0
Namespace: flytesnacks-development
Priority: 0
Service Account: default
Node: 1002541774d8/172.17.0.2
Start Time: Thu, 13 Jun 2024 10:04:25 +0800
Labels: domain=development
execution-id=f591aa743583746998b7
interruptible=false
node-id=pytorchexamplemnistpytorchjob
project=flytesnacks
shard-key=22
task-name=pytorch-example-mnist-pytorch-job
training.kubeflow.org/job-name=f591aa743583746998b7-fg3djuyi-0
training.kubeflow.org/job-role=master
training.kubeflow.org/operator-name=pytorchjob-controller
training.kubeflow.org/replica-index=0
training.kubeflow.org/replica-type=master
workflow-name=flytegen-pytorch-example-mnist-pytorch-job
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status: Succeeded
IP: 10.42.0.14
IPs:
IP: 10.42.0.14
Controlled By: PyTorchJob/f591aa743583746998b7-fg3djuyi-0
Containers:
pytorch:
Container ID: containerd://53ff705a7daf13acdd4a29828d4eac82ff3ee64d298fa5d6807643dfd0768ffa
Image: localhost:30000/torch-0611:latest
Image ID: localhost:30000/torch-0611@sha256:f3d76504e47fa1950347721ec159083870494c1602fbfeb9ed86dacf5a6a4d83
Port: 23456/TCP
Host Port: 0/TCP
Args:
pyflyte-fast-execute
--additional-distribution
s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
--dest-dir
.
--
pyflyte-execute
--inputs
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
--output-prefix
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
--raw-output-data-prefix
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
--checkpoint-path
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
pytorch_example
task-name
mnist_pytorch_job
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 13 Jun 2024 10:04:26 +0800
Finished: Thu, 13 Jun 2024 10:04:38 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 500Mi
Requests:
cpu: 500m
memory: 500Mi
Environment:
FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_EXECUTION_ID: f591aa743583746998b7
FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks
FLYTE_INTERNAL_EXECUTION_DOMAIN: development
FLYTE_ATTEMPT_NUMBER: 0
FLYTE_INTERNAL_TASK_PROJECT: flytesnacks
FLYTE_INTERNAL_TASK_DOMAIN: development
FLYTE_INTERNAL_TASK_NAME: pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_TASK_VERSION: TPtCnpd9zLfeKcUJ5IeFDw
FLYTE_INTERNAL_PROJECT: flytesnacks
FLYTE_INTERNAL_DOMAIN: development
FLYTE_INTERNAL_NAME: pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_VERSION: TPtCnpd9zLfeKcUJ5IeFDw
FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
FLYTE_AWS_ENDPOINT: http://flyte-sandbox-minio.flyte:9000
FLYTE_AWS_ACCESS_KEY_ID: minio
PYTHONUNBUFFERED: 1
MASTER_PORT: 23456
PET_MASTER_PORT: 23456
MASTER_ADDR: f591aa743583746998b7-fg3djuyi-0-master-0
PET_MASTER_ADDR: f591aa743583746998b7-fg3djuyi-0-master-0
WORLD_SIZE: 3
RANK: 0
PET_NPROC_PER_NODE: auto
PET_NODE_RANK: 0
PET_NNODES: 3
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-46gf5 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-46gf5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m23s default-scheduler Successfully assigned flytesnacks-development/f591aa743583746998b7-fg3djuyi-0-master-0 to 1002541774d8
Normal Pulling 5m23s kubelet Pulling image "localhost:30000/torch-0611:latest"
Normal Pulled 5m23s kubelet Successfully pulled image "localhost:30000/torch-0611:latest" in 8.59375ms
Normal Created 5m23s kubelet Created container pytorch
Normal Started 5m23s kubelet Started container pytorch
worker pod
(dev) future@outlier ~ % kubectl describe pod f591aa743583746998b7-fg3djuyi-0-worker-0 -n flytesnacks-development
Name: f591aa743583746998b7-fg3djuyi-0-worker-0
Namespace: flytesnacks-development
Priority: 0
Service Account: default
Node: 1002541774d8/172.17.0.2
Start Time: Thu, 13 Jun 2024 10:04:25 +0800
Labels: domain=development
execution-id=f591aa743583746998b7
interruptible=false
node-id=pytorchexamplemnistpytorchjob
project=flytesnacks
shard-key=22
task-name=pytorch-example-mnist-pytorch-job
training.kubeflow.org/job-name=f591aa743583746998b7-fg3djuyi-0
training.kubeflow.org/operator-name=pytorchjob-controller
training.kubeflow.org/replica-index=0
training.kubeflow.org/replica-type=worker
workflow-name=flytegen-pytorch-example-mnist-pytorch-job
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status: Succeeded
IP: 10.42.0.15
IPs:
IP: 10.42.0.15
Controlled By: PyTorchJob/f591aa743583746998b7-fg3djuyi-0
Init Containers:
init-pytorch:
Container ID: containerd://8f5cdc0d84ded4e3a22a62a85ef98981e970d2f0ef67feea4afb5e240eabb044
Image: alpine:3.10
Image ID: docker.io/library/alpine@sha256:451eee8bedcb2f029756dc3e9d73bab0e7943c1ac55cff3a4861c52a0fdd3e98
Port: <none>
Host Port: <none>
Command:
sh
-c
err=1;for i in $(seq 100); do if nslookup f591aa743583746998b7-fg3djuyi-0-master-0; then err=0 && break; fi;echo waiting for master; sleep 2; done; exit $err
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 13 Jun 2024 10:04:33 +0800
Finished: Thu, 13 Jun 2024 10:04:33 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 20Mi
Requests:
cpu: 50m
memory: 10Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lw7ww (ro)
Containers:
pytorch:
Container ID: containerd://182ef52d6b7b456586f9a2d37685bd202fe76ff7e094ec50242047f9604b741b
Image: localhost:30000/torch-0611:latest
Image ID: localhost:30000/torch-0611@sha256:f3d76504e47fa1950347721ec159083870494c1602fbfeb9ed86dacf5a6a4d83
Port: 23456/TCP
Host Port: 0/TCP
Args:
pyflyte-fast-execute
--additional-distribution
s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
--dest-dir
.
--
pyflyte-execute
--inputs
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
--output-prefix
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
--raw-output-data-prefix
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
--checkpoint-path
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
pytorch_example
task-name
mnist_pytorch_job
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 13 Jun 2024 10:04:34 +0800
Finished: Thu, 13 Jun 2024 10:04:41 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 500Mi
Requests:
cpu: 500m
memory: 500Mi
Environment:
FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_EXECUTION_ID: f591aa743583746998b7
FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks
FLYTE_INTERNAL_EXECUTION_DOMAIN: development
FLYTE_ATTEMPT_NUMBER: 0
FLYTE_INTERNAL_TASK_PROJECT: flytesnacks
FLYTE_INTERNAL_TASK_DOMAIN: development
FLYTE_INTERNAL_TASK_NAME: pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_TASK_VERSION: TPtCnpd9zLfeKcUJ5IeFDw
FLYTE_INTERNAL_PROJECT: flytesnacks
FLYTE_INTERNAL_DOMAIN: development
FLYTE_INTERNAL_NAME: pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_VERSION: TPtCnpd9zLfeKcUJ5IeFDw
FLYTE_AWS_ENDPOINT: http://flyte-sandbox-minio.flyte:9000
FLYTE_AWS_ACCESS_KEY_ID: minio
FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
PYTHONUNBUFFERED: 1
MASTER_PORT: 23456
PET_MASTER_PORT: 23456
MASTER_ADDR: f591aa743583746998b7-fg3djuyi-0-master-0
PET_MASTER_ADDR: f591aa743583746998b7-fg3djuyi-0-master-0
WORLD_SIZE: 3
RANK: 1
PET_NPROC_PER_NODE: auto
PET_NODE_RANK: 1
PET_NNODES: 3
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lw7ww (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-lw7ww:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m26s default-scheduler Successfully assigned flytesnacks-development/f591aa743583746998b7-fg3djuyi-0-worker-0 to 1002541774d8
Normal Pulling 6m26s kubelet Pulling image "alpine:3.10"
Normal Pulled 6m19s kubelet Successfully pulled image "alpine:3.10" in 6.911665252s
Normal Created 6m19s kubelet Created container init-pytorch
Normal Started 6m19s kubelet Started container init-pytorch
Normal Pulling 6m18s kubelet Pulling image "localhost:30000/torch-0611:latest"
Normal Pulled 6m18s kubelet Successfully pulled image "localhost:30000/torch-0611:latest" in 28.368833ms
Normal Created 6m18s kubelet Created container pytorch
Normal Started 6m18s kubelet Started container pytorch
from flyte.
Hi, thank you very much for looking into this. The version was the issue. I used Flytekit version 1.12.0 to deploy the test workflow, my Flyte backend was old, at version v1.1.32. Using Flytekit version 1.2.12 the workers were set to the right number.
from flyte.
Nice, I can close the issue now!
from flyte.
Related Issues (20)
- [BUG] python task retries=n causes "Requests overridden" error log message for interruptible task HOT 1
- [BUG] Handler for .well-known/openid-configuration constructs redirect path incorrectly HOT 2
- [Core feature] LiteralBlob and StructuredDataset metadata HOT 1
- More flexible configuration of SecurityContext for Pods/Containers started by flyte HOT 10
- [Core feature] `@dynamic` should accept all (?) of `@workflow` attributes HOT 2
- [BUG] Tasks from subworkflow calling reference launch plan read cache from different projects HOT 2
- [BUG] New versions of viper breaks config loading HOT 3
- [BUG] nested dynamic won't bind pydantic models or dictionaries as inputs HOT 2
- [Core feature] Build multiple ImageSpec in parallel HOT 2
- [Housekeeping] Distributed Tracing Should Support OTLP Exporters HOT 1
- [Integration] NIM
- [BUG] ArrayNodes downloads all inputs for every subtasks HOT 2
- Flytekit checkpoint improvement- pytorch HOT 1
- [BUG] Union types fail for e.g. two different dataclasses HOT 4
- [BUG] Missing inputs when using datetime.date type hint. HOT 3
- [Core feature] UX improvement: `pyflyte run` includes imported local modules in the target workflow file HOT 2
- [Housekeeping] Remove the need of kwtypes in user code HOT 1
- [Core feature] pyflyte run --remote should support a url HOT 2
- [BUG] Error in workflow compilation logic for special variable names HOT 4
- SubWorkflow Error handling HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flyte.