Git Product home page Git Product logo

Comments (15)

goswamig avatar goswamig commented on May 29, 2024

@charlesa101 Thanks for trying out. I am assuming you have replaced input, output buckets and role Arn.

Would you please run the following command provide the output ?

kubectl  get trainingjobs xgboost-mnist
kubectl describe trainingjob xgboost-mnist

from amazon-sagemaker-operator-for-k8s.

charlesa101 avatar charlesa101 commented on May 29, 2024

@gautamkmr, here you go thank you! yeah i have my own bucket and sagemaker executor role

NAME            STATUS   SECONDARY-STATUS   CREATION-TIME          SAGEMAKER-JOB-NAME
xgboost-mnist                               2020-03-09T16:51:08Z ```

```kubectl describe TrainingJob            
Name:         xgboost-mnist
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sagemaker.aws.amazon.com/v1","kind":"TrainingJob","metadata":{"annotations":{},"name":"xgboost-mnist","namespace":"default"...
API Version:  sagemaker.aws.amazon.com/v1
Kind:         TrainingJob
Metadata:
  Creation Timestamp:  2020-03-09T06:58:17Z
  Generation:          1
  Resource Version:    117181
  Self Link:           /apis/sagemaker.aws.amazon.com/v1/namespaces/default/trainingjobs/xgboost-mnist
  UID:                 5a907178-61d3-11ea-b461-02efd6507006
Spec:
  Algorithm Specification:
    Training Image:       825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest
    Training Input Mode:  File
  Hyper Parameters:
    Name:   max_depth
    Value:  5
    Name:   eta
    Value:  0.2
    Name:   gamma
    Value:  4
    Name:   min_child_weight
    Value:  6
    Name:   silent
    Value:  0
    Name:   objective
    Value:  multi:softmax
    Name:   num_class
    Value:  10
    Name:   num_round
    Value:  10
  Input Data Config:
    Channel Name:      train
    Compression Type:  None
    Content Type:      text/csv
    Data Source:
      S 3 Data Source:
        S 3 Data Distribution Type:  FullyReplicated
        S 3 Data Type:               S3Prefix
        S 3 Uri:                     s3://<MY-BUCKET>/xgboost-mnist/train/
    Channel Name:                    validation
    Compression Type:                None
    Content Type:                    text/csv
    Data Source:
      S 3 Data Source:
        S 3 Data Distribution Type:  FullyReplicated
        S 3 Data Type:               S3Prefix
        S 3 Uri:                     s3://<MY-BUCKET>/xgboost-mnist/validation/
  Output Data Config:
    S 3 Output Path:  s3://<MY-BUCKET>/xgboost-mnist/models/
  Region:             us-east-2
  Resource Config:
    Instance Count:     1
    Instance Type:      ml.m4.xlarge
    Volume Size In GB:  5
  Role Arn:             arn:aws:iam::<ACCOUNT>:role/sagemaker_execution_role
  Stopping Condition:
    Max Runtime In Seconds:  86400```

from amazon-sagemaker-operator-for-k8s.

goswamig avatar goswamig commented on May 29, 2024

@charlesa101 Thanks for providing the output. It appears that operator is not running successfully on your k8s cluster. you can verify that

 kubectl get pods -A | grep -i sagemaker

You can follow steps from here to install the operator, let us know if you face any issue.

from amazon-sagemaker-operator-for-k8s.

charlesa101 avatar charlesa101 commented on May 29, 2024

yeah that's what i noticed as well now

NAME                                                         READY   STATUS    RESTARTS   AGE
sagemaker-k8s-operator-controller-manager-5858fd7b8d-h89s8   0/2     Pending   0          24h```

from amazon-sagemaker-operator-for-k8s.

charlesa101 avatar charlesa101 commented on May 29, 2024
Name:               sagemaker-k8s-operator-controller-manager-5858fd7b8d-h89s8
Namespace:          sagemaker-k8s-operator-system
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             control-plane=controller-manager
                    pod-template-hash=5858fd7b8d
Annotations:        kubernetes.io/psp: eks.privileged
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/sagemaker-k8s-operator-controller-manager-5858fd7b8d
Containers:
  kube-rbac-proxy:
    Image:      gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
    Port:       8443/TCP
    Host Port:  0/TCP
    Args:
      --secure-listen-address=0.0.0.0:8443
      --upstream=http://127.0.0.1:8080/
      --logtostderr=true
      --v=10
    Environment:
      AWS_ROLE_ARN:                 arn:aws:iam::123456789012:role/DELETE_ME
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from sagemaker-k8s-operator-default-token-rwdkn (ro)
  manager:
    Image:      957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1
    Port:       <none>
    Host Port:  <none>
    Command:
      /manager
    Args:
      --metrics-addr=127.0.0.1:8080
    Limits:
      cpu:     100m
      memory:  30Mi
    Requests:
      cpu:     100m
      memory:  20Mi
    Environment:
      AWS_DEFAULT_SAGEMAKER_ENDPOINT:  
      AWS_ROLE_ARN:                    arn:aws:iam::123456789012:role/DELETE_ME
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from sagemaker-k8s-operator-default-token-rwdkn (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  sagemaker-k8s-operator-default-token-rwdkn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  sagemaker-k8s-operator-default-token-rwdkn
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  64s (x1378 over 34h)  default-scheduler  no nodes available to schedule pods

from amazon-sagemaker-operator-for-k8s.

charlesa101 avatar charlesa101 commented on May 29, 2024

my eks/ecr is on us-east2, but it seems all the crd artifacts are coming from us-east1 could that be the issue?

from amazon-sagemaker-operator-for-k8s.

goswamig avatar goswamig commented on May 29, 2024

EKS can pull the image from other region too. I think in your case it seems that you don't have any worker node associated to cluster? At least thats what below message says.

  Warning  FailedScheduling  64s (x1378 over 34h)  default-scheduler  no nodes available to schedule pods

Can you run ?

kubectl get node

from amazon-sagemaker-operator-for-k8s.

goswamig avatar goswamig commented on May 29, 2024

@charlesa101 did you get chance to review it again?

from amazon-sagemaker-operator-for-k8s.

charlesa101 avatar charlesa101 commented on May 29, 2024
NAME                                           STATUS   ROLES    AGE     VERSION
ip-172-16-116-51.us-east-2.compute.internal    Ready    <none>   5h47m   v1.14.8-eks-b8860f
ip-172-16-121-255.us-east-2.compute.internal   Ready    <none>   5h47m   v1.14.8-eks-b8860f
ip-172-16-137-197.us-east-2.compute.internal   Ready    <none>   5h47m   v1.14.8-eks-b8860f

from amazon-sagemaker-operator-for-k8s.

charlesa101 avatar charlesa101 commented on May 29, 2024

yeah i did, recreated the cluster again but still the same issue

from amazon-sagemaker-operator-for-k8s.

goswamig avatar goswamig commented on May 29, 2024

@charlesa101 In previous describe output of pod it appears that cluster did not have any worker nodes available (no nodes available to schedule pods).

But based on recent output it appears that you have three worker nodes available.

NAME STATUS ROLES AGE VERSION
ip-172-16-116-51.us-east-2.compute.internal Ready 5h47m v1.14.8-eks-b8860f
ip-172-16-121-255.us-east-2.compute.internal Ready 5h47m v1.14.8-eks-b8860f
ip-172-16-137-197.us-east-2.compute.internal Ready 5h47m v1.14.8-eks-b8860f

Could you please describe each of these nodes and operator pod ?

# Describe nodes , assuming the names of nodes are same as you mentioned in previous comment.
kubectl describe node ip-172-16-116-51.us-east-2.compute.internal 
kubectl describe node ip-172-16-121-255.us-east-2.compute.internal 
kubectl describe node ip-172-16-137-197.us-east-2.compute.internal 
#Get the operator pod name 
kubectl get pods -A | grep -i sagemaker
kubectl describe pod <put the pod name here>  -n sagemaker-k8s-operator-system

If operator has been deployed successfully and if trainingjob is still not yet running please attach the out put of describe trainingjob as well ?

kubectl describe trainingjob xgboost-mnist

from amazon-sagemaker-operator-for-k8s.

charlesa101 avatar charlesa101 commented on May 29, 2024

i tried to look checked the operator pod, here is the log @gautamkmr

kubectl logs -f sagemaker-k8s-operator-controller-manager-5858fd7b8d-2dk5c  -n sagemaker-k8s-operator-system manager
2020-03-15T18:09:13.864Z        INFO    controller-runtime.metrics      metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2020-03-15T18:09:13.865Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "trainingjob", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.865Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "hyperparametertuningjob", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.865Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "hostingdeployment", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "model", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "endpointconfig", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "batchtransformjob", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z        INFO    setup   starting manager
2020-03-15T18:09:13.866Z        INFO    controller-runtime.manager      starting metrics server {"path": "/metrics"}
2020-03-15T18:09:14.066Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "trainingjob"}
2020-03-15T18:09:14.066Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "model"}
2020-03-15T18:09:14.067Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "batchtransformjob"}
2020-03-15T18:09:14.067Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "hostingdeployment"}
2020-03-15T18:09:14.066Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "endpointconfig"}
2020-03-15T18:09:14.067Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "hyperparametertuningjob"}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "trainingjob", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "model", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "endpointconfig", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "batchtransformjob", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "hostingdeployment", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "hyperparametertuningjob", "worker count": 1}
2020-03-15T19:09:19.962Z        INFO    controllers.TrainingJob Getting resource        {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:09:19.962Z        INFO    controllers.TrainingJob Job status is empty, setting to intermediate status     {"trainingjob": "default/xgboost-mnist", "status": "SynchronizingK8sJobWithSageMaker"}
2020-03-15T19:09:19.963Z        INFO    controllers.TrainingJob Updating job status     {"trainingjob": "default/xgboost-mnist", "new-status": {"trainingJobStatus":"SynchronizingK8sJobWithSageMaker","lastCheckTime":"2020-03-15T19:09:19Z"}}
2020-03-15T19:09:19.976Z        INFO    controllers.TrainingJob Getting resource        {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:09:19.976Z        INFO    controllers.TrainingJob Adding generated name to spec   {"trainingjob": "default/xgboost-mnist", "new-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444"}
2020-03-15T19:09:19.982Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "trainingjob", "request": "default/xgboost-mnist"}
2020-03-15T19:09:19.983Z        INFO    controllers.TrainingJob Getting resource        {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:09:19.983Z        INFO    controllers.TrainingJob Loaded AWS config       {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:09:19.983Z        INFO    controllers.TrainingJob Calling SM API DescribeTrainingJob      {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:09:20.916Z        ERROR   controllers.TrainingJob.handleSageMakerApiError Handling unrecoverable sagemaker API error      {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "error": "UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 01ea5be5-6bd5-4bae-b79e-2bc8d86338ee"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).handleSageMakerApiError
        /workspace/controllers/trainingjob/trainingjob_controller.go:396
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).Reconcile
        /workspace/controllers/trainingjob/trainingjob_controller.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88
2020-03-15T19:09:20.916Z        INFO    controllers.TrainingJob.handleSageMakerApiError Updating job status     {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "new-status": {"trainingJobStatus":"Failed","additional":"UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 01ea5be5-6bd5-4bae-b79e-2bc8d86338ee","lastCheckTime":"2020-03-15T19:09:20Z","cloudWatchLogUrl":"https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logStream:group=/aws/sagemaker/TrainingJobs;prefix=xgboost-mnist-792eb47166f011ea88d202c3652bf444;streamFilter=typeLogStreamPrefix","sageMakerTrainingJobName":"xgboost-mnist-792eb47166f011ea88d202c3652bf444"}}
2020-03-15T19:09:20.924Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "trainingjob", "request": "default/xgboost-mnist"}
2020-03-15T19:11:41.623Z        INFO    controllers.TrainingJob Getting resource        {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:11:41.623Z        INFO    controllers.TrainingJob Loaded AWS config       {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:11:41.623Z        INFO    controllers.TrainingJob Calling SM API DescribeTrainingJob      {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:11:42.150Z        ERROR   controllers.TrainingJob.handleSageMakerApiError Handling unrecoverable sagemaker API error      {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "error": "UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 7145c885-b685-4663-8dd3-6c212ce574b2"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).handleSageMakerApiError
        /workspace/controllers/trainingjob/trainingjob_controller.go:396
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).Reconcile
        /workspace/controllers/trainingjob/trainingjob_controller.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88
2020-03-15T19:11:42.150Z        INFO    controllers.TrainingJob.handleSageMakerApiError Updating job status     {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "new-status": {"trainingJobStatus":"Failed","additional":"UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 7145c885-b685-4663-8dd3-6c212ce574b2","lastCheckTime":"2020-03-15T19:11:42Z","cloudWatchLogUrl":"https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logStream:group=/aws/sagemaker/TrainingJobs;prefix=xgboost-mnist-792eb47166f011ea88d202c3652bf444;streamFilter=typeLogStreamPrefix","sageMakerTrainingJobName":"xgboost-mnist-792eb47166f011ea88d202c3652bf444"}}
2020-03-15T19:11:42.159Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "trainingjob", "request": "default/xgboost-mnist"}

from amazon-sagemaker-operator-for-k8s.

goswamig avatar goswamig commented on May 29, 2024

@charlesa101 Thanks for sharing the log. You are on right track. I think the issue now is operator pod is unable to retrieve credentials from IAM service to talk to sagemaker.

"error": "UnrecognizedClientException: The security token included in the request is invalid.\n

Could you please check your trust.json basically trust policy have three places to update cluster region and OIDC ID and one place to add your AWS account number.

from amazon-sagemaker-operator-for-k8s.

surajkota avatar surajkota commented on May 29, 2024

Hi @charlesa101

Closing this issue since there has been no activity in 90 days. Please re-open if you still need help

Thanks

from amazon-sagemaker-operator-for-k8s.

angadkalra avatar angadkalra commented on May 29, 2024

Hi, I'm having the exact same issue except that my pod is running fine. I setup my k8s cluster using terraform with 1 master node and 1 worker node. When I submit the trainingjob, there is no status or job name or anything else. I tried all the commands above and it looks like the scheduler was able to assign the pods to the worker node. Any help would be appreciated! Please see outputs for commands below:

ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl get pods -A                                                                                                                                                                                                                                                    
NAMESPACE        NAME                                                         READY   STATUS    RESTARTS   AGE                                                                                                                                                                                                                
kube-system      aws-node-67tgx                                               1/1     Running   0          2d18h
kube-system      aws-node-k2q7z                                               1/1     Running   0          2d18h
kube-system      coredns-85d5b4454c-cwfvj                                     1/1     Running   0          2d18h
kube-system      coredns-85d5b4454c-x5ld9                                     1/1     Running   0          2d18h
kube-system      kube-proxy-54vm5                                             1/1     Running   0          2d18h
kube-system      kube-proxy-r8j7j                                             1/1     Running   0          2d18h
kube-system      metrics-server-64cf6869bd-6nppx                              1/1     Running   0          2d18h
sagemaker-jobs   sagemaker-k8s-operator-controller-manager-855f498957-fhkvv   2/2     Running   0          2d18h
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl describe pod sagemaker-k8s-operator-controller-manager-855f498957-fhkvv -n sagemaker-jobs
Name:         sagemaker-k8s-operator-controller-manager-855f498957-fhkvv
Namespace:    sagemaker-jobs
Priority:     0
Node:         ip-10-0-1-245.us-west-2.compute.internal/10.0.1.245
Start Time:   Fri, 24 Jun 2022 22:26:03 +0000
Labels:       control-plane=controller-manager
              pod-template-hash=855f498957
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.0.1.144
IPs:
  IP:           10.0.1.144
Controlled By:  ReplicaSet/sagemaker-k8s-operator-controller-manager-855f498957
Containers:
  manager:
    Container ID:  docker://d8fc52b3e20a050999d3f24ab914f1d865a84a168a8b038f3fa81ce59cccbced
    Image:         957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1
    Image ID:      docker-pullable://957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s@sha256:94ffbba68954249b1724fdb43f1e8ab13547114555b4a217849687d566191e23
    Port:          <none>
    Host Port:     <none>
    Command:
      /manager
    Args:
      --metrics-addr=127.0.0.1:8080
      --namespace=sagemaker-jobs
    State:          Running
      Started:      Fri, 24 Jun 2022 22:26:09 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  30Mi
    Requests:
      cpu:     100m
      memory:  20Mi
    Environment:
      AWS_DEFAULT_SAGEMAKER_ENDPOINT:
      AWS_DEFAULT_REGION:              us-west-2
      AWS_REGION:                      us-west-2
      AWS_ROLE_ARN:                    arn:aws:iam::438029713005:role/model-training-sagemaker-role20220624222338450100000009
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6j8rt (ro)
kube-rbac-proxy:
    Container ID:  docker://4ecdaa395fdc70d5cead609465dbf21f6e11771a80ad5db0a6125053ab08b9d3
    Image:         gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
    Image ID:      docker-pullable://gcr.io/kubebuilder/kube-rbac-proxy@sha256:297896d96b827bbcb1abd696da1b2d81cab88359ac34cce0e8281f266b4e08de
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:8443
      --upstream=http://127.0.0.1:8080/
      --logtostderr=true
      --v=10
    State:          Running
      Started:      Fri, 24 Jun 2022 22:26:11 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      AWS_DEFAULT_REGION:           us-west-2
      AWS_REGION:                   us-west-2
      AWS_ROLE_ARN:                 arn:aws:iam::438029713005:role/model-training-sagemaker-role20220624222338450100000009
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6j8rt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  kube-api-access-6j8rt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl logs sagemaker-k8s-operator-controller-manager-855f498957-fhkvv manager -n sagemaker-jobs
I0624 22:26:11.339445       1 request.go:621] Throttling request took 1.046981399s, request: GET:https://172.20.0.1:443/apis/extensions/v1beta1?timeout=32s
2022-06-24T22:26:12.443Z        INFO    controller-runtime.metrics      metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2022-06-24T22:26:12.443Z        INFO    Starting manager in the namespace:      sagemaker-jobs
2022-06-24T22:26:12.443Z        INFO    setup   starting manager
2022-06-24T22:26:12.444Z        INFO    controller-runtime.manager      starting metrics server {"path": "/metrics"}
2022-06-24T22:26:12.444Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "EndpointConfig", "controller": "endpointconfig", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.444Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "BatchTransformJob", "controller": "batchtransformjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.445Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingAutoscalingPolicy", "controller": "hostingautoscalingpolicy", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.444Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "Model", "controller": "model", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.444Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "TrainingJob", "controller": "trainingjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.445Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "ProcessingJob", "controller": "processingjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.446Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HyperparameterTuningJob", "controller": "hyperparametertuningjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.446Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingDeployment", "controller": "hostingdeployment", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.665Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "Model", "controller": "model"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingAutoscalingPolicy", "controller": "hostingautoscalingpolicy"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "EndpointConfig", "controller": "endpointconfig"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "BatchTransformJob", "controller": "batchtransformjob"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "ProcessingJob", "controller": "processingjob"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HyperparameterTuningJob", "controller": "hyperparametertuningjob"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "TrainingJob", "controller": "trainingjob"}
2022-06-24T22:26:12.746Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingDeployment", "controller": "hostingdeployment"}
2022-06-24T22:26:12.747Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingDeployment", "controller": "hostingdeployment", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "Model", "controller": "model", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "EndpointConfig", "controller": "endpointconfig", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingAutoscalingPolicy", "controller": "hostingautoscalingpolicy", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "ProcessingJob", "controller": "processingjob", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "BatchTransformJob", "controller": "batchtransformjob", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "TrainingJob", "controller": "trainingjob", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HyperparameterTuningJob", "controller": "hyperparametertuningjob", "worker count": 1}
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl get trainingjobs
NAME            STATUS   SECONDARY-STATUS   CREATION-TIME          SAGEMAKER-JOB-NAME
osic-test-run                               2022-06-24T22:38:13Z  
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl describe trainingjob osic-test-run                                                                                                                                                                                                                             
Name:         osic-test-run                                                                                                                                                                                                                                                                                                   
Namespace:    default                                                                                                                                                                                                                                                                                                         
Labels:       <none>                                                                                                                                                                                                                                                                                                          
Annotations:  <none>                                                                                                                                                                                                                                                                                                          
API Version:  sagemaker.aws.amazon.com/v1                                                                                                                                                                                                                                                                                     
Kind:         TrainingJob                                                                                                                                                                                                                                                                                                     
Metadata:                                                                                                                                                                                                                                                                                                                     
  Creation Timestamp:  2022-06-24T22:38:13Z                                                                                                                                                                                                                                                                                   
  Generation:          1                                                                                                                                                                                                                                                                                                      
  Managed Fields:
    API Version:  sagemaker.aws.amazon.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:algorithmSpecification:
          .:
          f:trainingImage:
          f:trainingInputMode:
        f:inputDataConfig:
        f:outputDataConfig:
          .:
          f:s3OutputPath:
        f:region:
        f:resourceConfig:
          .:
          f:instanceCount:
          f:instanceType:
          f:volumeSizeInGB:
        f:roleArn:
        f:stoppingCondition:
          .:
          f:maxRuntimeInSeconds:
        f:trainingJobName:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2022-06-24T22:38:13Z
  Resource Version:  3182
  UID:               0a0880c0-baf9-4f1a-8aa3-37480520c3e2
Spec:
  Algorithm Specification:
Training Image:       438029713005.dkr.ecr.us-west-2.amazonaws.com/model-training:latest
    Training Input Mode:  File
  Input Data Config:
    Channel Name:      train
    Compression Type:  None
    Data Source:
      s3DataSource:
        s3DataDistributionType:  FullyReplicated
        s3DataType:              S3Prefix
        s3Uri:                   s3://osic-full-including-override
  Output Data Config:
    s3OutputPath:  s3://osic-full-including-override/experiments
  Region:          us-west-2
  Resource Config:
    Instance Count:     1
    Instance Type:      ml.p3.2xlarge
    Volume Size In GB:  500
  Role Arn:             arn:aws:iam::438029713005:role/model-training-sagemaker-role20220624222338450100000009
  Stopping Condition:
    Max Runtime In Seconds:  900
  Training Job Name:         osic-test-run
Events:                      <none>

please let me know if you need to see anything else!

from amazon-sagemaker-operator-for-k8s.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.