Git Product home page Git Product logo

Comments (4)

surajkota avatar surajkota commented on May 29, 2024

Hi,

The ValidationException is coming from SageMaker when using parameterRanges in the hyperParameterTuningJobConfig.

semantic of MaxNumberOfJob enforces a lower bound on the number of combinations

What type of parameter ranges and their scale is being used when you when you see this error ?

doesn't seem like a status where the job should be held in Reconciling state

Jobs stuck in reconciling state after Validation error occurs looks like an issue. I will try to reproduce it on our end. Please provide us with minimum reproducible sample input.

Thanks

from amazon-sagemaker-operator-for-k8s.

surajkota avatar surajkota commented on May 29, 2024

I was able to replicate the job stuck in ReconcilingTuningJob status with the below job definition

apiVersion: sagemaker.aws.amazon.com/v1
kind: HyperparameterTuningJob
metadata:
  name: kmeans-mnist-hpo-3
spec:
  region: us-east-1
  hyperParameterTuningJobConfig:
    strategy: Bayesian
    hyperParameterTuningJobObjective:
      type: Minimize
      metricName: test:msd
    resourceLimits:
      maxNumberOfTrainingJobs: 10
      maxParallelTrainingJobs: 5
    parameterRanges:
      categoricalParameterRanges:
      - name: init_method
        values:
        - 'random'
        - 'kmeans++'
  trainingJobDefinition:
    staticHyperParameters:
      - name: k
        value: '10'
      - name: feature_dim
        value: '784'
    algorithmSpecification:
      trainingImage: 382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:1
      trainingInputMode: File
    roleArn: <REPLACE_ME>
    inputDataConfig:
    - channelName: train
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: s3://<REPLACE_ME>/mnist_kmeans_example/train_data/
          s3DataDistributionType: FullyReplicated
      compressionType: None
      recordWrapperType: None
      inputMode: File
    outputDataConfig:
      s3OutputPath: s3://<REPLACE_ME>/mnist_kmeans_example/output
    resourceConfig:
      instanceType: ml.m4.xlarge
      instanceCount: 1
      volumeSizeInGB: 25
    stoppingCondition:
      maxRuntimeInSeconds: 3600

from amazon-sagemaker-operator-for-k8s.

bnsblue avatar bnsblue commented on May 29, 2024

Hi @surajkota! Sorry for the late reply.

What type of parameter ranges and their scale is being used when you when you see this error ?

So in my case, the validation failed when I set the maxNumberOfTrainingJob to 10 and only have, say, an integer parameter range (e.g., num_round) from 1 to 3. The ScalingType was Linear. Since you

I was able to replicate the job stuck in ReconcilingTuningJob status with the below job definition

In your example it seems that you are encountering ReconcilingTuningJob because there are only two possible training job configurations: init_method = ['random', 'kmeans++'] and you have maxNumberOfTrainingJobs=10. This aligns with my experience.

What I am trying to get at in this issue is that, even if the number of possible number of configurations is smaller than maxNumberOfTrainingJobs, SageMaker should still let the job proceed. maxNumberOfTrainingJobs should enforce only an upper limit on the number of training jobs that will be launched when the total number of possibilities is larger; its semantic should not enforce that the hpo job needs to have at least maxNumberOfTrainingJobs of training jobs.

I hope that makes sense :)

from amazon-sagemaker-operator-for-k8s.

surajkota avatar surajkota commented on May 29, 2024

Please use the latest version of SageMaker Operator - https://github.com/aws/amazon-sagemaker-operator-for-k8s#migrate-resources-to-the-new-sagemaker-operators-for-kubernetes

from amazon-sagemaker-operator-for-k8s.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.