awslabs / syne-tune Goto Github PK

Large scale and asynchronous Hyperparameter and Architecture Optimization at your fingertips.

Home Page: https://syne-tune.readthedocs.io

License: Apache License 2.0

Python 99.76% Dockerfile 0.07% Shell 0.17%

hyperparameter-optimization machine-learning multi-objective-optimization hyperparameter-tuning bayesian-optimization sagemaker neural-architecture-search

syne-tune's People

Contributors

Stargazers

Watchers

syne-tune's Issues

ModuleNotFoundError when running example_syne_tune_for_hf.ipynb notebook

When I run example_syne_tune_for_hf.ipynb notebook, first cell after !pip install commands, results in ModuleNotFoundError: No module named 'syne_tune.config_space' error.

Cell:

import matplotlib as mpl #$; mpl.use('pgf')
import os

%matplotlib inline
import matplotlib.pyplot as plt
import logging
logging.basicConfig(level=logging.INFO)
from pathlib import Path

from syne_tune.backend.local_backend import LocalBackend
from syne_tune.tuner import Tuner
from syne_tune.search_space import uniform, loguniform, choice, randint
from syne_tune.stopping_criterion import StoppingCriterion
from syne_tune.optimizer.baselines import ASHA, MOBSTER, BayesianOptimization, RandomSearch, MOASHA
from syne_tune.constants import ST_WORKER_TIME
from syne_tune.backend.sagemaker_backend.instance_info import select_instance_type
from syne_tune.backend.sagemaker_backend.sagemaker_backend import SagemakerBackend
from syne_tune.backend.sagemaker_backend.sagemaker_utils import get_execution_role


TASK2METRICSMODE = {
    "cola": {'metric': 'matthews_correlation', 'mode': 'max'},
    "mnli": {'metric': 'accuracy', 'mode': 'max'},
    "mrpc": {'metric': 'f1', 'mode': 'max'},
    "qnli": {'metric': 'accuracy', 'mode': 'max'},
    "qqp": {'metric': 'f1', 'mode': 'max'},
    "rte": {'metric': 'accuracy', 'mode': 'max'},
    "sst2": {'metric': 'accuracy', 'mode': 'max'},
    "stsb": {'metric': 'spearmanr', 'mode': 'max'},
    "wnli": {'metric': 'accuracy', 'mode': 'max'},
}

Full Logs:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-ad89febf37d1> in <module>
     10 from syne_tune.backend.local_backend import LocalBackend
     11 from syne_tune.tuner import Tuner
---> 12 from syne_tune.config_space import uniform, loguniform, choice, randint
     13 from syne_tune.stopping_criterion import StoppingCriterion
     14 from syne_tune.optimizer.baselines import ASHA, MOBSTER, BayesianOptimization, RandomSearch, MOASHA

ModuleNotFoundError: No module named 'syne_tune.config_space'

Grid search in syne-tune

Hey folks,
would you be interested in grid search implemented in syne-tune? I had a few offline discussions with some of you already, and it seems that you are not against grid search added to syne-tune, but want to keep a record of that here.

Additionally, would you have any pointers as to what would be the best way to add grid search to syne-tune?

make QuantileBasedSurrogateSearcher import in baselines optional

Right now importing syne_tune.optimizer.baselines fails when only core dependencies are installed because it imports QuantileBasedSurrogateSearcher, which in turn requires additional dependencies, such as XGBoost or sklearn. I would suggest to make it's import optional to avoid exceptions.

Make Yahpo available in blackbox-repository and add example to run one of its benchmark with ASHA

The goal is to resume #301 and make YAHPO benchmarks accessible in blackbox repository so that we can evaluate our methods on all those new benchmark.

One example should be added to show how to run distributed ASHA on one Yahpo benchmark.

Experiment Results Contain Random Rows

In my experiment, the result data frame contains multiple rows with trial id 1 with the same content as the next row, the only difference being the config. This causes problems since sometimes the best config is now trial id 1 that shows a config which did not achieve the best performance.

See this example: True trial id 1 performance is 81% (Row 4) but trial id 1 also shows up in row 10 with highest accuracy.
I've added a simple example to reproduce this behavior.

from pathlib import Path

from sagemaker.pytorch import PyTorch

from syne_tune.backend import SageMakerBackend
from sagemaker import get_execution_role
from syne_tune.optimizer.baselines import RandomSearch
from syne_tune import Tuner
from syne_tune.config_space import randint
from syne_tune import StoppingCriterion
from syne_tune.optimizer.schedulers.fifo import FIFOScheduler

entry_point = Path('examples') / "training_scripts" / "height_example" / "train_height.py"
assert entry_point.is_file(), 'File unknown'
mode = "min"
metric = "mean_loss"
instance_type = 'ml.c5.4xlarge'
instance_count = 1
instance_max_time = 999
n_workers = 20

config_space = {
    "steps": 1,
    "width": randint(0, 20),
    "height": randint(-100, 100)
}

backend = SageMakerBackend(
    sm_estimator=PyTorch(
        entry_point=str(entry_point),
        instance_type=instance_type,
        instance_count=instance_count,
        role=get_execution_role(),
        max_run=instance_max_time,
        py_version='py3',
        framework_version='1.6',
    ),
    metrics_names=[metric],
)

# Random search without stopping
scheduler = FIFOScheduler(
    config_space=config_space,
    searcher='random',
    mode=mode,
    metric=metric,
)

tuner = Tuner(
    trial_backend=backend,
    scheduler=scheduler,
    stop_criterion=StoppingCriterion(max_wallclock_time=300),
    n_workers=n_workers,
)

tuner.run()

list_experiments does not list the experiments results available locally

I have a long list of results available in my local SyneTune folder (~/syne-tune/) and list_experiments doesn't add them to the list. The experiments are not stored in an S3 bucket of my property but they are copied from a different location to the local filesystem.

INFO:root:Detected 2 GPUs on an EC2 m5d.12xlarge that has no GPU ?

Hi,

I'm running Syne Tune on the conda_python3 Jupyter kernel of a SageMaker-managed EC2 instance (ml.m5d.12xlarge notebook instance), that has no GPUs.
However, in the Syne Tune logs I see:

INFO:root:Detected 2 GPUs

and then few lines below

DEBUG:root:Free GPUs: {0, 1}
DEBUG:root:Assigned GPU 0 to trial_id 0

But an m5d.12xlarge is not expected to have GPUs, right?

Config_space for full pipeline optimization

Hi, would it be possible to direct the config_space search for a conditional set formation so it can create a multi-step pipeline?
Something that will limit activation of invalid pipelines from alg+hp config variables.

Gracefully deal with SageMaker Failures

A SageMaker training job failed for some random reasons which seems to break the tuner:

File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 152, in run
    new_done_trial_statuses, new_results = self._process_new_results(
  File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 282, in _process_new_results
    done_trials_statuses = self._update_running_trials(trial_status_dict, new_results, callbacks=self.callbacks)
  File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 437, in _update_running_trials
    assert trial_id in self.last_seen_result_per_trial, \
AssertionError: trial 35 completed and no metrics got observed

Would be great to retry jobs or at least ignore and continue somehow.

Is there a tuner.best_config() API?

After a tuner.run() execution, I'd like to be able to programmatically get the best config, either from the tuner or from its data folder, eg:

tuner.best_config()

tuning_experiment = load_experiment("experiment-xxxxxxxx")
tuning_experiment.best_config()

Is there an API for this?
If no, I suggest to add it to the roadmap

Streamline manual process for releases

Right now releases are cut manually with adhoc scripts. Add automatic tools to cut pypi releases for instance using poetry instead of manual scripts

AttributeError: 'NoneType' object has no attribute 'scheduler'

Hi,

I launched a Syne Tune experiment few hours ago (experiment-2022-01-11-10-57-17-491), then stopped it and launched another one.

While experiment-2022-01-11-10-57-17-491 was running I could see its chart using

from syne_tune.experiments import load_experiment

tuning_experiment = load_experiment("experiment-2022-01-11-10-57-17-491")
tuning_experiment.plot()

Now, when I'm doing it from the same machine, I get a :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-0c0bfae5f6de> in <module>
      1 # metric over time
----> 2 tuning_experiment.plot()

~/anaconda3/envs/python3/lib/python3.6/site-packages/syne_tune/experiments.py in plot(self, **plt_kwargs)
     51         import matplotlib.pyplot as plt
     52 
---> 53         scheduler = self.tuner.scheduler
     54         metric = self.metric_name()
     55         df = self.results

AttributeError: 'NoneType' object has no attribute 'scheduler'

What is wrong? Can the graph be accessed only while the tuner is running?

Please clarify the relationship between syne-tune and ray:tune

Please clarify in readme.txt - Can syne-tune wrap any ray:tune scheduler/searcher? Which ray:tune version is used?
Is PBT supported on the SageMaker backend?

Cannot serialize sp.finrange

Hello!
Serialization of the syne_tune.search_space.finrange is failing - if we define the sp.finrange in the config_space, the Blackbox serialization tests fail:
https://github.com/talesa/syne-tune/blob/blackbox-serialization-with-finrange-bug/tst/blackbox_repository/test_blackbox.py#L21-L24

Doc mismatch leads to ImportError: cannot import name 'report' ?

Hi,

I'm using Syne Tune from a SageMaker-managed EC2 instance (notebook instance)

As indicated here, I'm using this code in my backend script:
from syne_tune.report import report
which returns an ImportError: cannot import name 'report'

and when I look in report.py I can see a Reporter but no report

This blog post however proposes a different code:

from syne_tune.report import Reporter
report = Reporter()

Could the README be clarified?

(Note that I cannot check the version: import syne_tune syne_tune.__version__ returns an AttributeError: module 'syne_tune' has no attribute '__version__'

thanks!

Tutorial: Multi-fidelity HNAS in Syne Tune

What. Longer step-by-step tutorial on how to run experiments with our async and sync multi-fidelity HPO methods, both using tabulated blackboxes and a real DNN tuning problem (Hugging Face?).
Why. The way in which variants of different algos are implemented and available in ST could be a real advantage, but is right now hidden and undocumented. A tutorial would be most accessible, and would clarify important concepts (sync/async)
Done. Tutorial tested with volunteer outside the team, feedback incorporated

A second part of the tutorial could be for developers: how to implement a new scheduler, or a variant of an existing one.

Container build fails

Hi, when running the container build script, it fails at the following:

Step 12/12 : RUN python -m pip install --no-cache-dir --upgrade -r /tmp/packages/requirements.txt
 ---> Running in 67f84184ab3e
ERROR: Extras after version '>=1.3ray[tune]'.
The command '/bin/sh -c python -m pip install --no-cache-dir --upgrade -r /tmp/packages/requirements.txt' returned a non-zero code: 1

How to set a custom tuner_path ?

Hi,

how to set a custom tuner_path?

I'm launching long-running experiments on remote SageMaker jobs, and I'd like to set the tuner metadata path to /opt/ml/checkpoints (local path on those transient VMs), to get the metadata sent to s3 upon updates

list_experiments failing when no experiments are available

I accidentally ran list_exp = list_experiments() using a different AWS account and noticed that the function fails with EmptyDataError: No columns to parse from file when there are no experiments available.

[BUG] LocalBackend: Evaluation Failed!

Hi, I am using LocalBackend to train a couple of huggingface models for a sample dataset (still WIP)..

However, I ran into the following errors:

INFO:syne_tune.optimizer.schedulers.hyperband:trial_id 1 starts (first milestone = 1)
INFO:root:running subprocess with command: /opt/conda/bin/python huggingface_on_excel.py --model_type google/electra-base-discriminator --learning_rate 8.018154654725304e-05 --weight_decay 1.3591419560772573e-07 --dataset_path /DATA/jin/  --CUDA_VISIBLE_DEVICES 2 --train_batch_size 8 --valid_batch_size 8 --epochs 1 --output_dir output/ --eval_steps 100 --st_checkpoint_dir /root/syne-tune/test-hugging/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'model_type': 'google/electra-base-discriminator', 'learning_rate': 8.018154654725304e-05, 'weight_decay': 1.3591419560772573e-07, 'dataset_path': '/DATA/jin/, 'CUDA_VISIBLE_DEVICES': '2', 'train_batch_size': 8, 'valid_batch_size': 8, 'epochs': 1, 'output_dir': 'output/', 'eval_steps': 100}
INFO:syne_tune.tuner:Trial trial_id 1 was stopped independently of the scheduler.
INFO:syne_tune.optimizer.schedulers.fifo:trial_id 1: Evaluation failed!

Some of the debugging methods I have tried:

setting debug_mode : True in tuner did not reflect the bug.
I am able to run the exact commands for the subprocess without running into any issue or bug.

Any advice will be appreciated. Thank you!

Duplicate SM training job names with SagemakerBackend

Running python docs/tutorials/basics/scripts/launch_sagemaker_backend.py produces SM training jobs named None-0, None-1, etc which do not depend on tuner_name.
Rerunning the example leads to duplicate SM training job names and hence failure of the script.
This is because tuner_name inside the SagemakerBackend object is only ever set to None in the constructor.

Cross-ref: #112 and points by mseeger in #113.

Implement independent GP surrogate model and Hyper-Tune

What. Implement Hyper-Tune as extension of asynchronous Hyperband (ASHA)
Why. Very competitive method, according to the paper. We lack async HB methods that do a good job with bracket sampling
Done. Some unit tests, comparison with baselines

Numeric and Log-Scale Choice

There is no equivalent of choice for numeric values. E.g., in the FCNet blackbox the learning rate is defined as 'hp_init_lr': choice([0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]). This will not allow model-based approaches to encode this hyperparameter correctly. Would be great to identify them as numeric and also indicate whether log transform is needed.

Warning: Python 3.6

Hi, I receive this warning using the docker image:

PythonDeprecationWarning: Boto3 will no longer support Python 3.6 starting May 30, 2022. To continue receiving service updates, bug fixes, and security updates please upgrade to Python 3.7 or later. More information can be found here: https://aws.amazon.com/blogs/developer/python-support-policy-updates-for-aws-sdks-and-tools/

Since this is only a couple of weeks away it might be a good idea to update the Dockerfile to Python 3.7 now.

SageMaker ResourceLimitExceeded

Hi, I have a limit of 8 ml.g5.12xlarge instances, and although I set Tuner.n_workers = 5 I still got a ResourceLimitExceeded error. Is there a way to make sure that jobs are fully stopped when using SageMakerBackend before launching new ones?

Also, when using RemoteLauncher, in situations where the management instance does error out (for example due to ResourceLimitExceeded), is there a way to make sure the management instance sends a stop signal to all tuning jobs before exiting? Maybe something like:

try:
    # manage tuning jobs
except:
   # raise error
finally:
   # stop any trials still running

Avoid termination due to SM throttling in early stage of experiment with SageMaker backend

If an HPO experiment with SageMaker backend and n_workers=16 is run, this can fail with a ThrottlingException.

The problem is that at the start, all 16 trials are started very close to each other, as a batch.

Need to avoid this failure by simply trying to start jobs again until this succeeds.

Is there a tuner.data() API?

Hi,

When using Syne Tune, I'd like to be able to extract raw tuning data (parameters, metrics, timestamps) in a dataframe to analyze them. Is there something like a tuner.data() API? For example Sagemaker HPO has the great tuner.analytics()

If this API doesn't exist I suggest to add it to the roadmap

Add documentation/example showing how to retrieve the model with the best performance

It would be useful to have an example showing how to retrieve the trained model providing the best performance.

Add plateau stopper

Add a stopping criterion that stops the HPO process if it hasn’t improved for N consecutive steps

Promotion Logic Bug

There seems to be a problem with the Hyperband promotion logic.

How to reproduce:
Add type="promotion" to https://github.com/awslabs/syne-tune/blob/main/benchmarking/nursery/benchmark_automl/baselines.py#L69

Run python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark lcbench-airlines

  File "/syne-tune/benchmarking/nursery/benchmark_automl/benchmark_main.py", line 209, in <module>
    tuner.run()
  File "/syne-tune/syne_tune/tuner.py", line 240, in run
    raise e
  File "/syne-tune/syne_tune/tuner.py", line 175, in run
    new_done_trial_statuses, new_results = self._process_new_results(
  File "/syne-tune/syne_tune/tuner.py", line 345, in _process_new_results
    done_trials_statuses = self._update_running_trials(
  File "/syne-tune/syne_tune/tuner.py", line 465, in _update_running_trials
    decision = self.scheduler.on_trial_result(trial=trial, result=result)
  File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 779, in on_trial_result
    task_info = self.terminator.on_task_report(trial_id, result)
  File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 1124, in on_task_report
    rung_sys.on_task_report(trial_id, result, skip_rungs=skip_rungs)
  File "/syne-tune/syne_tune/optimizer/schedulers/hyperband_promotion.py", line 221, in on_task_report
    assert resource == milestone, (
AssertionError: trial_id 1: resource = 4 > 3 milestone. Make sure to report time attributes covering all milestones```

Add Unit Test for wait_trial_completion_when_stopping

No unit test is covering wait_trial_completion_when_stopping=True. I need to get back to this at some time.

[Feature Request] Attach to SageMakerBackend logging

Hi, could you add a method to attach to the logs for the SageMakerBackend management estimator? For example, RemoteLauncher.logs so we can simply do remote.logs()?

Some customers can't access console to view CloudWatch logs, so this would be easier for them than fiddling with boto3.

How to set and get experiment name?

Hi,

I see in the blog that one can query experiments by name, to access the metrics:

from syne_tune.experiments import load_experiment
tuning_experiment = load_experiment("train-cifar100-2021-11-05-15-22-27-531")
tuning_experiment.plot()

How do we know and set an experiment name?

BlackboxSurrogate returns inconsistent values for LCBench blackbox

This is a follow-up issue for #304.

Custom results directory

Dear creators, thank you again for your great work and perhaps sorry for being annoying with my suggestions/questions. Is it possible to change the home directory of different runs such that it is not ~/syne-tune but a custom path? Thanks!

Custom arguments packaging

Dear creators, thank you for your great work. Is there a way how we could specify any packaging for the input hyperparameters for our main script? E.g. in our project we do not input hyperparameters directly as in python3 train.py --width 1 but through python3 train.py --hyperparameters='{"width": 1}' to avoid adding a new argument to our parser and to avoid clutter each time we would like to change something. I have checked the FAQ but I have not found anything related. Thank you for your input!

TransferLearningTaskEvaluations.top_k_hyperparameter_configurations does not respect config_space

Methods that will use top k candidates will therefore use out of config space configs which will raise in error. Restrict the feasible candidates for this function to everything within the config_space.

HyperBO integration

Integrating the HyperBO benchmark introduced in Pre-trained Gaussian processes for Bayesian optimization.

Can this architecture find hyperparameters for multi-algorithm in parallel?

I need to find different algorithms' hyperparameters in parallel, such as RF and XGboost. I tried ray.tune but if I parallel find two algorithms hyperparameters it will mix and send error, It will apply xgb parameters to RF. So I want to know this architecture can solve this problem?

Add reference to readme

Perhaps the following reference can be added to the readme:

https://www.amazon.science/publications/syne-tune-a-library-for-large-scale-hyperparameter-tuning-and-reproducible-research

Failed trials have out of date metrics

Hi, I'm using SageMaker as a backend and remote launcher. I noticed that if a job errors out during training, the latest performance logs will not be captured.

For example in my HPO experiment on CIFAR-10 dataset, One trial (number 8) had been reported in the Syne Tune results dataframe as achieving a validation accuracy of 0.8478 at epoch 22:

However my CloudWatch logs show that the validation accuracy actually reached 0.926 at epoch 60 before crashing:

Interestingly the job shows as Stopped rather than Failed in SageMaker console. Does Syne Tune notice an exception and stop the job before it exits with a failure?

ImportError for BotorchSearcher

Test (3.8) fails with:

____________ ERROR collecting tst/schedulers/test_schedulers_api.py ____________ ImportError while importing test module '/home/runner/work/syne-tune/syne-tune/tst/schedulers/test_schedulers_api.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) tst/schedulers/test_schedulers_api.py:47: in <module> from syne_tune.optimizer.schedulers.botorch.botorch_searcher import BotorchSearcher syne_tune/optimizer/schedulers/botorch/botorch_searcher.py:18: in <module> from botorch.models import SingleTaskGP /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/__init__.py:7: in <module> from botorch import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/acquisition/__init__.py:7: in <module> from botorch.acquisition.acquisition import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/acquisition/acquisition.py:16: in <module> from botorch.models.model import Model /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/__init__.py:7: in <module> from botorch.models.approximate_gp import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/approximate_gp.py:35: in <module> from botorch.models.gpytorch import GPyTorchModel /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/gpytorch.py:23: in <module> from botorch.acquisition.objective import PosteriorTransform /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/acquisition/objective.py:18: in <module> from botorch.models.model import Model /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/model.py:24: in <module> from botorch.models.utils.assorted import fantasize as fantasize_flag /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/utils/__init__.py:7: in <module> from botorch.models.utils.assorted import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/utils/assorted.py:21: in <module> from gpytorch.utils.broadcasting import _mul_broadcast_shape E ImportError: cannot import name '_mul_broadcast_shape' from 'gpytorch.utils.broadcasting' (/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/gpytorch/utils/broadcasting.py)

Issue with running launch_sagemaker_backend.py: No module named 'benchmarks'

Hello!
When running https://github.com/awslabs/syne-tune/blob/main/docs/tutorials/basics/scripts/launch_sagemaker_backend.py (python docs/tutorials/basics/scripts/launch_sagemaker_backend.py) on the main branch I get an error within the spawned SageMaker training jobs:

Traceback (most recent call last):
  File "traincode_report_withcheckpointing.py", line 29, in <module>
    from benchmarks.checkpoint import resume_from_checkpointed_model, \
ModuleNotFoundError: No module named 'benchmarks'

I'm including the full log below.
I’m not certain if it’s due to my AWS environment setup (although I am generally able to run SageMaker training jobs) or an issue with the code, could you please have a look?

Best wishes,
Adam

Full log:

showing log of sagemaker job: traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-01-18 16:34:35,020 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2022-01-18 16:34:35,023 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:35,035 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-01-18 16:34:36,465 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-01-18 16:34:37,061 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:37,076 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:37,090 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:37,103 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "batch_size": 126,
        "weight_decay": 0.7744002774231975,
        "st_checkpoint_dir": "/opt/ml/checkpoints",
        "st_instance_count": 1,
        "n_units_2": 322,
        "dataset_path": "./",
        "n_units_1": 107,
        "dropout_2": 0.20979101632756325,
        "dropout_1": 0.4715702331554363,
        "epochs": 81,
        "learning_rate": 0.0029903699075321814,
        "st_instance_type": "ml.m4.10xlarge"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz",
    "module_name": "traincode_report_withcheckpointing",
    "network_interface_name": "eth0",
    "num_cpus": 40,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "traincode_report_withcheckpointing.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975}
SM_USER_ENTRY_POINT=traincode_report_withcheckpointing.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=traincode_report_withcheckpointing
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=40
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz","module_name":"traincode_report_withcheckpointing","network_interface_name":"eth0","num_cpus":40,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"traincode_report_withcheckpointing.py"}
SM_USER_ARGS=["--batch_size","126","--dataset_path","./","--dropout_1","0.4715702331554363","--dropout_2","0.20979101632756325","--epochs","81","--learning_rate","0.0029903699075321814","--n_units_1","107","--n_units_2","322","--st_checkpoint_dir","/opt/ml/checkpoints","--st_instance_count","1","--st_instance_type","ml.m4.10xlarge","--weight_decay","0.7744002774231975"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_BATCH_SIZE=126
SM_HP_WEIGHT_DECAY=0.7744002774231975
SM_HP_ST_CHECKPOINT_DIR=/opt/ml/checkpoints
SM_HP_ST_INSTANCE_COUNT=1
SM_HP_N_UNITS_2=322
SM_HP_DATASET_PATH=./
SM_HP_N_UNITS_1=107
SM_HP_DROPOUT_2=0.20979101632756325
SM_HP_DROPOUT_1=0.4715702331554363
SM_HP_EPOCHS=81
SM_HP_LEARNING_RATE=0.0029903699075321814
SM_HP_ST_INSTANCE_TYPE=ml.m4.10xlarge
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975
Traceback (most recent call last):
  File "traincode_report_withcheckpointing.py", line 29, in <module>
    from benchmarks.checkpoint import resume_from_checkpointed_model, \
ModuleNotFoundError: No module named 'benchmarks'
2022-01-18 16:34:38,444 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975"
Traceback (most recent call last):
  File "traincode_report_withcheckpointing.py", line 29, in <module>
    from benchmarks.checkpoint import resume_from_checkpointed_model, \
ModuleNotFoundError: No module named 'benchmarks'

Simulator results are ignored

When running (main branch) e.g.

python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark fcnet-protein

I get following warnings. Is this expected, anyone knows what's going on?

WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 38: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 44: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 49: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 77: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 86: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 113: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 121: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 142: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 169: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 186: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 188: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 229: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 247: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 252: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 260: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 255: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 264: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 297: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 309: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
  trial_id 314: status = Stopped, num_results = 1

sp.number_choice?

Hello!
ST already has sp.choice for categorical variables, and sp.finrange and sp.logfinrange for numerical values, but I feel that sometimes it is easier to manually specify the elements (as per sp.choice), but have them treated as numerical values by the GP-models and by the Blackbox-surrogate-models. Hence I'm wondering about implementing something like sp.number_choice, mostly for convenience, what do you think?

SageMaker Experiments compatibility

Hi, is this library compatible with SageMaker Experiments tracking and/or SageMaker estimator experiment_config?

Implement special encoding for categorical NASBench-201 variables used in DEHB

What. There has been some feedback, and another workshop paper on DEHB was pointed out. The main point we missed is a special ordinal encoding of the values of the NB201 categorical HPs.
Why. DEHB is a competitive method, and having it well done in Syne Tune increases our chance of adoption with Freiburg.
Done. Redo comparison, especially the _ORD variants.

sp.(log)finrange throws an error when sample(size=1)

Caused by self._uniform_int.sample(spec, size=1, random_state) returning an int rather than an iterable.
This seems to be caused by this piece of code

syne-tune/syne_tune/search_space.py

Lines 192 to 196 in 3e1c396

 def _sanitize_sample_result(items, domain: Domain): 

 if len(items) > 1: 

 return [domain.cast(x) for x in items] 

 else: 

 return domain.cast(items[0])

import syne_tune.search_space as sp
fr = sp.finrange(1, 2, 2)
fr.sample(size=2)
> Out[4]: [1.0, 2.0]

fr.sample(size=1)
> Traceback (most recent call last):
>   ...
>   File "/Users/awgol/code/syne-tune/syne_tune/search_space.py", line 592, in sample
>     for x in self._uniform_int.sample(spec, size, random_state)]
> TypeError: 'int' object is not iterable

[Feature Request] Parallel Categories Plot

(Apologies for creating multiple recent GitHub issues, this is the last one, I promise!)

I took the DataFrame from my experiment results and used Plotly's plotly.express.parallel_categories plot to visualize hyperparameter interactions, dropping any features that only have one unique value. This is an interactive plot, and you can wrap it in a function that refreshes periodically when new data is available:

This has been super useful for myself, so I thought that it may be useful to others as well if it were added as a plotting capability to the library? Although I'd understand if it's not desirable to add another dependency. Just thought I'd share!

ExperimentResult plot warning

Hi, with recent versions of matplotlib, ExperimentResult.plot() gives me the warning:

WARNING:matplotlib.legend:No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

	def _sanitize_sample_result(items, domain: Domain):
	if len(items) > 1:
	return [domain.cast(x) for x in items]
	else:
	return domain.cast(items[0])

awslabs / syne-tune Goto Github PK

syne-tune's People

Contributors

Stargazers

Watchers

Forkers

syne-tune's Issues

Recommend Projects

Recommend Topics

Recommend Org