awslabs / syne-tune Goto Github PK
View Code? Open in Web Editor NEWLarge scale and asynchronous Hyperparameter and Architecture Optimization at your fingertips.
Home Page: https://syne-tune.readthedocs.io
License: Apache License 2.0
Large scale and asynchronous Hyperparameter and Architecture Optimization at your fingertips.
Home Page: https://syne-tune.readthedocs.io
License: Apache License 2.0
When I run example_syne_tune_for_hf.ipynb notebook, first cell after !pip install
commands, results in ModuleNotFoundError: No module named 'syne_tune.config_space'
error.
Cell:
import matplotlib as mpl #$; mpl.use('pgf')
import os
%matplotlib inline
import matplotlib.pyplot as plt
import logging
logging.basicConfig(level=logging.INFO)
from pathlib import Path
from syne_tune.backend.local_backend import LocalBackend
from syne_tune.tuner import Tuner
from syne_tune.search_space import uniform, loguniform, choice, randint
from syne_tune.stopping_criterion import StoppingCriterion
from syne_tune.optimizer.baselines import ASHA, MOBSTER, BayesianOptimization, RandomSearch, MOASHA
from syne_tune.constants import ST_WORKER_TIME
from syne_tune.backend.sagemaker_backend.instance_info import select_instance_type
from syne_tune.backend.sagemaker_backend.sagemaker_backend import SagemakerBackend
from syne_tune.backend.sagemaker_backend.sagemaker_utils import get_execution_role
TASK2METRICSMODE = {
"cola": {'metric': 'matthews_correlation', 'mode': 'max'},
"mnli": {'metric': 'accuracy', 'mode': 'max'},
"mrpc": {'metric': 'f1', 'mode': 'max'},
"qnli": {'metric': 'accuracy', 'mode': 'max'},
"qqp": {'metric': 'f1', 'mode': 'max'},
"rte": {'metric': 'accuracy', 'mode': 'max'},
"sst2": {'metric': 'accuracy', 'mode': 'max'},
"stsb": {'metric': 'spearmanr', 'mode': 'max'},
"wnli": {'metric': 'accuracy', 'mode': 'max'},
}
Full Logs:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-5-ad89febf37d1> in <module>
10 from syne_tune.backend.local_backend import LocalBackend
11 from syne_tune.tuner import Tuner
---> 12 from syne_tune.config_space import uniform, loguniform, choice, randint
13 from syne_tune.stopping_criterion import StoppingCriterion
14 from syne_tune.optimizer.baselines import ASHA, MOBSTER, BayesianOptimization, RandomSearch, MOASHA
ModuleNotFoundError: No module named 'syne_tune.config_space'
Hey folks,
would you be interested in grid search implemented in syne-tune? I had a few offline discussions with some of you already, and it seems that you are not against grid search added to syne-tune, but want to keep a record of that here.
Additionally, would you have any pointers as to what would be the best way to add grid search to syne-tune?
Right now importing syne_tune.optimizer.baselines
fails when only core dependencies are installed because it imports QuantileBasedSurrogateSearcher, which in turn requires additional dependencies, such as XGBoost or sklearn. I would suggest to make it's import optional to avoid exceptions.
The goal is to resume #301 and make YAHPO benchmarks accessible in blackbox repository so that we can evaluate our methods on all those new benchmark.
One example should be added to show how to run distributed ASHA on one Yahpo benchmark.
In my experiment, the result data frame contains multiple rows with trial id 1 with the same content as the next row, the only difference being the config. This causes problems since sometimes the best config is now trial id 1 that shows a config which did not achieve the best performance.
See this example: True trial id 1 performance is 81% (Row 4) but trial id 1 also shows up in row 10 with highest accuracy.
I've added a simple example to reproduce this behavior.
from pathlib import Path
from sagemaker.pytorch import PyTorch
from syne_tune.backend import SageMakerBackend
from sagemaker import get_execution_role
from syne_tune.optimizer.baselines import RandomSearch
from syne_tune import Tuner
from syne_tune.config_space import randint
from syne_tune import StoppingCriterion
from syne_tune.optimizer.schedulers.fifo import FIFOScheduler
entry_point = Path('examples') / "training_scripts" / "height_example" / "train_height.py"
assert entry_point.is_file(), 'File unknown'
mode = "min"
metric = "mean_loss"
instance_type = 'ml.c5.4xlarge'
instance_count = 1
instance_max_time = 999
n_workers = 20
config_space = {
"steps": 1,
"width": randint(0, 20),
"height": randint(-100, 100)
}
backend = SageMakerBackend(
sm_estimator=PyTorch(
entry_point=str(entry_point),
instance_type=instance_type,
instance_count=instance_count,
role=get_execution_role(),
max_run=instance_max_time,
py_version='py3',
framework_version='1.6',
),
metrics_names=[metric],
)
# Random search without stopping
scheduler = FIFOScheduler(
config_space=config_space,
searcher='random',
mode=mode,
metric=metric,
)
tuner = Tuner(
trial_backend=backend,
scheduler=scheduler,
stop_criterion=StoppingCriterion(max_wallclock_time=300),
n_workers=n_workers,
)
tuner.run()
I have a long list of results available in my local SyneTune folder (~/syne-tune/
) and list_experiments doesn't add them to the list. The experiments are not stored in an S3 bucket of my property but they are copied from a different location to the local filesystem.
Hi,
I'm running Syne Tune on the conda_python3
Jupyter kernel of a SageMaker-managed EC2 instance (ml.m5d.12xlarge notebook instance), that has no GPUs.
However, in the Syne Tune logs I see:
INFO:root:Detected 2 GPUs
and then few lines below
DEBUG:root:Free GPUs: {0, 1}
DEBUG:root:Assigned GPU 0 to trial_id 0
But an m5d.12xlarge is not expected to have GPUs, right?
Hi, would it be possible to direct the config_space search for a conditional set formation so it can create a multi-step pipeline?
Something that will limit activation of invalid pipelines from alg+hp config variables.
A SageMaker training job failed for some random reasons which seems to break the tuner:
File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 152, in run
new_done_trial_statuses, new_results = self._process_new_results(
File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 282, in _process_new_results
done_trials_statuses = self._update_running_trials(trial_status_dict, new_results, callbacks=self.callbacks)
File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 437, in _update_running_trials
assert trial_id in self.last_seen_result_per_trial, \
AssertionError: trial 35 completed and no metrics got observed
Would be great to retry jobs or at least ignore and continue somehow.
After a tuner.run() execution, I'd like to be able to programmatically get the best config, either from the tuner or from its data folder, eg:
tuner.best_config()
or
tuning_experiment = load_experiment("experiment-xxxxxxxx")
tuning_experiment.best_config()
Is there an API for this?
If no, I suggest to add it to the roadmap
Right now releases are cut manually with adhoc scripts. Add automatic tools to cut pypi releases for instance using poetry instead of manual scripts
Hi,
I launched a Syne Tune experiment few hours ago (experiment-2022-01-11-10-57-17-491
), then stopped it and launched another one.
While experiment-2022-01-11-10-57-17-491 was running I could see its chart using
from syne_tune.experiments import load_experiment
tuning_experiment = load_experiment("experiment-2022-01-11-10-57-17-491")
tuning_experiment.plot()
Now, when I'm doing it from the same machine, I get a :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-0c0bfae5f6de> in <module>
1 # metric over time
----> 2 tuning_experiment.plot()
~/anaconda3/envs/python3/lib/python3.6/site-packages/syne_tune/experiments.py in plot(self, **plt_kwargs)
51 import matplotlib.pyplot as plt
52
---> 53 scheduler = self.tuner.scheduler
54 metric = self.metric_name()
55 df = self.results
AttributeError: 'NoneType' object has no attribute 'scheduler'
What is wrong? Can the graph be accessed only while the tuner is running?
Please clarify in readme.txt - Can syne-tune wrap any ray:tune scheduler/searcher? Which ray:tune version is used?
Is PBT supported on the SageMaker backend?
Hello!
Serialization of the syne_tune.search_space.finrange
is failing - if we define the sp.finrange in the config_space, the Blackbox serialization tests fail:
https://github.com/talesa/syne-tune/blob/blackbox-serialization-with-finrange-bug/tst/blackbox_repository/test_blackbox.py#L21-L24
Hi,
I'm using Syne Tune from a SageMaker-managed EC2 instance (notebook instance)
As indicated here, I'm using this code in my backend script:
from syne_tune.report import report
which returns an ImportError: cannot import name 'report'
and when I look in report.py
I can see a Reporter
but no report
This blog post however proposes a different code:
from syne_tune.report import Reporter
report = Reporter()
Could the README be clarified?
(Note that I cannot check the version: import syne_tune syne_tune.__version__
returns an AttributeError: module 'syne_tune' has no attribute '__version__'
thanks!
What. Longer step-by-step tutorial on how to run experiments with our async and sync multi-fidelity HPO methods, both using tabulated blackboxes and a real DNN tuning problem (Hugging Face?).
Why. The way in which variants of different algos are implemented and available in ST could be a real advantage, but is right now hidden and undocumented. A tutorial would be most accessible, and would clarify important concepts (sync/async)
Done. Tutorial tested with volunteer outside the team, feedback incorporated
A second part of the tutorial could be for developers: how to implement a new scheduler, or a variant of an existing one.
Hi, when running the container build script, it fails at the following:
Step 12/12 : RUN python -m pip install --no-cache-dir --upgrade -r /tmp/packages/requirements.txt
---> Running in 67f84184ab3e
ERROR: Extras after version '>=1.3ray[tune]'.
The command '/bin/sh -c python -m pip install --no-cache-dir --upgrade -r /tmp/packages/requirements.txt' returned a non-zero code: 1
Hi,
how to set a custom tuner_path
?
I'm launching long-running experiments on remote SageMaker jobs, and I'd like to set the tuner metadata path to /opt/ml/checkpoints
(local path on those transient VMs), to get the metadata sent to s3 upon updates
I accidentally ran list_exp = list_experiments()
using a different AWS account and noticed that the function fails with EmptyDataError: No columns to parse from file
when there are no experiments available.
Hi, I am using LocalBackend to train a couple of huggingface models for a sample dataset (still WIP)..
However, I ran into the following errors:
INFO:syne_tune.optimizer.schedulers.hyperband:trial_id 1 starts (first milestone = 1)
INFO:root:running subprocess with command: /opt/conda/bin/python huggingface_on_excel.py --model_type google/electra-base-discriminator --learning_rate 8.018154654725304e-05 --weight_decay 1.3591419560772573e-07 --dataset_path /DATA/jin/ --CUDA_VISIBLE_DEVICES 2 --train_batch_size 8 --valid_batch_size 8 --epochs 1 --output_dir output/ --eval_steps 100 --st_checkpoint_dir /root/syne-tune/test-hugging/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'model_type': 'google/electra-base-discriminator', 'learning_rate': 8.018154654725304e-05, 'weight_decay': 1.3591419560772573e-07, 'dataset_path': '/DATA/jin/, 'CUDA_VISIBLE_DEVICES': '2', 'train_batch_size': 8, 'valid_batch_size': 8, 'epochs': 1, 'output_dir': 'output/', 'eval_steps': 100}
INFO:syne_tune.tuner:Trial trial_id 1 was stopped independently of the scheduler.
INFO:syne_tune.optimizer.schedulers.fifo:trial_id 1: Evaluation failed!
Some of the debugging methods I have tried:
Any advice will be appreciated. Thank you!
Running python docs/tutorials/basics/scripts/launch_sagemaker_backend.py
produces SM training jobs named None-0, None-1, etc which do not depend on tuner_name
.
Rerunning the example leads to duplicate SM training job names and hence failure of the script.
This is because tuner_name
inside the SagemakerBackend object is only ever set to None in the constructor.
What. Implement Hyper-Tune as extension of asynchronous Hyperband (ASHA)
Why. Very competitive method, according to the paper. We lack async HB methods that do a good job with bracket sampling
Done. Some unit tests, comparison with baselines
There is no equivalent of choice for numeric values. E.g., in the FCNet blackbox the learning rate is defined as 'hp_init_lr': choice([0.0005, 0.001, 0.005, 0.01, 0.05, 0.1])
. This will not allow model-based approaches to encode this hyperparameter correctly. Would be great to identify them as numeric and also indicate whether log transform is needed.
Hi, I receive this warning using the docker image:
PythonDeprecationWarning: Boto3 will no longer support Python 3.6 starting May 30, 2022. To continue receiving service updates, bug fixes, and security updates please upgrade to Python 3.7 or later. More information can be found here: https://aws.amazon.com/blogs/developer/python-support-policy-updates-for-aws-sdks-and-tools/
Since this is only a couple of weeks away it might be a good idea to update the Dockerfile to Python 3.7 now.
Hi, I have a limit of 8 ml.g5.12xlarge
instances, and although I set Tuner.n_workers = 5
I still got a ResourceLimitExceeded
error. Is there a way to make sure that jobs are fully stopped when using SageMakerBackend
before launching new ones?
Also, when using RemoteLauncher
, in situations where the management instance does error out (for example due to ResourceLimitExceeded), is there a way to make sure the management instance sends a stop signal to all tuning jobs before exiting? Maybe something like:
try:
# manage tuning jobs
except:
# raise error
finally:
# stop any trials still running
If an HPO experiment with SageMaker backend and n_workers=16
is run, this can fail with a ThrottlingException.
The problem is that at the start, all 16 trials are started very close to each other, as a batch.
Need to avoid this failure by simply trying to start jobs again until this succeeds.
Hi,
When using Syne Tune, I'd like to be able to extract raw tuning data (parameters, metrics, timestamps) in a dataframe to analyze them. Is there something like a tuner.data() API? For example Sagemaker HPO has the great tuner.analytics()
If this API doesn't exist I suggest to add it to the roadmap
It would be useful to have an example showing how to retrieve the trained model providing the best performance.
Add a stopping criterion that stops the HPO process if it hasn’t improved for N consecutive steps
There seems to be a problem with the Hyperband promotion logic.
How to reproduce:
Add type="promotion"
to https://github.com/awslabs/syne-tune/blob/main/benchmarking/nursery/benchmark_automl/baselines.py#L69
Run python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark lcbench-airlines
File "/syne-tune/benchmarking/nursery/benchmark_automl/benchmark_main.py", line 209, in <module>
tuner.run()
File "/syne-tune/syne_tune/tuner.py", line 240, in run
raise e
File "/syne-tune/syne_tune/tuner.py", line 175, in run
new_done_trial_statuses, new_results = self._process_new_results(
File "/syne-tune/syne_tune/tuner.py", line 345, in _process_new_results
done_trials_statuses = self._update_running_trials(
File "/syne-tune/syne_tune/tuner.py", line 465, in _update_running_trials
decision = self.scheduler.on_trial_result(trial=trial, result=result)
File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 779, in on_trial_result
task_info = self.terminator.on_task_report(trial_id, result)
File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 1124, in on_task_report
rung_sys.on_task_report(trial_id, result, skip_rungs=skip_rungs)
File "/syne-tune/syne_tune/optimizer/schedulers/hyperband_promotion.py", line 221, in on_task_report
assert resource == milestone, (
AssertionError: trial_id 1: resource = 4 > 3 milestone. Make sure to report time attributes covering all milestones```
No unit test is covering wait_trial_completion_when_stopping=True. I need to get back to this at some time.
Hi, could you add a method to attach to the logs for the SageMakerBackend management estimator? For example, RemoteLauncher.logs
so we can simply do remote.logs()
?
Some customers can't access console to view CloudWatch logs, so this would be easier for them than fiddling with boto3.
Hi,
I see in the blog that one can query experiments by name, to access the metrics:
from syne_tune.experiments import load_experiment
tuning_experiment = load_experiment("train-cifar100-2021-11-05-15-22-27-531")
tuning_experiment.plot()
How do we know and set an experiment name?
This is a follow-up issue for #304.
Dear creators, thank you again for your great work and perhaps sorry for being annoying with my suggestions/questions. Is it possible to change the home directory of different runs such that it is not ~/syne-tune
but a custom path? Thanks!
Dear creators, thank you for your great work. Is there a way how we could specify any packaging for the input hyperparameters for our main script? E.g. in our project we do not input hyperparameters directly as in python3 train.py --width 1
but through python3 train.py --hyperparameters='{"width": 1}'
to avoid adding a new argument to our parser and to avoid clutter each time we would like to change something. I have checked the FAQ but I have not found anything related. Thank you for your input!
Methods that will use top k candidates will therefore use out of config space configs which will raise in error. Restrict the feasible candidates for this function to everything within the config_space.
Integrating the HyperBO benchmark introduced in Pre-trained Gaussian processes for Bayesian optimization.
I need to find different algorithms' hyperparameters in parallel, such as RF and XGboost. I tried ray.tune but if I parallel find two algorithms hyperparameters it will mix and send error, It will apply xgb parameters to RF. So I want to know this architecture can solve this problem?
Perhaps the following reference can be added to the readme:
Hi, I'm using SageMaker as a backend and remote launcher. I noticed that if a job errors out during training, the latest performance logs will not be captured.
For example in my HPO experiment on CIFAR-10 dataset, One trial (number 8) had been reported in the Syne Tune results dataframe as achieving a validation accuracy of 0.8478
at epoch 22:
However my CloudWatch logs show that the validation accuracy actually reached 0.926
at epoch 60 before crashing:
Interestingly the job shows as Stopped rather than Failed in SageMaker console. Does Syne Tune notice an exception and stop the job before it exits with a failure?
Test (3.8) fails with:
____________ ERROR collecting tst/schedulers/test_schedulers_api.py ____________ ImportError while importing test module '/home/runner/work/syne-tune/syne-tune/tst/schedulers/test_schedulers_api.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) tst/schedulers/test_schedulers_api.py:47: in <module> from syne_tune.optimizer.schedulers.botorch.botorch_searcher import BotorchSearcher syne_tune/optimizer/schedulers/botorch/botorch_searcher.py:18: in <module> from botorch.models import SingleTaskGP /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/__init__.py:7: in <module> from botorch import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/acquisition/__init__.py:7: in <module> from botorch.acquisition.acquisition import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/acquisition/acquisition.py:16: in <module> from botorch.models.model import Model /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/__init__.py:7: in <module> from botorch.models.approximate_gp import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/approximate_gp.py:35: in <module> from botorch.models.gpytorch import GPyTorchModel /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/gpytorch.py:23: in <module> from botorch.acquisition.objective import PosteriorTransform /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/acquisition/objective.py:18: in <module> from botorch.models.model import Model /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/model.py:24: in <module> from botorch.models.utils.assorted import fantasize as fantasize_flag /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/utils/__init__.py:7: in <module> from botorch.models.utils.assorted import ( /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/botorch/models/utils/assorted.py:21: in <module> from gpytorch.utils.broadcasting import _mul_broadcast_shape E ImportError: cannot import name '_mul_broadcast_shape' from 'gpytorch.utils.broadcasting' (/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/gpytorch/utils/broadcasting.py)
Hello!
When running https://github.com/awslabs/syne-tune/blob/main/docs/tutorials/basics/scripts/launch_sagemaker_backend.py (python docs/tutorials/basics/scripts/launch_sagemaker_backend.py
) on the main
branch I get an error within the spawned SageMaker training jobs:
Traceback (most recent call last):
File "traincode_report_withcheckpointing.py", line 29, in <module>
from benchmarks.checkpoint import resume_from_checkpointed_model, \
ModuleNotFoundError: No module named 'benchmarks'
I'm including the full log below.
I’m not certain if it’s due to my AWS environment setup (although I am generally able to run SageMaker training jobs) or an issue with the code, could you please have a look?
Best wishes,
Adam
Full log:
showing log of sagemaker job: traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-01-18 16:34:35,020 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2022-01-18 16:34:35,023 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:35,035 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2022-01-18 16:34:36,465 sagemaker_pytorch_container.training INFO Invoking user training script.
2022-01-18 16:34:37,061 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:37,076 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:37,090 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-01-18 16:34:37,103 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {},
"current_host": "algo-1",
"framework_module": "sagemaker_pytorch_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"batch_size": 126,
"weight_decay": 0.7744002774231975,
"st_checkpoint_dir": "/opt/ml/checkpoints",
"st_instance_count": 1,
"n_units_2": 322,
"dataset_path": "./",
"n_units_1": 107,
"dropout_2": 0.20979101632756325,
"dropout_1": 0.4715702331554363,
"epochs": 81,
"learning_rate": 0.0029903699075321814,
"st_instance_type": "ml.m4.10xlarge"
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {},
"input_dir": "/opt/ml/input",
"is_master": true,
"job_name": "traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz",
"module_name": "traincode_report_withcheckpointing",
"network_interface_name": "eth0",
"num_cpus": 40,
"num_gpus": 0,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"hosts": [
"algo-1"
],
"network_interface_name": "eth0"
},
"user_entry_point": "traincode_report_withcheckpointing.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975}
SM_USER_ENTRY_POINT=traincode_report_withcheckpointing.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=traincode_report_withcheckpointing
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=40
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz","module_name":"traincode_report_withcheckpointing","network_interface_name":"eth0","num_cpus":40,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"traincode_report_withcheckpointing.py"}
SM_USER_ARGS=["--batch_size","126","--dataset_path","./","--dropout_1","0.4715702331554363","--dropout_2","0.20979101632756325","--epochs","81","--learning_rate","0.0029903699075321814","--n_units_1","107","--n_units_2","322","--st_checkpoint_dir","/opt/ml/checkpoints","--st_instance_count","1","--st_instance_type","ml.m4.10xlarge","--weight_decay","0.7744002774231975"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_BATCH_SIZE=126
SM_HP_WEIGHT_DECAY=0.7744002774231975
SM_HP_ST_CHECKPOINT_DIR=/opt/ml/checkpoints
SM_HP_ST_INSTANCE_COUNT=1
SM_HP_N_UNITS_2=322
SM_HP_DATASET_PATH=./
SM_HP_N_UNITS_1=107
SM_HP_DROPOUT_2=0.20979101632756325
SM_HP_DROPOUT_1=0.4715702331554363
SM_HP_EPOCHS=81
SM_HP_LEARNING_RATE=0.0029903699075321814
SM_HP_ST_INSTANCE_TYPE=ml.m4.10xlarge
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975
Traceback (most recent call last):
File "traincode_report_withcheckpointing.py", line 29, in <module>
from benchmarks.checkpoint import resume_from_checkpointed_model, \
ModuleNotFoundError: No module named 'benchmarks'
2022-01-18 16:34:38,444 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975"
Traceback (most recent call last):
File "traincode_report_withcheckpointing.py", line 29, in <module>
from benchmarks.checkpoint import resume_from_checkpointed_model, \
ModuleNotFoundError: No module named 'benchmarks'
When running (main branch) e.g.
python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark fcnet-protein
I get following warnings. Is this expected, anyone knows what's going on?
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 38: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 44: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 49: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 77: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 86: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 113: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 121: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 142: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 169: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 186: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 188: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 229: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 247: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 252: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 260: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 255: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 264: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 297: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 309: status = Stopped, num_results = 1
WARNING:syne_tune.backend.simulator_backend.simulator_backend:The following trials reported results, but are not covered by trial_ids. These results will be ignored:
trial_id 314: status = Stopped, num_results = 1
Hello!
ST already has sp.choice
for categorical variables, and sp.finrange
and sp.logfinrange
for numerical values, but I feel that sometimes it is easier to manually specify the elements (as per sp.choice
), but have them treated as numerical values by the GP-models and by the Blackbox-surrogate-models. Hence I'm wondering about implementing something like sp.number_choice
, mostly for convenience, what do you think?
Hi, is this library compatible with SageMaker Experiments tracking and/or SageMaker estimator experiment_config
?
What. There has been some feedback, and another workshop paper on DEHB was pointed out. The main point we missed is a special ordinal encoding of the values of the NB201 categorical HPs.
Why. DEHB is a competitive method, and having it well done in Syne Tune increases our chance of adoption with Freiburg.
Done. Redo comparison, especially the _ORD variants.
Caused by self._uniform_int.sample(spec, size=1, random_state)
returning an int
rather than an iterable.
This seems to be caused by this piece of code
syne-tune/syne_tune/search_space.py
Lines 192 to 196 in 3e1c396
import syne_tune.search_space as sp
fr = sp.finrange(1, 2, 2)
fr.sample(size=2)
> Out[4]: [1.0, 2.0]
fr.sample(size=1)
> Traceback (most recent call last):
> ...
> File "/Users/awgol/code/syne-tune/syne_tune/search_space.py", line 592, in sample
> for x in self._uniform_int.sample(spec, size, random_state)]
> TypeError: 'int' object is not iterable
(Apologies for creating multiple recent GitHub issues, this is the last one, I promise!)
I took the DataFrame from my experiment results and used Plotly's plotly.express.parallel_categories
plot to visualize hyperparameter interactions, dropping any features that only have one unique value. This is an interactive plot, and you can wrap it in a function that refreshes periodically when new data is available:
This has been super useful for myself, so I thought that it may be useful to others as well if it were added as a plotting capability to the library? Although I'd understand if it's not desirable to add another dependency. Just thought I'd share!
Hi, with recent versions of matplotlib, ExperimentResult.plot()
gives me the warning:
WARNING:matplotlib.legend:No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.