Git Product home page Git Product logo

amazon-sagemaker-script-mode's People

Contributors

ddworin-caa avatar dependabot[bot] avatar frgud avatar jpeddicord avatar rabowskyb avatar swlkn avatar vdabravolski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-sagemaker-script-mode's Issues

[Error] Notebook: sentiment-analysis.ipynb in SageMaker Studio with kernel Python3 (Tensorflow2 GPU Optimized)

Hello,
When I ran the following condition and code, I had an error.
I appreciate if you provide some suggestions.

Thanks.

notebook:

Error:
./local_mode_setup.sh: line 7: rpm: command not found
./local_mode_setup.sh: line 10: yum: command not found
./local_mode_setup.sh: line 12: [: -eq: unary operator expected
./local_mode_setup.sh: line 20: sudo: command not found
./local_mode_setup.sh: line 21: sudo: command not found
./local_mode_setup.sh: line 23: sudo: command not found
./local_mode_setup.sh: line 25: sudo: command not found
(23) Failed writing body
./local_mode_setup.sh: line 26: sudo: command not found
./local_mode_setup.sh: line 27: sudo: command not found
./local_mode_setup.sh: line 28: sudo: command not found
installed nvidia-docker2
./local_mode_setup.sh: line 45: docker: command not found
./local_mode_setup.sh: line 47: docker: command not found
./local_mode_setup.sh: line 51: sudo: command not found
./local_mode_setup.sh: line 54: docker: command not found
./local_mode_setup.sh: line 55: ip: command not found
./local_mode_setup.sh: line 56: ip: command not found
./local_mode_setup.sh: line 59: sudo: command not found
./local_mode_setup.sh: line 60: sudo: command not found

[Error] Notebook: Sentiment Analysis with TensorFlow 2

Environment:
sagemaker notebook:
kernel: conda_tensorflow_p36

When I ran the following code cell, I got an error.

transformer = estimator.transformer(instance_count=1,
instance_type='ml.c5.xlarge')

transformer.transform(csvtest_s3, content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

What did I do wrong?


TypeError Traceback (most recent call last)
in ()
1 transformer = estimator.transformer(instance_count=1,
----> 2 instance_type='ml.c5.xlarge')
3
4 transformer.transform(csvtest_s3, content_type='text/csv')
5 print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in transformer(self, instance_count, instance_type, strategy, assemble_with, output_path, output_kms_key, accept, env, max_concurrent_transforms, max_payload, tags, role, model_server_workers, volume_kms_key, endpoint_type, entry_point, vpc_config_override, enable_network_isolation)
885 endpoint_type=endpoint_type,
886 entry_point=entry_point,
--> 887 enable_network_isolation=enable_network_isolation,
888 )
889

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in create_model(self, model_server_workers, role, vpc_config_override, endpoint_type, entry_point, source_dir, dependencies, **kwargs)
592 source_dir=source_dir,
593 dependencies=dependencies,
--> 594 **kwargs
595 )
596

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in _create_tfs_model(self, role, vpc_config_override, entry_point, source_dir, dependencies, **kwargs)
635 dependencies=dependencies,
636 enable_network_isolation=self.enable_network_isolation(),
--> 637 **kwargs
638 )
639

TypeError: type object got multiple values for keyword argument 'enable_network_isolation'

How to change the value of SM_MODULE_DIR?

In the tf-eager-script-mode example, is there an argument that I can pass to some function to change the value of SM_MODULE_DIR?

The default value is s3://sagemaker-{aws-region}-{aws-id}/{training-job-name}/source/sourcedir.tar.gz which means it will always create a new bucket. I already have a bucket that I would like to save the compressed source code to. How can I modify SM_MODULE_DIR so that the source code will be saved to a specific bucket?

parameter server is not working model not synced between different worker

I have a training script that followed the distributed training script with parameter server approach,

I found the models seems not synced, every worker is training on it's own data and never communicated to each other.
by using ShardedByS3Key approach I can validate that only data on the master worker get used.

So I didn't see the leverage by using 10 /20 workers. seems like every worker just train on it's own data with it's own PS and save it's own model and never get synced.

Do I need to add anything in the training code to specify how gradient is synced ?

"Error: argument "agent" is wrong: table id value is invalid" Error

Hi,

I am trying to run the tf-2-workflow/tf-2-workflow.ipynb notebook. I was able to get it to run until I reach the line !/bin/bash ./local_mode_setup.sh. Then I get the following errors:

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/networks: dial unix /var/run/docker.sock: connect: permission denied
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/networks/create: dial unix /var/run/docker.sock: connect: permission denied
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/networks: dial unix /var/run/docker.sock: connect: permission denied
Error: argument "agent" is wrong: table id value is invalid

I fixed my Docker permissions w/ this guide here, so I am able to successfully run docker run hello-world. The issue seems to be with the following line in local_mode_setup.sh:

ROUTE_TABLE_PATCHED=sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE``

Please help. Thanks!

aws-samples / amazon-sagemaker-script-mode / tf-2-workflow preprocessing.py bug

When running the preprocessing.py script as part of the sklearn_processor.run step of the notebook, the SageMaker Processing job fails with an "Algorithm" error. The CloudWatch logs show the following:

2021-07-09T17:12:14.418-04:00 | INPUT FILE List
2021-07-09T17:12:14.418-04:00 | ['/opt/ml/processing/input/x_test.npy', '/opt/ml/processing/input/y_test.npy']
2021-07-09T17:12:14.418-04:00 | SAVED TRANSFORMED TEST DATA FILE
2021-07-09T17:12:14.418-04:00
Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocessing.py", line 14, in
transformed = scaler.fit_transform(raw)
File "/miniconda3/lib/python3.7/site-packages/sklearn/base.py", line 462, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 617, in fit
return self.partial_fit(X, y)
File "/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 641, in partial_fit
force_all_finite='allow-nan')
File "/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 547, in check_array
"if it contains a single sample.".format(array))
2021-07-09T17:12:14.418-04:00
ValueError: Expected 2D array, got 1D array instead:
2021-07-09T17:12:14.418-04:00
array=[ 7.2 18.8 19. 27. 22.2 24.5 31.2 22.9 20.5 23.2 18.6 14.5 17.8 50.
20.8 24.3 24.2 19.8 19.1 22.7 12. 10.2 20. 18.5 20.9 23. 27.5 30.1
9.5 22. 21.2 14.1 33.1 23.4 20.1 7.4 15.4 23.8 20.1 24.5 33. 28.4
14.1 46.7 32.5 29.6 28.4 19.8 20.2 25. 35.4 20.3 9.7 14.5 34.9 26.6
7.2 50. 32.4 21.6 29.8 13.1 27.5 21.2 23.1 21.9 13. 23.2 8.1 5.6
21.7 29.6 19.6 7. 26.4 18.9 20.9 28.1 35.4 10.2 24.3 43.1 17.6 15.4
16.2 27.1 21.4 21.5 22.4 25. 16.6 18.6 22. 42.8 35.1 21.5 36. 21.9
24.1 50. 26.7 25. ].
2021-07-09T17:12:14.418-04:00
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Adding the array.reshape() recommendation is misleading and does not fix the problem.

I have found a working solution by changing the third cell (which generates the preprocessing.py script that throws the above error) to the following:

`python3
%%writefile preprocessing.py

import glob
import numpy as np
import os
from sklearn.preprocessing import StandardScaler

if name=='main':

input_files = glob.glob('{}/*.npy'.format('/opt/ml/processing/input'))
print('\nINPUT FILE LIST: \n{}\n'.format(input_files))
scaler = StandardScaler()
for file in input_files:
    if 'x_' in file:
        X = file
    elif 'y_' in file:
        y = file
raw_x = np.load(X)
raw_y = np.load(y)
transformed = scaler.fit_transform(raw_x, raw_y)
if 'train' in file:
    output_path = os.path.join('/opt/ml/processing/train', 'x_train.npy')
    np.save(output_path, transformed)
    print('SAVED TRANSFORMED TRAINING DATA FILE\n')
else:
    output_path = os.path.join('/opt/ml/processing/test', 'x_test.npy')
    np.save(output_path, transformed)
    print('SAVED TRANSFORMED TEST DATA FILE\n')

`

This change allows fit_transform to build the transformed data set while being aware of the dimensionality of the y predictions. This removes the error in training and allows the model to be trained and deployed locally or on a hosted endpoint and to generate inferences.

I don't know if this is the best solution, but it's one that fixes the error with this notebook.

Error when processing sklearn_processor.run(...

IN THIS repo:
https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-2-workflow/tf-2-workflow.ipynb

Running sklearn_processor.run(etc...

leads to following error:
miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses

import imp
INPUT FILE LIST:
['/opt/ml/processing/input/x_test.npy']
Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocessing1.py", line 18, in
raw_y = np.load(y)
NameError: name 'y' is not defined

TF2 Workflow Example fails in training

Notebook: https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-2-workflow/tf-2-workflow.ipynb

The notebook for TF2 end-to-end workflow fails during training, with error KeyError: 'callable_inputs'. Using Amazon SageMaker t3.large instance, kernel conda_tensorflow2_p36.

image

Important section of the log
Traceback (most recent call last):
  File "train.py", line 75, in 
    model.save(args.model_dir + '/1')
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 1059, in save
    signatures, options)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/save.py", line 138, in save_model
    signatures, options)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save.py", line 78, in save
    save_lib.save(model, filepath, signatures, options)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 951, in save
    obj, export_dir, signatures, options, meta_graph_def)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 1008, in _build_meta_graph
    checkpoint_graph_view)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_serialization.py", line 75, in find_function_to_export
    functions = saveable_view.list_functions(saveable_view.root)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 143, in list_functions
    self._serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1701, in _list_functions_for_serialization
    Model, self)._list_functions_for_serialization(serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2764, in _list_functions_for_serialization
    .list_functions_for_serialization(serialization_cache))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/base_serialization.py", line 87, in list_functions_for_serialization
    fns = self.functions_to_serialize(serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 77, in functions_to_serialize
    serialization_cache).functions_to_serialize)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 92, in _get_serialized_attributes
    serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/model_serialization.py", line 53, in _get_serialized_attributes_internal
    serialization_cache))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 101, in _get_serialized_attributes_internal
    functions = save_impl.wrap_layer_functions(self.obj, serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 163, in wrap_layer_functions
    '{}_layer_call_and_return_conditional_losses'.format(layer.name))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 503, in add_function
    self.add_trace(*self._input_signature)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 418, in add_trace
    trace_with_training(True)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 416, in trace_with_training
    fn.get_concrete_function(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 547, in get_concrete_function
    return super(LayerCall, self).get_concrete_function(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 959, in get_concrete_function
    concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 865, in _get_concrete_function_garbage_collected
    self._initialize(args, kwargs, add_initializers_to=initializers)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 506, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2446, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2777, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2667, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 981, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 441, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 524, in wrapper
    ret = method(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 170, in wrap_with_training_arg
    lambda: replace_training_and_call(False))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py", line 65, in smart_cond
    pred, true_fn=true_fn, false_fn=false_fn, name=name)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
    return true_fn()
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 169, in 
    lambda: replace_training_and_call(True),
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 165, in replace_training_and_call
    return wrapped_call(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 566, in call_and_return_conditional_losses
    return layer_call(inputs, *args, **kwargs), layer.get_losses_for(inputs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 721, in call
    convert_kwargs_to_constants=base_layer_utils.call_context().saving)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 891, in _run_internal_graph
    output_tensors = layer(computed_tensors, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 929, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 71, in return_outputs_and_add_losses
    outputs, losses = fn(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 541, in __call__
    self.call_collection.add_trace(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 421, in add_trace
    fn.get_concrete_function(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 547, in get_concrete_function
    return super(LayerCall, self).get_concrete_function(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 959, in get_concrete_function
    concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 865, in _get_concrete_function_garbage_collected
    self._initialize(args, kwargs, add_initializers_to=initializers)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 506, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2446, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2777, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2667, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 981, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 441, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 513, in wrapper
    inputs = call_collection.get_input_arg_value(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 453, in get_input_arg_value
    self._input_arg_name, args, kwargs, inputs_in_args=True)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2305, in _get_call_arg_value
    return args_dict[arg_name]
KeyError: 'callable_inputs'
2020-08-11 09:58:37,280 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 train.py --batch_size 128 --epochs 30 --learning_rate 0.01 --model_dir /opt/ml/model"
Very thorough log:
2020-08-11 09:56:12 Starting - Starting the training job...
2020-08-11 09:56:14 Starting - Launching requested ML instances......
2020-08-11 09:57:20 Starting - Preparing the instances for training...
2020-08-11 09:57:57 Downloading - Downloading input data...
2020-08-11 09:58:39 Training - Training image download completed. Training in progress.
2020-08-11 09:58:39 Uploading - Uploading generated training model2020-08-11 09:58:32,569 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training
2020-08-11 09:58:32,576 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2020-08-11 09:58:32,833 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2020-08-11 09:58:32,850 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2020-08-11 09:58:32,865 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2020-08-11 09:58:32,875 sagemaker-training-toolkit INFO     Invoking user script

Training Env:

{
"additional_framework_parameters": {},
"channel_input_dirs": {
"test": "/opt/ml/input/data/test",
"train": "/opt/ml/input/data/train"
},
"current_host": "algo-1",
"framework_module": "sagemaker_tensorflow_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"batch_size": 128,
"model_dir": "/opt/ml/model",
"epochs": 30,
"learning_rate": 0.01
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {
"test": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
},
"train": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
}
},
"input_dir": "/opt/ml/input",
"is_master": true,
"job_name": "tf-2-workflow-2020-08-11-09-56-09-519",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-eu-west-1-859755744029/tf-2-workflow-2020-08-11-09-56-09-519/source/sourcedir.tar.gz",
"module_name": "train",
"network_interface_name": "eth0",
"num_cpus": 4,
"num_gpus": 0,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"hosts": [
"algo-1"
],
"network_interface_name": "eth0"
},
"user_entry_point": "train.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch_size":128,"epochs":30,"learning_rate":0.01,"model_dir":"/opt/ml/model"}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["test","train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-west-1-859755744029/tf-2-workflow-2020-08-11-09-56-09-519/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":128,"epochs":30,"learning_rate":0.01,"model_dir":"/opt/ml/model"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"tf-2-workflow-2020-08-11-09-56-09-519","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-1-859755744029/tf-2-workflow-2020-08-11-09-56-09-519/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--batch_size","128","--epochs","30","--learning_rate","0.01","--model_dir","/opt/ml/model"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_BATCH_SIZE=128
SM_HP_MODEL_DIR=/opt/ml/model
SM_HP_EPOCHS=30
SM_HP_LEARNING_RATE=0.01
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages

Invoking script with the following command:

/usr/local/bin/python3.7 train.py --batch_size 128 --epochs 30 --learning_rate 0.01 --model_dir /opt/ml/model

x train (404, 13) y train (404,)
x test (102, 13) y test (102,)
/cpu:0
batch_size = 128, epochs = 30, learning rate = 0.01
[2020-08-11 09:58:34.837 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2020-08-11 09:58:34.838 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO hook.py:192] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2020-08-11 09:58:34.838 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO hook.py:237] Saving to /opt/ml/output/tensors
[2020-08-11 09:58:34.838 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO state_store.py:67] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
Epoch 1/30
[2020-08-11 09:58:34.948 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO hook.py:382] Monitoring the collections: sm_metrics, losses, metrics
#0151/4 [======>.......................] - ETA: 0s - loss: 521.6073#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 56ms/step - loss: 522.4297 - val_loss: 395.8727 - batch: 0.0000e+00
Epoch 2/30
#0151/4 [======>.......................] - ETA: 0s - loss: 370.8837#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 327.1166 - val_loss: 231.9524 - batch: 1.0000
Epoch 3/30
#0151/4 [======>.......................] - ETA: 0s - loss: 213.0124#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 186.0386 - val_loss: 131.3616 - batch: 2.0000
Epoch 4/30
#0151/4 [======>.......................] - ETA: 0s - loss: 120.5229#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 108.9488 - val_loss: 86.0883 - batch: 3.0000
Epoch 5/30
#0151/4 [======>.......................] - ETA: 0s - loss: 91.8608#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 77.4079 - val_loss: 68.1158 - batch: 4.0000
Epoch 6/30
#0151/4 [======>.......................] - ETA: 0s - loss: 52.8480#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 64.5651 - val_loss: 55.5730 - batch: 5.0000
Epoch 7/30
#0151/4 [======>.......................] - ETA: 0s - loss: 48.7449#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 11ms/step - loss: 56.4356 - val_loss: 47.1502 - batch: 6.0000
Epoch 8/30
#0151/4 [======>.......................] - ETA: 0s - loss: 56.9593#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 50.3346 - val_loss: 44.9798 - batch: 7.0000
Epoch 9/30
#0151/4 [======>.......................] - ETA: 0s - loss: 35.4132#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 48.4684 - val_loss: 42.5844 - batch: 8.0000
Epoch 10/30
#0151/4 [======>.......................] - ETA: 0s - loss: 56.4872#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 46.1197 - val_loss: 40.3272 - batch: 9.0000
Epoch 11/30
#0151/4 [======>.......................] - ETA: 0s - loss: 51.8289#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 44.0385 - val_loss: 39.4977 - batch: 10.0000
Epoch 12/30
#0151/4 [======>.......................] - ETA: 0s - loss: 46.2814#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 43.3905 - val_loss: 38.6036 - batch: 11.0000
Epoch 13/30
#0151/4 [======>.......................] - ETA: 0s - loss: 41.2571#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 11ms/step - loss: 40.0422 - val_loss: 36.2050 - batch: 12.0000
Epoch 14/30
#0151/4 [======>.......................] - ETA: 0s - loss: 36.4492#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 38.1127 - val_loss: 35.1686 - batch: 13.0000
Epoch 15/30
#0151/4 [======>.......................] - ETA: 0s - loss: 42.1003#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 36.3987 - val_loss: 34.2136 - batch: 14.0000
Epoch 16/30
#0151/4 [======>.......................] - ETA: 0s - loss: 40.9127#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 37.1170 - val_loss: 32.2987 - batch: 15.0000
Epoch 17/30
#0151/4 [======>.......................] - ETA: 0s - loss: 37.5749#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 11ms/step - loss: 33.2491 - val_loss: 31.5382 - batch: 16.0000
Epoch 18/30
#0151/4 [======>.......................] - ETA: 0s - loss: 22.2220#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 32.1300 - val_loss: 30.8262 - batch: 17.0000
Epoch 19/30
#0151/4 [======>.......................] - ETA: 0s - loss: 25.8472#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 8ms/step - loss: 30.1447 - val_loss: 29.7609 - batch: 18.0000
Epoch 20/30
#0151/4 [======>.......................] - ETA: 0s - loss: 37.5497#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 28.9837 - val_loss: 29.3724 - batch: 19.0000
Epoch 21/30
#0151/4 [======>.......................] - ETA: 0s - loss: 28.8834#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 28.4000 - val_loss: 29.8117 - batch: 20.0000
Epoch 22/30
#0151/4 [======>.......................] - ETA: 0s - loss: 29.1055#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 27.5027 - val_loss: 27.7952 - batch: 21.0000
Epoch 23/30
#0151/4 [======>.......................] - ETA: 0s - loss: 31.1734#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 26.5844 - val_loss: 28.7902 - batch: 22.0000
Epoch 24/30
#0151/4 [======>.......................] - ETA: 0s - loss: 27.4608#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 25.0937 - val_loss: 27.6794 - batch: 23.0000
Epoch 25/30
#0151/4 [======>.......................] - ETA: 0s - loss: 17.5128#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 23.8630 - val_loss: 26.5982 - batch: 24.0000
Epoch 26/30
#0151/4 [======>.......................] - ETA: 0s - loss: 18.9573#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 11ms/step - loss: 22.8002 - val_loss: 24.9170 - batch: 25.0000
Epoch 27/30
#0151/4 [======>.......................] - ETA: 0s - loss: 23.8510#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 8ms/step - loss: 22.2273 - val_loss: 26.5363 - batch: 26.0000
Epoch 28/30
#0151/4 [======>.......................] - ETA: 0s - loss: 14.1942#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 8ms/step - loss: 21.6918 - val_loss: 24.9937 - batch: 27.0000
Epoch 29/30
#0151/4 [======>.......................] - ETA: 0s - loss: 22.5802#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 20.2896 - val_loss: 24.1509 - batch: 28.0000
Epoch 30/30
#0151/4 [======>.......................] - ETA: 0s - loss: 22.4525#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 19.5049 - val_loss: 26.8152 - batch: 29.0000
1/1 - 0s - loss: 26.8152

Test MSE : 26.815200805664062
Traceback (most recent call last):
File "train.py", line 75, in
model.save(args.model_dir + '/1')
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 1059, in save
signatures, options)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/save.py", line 138, in save_model
signatures, options)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save.py", line 78, in save
save_lib.save(model, filepath, signatures, options)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 951, in save
obj, export_dir, signatures, options, meta_graph_def)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 1008, in _build_meta_graph
checkpoint_graph_view)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_serialization.py", line 75, in find_function_to_export
functions = saveable_view.list_functions(saveable_view.root)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 143, in list_functions
self._serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1701, in _list_functions_for_serialization
Model, self)._list_functions_for_serialization(serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2764, in _list_functions_for_serialization
.list_functions_for_serialization(serialization_cache))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/base_serialization.py", line 87, in list_functions_for_serialization
fns = self.functions_to_serialize(serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 77, in functions_to_serialize
serialization_cache).functions_to_serialize)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 92, in _get_serialized_attributes
serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/model_serialization.py", line 53, in _get_serialized_attributes_internal
serialization_cache))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 101, in _get_serialized_attributes_internal
functions = save_impl.wrap_layer_functions(self.obj, serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 163, in wrap_layer_functions
'{}_layer_call_and_return_conditional_losses'.format(layer.name))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 503, in add_function
self.add_trace(*self._input_signature)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 418, in add_trace
trace_with_training(True)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 416, in trace_with_training
fn.get_concrete_function(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 547, in get_concrete_function
return super(LayerCall, self).get_concrete_function(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 959, in get_concrete_function
concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 865, in _get_concrete_function_garbage_collected
self._initialize(args, kwargs, add_initializers_to=initializers)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 506, in _initialize
*args, **kwds))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2446, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2777, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2667, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 981, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 441, in wrapped_fn
return weak_wrapped_fn().wrapped(*args, **kwds)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 524, in wrapper
ret = method(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 170, in wrap_with_training_arg
lambda: replace_training_and_call(False))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py", line 65, in smart_cond
pred, true_fn=true_fn, false_fn=false_fn, name=name)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
return true_fn()
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 169, in
lambda: replace_training_and_call(True),
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 165, in replace_training_and_call
return wrapped_call(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 566, in call_and_return_conditional_losses
return layer_call(inputs, *args, **kwargs), layer.get_losses_for(inputs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 721, in call
convert_kwargs_to_constants=base_layer_utils.call_context().saving)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 891, in _run_internal_graph
output_tensors = layer(computed_tensors, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 929, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 71, in return_outputs_and_add_losses
outputs, losses = fn(inputs, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 541, in call
self.call_collection.add_trace(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 421, in add_trace
fn.get_concrete_function(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 547, in get_concrete_function
return super(LayerCall, self).get_concrete_function(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 959, in get_concrete_function
concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 865, in _get_concrete_function_garbage_collected
self._initialize(args, kwargs, add_initializers_to=initializers)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 506, in _initialize
*args, **kwds))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2446, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2777, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2667, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 981, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 441, in wrapped_fn
return weak_wrapped_fn().wrapped(*args, **kwds)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 513, in wrapper
inputs = call_collection.get_input_arg_value(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 453, in get_input_arg_value
self._input_arg_name, args, kwargs, inputs_in_args=True)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2305, in _get_call_arg_value
return args_dict[arg_name]
KeyError: 'callable_inputs'
2020-08-11 09:58:37,280 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 train.py --batch_size 128 --epochs 30 --learning_rate 0.01 --model_dir /opt/ml/model"

What is the correct value for model_dir for TensorFlow estimator in script mode?

In the tf-eager-sm-scriptmode.ipynb example, model_dir is defined as /opt/ml/model:

image

But the documentation for TensorFlow says that model_dir should be an S3 location:

image

Is the documentation outdated? It looks like model_dir in the documentation should be changed to output_path.

Also, based on reading other documentation, I get the impression that if my training script generates the usual SavedModel files and other files like evaluation metrics, I must write the SavedModel files to /opt/ml/model and evaluation metrics file to /opt/ml/output. If I were to write both files to /opt/ml/model, then the output will not be serveable. Is this correct, and if yes, how do I get sagemaker to export the the files written in /opt/ml/output to an S3 bucket?

get an error message of no module named transformer for deploy-pretrained-model/BERT/Deploy_BERT.ipynb

Hi

I read your deploy bert notebook and followed the steps (made changes to use distilbert rather bert) and created endpoint and I can see that
from dashboard it has inservice status

However when I tried to endpoint using your example code,

import boto3

sm = boto3.client('sagemaker-runtime')

prompt = "The best part of Amazon SageMaker is that it makes machine learning easy."

response = sm.invoke_endpoint(EndpointName=endpoint_name,
Body=prompt.encode(encoding='UTF-8'),
ContentType='text/csv')

response['Body'].read()

I get an error message of no module named transformer.

ModelError Traceback (most recent call last)
in
6 response = sm.invoke_endpoint(EndpointName=endpoint_name,
7 Body=prompt.encode(encoding='UTF-8'),
----> 8 ContentType='text/csv')
9
10 response['Body'].read()

/opt/conda/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.name = str(py_operation_name)

/opt/conda/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
674 error_code = parsed_response.get("Error", {}).get("Code")
675 error_class = self.exceptions.from_code(error_code)
--> 676 raise error_class(parsed_response, operation_name)
677 else:
678 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "No module named 'transformers'
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 110, in transform
self.validate_and_initialize(model_dir=model_dir)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 157, in validate_and_initialize
self._validate_user_module_and_set_functions()
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 170, in _validate_user_module_and_set_functions
user_module = importlib.import_module(user_module_name)
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 665, in _load_unlocked
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "/opt/ml/model/code/deploy_distilbert.py", line 12, in
from transformers import DistilBertTokenizer, DistilBertForTokenClassification
ModuleNotFoundError: No module named 'transformers'
". See https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/distilbert-base-cased in account 645532039181 for more information.

I am having terrible time of understanding how to use pre-trained model in sagemaker and would really appreciate if you could help.

In the log, it says:

Python executable: /opt/conda/bin/python

And I checked the python and it does have transfomers module installed

/opt/conda/bin/python

Python 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import transformers
quit()

My requirements.txt has:

transformers>=4.0.0
tensorflow==1.13.1
torch==1.5.0

When I run this cell, I added requirements.txt into code,
import tarfile

zipped_model_path = os.path.join(model_path, "model.tar.gz")

with tarfile.open(zipped_model_path, "w:gz") as tar:
tar.add(model_path)
tar.add(code_path)

Would you please tell me what's going wrong?

Thanks
Teresa

Replace global reference with a local variable in example script

There is a function in one of the SageMaker's examples that saves Keras model onto disk:

def save_model(model, output):
    tf.contrib.saved_model.save_keras_model(model, args.model_dir)
    logging.info("Model successfully saved at: {}".format(output))
    return

But it seems that the function doesn't use output variable and refers global args.model_dir instead.

How do you think, should this function be replaced with something like this?

def save_model(model, output):
    tf.contrib.saved_model.save_keras_model(model, output)
    logging.info("Model successfully saved at: {}".format(output))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.