aws-samples / amazon-sagemaker-script-mode Goto Github PK

Amazon SageMaker examples for prebuilt framework mode containers, a.k.a. Script Mode, and more (BYO containers and models etc.)

License: Apache License 2.0

Shell 0.52% Python 15.89% Jupyter Notebook 83.53% Dockerfile 0.05%

amazon-sagemaker-script-mode's People

Contributors

Stargazers

Watchers

Forkers

rabowskyb martalk irmem rbigithub radiositypress jihys hanman-aws upalchowdhury pkorda fsaimsapkota chuyang-deng jzhzhu andremoeller imyoungyang leolorenzoluis emilywebber pangshady devforfu paulrigor akirajay bohdanbilonoh dgallitelli okagen yselivonchyk ogfunkycold wjzhan marissaposner ahmedcs gchqresearcher92457 arpitsri3 benhyy job-asfaw tvkpz seabaseballfan gonsoomoon-ml swlkn ronnie294 milovanovicdusan seanpmorgan yonghyeokrhee balijepalli yashmshah komushi vdabravolski csetraynor dast1 arunprsh harrisonford yilingyiling fldiaz guadagonz-ual mike-goitia jw207427 anuj8497 chatcharoen scott2b nmadhire alar0330 sidmuc amirosimani lasweiga walkingerica lesh3000 ashburncrash danabens leticia-hb minlaca elbrentano corebi frgud qpc-database sombory kakhan87 kirit93 fcamuz techthiyanes solomem lcajachahua jpthenasseril ddworin-caa bkamalmca kaancceylan ruzick vabluesky therealvish hajekim saketks2694 andres-altena-simon dheerajpranav trellixvulnteam ujjwalk906 danieltimar agungor2 karthik121222 zqcsrz sandy4321 sckhg1367 mariogiusto cuulcat

amazon-sagemaker-script-mode's Issues

What needs to be changed if I my training script depends on other scripts too?

In the example, train.py contains all the logic to get training data, build the model, train and evaluate.

How would this example change if train.py depends on multiple modules? For example, I have the model definition logic stored in a separate file called model.py.

Typo alert: use assignment (=) instead of comparison (==).

amazon-sagemaker-script-mode/tf-sentiment-script-mode/sentiment.py

Line 42 in 4452cbd

history_for_json[key] == history.history[key].tolist()

Need to define model_dir in tf-2-word-embeddings example

In the tf-2-word-embeddings the variable model_dir is not defined https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-2-word-embeddings/tf-2-word-embeddings.ipynb

Am I missing something?

[Error] Notebook: sentiment-analysis.ipynb in SageMaker Studio with kernel Python3 (Tensorflow2 GPU Optimized)

Hello,
When I ran the following condition and code, I had an error.
I appreciate if you provide some suggestions.

Thanks.

notebook:

sentiment-analysis.ipynb
Env:
SageMaker Studio
Kernel
Python3 (Tensorflow2 GPU Optimized)
Code:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json
! /bin/bash ./local_mode_setup.sh

Error:
./local_mode_setup.sh: line 7: rpm: command not found
./local_mode_setup.sh: line 10: yum: command not found
./local_mode_setup.sh: line 12: [: -eq: unary operator expected
./local_mode_setup.sh: line 20: sudo: command not found
./local_mode_setup.sh: line 21: sudo: command not found
./local_mode_setup.sh: line 23: sudo: command not found
./local_mode_setup.sh: line 25: sudo: command not found
(23) Failed writing body
./local_mode_setup.sh: line 26: sudo: command not found
./local_mode_setup.sh: line 27: sudo: command not found
./local_mode_setup.sh: line 28: sudo: command not found
installed nvidia-docker2
./local_mode_setup.sh: line 45: docker: command not found
./local_mode_setup.sh: line 47: docker: command not found
./local_mode_setup.sh: line 51: sudo: command not found
./local_mode_setup.sh: line 54: docker: command not found
./local_mode_setup.sh: line 55: ip: command not found
./local_mode_setup.sh: line 56: ip: command not found
./local_mode_setup.sh: line 59: sudo: command not found
./local_mode_setup.sh: line 60: sudo: command not found

[Error] Notebook: Sentiment Analysis with TensorFlow 2

Environment:
sagemaker notebook:
kernel: conda_tensorflow_p36

When I ran the following code cell, I got an error.

transformer = estimator.transformer(instance_count=1,
instance_type='ml.c5.xlarge')

transformer.transform(csvtest_s3, content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

What did I do wrong?

TypeError Traceback (most recent call last)
in ()
1 transformer = estimator.transformer(instance_count=1,
----> 2 instance_type='ml.c5.xlarge')
3
4 transformer.transform(csvtest_s3, content_type='text/csv')
5 print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in transformer(self, instance_count, instance_type, strategy, assemble_with, output_path, output_kms_key, accept, env, max_concurrent_transforms, max_payload, tags, role, model_server_workers, volume_kms_key, endpoint_type, entry_point, vpc_config_override, enable_network_isolation)
885 endpoint_type=endpoint_type,
886 entry_point=entry_point,
--> 887 enable_network_isolation=enable_network_isolation,
888 )
889

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in create_model(self, model_server_workers, role, vpc_config_override, endpoint_type, entry_point, source_dir, dependencies, **kwargs)
592 source_dir=source_dir,
593 dependencies=dependencies,
--> 594 **kwargs
595 )
596

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in _create_tfs_model(self, role, vpc_config_override, entry_point, source_dir, dependencies, **kwargs)
635 dependencies=dependencies,
636 enable_network_isolation=self.enable_network_isolation(),
--> 637 **kwargs
638 )
639

TypeError: type object got multiple values for keyword argument 'enable_network_isolation'

How to change the value of SM_MODULE_DIR?

In the tf-eager-script-mode example, is there an argument that I can pass to some function to change the value of SM_MODULE_DIR?

The default value is s3://sagemaker-{aws-region}-{aws-id}/{training-job-name}/source/sourcedir.tar.gz which means it will always create a new bucket. I already have a bucket that I would like to save the compressed source code to. How can I modify SM_MODULE_DIR so that the source code will be saved to a specific bucket?

parameter server is not working model not synced between different worker

I have a training script that followed the distributed training script with parameter server approach,

I found the models seems not synced, every worker is training on it's own data and never communicated to each other.
by using ShardedByS3Key approach I can validate that only data on the master worker get used.

So I didn't see the leverage by using 10 /20 workers. seems like every worker just train on it's own data with it's own PS and save it's own model and never get synced.

Do I need to add anything in the training code to specify how gradient is synced ?

How do I set up Ubuntu based docker for Sagemaker Local mode training?

I'm having Ubuntu 16.04 with Nvidia support. The setup.sh available in keras script mode for CNN Text classification. Can't run install local mode training container in Ubuntu?

"Error: argument "agent" is wrong: table id value is invalid" Error

Hi,

I am trying to run the tf-2-workflow/tf-2-workflow.ipynb notebook. I was able to get it to run until I reach the line !/bin/bash ./local_mode_setup.sh. Then I get the following errors:

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/networks: dial unix /var/run/docker.sock: connect: permission denied
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/networks/create: dial unix /var/run/docker.sock: connect: permission denied
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/networks: dial unix /var/run/docker.sock: connect: permission denied
Error: argument "agent" is wrong: table id value is invalid

I fixed my Docker permissions w/ this guide here, so I am able to successfully run docker run hello-world. The issue seems to be with the following line in local_mode_setup.sh:

ROUTE_TABLE_PATCHED=sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE``

Please help. Thanks!

aws-samples / amazon-sagemaker-script-mode / tf-2-workflow preprocessing.py bug

When running the preprocessing.py script as part of the sklearn_processor.run step of the notebook, the SageMaker Processing job fails with an "Algorithm" error. The CloudWatch logs show the following:

2021-07-09T17:12:14.418-04:00 | INPUT FILE List
2021-07-09T17:12:14.418-04:00 | ['/opt/ml/processing/input/x_test.npy', '/opt/ml/processing/input/y_test.npy']
2021-07-09T17:12:14.418-04:00 | SAVED TRANSFORMED TEST DATA FILE
2021-07-09T17:12:14.418-04:00
Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocessing.py", line 14, in
transformed = scaler.fit_transform(raw)
File "/miniconda3/lib/python3.7/site-packages/sklearn/base.py", line 462, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 617, in fit
return self.partial_fit(X, y)
File "/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 641, in partial_fit
force_all_finite='allow-nan')
File "/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 547, in check_array
"if it contains a single sample.".format(array))
2021-07-09T17:12:14.418-04:00
ValueError: Expected 2D array, got 1D array instead:
2021-07-09T17:12:14.418-04:00
array=[ 7.2 18.8 19. 27. 22.2 24.5 31.2 22.9 20.5 23.2 18.6 14.5 17.8 50.
20.8 24.3 24.2 19.8 19.1 22.7 12. 10.2 20. 18.5 20.9 23. 27.5 30.1
9.5 22. 21.2 14.1 33.1 23.4 20.1 7.4 15.4 23.8 20.1 24.5 33. 28.4
14.1 46.7 32.5 29.6 28.4 19.8 20.2 25. 35.4 20.3 9.7 14.5 34.9 26.6
7.2 50. 32.4 21.6 29.8 13.1 27.5 21.2 23.1 21.9 13. 23.2 8.1 5.6
21.7 29.6 19.6 7. 26.4 18.9 20.9 28.1 35.4 10.2 24.3 43.1 17.6 15.4
16.2 27.1 21.4 21.5 22.4 25. 16.6 18.6 22. 42.8 35.1 21.5 36. 21.9
24.1 50. 26.7 25. ].
2021-07-09T17:12:14.418-04:00
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Adding the array.reshape() recommendation is misleading and does not fix the problem.

I have found a working solution by changing the third cell (which generates the preprocessing.py script that throws the above error) to the following:

`python3
%%writefile preprocessing.py

import glob
import numpy as np
import os
from sklearn.preprocessing import StandardScaler

if name=='main':

input_files = glob.glob('{}/*.npy'.format('/opt/ml/processing/input'))
print('\nINPUT FILE LIST: \n{}\n'.format(input_files))
scaler = StandardScaler()
for file in input_files:
    if 'x_' in file:
        X = file
    elif 'y_' in file:
        y = file
raw_x = np.load(X)
raw_y = np.load(y)
transformed = scaler.fit_transform(raw_x, raw_y)
if 'train' in file:
    output_path = os.path.join('/opt/ml/processing/train', 'x_train.npy')
    np.save(output_path, transformed)
    print('SAVED TRANSFORMED TRAINING DATA FILE\n')
else:
    output_path = os.path.join('/opt/ml/processing/test', 'x_test.npy')
    np.save(output_path, transformed)
    print('SAVED TRANSFORMED TEST DATA FILE\n')

This change allows fit_transform to build the transformed data set while being aware of the dimensionality of the y predictions. This removes the error in training and allows the model to be trained and deployed locally or on a hosted endpoint and to generate inferences.

I don't know if this is the best solution, but it's one that fixes the error with this notebook.

Error when processing sklearn_processor.run(...

IN THIS repo:
https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-2-workflow/tf-2-workflow.ipynb

Running sklearn_processor.run(etc...

leads to following error:
miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses

import imp
INPUT FILE LIST:
['/opt/ml/processing/input/x_test.npy']
Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocessing1.py", line 18, in
raw_y = np.load(y)
NameError: name 'y' is not defined

TF2 Workflow Example fails in training

Notebook: https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-2-workflow/tf-2-workflow.ipynb

The notebook for TF2 end-to-end workflow fails during training, with error KeyError: 'callable_inputs'. Using Amazon SageMaker t3.large instance, kernel conda_tensorflow2_p36.

Important section of the log

Traceback (most recent call last):
  File "train.py", line 75, in 
    model.save(args.model_dir + '/1')
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 1059, in save
    signatures, options)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/save.py", line 138, in save_model
    signatures, options)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save.py", line 78, in save
    save_lib.save(model, filepath, signatures, options)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 951, in save
    obj, export_dir, signatures, options, meta_graph_def)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 1008, in _build_meta_graph
    checkpoint_graph_view)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_serialization.py", line 75, in find_function_to_export
    functions = saveable_view.list_functions(saveable_view.root)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 143, in list_functions
    self._serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1701, in _list_functions_for_serialization
    Model, self)._list_functions_for_serialization(serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2764, in _list_functions_for_serialization
    .list_functions_for_serialization(serialization_cache))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/base_serialization.py", line 87, in list_functions_for_serialization
    fns = self.functions_to_serialize(serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 77, in functions_to_serialize
    serialization_cache).functions_to_serialize)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 92, in _get_serialized_attributes
    serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/model_serialization.py", line 53, in _get_serialized_attributes_internal
    serialization_cache))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 101, in _get_serialized_attributes_internal
    functions = save_impl.wrap_layer_functions(self.obj, serialization_cache)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 163, in wrap_layer_functions
    '{}_layer_call_and_return_conditional_losses'.format(layer.name))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 503, in add_function
    self.add_trace(*self._input_signature)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 418, in add_trace
    trace_with_training(True)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 416, in trace_with_training
    fn.get_concrete_function(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 547, in get_concrete_function
    return super(LayerCall, self).get_concrete_function(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 959, in get_concrete_function
    concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 865, in _get_concrete_function_garbage_collected
    self._initialize(args, kwargs, add_initializers_to=initializers)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 506, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2446, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2777, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2667, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 981, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 441, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 524, in wrapper
    ret = method(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 170, in wrap_with_training_arg
    lambda: replace_training_and_call(False))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py", line 65, in smart_cond
    pred, true_fn=true_fn, false_fn=false_fn, name=name)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
    return true_fn()
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 169, in 
    lambda: replace_training_and_call(True),
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 165, in replace_training_and_call
    return wrapped_call(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 566, in call_and_return_conditional_losses
    return layer_call(inputs, *args, **kwargs), layer.get_losses_for(inputs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 721, in call
    convert_kwargs_to_constants=base_layer_utils.call_context().saving)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 891, in _run_internal_graph
    output_tensors = layer(computed_tensors, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 929, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 71, in return_outputs_and_add_losses
    outputs, losses = fn(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 541, in __call__
    self.call_collection.add_trace(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 421, in add_trace
    fn.get_concrete_function(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 547, in get_concrete_function
    return super(LayerCall, self).get_concrete_function(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 959, in get_concrete_function
    concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 865, in _get_concrete_function_garbage_collected
    self._initialize(args, kwargs, add_initializers_to=initializers)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 506, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2446, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2777, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2667, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 981, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 441, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 513, in wrapper
    inputs = call_collection.get_input_arg_value(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 453, in get_input_arg_value
    self._input_arg_name, args, kwargs, inputs_in_args=True)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2305, in _get_call_arg_value
    return args_dict[arg_name]
KeyError: 'callable_inputs'
2020-08-11 09:58:37,280 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 train.py --batch_size 128 --epochs 30 --learning_rate 0.01 --model_dir /opt/ml/model"

Very thorough log:

2020-08-11 09:56:12 Starting - Starting the training job...
2020-08-11 09:56:14 Starting - Launching requested ML instances......
2020-08-11 09:57:20 Starting - Preparing the instances for training...
2020-08-11 09:57:57 Downloading - Downloading input data...
2020-08-11 09:58:39 Training - Training image download completed. Training in progress.
2020-08-11 09:58:39 Uploading - Uploading generated training model2020-08-11 09:58:32,569 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training
2020-08-11 09:58:32,576 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2020-08-11 09:58:32,833 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2020-08-11 09:58:32,850 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2020-08-11 09:58:32,865 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2020-08-11 09:58:32,875 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{

"additional_framework_parameters": {},

"channel_input_dirs": {

"test": "/opt/ml/input/data/test",

"train": "/opt/ml/input/data/train"

},

"current_host": "algo-1",

"framework_module": "sagemaker_tensorflow_container.training:main",

"hosts": [

"algo-1"

],

"hyperparameters": {

"batch_size": 128,

"model_dir": "/opt/ml/model",

"epochs": 30,

"learning_rate": 0.01

},

"input_config_dir": "/opt/ml/input/config",

"input_data_config": {

"test": {

"TrainingInputMode": "File",

"S3DistributionType": "FullyReplicated",

"RecordWrapperType": "None"

},

"train": {

"TrainingInputMode": "File",

"S3DistributionType": "FullyReplicated",

"RecordWrapperType": "None"

}

},

"input_dir": "/opt/ml/input",

"is_master": true,

"job_name": "tf-2-workflow-2020-08-11-09-56-09-519",

"log_level": 20,

"master_hostname": "algo-1",

"model_dir": "/opt/ml/model",

"module_dir": "s3://sagemaker-eu-west-1-859755744029/tf-2-workflow-2020-08-11-09-56-09-519/source/sourcedir.tar.gz",

"module_name": "train",

"network_interface_name": "eth0",

"num_cpus": 4,

"num_gpus": 0,

"output_data_dir": "/opt/ml/output/data",

"output_dir": "/opt/ml/output",

"output_intermediate_dir": "/opt/ml/output/intermediate",

"resource_config": {

"current_host": "algo-1",

"hosts": [

"algo-1"

],

"network_interface_name": "eth0"

},

"user_entry_point": "train.py"

}
Environment variables:
SM_HOSTS=["algo-1"]

SM_NETWORK_INTERFACE_NAME=eth0

SM_HPS={"batch_size":128,"epochs":30,"learning_rate":0.01,"model_dir":"/opt/ml/model"}

SM_USER_ENTRY_POINT=train.py

SM_FRAMEWORK_PARAMS={}

SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}

SM_INPUT_DATA_CONFIG={"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}

SM_OUTPUT_DATA_DIR=/opt/ml/output/data

SM_CHANNELS=["test","train"]

SM_CURRENT_HOST=algo-1

SM_MODULE_NAME=train

SM_LOG_LEVEL=20

SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main

SM_INPUT_DIR=/opt/ml/input

SM_INPUT_CONFIG_DIR=/opt/ml/input/config

SM_OUTPUT_DIR=/opt/ml/output

SM_NUM_CPUS=4

SM_NUM_GPUS=0

SM_MODEL_DIR=/opt/ml/model

SM_MODULE_DIR=s3://sagemaker-eu-west-1-859755744029/tf-2-workflow-2020-08-11-09-56-09-519/source/sourcedir.tar.gz

SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":128,"epochs":30,"learning_rate":0.01,"model_dir":"/opt/ml/model"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"tf-2-workflow-2020-08-11-09-56-09-519","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-1-859755744029/tf-2-workflow-2020-08-11-09-56-09-519/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}

SM_USER_ARGS=["--batch_size","128","--epochs","30","--learning_rate","0.01","--model_dir","/opt/ml/model"]

SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate

SM_CHANNEL_TEST=/opt/ml/input/data/test

SM_CHANNEL_TRAIN=/opt/ml/input/data/train

SM_HP_BATCH_SIZE=128

SM_HP_MODEL_DIR=/opt/ml/model

SM_HP_EPOCHS=30

SM_HP_LEARNING_RATE=0.01

PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages
Invoking script with the following command:
/usr/local/bin/python3.7 train.py --batch_size 128 --epochs 30 --learning_rate 0.01 --model_dir /opt/ml/model
x train (404, 13) y train (404,)

x test (102, 13) y test (102,)

/cpu:0

batch_size = 128, epochs = 30, learning rate = 0.01

[2020-08-11 09:58:34.837 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.

[2020-08-11 09:58:34.838 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO hook.py:192] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.

[2020-08-11 09:58:34.838 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO hook.py:237] Saving to /opt/ml/output/tensors

[2020-08-11 09:58:34.838 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO state_store.py:67] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.

Epoch 1/30

[2020-08-11 09:58:34.948 ip-10-0-227-174.eu-west-1.compute.internal:22 INFO hook.py:382] Monitoring the collections: sm_metrics, losses, metrics

#0151/4 [======>.......................] - ETA: 0s - loss: 521.6073#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 56ms/step - loss: 522.4297 - val_loss: 395.8727 - batch: 0.0000e+00

Epoch 2/30

#0151/4 [======>.......................] - ETA: 0s - loss: 370.8837#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 327.1166 - val_loss: 231.9524 - batch: 1.0000

Epoch 3/30

#0151/4 [======>.......................] - ETA: 0s - loss: 213.0124#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 186.0386 - val_loss: 131.3616 - batch: 2.0000

Epoch 4/30

#0151/4 [======>.......................] - ETA: 0s - loss: 120.5229#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 108.9488 - val_loss: 86.0883 - batch: 3.0000

Epoch 5/30

#0151/4 [======>.......................] - ETA: 0s - loss: 91.8608#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 77.4079 - val_loss: 68.1158 - batch: 4.0000

Epoch 6/30

#0151/4 [======>.......................] - ETA: 0s - loss: 52.8480#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 64.5651 - val_loss: 55.5730 - batch: 5.0000

Epoch 7/30

#0151/4 [======>.......................] - ETA: 0s - loss: 48.7449#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 11ms/step - loss: 56.4356 - val_loss: 47.1502 - batch: 6.0000

Epoch 8/30

#0151/4 [======>.......................] - ETA: 0s - loss: 56.9593#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 50.3346 - val_loss: 44.9798 - batch: 7.0000

Epoch 9/30

#0151/4 [======>.......................] - ETA: 0s - loss: 35.4132#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 48.4684 - val_loss: 42.5844 - batch: 8.0000

Epoch 10/30

#0151/4 [======>.......................] - ETA: 0s - loss: 56.4872#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 46.1197 - val_loss: 40.3272 - batch: 9.0000

Epoch 11/30

#0151/4 [======>.......................] - ETA: 0s - loss: 51.8289#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 44.0385 - val_loss: 39.4977 - batch: 10.0000

Epoch 12/30

#0151/4 [======>.......................] - ETA: 0s - loss: 46.2814#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 43.3905 - val_loss: 38.6036 - batch: 11.0000

Epoch 13/30

#0151/4 [======>.......................] - ETA: 0s - loss: 41.2571#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 11ms/step - loss: 40.0422 - val_loss: 36.2050 - batch: 12.0000

Epoch 14/30

#0151/4 [======>.......................] - ETA: 0s - loss: 36.4492#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 38.1127 - val_loss: 35.1686 - batch: 13.0000

Epoch 15/30

#0151/4 [======>.......................] - ETA: 0s - loss: 42.1003#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 36.3987 - val_loss: 34.2136 - batch: 14.0000

Epoch 16/30

#0151/4 [======>.......................] - ETA: 0s - loss: 40.9127#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 37.1170 - val_loss: 32.2987 - batch: 15.0000

Epoch 17/30

#0151/4 [======>.......................] - ETA: 0s - loss: 37.5749#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 11ms/step - loss: 33.2491 - val_loss: 31.5382 - batch: 16.0000

Epoch 18/30

#0151/4 [======>.......................] - ETA: 0s - loss: 22.2220#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 32.1300 - val_loss: 30.8262 - batch: 17.0000

Epoch 19/30

#0151/4 [======>.......................] - ETA: 0s - loss: 25.8472#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 8ms/step - loss: 30.1447 - val_loss: 29.7609 - batch: 18.0000

Epoch 20/30

#0151/4 [======>.......................] - ETA: 0s - loss: 37.5497#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 28.9837 - val_loss: 29.3724 - batch: 19.0000

Epoch 21/30

#0151/4 [======>.......................] - ETA: 0s - loss: 28.8834#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 28.4000 - val_loss: 29.8117 - batch: 20.0000

Epoch 22/30

#0151/4 [======>.......................] - ETA: 0s - loss: 29.1055#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 27.5027 - val_loss: 27.7952 - batch: 21.0000

Epoch 23/30

#0151/4 [======>.......................] - ETA: 0s - loss: 31.1734#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 26.5844 - val_loss: 28.7902 - batch: 22.0000

Epoch 24/30

#0151/4 [======>.......................] - ETA: 0s - loss: 27.4608#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 25.0937 - val_loss: 27.6794 - batch: 23.0000

Epoch 25/30

#0151/4 [======>.......................] - ETA: 0s - loss: 17.5128#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 10ms/step - loss: 23.8630 - val_loss: 26.5982 - batch: 24.0000

Epoch 26/30

#0151/4 [======>.......................] - ETA: 0s - loss: 18.9573#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 11ms/step - loss: 22.8002 - val_loss: 24.9170 - batch: 25.0000

Epoch 27/30

#0151/4 [======>.......................] - ETA: 0s - loss: 23.8510#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 8ms/step - loss: 22.2273 - val_loss: 26.5363 - batch: 26.0000

Epoch 28/30

#0151/4 [======>.......................] - ETA: 0s - loss: 14.1942#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 8ms/step - loss: 21.6918 - val_loss: 24.9937 - batch: 27.0000

Epoch 29/30

#0151/4 [======>.......................] - ETA: 0s - loss: 22.5802#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 20.2896 - val_loss: 24.1509 - batch: 28.0000

Epoch 30/30

#0151/4 [======>.......................] - ETA: 0s - loss: 22.4525#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#010#0154/4 [==============================] - 0s 9ms/step - loss: 19.5049 - val_loss: 26.8152 - batch: 29.0000

1/1 - 0s - loss: 26.8152
Test MSE : 26.815200805664062

Traceback (most recent call last):

File "train.py", line 75, in 

model.save(args.model_dir + '/1')

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 1059, in save

signatures, options)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/save.py", line 138, in save_model

signatures, options)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save.py", line 78, in save

save_lib.save(model, filepath, signatures, options)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 951, in save

obj, export_dir, signatures, options, meta_graph_def)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 1008, in _build_meta_graph

checkpoint_graph_view)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_serialization.py", line 75, in find_function_to_export

functions = saveable_view.list_functions(saveable_view.root)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 143, in list_functions

self._serialization_cache)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1701, in _list_functions_for_serialization

Model, self)._list_functions_for_serialization(serialization_cache)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2764, in _list_functions_for_serialization

.list_functions_for_serialization(serialization_cache))

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/base_serialization.py", line 87, in list_functions_for_serialization

fns = self.functions_to_serialize(serialization_cache)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 77, in functions_to_serialize

serialization_cache).functions_to_serialize)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 92, in _get_serialized_attributes

serialization_cache)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/model_serialization.py", line 53, in _get_serialized_attributes_internal

serialization_cache))

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 101, in _get_serialized_attributes_internal

functions = save_impl.wrap_layer_functions(self.obj, serialization_cache)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 163, in wrap_layer_functions

'{}_layer_call_and_return_conditional_losses'.format(layer.name))

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 503, in add_function

self.add_trace(*self._input_signature)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 418, in add_trace

trace_with_training(True)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 416, in trace_with_training

fn.get_concrete_function(*args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 547, in get_concrete_function

return super(LayerCall, self).get_concrete_function(*args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 959, in get_concrete_function

concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 865, in _get_concrete_function_garbage_collected

self._initialize(args, kwargs, add_initializers_to=initializers)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 506, in _initialize

*args, **kwds))

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2446, in _get_concrete_function_internal_garbage_collected

graph_function, _, _ = self._maybe_define_function(args, kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2777, in _maybe_define_function

graph_function = self._create_graph_function(args, kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2667, in _create_graph_function

capture_by_value=self._capture_by_value),

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 981, in func_graph_from_py_func

func_outputs = python_func(*func_args, **func_kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 441, in wrapped_fn

return weak_wrapped_fn().wrapped(*args, **kwds)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 524, in wrapper

ret = method(*args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 170, in wrap_with_training_arg

lambda: replace_training_and_call(False))

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py", line 65, in smart_cond

pred, true_fn=true_fn, false_fn=false_fn, name=name)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond

return true_fn()

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 169, in 

lambda: replace_training_and_call(True),

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 165, in replace_training_and_call

return wrapped_call(*args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 566, in call_and_return_conditional_losses

return layer_call(inputs, *args, **kwargs), layer.get_losses_for(inputs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 721, in call

convert_kwargs_to_constants=base_layer_utils.call_context().saving)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py", line 891, in _run_internal_graph

output_tensors = layer(computed_tensors, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 929, in call

outputs = call_fn(cast_inputs, *args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/utils.py", line 71, in return_outputs_and_add_losses

outputs, losses = fn(inputs, *args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 541, in call

self.call_collection.add_trace(*args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 421, in add_trace

fn.get_concrete_function(*args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 547, in get_concrete_function

return super(LayerCall, self).get_concrete_function(*args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 959, in get_concrete_function

concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 865, in _get_concrete_function_garbage_collected

self._initialize(args, kwargs, add_initializers_to=initializers)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 506, in _initialize

*args, **kwds))

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2446, in _get_concrete_function_internal_garbage_collected

graph_function, _, _ = self._maybe_define_function(args, kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2777, in _maybe_define_function

graph_function = self._create_graph_function(args, kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2667, in _create_graph_function

capture_by_value=self._capture_by_value),

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 981, in func_graph_from_py_func

func_outputs = python_func(*func_args, **func_kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 441, in wrapped_fn

return weak_wrapped_fn().wrapped(*args, **kwds)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 513, in wrapper

inputs = call_collection.get_input_arg_value(args, kwargs)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 453, in get_input_arg_value

self._input_arg_name, args, kwargs, inputs_in_args=True)

File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2305, in _get_call_arg_value

return args_dict[arg_name]

KeyError: 'callable_inputs'

2020-08-11 09:58:37,280 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:

Command "/usr/local/bin/python3.7 train.py --batch_size 128 --epochs 30 --learning_rate 0.01 --model_dir /opt/ml/model"

What is the correct value for model_dir for TensorFlow estimator in script mode?

In the tf-eager-sm-scriptmode.ipynb example, model_dir is defined as /opt/ml/model:

But the documentation for TensorFlow says that model_dir should be an S3 location:

Is the documentation outdated? It looks like model_dir in the documentation should be changed to output_path.

Also, based on reading other documentation, I get the impression that if my training script generates the usual SavedModel files and other files like evaluation metrics, I must write the SavedModel files to /opt/ml/model and evaluation metrics file to /opt/ml/output. If I were to write both files to /opt/ml/model, then the output will not be serveable. Is this correct, and if yes, how do I get sagemaker to export the the files written in /opt/ml/output to an S3 bucket?

get an error message of no module named transformer for deploy-pretrained-model/BERT/Deploy_BERT.ipynb

I read your deploy bert notebook and followed the steps (made changes to use distilbert rather bert) and created endpoint and I can see that
from dashboard it has inservice status

However when I tried to endpoint using your example code,

import boto3

sm = boto3.client('sagemaker-runtime')

prompt = "The best part of Amazon SageMaker is that it makes machine learning easy."

response = sm.invoke_endpoint(EndpointName=endpoint_name,
Body=prompt.encode(encoding='UTF-8'),
ContentType='text/csv')

response['Body'].read()

I get an error message of no module named transformer.

ModelError Traceback (most recent call last)
in
6 response = sm.invoke_endpoint(EndpointName=endpoint_name,
7 Body=prompt.encode(encoding='UTF-8'),
----> 8 ContentType='text/csv')
9
10 response['Body'].read()

/opt/conda/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.name = str(py_operation_name)

/opt/conda/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
674 error_code = parsed_response.get("Error", {}).get("Code")
675 error_class = self.exceptions.from_code(error_code)
--> 676 raise error_class(parsed_response, operation_name)
677 else:
678 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "No module named 'transformers'
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 110, in transform
self.validate_and_initialize(model_dir=model_dir)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 157, in validate_and_initialize
self._validate_user_module_and_set_functions()
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 170, in _validate_user_module_and_set_functions
user_module = importlib.import_module(user_module_name)
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 665, in _load_unlocked
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "/opt/ml/model/code/deploy_distilbert.py", line 12, in
from transformers import DistilBertTokenizer, DistilBertForTokenClassification
ModuleNotFoundError: No module named 'transformers'
". See https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/distilbert-base-cased in account 645532039181 for more information.

I am having terrible time of understanding how to use pre-trained model in sagemaker and would really appreciate if you could help.

In the log, it says:

Python executable: /opt/conda/bin/python

And I checked the python and it does have transfomers module installed

/opt/conda/bin/python

Python 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import transformers
quit()

My requirements.txt has:

transformers>=4.0.0
tensorflow==1.13.1
torch==1.5.0

When I run this cell, I added requirements.txt into code,
import tarfile

zipped_model_path = os.path.join(model_path, "model.tar.gz")

with tarfile.open(zipped_model_path, "w:gz") as tar:
tar.add(model_path)
tar.add(code_path)

Would you please tell me what's going wrong?

Thanks
Teresa

Replace global reference with a local variable in example script

There is a function in one of the SageMaker's examples that saves Keras model onto disk:

def save_model(model, output):
    tf.contrib.saved_model.save_keras_model(model, args.model_dir)
    logging.info("Model successfully saved at: {}".format(output))
    return

But it seems that the function doesn't use output variable and refers global args.model_dir instead.

How do you think, should this function be replaced with something like this?

def save_model(model, output):
    tf.contrib.saved_model.save_keras_model(model, output)
    logging.info("Model successfully saved at: {}".format(output))

Comparison statement instead of assignment operation in `save_history` function

Several TF related examples contain a save_history with the following statements:

amazon-sagemaker-script-mode/tf-batch-inference-script/code/train.py

Lines 38 to 39 in e22f843

 if type(history.history[key]) == np.ndarray: 

 history_for_json[key] == history.history[key].tolist()

I believe you wanted to use assignment instead of comparison here.

pretrained xgboost deployment in script mode

Sorry please close this issue. I figured out my problem. Although adding an example of deployment to local mode would help.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	if type(history.history[key]) == np.ndarray:
	history_for_json[key] == history.history[key].tolist()