Git Product home page Git Product logo

sagemaker-battlesnake-ai's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-battlesnake-ai's Issues

Add tags to resources to assist in tracking costs.

I was wondering if this project would be open to having a tag (or tags) added to the CloudFormation template resource in order to help users track the costs.

If so I'd be up for submitting a PR. I am relatively new to CloudFormation templates but feel this would be a good starting point. Thank you.

BattlesnakeNotebook initialization fails on S3 bucket name inconsistency

After stack creation, the log for BattlesnakeNotebook/LifecycleConfigOnStart includes the following line:

boto3.exceptions.S3UploadFailedError: Failed to upload RLlibEnv/inference/model.tar.gz to bonhomme-snake/battlesnake-aws/pretrainedmodels/model.tar.gz: An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist

where bonhomme-snake was the value I chose for the parameter SolutionS3BucketName.

Looking in S3, I see that the bucket was named sagemaker-solutions-bonhomme-snake.

Looking in the CloudFormation template yaml file, I see that the bucket is created with BucketName: !Sub "sagemaker-solutions-${SolutionS3BucketName}":
https://github.com/awslabs/sagemaker-battlesnake-ai/blob/master/CloudFormation/deploy-battlesnake-endpoint.yaml#L68

while the sed command doesn't have the same sagemaker-solutions- prefix and only uses the bucket name:
https://github.com/awslabs/sagemaker-battlesnake-ai/blob/master/CloudFormation/deploy-battlesnake-endpoint.yaml#L243-L245

I unstuck myself by manually creating an s3 bucket without the prefix, modifying the NotebookInstanceExecutionRole's IAM permissions to include the new bucket explicitly and objects underneath it, then restarting the notebook instance.

(I did not try updating the sed commands in the script.)

Running Issues

Hi,
I recently started to get this new user warning in Heuristics Developer:

/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py:1454: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
	data0: None
  input_sym_arg_type = in_param.infer_type()[0]
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py:1454: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
	data0: None
  input_sym_arg_type = in_param.infer_type()[0]

Everything else was running as expected by then suddenly when I run the next cell recently i.e. "Simulation loop", it prints "completed" twice while it used to print only once before.

When I tried running the SageModelTraining Notebook, I get the following error:

RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpic2b7kee/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

Sorry for the long message and I would really really appreciate your guidance!

api function throwing errors

INIT_START Runtime Version: python:3.7.v22 Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:c20bdb59d3d6e84cc4b436e730261b773590c43962e79e9b7a43e9715ac5276d
START RequestId: 3352c07d-f016-4128-9bfa-98d7891f8edf Version: $LATEST
[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda':
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:

  • The Python version is: Python3.7 from "/var/lang/bin/python3.7"
  • The NumPy version is: "1.22.4"
    and make sure that they are the versions you expect.
    Please carefully study the documentation linked above for further help.
    Original error was: No module named 'numpy.core._multiarray_umath'

This appears to be an issue with the api lambda code at
S3Bucket: sagemaker-solutions-prod-us-west-2
S3Key: sagemaker-battlesnake-ai/1.2.1/build/api.zip

I'm going to attempt to rebuild and deploy to from local to see if I can resolve this. But otherwise this project is unusable.

Missing setup.sh

It looks like there is some code that no longer is needed. In SageMakerModelTraining

# only run from SageMaker notebook instance
if local_mode:
    !/bin/bash ./setup.sh

gives
/bin/bash: ./setup.sh: No such file or directory
when in local_mode.

Issues with RLEstimator when training

Trying to follow the instructions from the markdown files, I struggle with the RLLibEnv/2_PolicyTraining.ipynb. In the cell which starts the training, the RLEstimator expects three further arguments toolkit, toolki_version, and framework. I fixed this with the following lines:

 toolkit=RLToolkit.COACH,
 toolkit_version='0.11.1',
 framework=RLFramework.TENSORFLOW,

After fixing that, the next problem occurred. When the RLEstimator is calling the train-mabs.py with the parameters. It seems to lack an installation of the requirements.txt in the created docker container. Ray is not installed, but doesn't seem to be the only problem. Output:

Invoking script with the following command:

/usr/bin/python -m train-mabs --additional_configs clip_rewards=True,gamma=0.999,kl_coeff=0.2,lambda=0.9,lr=0.0005,num_sgd_iter=3,sample_batch_size=96,sgd_minibatch_size=256,train_batch_size=9216,vf_clip_param=175.0 --algorithm PPO --iterate_map_size False --map_size 11 --num_agents 4 --num_iters 10 --use_heuristics_action_masks False

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/ml/code/train-mabs.py", line 5, in <module>
    import ray
ModuleNotFoundError: No module named 'ray'
2020-12-22 16:35:23,079 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/usr/bin/python -m train-mabs --additional_configs clip_rewards=True,gamma=0.999,kl_coeff=0.2,lambda=0.9,lr=0.0005,num_sgd_iter=3,sample_batch_size=96,sgd_minibatch_size=256,train_batch_size=9216,vf_clip_param=175.0 --algorithm PPO --iterate_map_size False --map_size 11 --num_agents 4 --num_iters 10 --use_heuristics_action_masks False"

2020-12-22 16:35:50 Uploading - Uploading generated training model
2020-12-22 16:35:50 Failed - Training job failed
ProfilerReport-1608654710: Stopping

Missing SELECTED_RL_METHOD environment variable in BattlesnakeAPIFunction Lamba

Using the Cloudformation template and getting around #28, I was getting internal server errors when I hit the snake/status endpoint. Digging into the logs, I found the following error:

[ERROR] KeyError: 'SELECTED_RL_METHOD'
Traceback (most recent call last):
  File "/var/lang/lib/python3.7/imp.py", line 234, in load_module
    return load_source(name, filename, file)
  File "/var/lang/lib/python3.7/imp.py", line 171, in load_source
    module = _load(spec)  File "<frozen importlib._bootstrap>", line 696, in _load
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/var/task/lambda.py", line 29, in <module>
    if os.environ['SELECTED_RL_METHOD'] == "MXNet":
  File "/var/lang/lib/python3.7/os.py", line 681, in __getitem__
    raise KeyError(key) from None

After I manually added the environment variable SELECTED_RL_METHOD with the value "RLLib" to this Lambda, I was able to get a 200 status saying the snake is ready.

Benchmarking method

Hi,
Where is the matrix saved so that I could benchmark the new model?

Thanks,

Shahzeb

SageMaker SDK argument incompatible in 2_PolicyTraining.ipynb

2_PolicyTraining.ipynb breaks in SageMaker Studio with "SageMaker JumpStart Tensorflow 1.0" image due to SageMaker SDK version imcompatibility.

The notebook and solution was deployed through SageMaker JumpStart. The 7th cell would break due to argument missing.

estimator.fit()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<timed exec> in <module>

/opt/conda/envs/sagemaker-soln/lib/python3.7/site-packages/sagemaker/rl/estimator.py in __init__(self, entry_point, toolkit, toolkit_version, framework, source_dir, hyperparameters, image_uri, metric_definitions, **kwargs)
    145             :class:`~sagemaker.estimator.EstimatorBase`.
    146         """
--> 147         self._validate_images_args(toolkit, toolkit_version, framework, image_uri)
    148 
    149         if not image_uri:

/opt/conda/envs/sagemaker-soln/lib/python3.7/site-packages/sagemaker/rl/estimator.py in _validate_images_args(cls, toolkit, toolkit_version, framework, image_uri)
    389                 raise AttributeError(
    390                     "Please provide `{}` or `image_uri` parameter.".format(
--> 391                         "`, `".join(not_found_args)
    392                     )
    393                 )

AttributeError: Please provide `toolkit`, `toolkit_version`, `framework` or `image_uri` parameter.

This is due to the input arguments to the estimator are obsolete after SageMaker Python SDK v1->v2 upgrade.

The SageMaker JumpStart Tensorflow 1.0 kernel has sdk version:

sagemaker.__version__
'2.45.0'

which takes image_uri instead of image_name.

image_name = '462105765813.dkr.ecr.{region}.amazonaws.com/sagemaker-rl-ray-container:ray-0.8.2-tf-{device}-py36'.format(region=region, device=device)
estimator = RLEstimator(entry_point="train-mabs.py",
                        source_dir='training/training_src',
                        dependencies=["training/common/sagemaker_rl", "inference/inference_src/", "../BattlesnakeGym/"],
                        image_name=image_name,
                        role=role,
                        training_instance_type=instance_type,
                        training_instance_count=1,
...
...

MXBoard Integration in the Example is Broken

Summary of Bug
When you try to use the Writer param in examples/train.py so you can view the training process with MXBoard/Tensorboard, the program crashes.

Steps to reproduce:

  1. Run: python3 train.py --should_render --writer --run_name test_1 --render_steps 100
  2. Notice ->

Traceback (most recent call last):
File "train.py", line 193, in
run(seed, args)
File "train.py", line 42, in run
writer = SummaryWriter("logs/{}-seed{}".format(run_name, seed), verbose=False)
NameError: name 'run_name' is not defined
and/or
Traceback (most recent call last):
File "train.py", line 193, in
run(seed, args)
File "train.py", line 91, in run
args.writer, args.print_progress)
File "/Users/Kshahzada/Documents/GitHub/sagemaker-battlesnake-ai/TrainingEnvironment/examples/dqn_run.py", line 85, in trainer
writer.add_scalar("rewards_{}".format(i), score[i], i_episode)
AttributeError: 'bool' object has no attribute 'add_scalar'

Expected Output
Should start creating logs for Tensorboard to use.

Actual Output
Crashes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.