Creates an AWS DeepRacing training environment which can be deployed in the cloud, or locally on Ubuntu Linux, Windows or Mac.

License: MIT No Attribution

Shell 61.36% Python 28.49% HTML 10.15%

deepracer-for-cloud's Introduction

DeepRacer-For-Cloud

Provides a quick and easy way to get up and running with a DeepRacer training environment using a cloud virtual machine or a local compter, such AWS EC2 Accelerated Computing instances or the Azure N-Series Virtual Machines.

DRfC runs on Ubuntu 20.04 or 22.04. GPU acceleration requires a NVIDIA GPU, preferrably with more than 8GB of VRAM.

Introduction

DeepRacer-For-Cloud (DRfC) started as an extension of the work done by Alex (https://github.com/alexschultz/deepracer-for-dummies), which is again a wrapper around the amazing work done by Chris (https://github.com/crr0004/deepracer). With the introduction of the second generation Deepracer Console the repository has been split up. This repository contains the scripts needed to run the training, but depends on Docker Hub to provide pre-built docker images. All the under-the-hood building capabilities are in the Deepracer Simapp repository.

Main Features

DRfC supports a wide set of features to ensure that you can focus on creating the best model:

User-friendly
- Based on the continously updated community Robomaker container, supporting a wide range of CPU and GPU setups.
- Wide set of scripts (dr-*) enables effortless training.
- Detection of your AWS DeepRacer Console models; allows upload of a locally trained model to any of them.
Modes
- Time Trial
- Object Avoidance
- Head-to-Bot
Training
- Multiple Robomaker instances per Sagemaker (N:1) to improve training progress.
- Multiple training sessions in parallel - each being (N:1) if hardware supports it - to test out things in parallel.
- Connect multiple nodes together (Swarm-mode only) to combine the powers of multiple computers/instances.
Evaluation
- Evaluate independently from training.
- Save evaluation run to MP4 file in S3.
Logging
- Training metrics and trace files are stored to S3.
- Optional integration with AWS CloudWatch.
- Optional exposure of Robomaker internal log-files.
Technology
- Supports both Docker Swarm (used for connecting multiple nodes together) and Docker Compose

Documentation

Full documentation can be found on the Deepracer-for-Cloud GitHub Pages.

Support

For general support it is suggested to join the AWS DeepRacing Community. The Community Slack has a channel #dr-training-local where the community provides active support.
Create a GitHub issue if you find an actual code issue, or where updates to documentation would be required.

deepracer-for-cloud's People

Contributors

Stargazers

Watchers

Forkers

arcc-race jbklopfenstein mominnawaf carlosloureda pammix86 jennyluciav liberson dafrost22 minervarose deepchatterjeevns cahya-wirawan oleey abhinavm24 yogeshsharma0201 tonymarkham pablobacho yveoms kebin8 tmayor ricardofideles mayurmadnani mike-matera osamasarhan usherfu noindyfikator daj freedragon anilmmsw abdelrhman-m vinitbipinjain maroussil alexlenk rishistyping amirstudy larsll sgatea emailhy ronald-d-rogers ending2015a zgle-fork vamsirajendra jhourque binhtrantt raushanprabhakar1 ersawant yoonian spatraso hafidzdaud jochem725 shamsherthind mkreder codehem youxiang518 budavarapu ruparee ibotmaker elijahkalii bandwidth yighu younjinjeong denis-ads qingxinhu123 oberfrank-rezso nalbam alexforks octavianx dartjason saina-ramyar dgalbraith peterpanstechland potsm0ker liyucode freemanfxzhong tcheungcircleci kishorekolli my-machine-learning-projects-ct jdmcdonagh kanika053 c0d3junkie tianrking gypsydang nk-niteshkumar benjamin-ky anjrew kenobi52 bryanbroussard-99 guos vppexis tnarik darkbalink vovikdrg drbothen bohica-labs tjw0311 darkclaw one-planet joaron4 ashstartempest fu-junnan fujunnan

deepracer-for-cloud's Issues

no .log files in logs

no .log files in /aws-deepracer-workshops/log-analysis/logs directory

minio GUI disabled by default in system.env

Did a fresh / clean install using all new pulls. MinIO gui on port 9000 will not authenticate. The profile is set and the login page loads, but rejects all authentication attempts.

The setting of
DR_GUI_ENABLE=False
in the default system.env is the culprit. Setting it to
DR_GUI_ENABLE=True
enables MinIO to authenticate at login

Suggested fixes:

Additional documentation to call this setting out
Update creation of system.env to set it to True

DR_ROBOMAKER_MOUNT_LOGS is not working

The DR_ROBOMAKER_MOUNT_LOGS flag no longer has any effect, even if it is documented. This should be re-instated to work.

See #111

Local training logs access

Started to train locally my model on a Linux machine. I was wondering where can I see some logs from prints that I wrote in the reward function.

Adding the Plotly in to log-analysis docker image

Hi Alex,

Could you please add the Plotly (https://plot.ly/python/getting-started/) in to aschu/log-analysis docker image? Plotly is a really nice graphing library that can be used instead of matplotlib. I use it now as interactive track visualisation (aws-deepracer-community/aws-deepracer-workshops#11). Currently my notebook use a docker image I have cloned from yours and adding it the Plotly. But it would be nice if your docker image has already this graphing library Plotly installed.

Thanks.

Starting fresh, docs reference script names not in repo

Under "first run" the docs reference scripts named dr-upload-custom-files and dr-start-training, but they're not to be found in the repo anywhere. In the scripts/upload directory (after sourcing the system.env file) trying to run any of the scripts in there to upload anything, the scripts complain that the bucket name is missing. I have verified that I have minio running properly.

Network sagemaker-local declared as external, but could not be found.

Bringing the containers up spits:

ERROR: Network sagemaker-local declared as external, but could not be found. Please create the network manually usingdocker network create sagemaker-localand try again.

Since the README says nothing about I guess it's a simple overlook.

CUDA Key Rotation Happened on April 28 2022 Causes Nvidia Docker Download Failure

There was a key rotation happening on NVIDIA side on April 28th as announced here:
https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771

Due to the apt key rotation, bin/prepare.sh fails with this error:

GPG error: http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY F60F4B3D7FA2AF80

Proposed solution:
Add below command after line 67 here :
https://github.com/aws-deepracer-community/deepracer-for-cloud/blob/master/bin/prepare.sh#L66

apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub

Can't back-up model

I have been really liking deepracer for dummies so far, but i have a problem. Every time after i have trained a model I run back-up-training-run.sh, and it says i cannot move my model to '/media/aschu/storage/deepracer-training/backup' because it does not exist.

Should i change it to a folder that does exist or is something going wrong?

thanks!

Running evaluations.

Hi,

Good day.

Been trying to figure out what is going on when I run an evaluation. My workflow is:

training/start.sh
...
training/stop.sh
...
evaluation/start.sh
...
evaluation/stop.sh

When I check the vncviewer, the car behaves completely different to what I read in the log analysis. So I checked the eval logs for the evaluation dockers. And it seems that it is not using the latest snapshot. Not only that, it overwrites the checkpoint. Before eval my checkpoint file is:

model_checkpoint_path: "85_Step-215377.ckpt"
all_model_checkpoint_paths: "81_Step-197464.ckpt"
all_model_checkpoint_paths: "82_Step-201475.ckpt"
all_model_checkpoint_paths: "83_Step-205622.ckpt"
all_model_checkpoint_paths: "84_Step-210502.ckpt"
all_model_checkpoint_paths: "85_Step-215377.ckpt"

after:

model_checkpoint_path: "0_Step-0.ckpt"
all_model_checkpoint_paths: "0_Step-0.ckpt"

I seem to be missing a step here I think? Could you please let me know what your workflow is like? Thanks. And kudos for the repo :).

Regards.

'sh: 1: !!: not found' showing in log output terminal

Hi,
This is my first time doing this and i am not a advanced linux user.I noticed that when i run ./start.sh the following error message is shown in the terminal where the log output is to shown. 'sh: 1: !!: not found'. I also noticed that i dont have the sagemaker folder in docker/volumes/minio/bucket.

Documentation: Missing . at end of docker build for multi gpu script

On the documentation page for building multi GPU, a "." is missing at the end of docker build.

Link to page: https://aws-deepracer-community.github.io/deepracer-for-cloud/multi_gpu.html

Unable to change track for local training

In the run.env file whenever I change the value for DR_WORLD_NAME before training, when I upload the model to the aws deepracer console after training it shows the model track to be reinvent_2018 track even when I changed the track to something else.

Empty rl_deepracer_coach_robomaker.py file?

Hi,

There seems to be nothing in the rl_deepracer_coach_robomaker.py?

Probably an error with the init.sh that blanks the file.

ERROR: pull access denied for aschu/rl_coach

I get this error message when running start.sh:
ERROR: The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.

Continue with the new image? [yN]y
Pulling rl_coach (aschu/rl_coach:)...
ERROR: pull access denied for aschu/rl_coach, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

Default hyperparameters

This isn't a bug or issue actually, but rather a question.

This is the DRfC default hyperparameters.json

{
    "batch_size": 64,
    "beta_entropy": 0.01,
    "discount_factor": 0.995,
    "e_greedy_value": 0.05,
    "epsilon_steps": 10000,
    "exploration_type": "categorical",
    "loss_type": "huber",
    "lr": 0.0003,
    "num_episodes_between_training": 20,
    "num_epochs": 10,
    "stack_size": 1,
    "term_cond_avg_score": 350.0,
    "term_cond_max_episodes": 1000,
    "sac_alpha": 0.2
}

I was wondering how you came up with values for the e_greedy_value, epsilon_steps, exploration_type, stack_size, term_cond_avg_score, term_cond_max_episodes, and sac_alpha? Or where I can find the AWS Deepracer official defaults for these values? I found defaults on the AWS DeepRacer documentation for the other values.

Thanks

Will this project work for Mac OSX?

Is this supported on a Mac? I didn't see Mac support mentioned in the README or any issues, and wanted to check if this will work before I go too far down the path of installing all the prerequisites.

The underlying project appears to be supported on Mac, though some custom steps are required, as listed in these guides:
aws-deepracer-community/deepracer-core#11 (comment)
https://github.com/kevinmarlis/deep-racer/blob/master/Mac-Local-Training-Installation.md

I'm using macOS Mojave 10.14.6 with Intel Iris Pro 1536MB.

init.sh does not work when DRfC is installed in a file path with spaces

Cannot load pre-trained model

Missing comma in deepracer-for-dummies/overrides/rl_deepracer_coach_robomaker.py at line 129. This causes error without any log. Sagemaker waits for connecting redis server infinitely.

Daemon.json is empty after running prepare.sh

In certain situations /etc/docker/daemon.json is empty; this means the nvidia-docker is not getting properly configured.

the initialization stops when trying to create the agent

everything runs smoothly without errors but it freezes at that point of creating agent and I don't know what part of the project should I be looking into to solve this or what to google, please help ASAP
it worked before for like 20 episodes and gave an error then

s

Unable to find deepracer checkpoint json

Hello, while trying to start training default model, after whole configuration, when I run dr-start-training, this is the place where it all gets stuck. No idea what to do, would appreciate any help

cannot start rl_coach , read only file system

minio is up-to-date
Starting rl_coach ... error

ERROR: for rl_coach Cannot start service rl_coach: error while creating mount source path '/robo/container': mkdir /robo: read-only file system

ERROR: for rl_coach Cannot start service rl_coach: error while creating mount source path '/robo/container': mkdir /robo: read-only file system
ERROR: Encountered errors while bringing up the project.
waiting for containers to start up.

ubuntu 18.04 lts

docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1a11f6a7c1b8 minio/minio "/usr/bin/docker-ent…" 48 minutes ago Up About an hour 0.0.0.0:9000->9000/tcp minio

docker version

Client: Docker Engine - Community
Version: 19.03.4
API version: 1.38 (downgraded from 1.40)
Go version: go1.12.10
Git commit: 9013bf583a
Built: Fri Oct 18 15:54:09 2019
OS/Arch: linux/amd64
Experimental: false

Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.4
Git commit: e68fc7a
Built: Tue May 7 17:57:34 2019
OS/Arch: linux/a

dpkg -l | grep nvidia-docker

ii nvidia-docker2 2.2.2-1 all nvidia-docker CLI wrapper

nvidia-docker version
NVIDIA Docker: 2.2.2
Client: Docker Engine - Community
Version: 19.03.4
API version: 1.38 (downgraded from 1.40)
Go version: go1.12.10
Git commit: 9013bf583a
Built: Fri Oct 18 15:54:09 2019
OS/Arch: linux/amd64
Experimental: false

Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.4
Git commit: e68fc7a
Built: Tue May 7 17:57:34 2019
OS/Arch: linux/amd64
Experimental: false

there are any parameters to set the evaluation to use certain model?

Hi Alex
First of all, it is a really good job for saving money, thank you

I have run the training and log-analysis smoothly, but when I run evaluation ./start.sh, I got a poor results in the gazebo, the car can not follow the track. I want to know there are any parameters to set the evaluation to use certain model? for example 'model_24.pb' or 'model_25.pb' in the folder '/bucket/rl-deepracer-pretrained/model'

now,my operations 1. after training ,I execute ‘set last run to pretrained.sh’ 2. set the pretrained model in file 'rl_deepracer_coach_robomaker.py', then to execute the evaluation start these operations are eight?

thanks

Failure in robomaker on new install on Ubuntu 18.04 LTS

I started with a clean Ubuntu 18.04 LTS install, and used the deepracer_for_dummies install, which ran smoothly.

But when I runs the ./start.sh script I see some errors in the logs in the robomaker logs. It seems like I am pretty close to getting this working as I do get a vncviewer window and a couple of other terminal windows.

I am running on a dell T7500 with 24 Gb memory and a geforce gtx 1070 from gigabyte with 6GB of memory, and all of the checks that were part of the install looked like then worked correctly to me. Other logs look clean as far as I can tell.

Any help getting local deepracer training running would be appreciated.

The docker log robomaker looks like this:

auto-starting new master
process[master]: started with pid [876]
ROS_MASTER_URI=http://localhost:11311
the rosdep view is empty: call 'sudo rosdep init' and 'rosdep update'
�]2;/app/robomaker-deepracer/simulatioSpawnModel script started
[INFO] [1578153836.596603, 0.000000]: Loading model XML from ros parameter
[INFO] [1578153836.600068, 0.000000]: Waiting for service /gazebo/spawn_urdf_model
[INFO] [1578153836.601705, 0.000000]: Calling service /gazebo/spawn_urdf_model
[INFO] [1578153836.745759, 0.393000]: Spawn status: SpawnModel: Successfully spawned entity

export PYTHONUNBUFFERED=1
PYTHONUNBUFFERED=1
python3 -m markov.rollout_worker
/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh: line 8: 1569 Illegal instruction (core dumped) python3 -m markov.rollout_worker
================================================================================REQUIRED process [agent-9] has died!
process has died [pid 1449, exit code 132, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh name:=agent log:=/root/.ros/log/cda39cbc-2f0b-11ea-a1a4-0242ac120004/agent-9.log].
log file: /root/.ros/log/cda39cbc-2f0b-11ea-a1a4-0242ac120004/agent-9*.log
Initiating shutdown!

[ INFO] [1578153835.827480371]: Finished loading Gazebo ROS API Plugin.
[ INFO] [1578153835.830352916]: waitForService: Service [/gazebo/set_physics_properties] has not been advertised, waiting...
[ INFO] [1578153836.270740797, 0.034000000]: waitForService: Service [/gazebo/set_physics_properties] is now available.
[ INFO] [1578153836.349773497, 0.111000000]: Physics dynamic reconfigure ready.
Traceback (most recent call last):
.........
Then further down in the file this error

The VNC desktop is: b44e0933fcf2:0

Have you tried the x11vnc '-ncache' VNC client-side pixel caching feature yet?

The scheme stores pixel data offscreen on the VNC viewer side for faster
retrieval. It should work with any VNC viewer. Try it by running:

x11vnc -ncache 10 ...
One can also add -ncache_cr for smooth 'copyrect' window motion.
More info: http://www.karlrunge.com/x11vnc/faq.html#faq-client-caching

PORT=5900
[racecar/controller_manager-5] escalating to SIGTERM
[gazebo-2] escalating to SIGTERM
[WARN] [1578153853.995580, 1.688000]: Controller Spawner error while taking down controllers: transport error completing service call: receive_once[/racecar/controller_manager/switch_controller]: unexpected error [Errno 4] Interrupted system call
[INFO] [1578153836.595853, 0.000000]: Controller Spawner: Waiting for service controller_manager/load_controller
[INFO] [1578153837.804389, 0.536000]: Controller Spawner: Waiting for service controller_manager/switch_controller
[INFO] [1578153837.806284, 0.538000]: Controller Spawner: Waiting for service controller_manager/unload_controller

VNC Viewer the connection close unexpectedly when run ./start.sh

When i try to run ./start, vnc is not running and close unexpectedly.

Here is the log:

vortana@vortana-System-Product-Name:~/Documents/awsdeepracer/deepracer-for-dummies/scripts/training$ ./start.sh 
minio is up-to-date
Recreating rl_coach ... done
Recreating robomaker ... done
waiting for containers to start up...
Attempting to pull up sagemaker logs...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.
Attempting to open vnc viewer...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.

What should I do?

Locally trained model not valid for import on DeepRacer console after the July 2020 update

Hi,

I tried to setup the environment and successfully train several models.
But when I tried import them to AWS DeepRacer console, I got Invalid model error status, and the description being that We can't validate your model because it's been edited.

I realized during middle of July 2020 there is a major update on DeepRacer console. Now the model artifacts will no longer be created on S3, but being hidden somewhere that we won't have access to, except for the logs.
I try to follow the official document about the update:

https://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-troubleshooting-service-migration-errors.html#what-is-update

to create necessary files from my local training environment and upload to S3 manually.
But I just cannot get it imported to DeepRacer console.
To be specific, I upload the following artifacts:

└── super
    ├── ip
    │   ├── done
    │   ├── hyperparameters.json
    │   └── ip.json
    ├── model
    │   ├── .coach_checkpoint
    │   ├── 16_Step-32917.ckpt.data-00000-of-00001
    │   ├── 16_Step-32917.ckpt.index
    │   ├── 16_Step-32917.ckpt.meta
    │   ├── 17_Step-37295.ckpt.data-00000-of-00001
    │   ├── 17_Step-37295.ckpt.index
    │   ├── 17_Step-37295.ckpt.meta
    │   ├── model_16.pb
    │   ├── model_17.pb
    │   └── model_metadata.json
    ├── model_metadata.json
    └── reward_function.py

But its not working.
I have also tried several combinations, for example delete the model_metada.json under root, keep only hyperparameres.json under ip, removing *.pb files...

Nothing works. :(

Anyone has the same issue?

aschu/rl_coach image is private

I'm getting the following error when I try to tun the start.sh training script, I think this docker image is private now, because I can't access it trought the docker hub.

Error:

latest: Pulling from minio/minio
e7c96db7181b: Already exists
3b53ed910b0e: Pulling fs layer
8a1bd2c467e1: Downloading [======================>                            ]     487B/1.0813b53ed910b0e: Downloading [>                                                  ]  22.39kB/2.234MB
1bd2c467e1: Download complete
3b53ed910b0e: Downloading [========>                                          ]  375.9kB/2.234MBf779e5db5f: Downloading [>                                                  ]  184.1kB/18.1M3b
53ed910b0e: Downloading [=================>                                 ]    781kB/2.2343b53ed910b0e: Downloading [========================>                          ]  1.109MB/2.2343b53
ed910b0e: Pull complete
8a1bd2c467e1: Pull complete
b2f779e5db5f: Pull complete
Digest: sha256:8b4f4c0de3ab6d7b160fd2fc015665ace73fb80329c892a96f7f7a2f56426b3d
Status: Downloaded newer image for minio/minio:latest
Pulling rl_coach (aschu/rl_coach:)...
ERROR: The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.

Continue with the new image? [yN]y
Pulling rl_coach (aschu/rl_coach:)...
ERROR: pull access denied for aschu/rl_coach, repository does not exist or may require 'docker login'

Seems cannot load pretrained model

these are the values of the reward supposed to be
https://i.imgur.com/MPAmU9M.png

what I get when I tried to retrain a model
https://i.imgur.com/GU1Tv7U.png

I've uncommented those two lines
https://i.imgur.com/jyTutI7.png

Besides the reward, the actions in retraining also look like random. I think I just started a new training instead of retraining a pretrained model? How to solve it? Thank you.

is there any other parameters should be set before running evaluation start.sh

Hi Alex，
when running the evaluation， the two parameters must be uncommented using pretrained？

#"pretrained_s3_bucket": "{}".format(s3_bucket),
#"pretrained_s3_prefix": "rl-deepracer-pretrained"

is there any other parameters should be set before running evaluation start.sh

thanks

Running Evaluation

I was able to run evaluation.launch present in deepracer_simulation package.

It requires the addition of environment variable NUMBER_OF_TRIALS

And the evaluation can be started by roslaunch deepracer_simulation evaluation.launch

Regarding which model it chooses for evaluation is dependent on the model present in checkpoint.txt on the minio server.

It will be helpful to have the evaluation function added.

dr-download-model error when downloading from default S3 export path

The default export path from the DeepRacer console is:
s3://aws-deepracer-assets-<uid>/<model>/<timestamp>/,
where timestamp is in the format %a, %d %b %Y %H:%M:%S %Z e.g. Wed, 27 Oct 2021 19:42:10 GMT

dr-download-model doesn't preserve the -s source-url string all the way through the script, so the aws s3 sync command fails.

Sagemaker frozen after printing "Checkpoint> Saving in path=['./checkpoint/agent/0_Step-0.ckpt']"

Hi there. I'm running on Ubuntu 18.04 with an Nvidia Tesla P4 gpu. I've managed to get the containers running for dr-start-training with default configs and local mode. However the log froze after printing this message:

Checkpoint> Saving in path=['./checkpoint/agent/0_Step-0.ckpt']

I checked the checkpoint path and all the files were not changed at all (neither size nor modified time), while the cpu remains at 100% consumption by the python process.

I've located the source of this log message to line 61 of training_worker.py

# save initial checkpoint                                           
graph_manager.save_checkpoint()

but couldn't debug further from there.

Any idea what could be causing this problem?

Thanks

Minio is not reachable when started before setting credentials

Solution is to create minio credentials during init if not present.

The rl_coach container is not stopping eventhough the limits has been reached

Hi,
I was trying to automate the training process and restart it upon some limits are reached.
I tried with the follow scripts:

#!/usr/bin/env bash
while true
do
        docker kill $(docker ps -q|xargs)
        ./start.sh
        sudo `pwd`/set-last-run-to-pretrained.sh
        sudo `pwd`/delete-last-run.sh
        ./stop.sh
        sleep 10
done

My hypothesis is that the training container will exit upon the limit is reached ( "term_cond_max_episodes" : 400 )
Unfortunately it doesn't .
Do you have any suggestion?

AgentsVideoEditor._mp4_queue['0'] is empty

when I run "dr-start-training"
the deepracer-robomaker keeps reporting
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
the deepracer-segemaker stay at
Checkpoint> Saving in path=['./checkpoint/agent/0_Step-0.ckpt']

I want to know why.

minio and portainer both run on port 9000 - support running minio outside of swarm?

The local init.sh script creates a docker swarm cluster. I already have a swarm cluster and I'd like to run the training on my existing swarm cluster, but I use portainer to manage it (for other projects). Portainer and minio both want to use port 9000, so I'm running minio on a different host. Can you alter the script/environment to be friendly regarding running minio on a different server than the docker swarm cluster IPs?

"Sagemaker is not running."

I followed the video tutorial on this, and I keep getting the error that sagemaker is not running. I have pulled the correct image and have checked the system.env file to make sure I got the right image. Can anyone please help me with this?

Log analysis

Hi, when I want to run log analysis it alwasy show me 799 episode and a lower iteration number.
Is the log loaded in the notebook correct or I must change file name?
If I must change file name, where is located the correct logs?
Thanks for your help

Could not connect to the endpoint URL

Getting a,

fatal error: Could not connect to the endpoint URL: "http://localhost:9000/bucket?list-type=2&prefix=custom_files%2F&encoding-type=url" when running dr-upload-custom-files

AND

botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://localhost:9000/bucket/custom_files/reward_function.py"
Creating Robomaker configuration in s3://bucket/rl-deepracer-sagemaker/training_params.yaml
Updating service deepracer-0_rl_coach (id: kjx3z9p2qxdzzddudkr6ul05q)
Updating service deepracer-0_robomaker (id: w3vkg5xmo4wgcd5idljlo1y9y)
Waiting up to 15 seconds for Sagemaker to start up...
Sagemaker is not running.

when running dr-start-training

Already tried changing the docker-compose-local.yml to minio/minio:RELEASE.2022-05-08T23-50-31Z

TIA.

Running on local server, can't connect to sagemaker logs

Thanks for these tools. I'm a little new to all this, but I'm trying to get this to run on a local server. When I try to connect to the sagemaker logs using this command from start.sh:

docker logs -f $(docker ps | awk ' /sagemaker/ { print $1 }')

It doesn't find any containers with /sagemaker/. It appears the docker-compose is starting these 3 containers: minio, rl_coach, and robomaker. I'm confused where sagemaker is running and how to connect to it's logs. Any pointers on what I'm missing? I can VNC into robomaker without any issues.

Allocator (GPU_0_bfc) ran out of memory

The simulation started good, but in the third checkpoint the console output a tensorflow bad allocator (see the picture) .
What is happening?
Thanks

CUDA_ERROR_OUT_OF_MEMORY:

## Creating agent - name: agent
Traceback (most recent call last):
  File "training_worker.py", line 252, in <module>
    main()
  File "training_worker.py", line 247, in main
    memory_backend_params=memory_backend_params
  File "training_worker.py", line 68, in training_worker
    graph_manager.create_graph(task_parameters)
  File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 153, in create_graph
    self.create_session(task_parameters=task_parameters)
  File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 254, in create_session
    self._create_session_tf(task_parameters)
  File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 238, in _create_session_tf
    self.sess = tf.Session(config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1511, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 634, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 2097479680

I am getting this error. Does anyone know how to solve it?

Thank you!

Alex,

What a fantastic job you've done! I am thoroughly impressed on how easy it is to set up deepracer locally, thanks to you and others who made it happen. It couldn't be any simpler!

It literally took under 2 minutes for me to run deepracer locally after setting up all the dependencies properly - dual boot, docker, docker-compose, nvidia drivers...etc. Of course there are many dependencies, and properly setting them up definitely takes longer, but you brought all of them together nicely packaged.

I know you mentioned that it worked for you with nvidia-driver-410 driver set. In my case, I went with version 440, and I didn't see any issues at all. I hope it helps others who are setting up the environment.

Once again, great job documenting your repo!

Cannot train my pretrained model

https://i.imgur.com/9klrABL.png

How much space does this take? Can it reasonably run from USB drive?

Wondering how big to make the Linux partition, also is it HD read/write heavy or can I run it reasonably from a non high speed USB drive?

Sagemaker is not running.

Hi, I meet the same issues as others. #71

1. Some issue about the docker images.

After running sudo ./init.sh, it shows as follows:

(base) lmc@carla:~/deepracer-for-cloud/bin$ sudo ./init.sh
Detected cloud type to be local
Sending build context to Docker daemon  2.048kB
Step 1/4 : FROM nvidia/cuda:10.2-base
 ---> 55c80b56bbcd
Step 2/4 : RUN apt-get update && apt-get install -y --no-install-recommends wget python3
 ---> Using cache
 ---> f025fa6dbdf7
Step 3/4 : RUN wget https://gist.githubusercontent.com/f0k/63a664160d016a491b2cbea15913d549/raw/f25b6b38932cfa489150966ee899e5cc899bf4a6/cuda_check.py
 ---> Using cache
 ---> 4b7a643484bf
Step 4/4 : CMD ["python3","cuda_check.py"]
 ---> Using cache
 ---> 2fe7d00643dc
Successfully built 2fe7d00643dc
Successfully tagged local/gputest:latest
Please run 'aws configure --profile minio' to set the credentials
4.0.6: Pulling from awsdeepracercommunity/deepracer-rlcoach
Digest: sha256:99722292f7234f9dc57ef52904de44d960ff3d1d459ddbb3326014911f595112
Status: Image is up to date for awsdeepracercommunity/deepracer-rlcoach:4.0.6
docker.io/awsdeepracercommunity/deepracer-rlcoach:4.0.6
4.0.8-cpu-avx2: Pulling from awsdeepracercommunity/deepracer-robomaker
Digest: sha256:2eb94142bddebf4661bd7a8c570b3ba7095f2a27a918e3f97c56c5559a3ed4ea
Status: Image is up to date for awsdeepracercommunity/deepracer-robomaker:4.0.8-cpu-avx2
docker.io/awsdeepracercommunity/deepracer-robomaker:4.0.8-cpu-avx2
4.0.0-gpu: Pulling from awsdeepracercommunity/deepracer-sagemaker
Digest: sha256:867ecdd6375855e02ad1232ed8f341086109d12d806c7fa7bc930b93f4dd5297
Status: Image is up to date for awsdeepracercommunity/deepracer-sagemaker:4.0.0-gpu
docker.io/awsdeepracercommunity/deepracer-sagemaker:4.0.0-gpu
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
ekuysq2js6awlqd0y80zas5g5
ekuysq2js6awlqd0y80zas5g5
Error response from daemon: rpc error: code = FailedPrecondition desc = network 6f1waq1pmy5xsvqovyvlwvtkx is in use by service 892woein99bc25wobyz4vs6b4
Error response from daemon: network with name sagemaker-local already exists

And I have checked the existing docker images, the images can be correctly built.

REPOSITORY                                  TAG              IMAGE ID       CREATED        SIZE
minio/minio                                 <none>           ecfbb387b46a   5 days ago     261MB
local/gputest                               latest           2fe7d00643dc   5 days ago     171MB
minio/minio                                 <none>           c40e60ad4853   7 days ago     259MB
hello-world                                 latest           feb5d9fea6a5   3 weeks ago    13.3kB
awsdeepracercommunity/deepracer-robomaker   4.0.8-cpu-avx2   d2ab8fd81e58   6 weeks ago    4.4GB
awsdeepracercommunity/deepracer-rlcoach     4.0.6            085ed8735130   3 months ago   757MB
nvidia/cuda                                 10.2-base        55c80b56bbcd   3 months ago   107MB
awsdeepracercommunity/deepracer-sagemaker   4.0.0-gpu        2fc6675edd10   5 months ago   3.95GB

Then I tried to set my aws configure and start to train.

2. `no module named boto3`

At the first time, I run the dr-start-training, it shows no module and I just pip install boto3 and then after waiting for about 15 seconds, it showsSagemaker is not running. Then I tried to reboot my machine and run dr-start-training-w again, it shows:

Wiping path s3://bucket/rl-deepracer-sagemaker.
delete: s3://bucket/rl-deepracer-sagemaker/reward_function.py
delete: s3://bucket/rl-deepracer-sagemaker/training_params.yaml
Traceback (most recent call last):
 File "/home/lmc/deepracer-for-cloud/scripts/training/prepare-config.py", line 105, in <module>
   s3_client.copy(copy_source, Bucket=s3_bucket, Key=reward_function_key)
 File "/home/lmc/anaconda3/lib/python3.8/site-packages/boto3/s3/inject.py", line 380, in copy
   return future.result()
 File "/home/lmc/anaconda3/lib/python3.8/site-packages/s3transfer/futures.py", line 106, in result
   return self._coordinator.result()
 File "/home/lmc/anaconda3/lib/python3.8/site-packages/s3transfer/futures.py", line 265, in result
   raise self._exception
 File "/home/lmc/anaconda3/lib/python3.8/site-packages/s3transfer/tasks.py", line 126, in __call__
   return self._execute_main(kwargs)
 File "/home/lmc/anaconda3/lib/python3.8/site-packages/s3transfer/tasks.py", line 150, in _execute_main
   return_value = self._main(**kwargs)
 File "/home/lmc/anaconda3/lib/python3.8/site-packages/s3transfer/copies.py", line 289, in _main
   client.copy_object(
 File "/home/lmc/anaconda3/lib/python3.8/site-packages/botocore/client.py", line 388, in _api_call
   return self._make_api_call(operation_name, kwargs)
 File "/home/lmc/anaconda3/lib/python3.8/site-packages/botocore/client.py", line 708, in _make_api_call
   raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CopyObject operation: Access Denied.
Creating Robomaker configuration in s3://bucket/rl-deepracer-sagemaker/training_params.yaml
Updating service deepracer-0_rl_coach (id: vx09d0wayr64t9hf0ladvx12i)
Updating service deepracer-0_robomaker (id: 892woein99bc25wobyz4vs6b4)
Waiting up to 15 seconds for Sagemaker to start up...
Sagemaker is not running.

I guess there must be some mistake with the boto3, and I haved checked the existing containers and find the sagemaker container is not running. And after running dr-logs-sagemaker, it also shows Sagemaker is not running. Please tell me if I am wrong. Thanks!

Training doesn't occur / car won't move

Hello,

I'm running into an issue where everything loads, I'm not getting any errors in regard to my installation, but the car will not move (even with the default reward function).

I'm running this on Ubuntu 18.04 and using the installation instructions.
From what I can tell, it doesn't look like the GPU is being utilized even though CUDA is installed and the CUDNN is also installed. I tested tensorflow-gpu and it detects and can run off my CPU.

The PC has a 1080TI and dual Xeon processors. I'll try installing this on another PC with a 2080TI and i5 processor. Though Unfortunately, I'm at a loss.

One other characteristic I noticed was that the robomaker environment seems to fail as well after a period of time. Similar to issue 29

Here is the terminal output from init.sh:
Cloning into 'deepracer'... remote: Enumerating objects: 195, done. remote: Counting objects: 100% (195/195), done. remote: Compressing objects: 100% (159/159), done. remote: Total 1697 (delta 60), reused 141 (delta 36), pack-reused 1502 Receiving objects: 100% (1697/1697), 125.01 MiB | 13.37 MiB/s, done. Resolving deltas: 100% (752/752), done. Checking out files: 100% (803/803), done. Submodule 'deepracer_worlds' (https://github.com/crr0004/deepracer_worlds.git) registered for path 'deepracer_worlds' Submodule 'intel_coach' (https://github.com/NervanaSystems/coach.git) registered for path 'intel_coach' Submodule 'sagemaker-containers' (https://github.com/crr0004/sagemaker-containers.git) registered for path 'sagemaker-containers' Submodule 'sagemaker-python-sdk' (https://github.com/crr0004/sagemaker-python-sdk.git) registered for path 'sagemaker-python-sdk' Submodule 'sagemaker-rl-container' (https://github.com/crr0004/sagemaker-rl-container) registered for path 'sagemaker-rl-container' Submodule 'sagemaker-tensorflow-container' (https://github.com/crr0004/sagemaker-tensorflow-container) registered for path 'sagemaker-tensorflow-container' Cloning into '/home/christian/deepracer-for-dummies/deepracer/deepracer_worlds'... remote: Enumerating objects: 104, done. remote: Counting objects: 100% (104/104), done. remote: Compressing objects: 100% (77/77), done. remote: Total 104 (delta 29), reused 93 (delta 22), pack-reused 0 Receiving objects: 100% (104/104), 8.96 MiB | 12.46 MiB/s, done. Resolving deltas: 100% (29/29), done. Cloning into '/home/christian/deepracer-for-dummies/deepracer/intel_coach'... remote: Enumerating objects: 366, done. remote: Counting objects: 100% (366/366), done. remote: Compressing objects: 100% (206/206), done. remote: Total 8982 (delta 269), reused 236 (delta 160), pack-reused 8616 Receiving objects: 100% (8982/8982), 72.72 MiB | 13.34 MiB/s, done. Resolving deltas: 100% (6148/6148), done. Cloning into '/home/christian/deepracer-for-dummies/deepracer/sagemaker-containers'... remote: Enumerating objects: 1711, done. remote: Total 1711 (delta 0), reused 0 (delta 0), pack-reused 1711 Receiving objects: 100% (1711/1711), 551.66 KiB | 5.16 MiB/s, done. Resolving deltas: 100% (1076/1076), done. Cloning into '/home/christian/deepracer-for-dummies/deepracer/sagemaker-python-sdk'... remote: Enumerating objects: 4690, done. remote: Total 4690 (delta 0), reused 0 (delta 0), pack-reused 4690 Receiving objects: 100% (4690/4690), 52.56 MiB | 13.26 MiB/s, done. Resolving deltas: 100% (3323/3323), done. Cloning into '/home/christian/deepracer-for-dummies/deepracer/sagemaker-rl-container'... remote: Enumerating objects: 9, done. remote: Counting objects: 100% (9/9), done. remote: Compressing objects: 100% (7/7), done. remote: Total 263 (delta 3), reused 7 (delta 2), pack-reused 254 Receiving objects: 100% (263/263), 79.80 KiB | 2.05 MiB/s, done. Resolving deltas: 100% (123/123), done. Cloning into '/home/christian/deepracer-for-dummies/deepracer/sagemaker-tensorflow-container'... remote: Enumerating objects: 1443, done. remote: Total 1443 (delta 0), reused 0 (delta 0), pack-reused 1443 Receiving objects: 100% (1443/1443), 13.29 MiB | 12.19 MiB/s, done. Resolving deltas: 100% (726/726), done. Submodule path 'deepracer_worlds': checked out 'bcf6fa58dd80e7c694323a0e77cfad5fe0e77adf' Submodule path 'intel_coach': checked out '533bb43720311e304ce57a46851c8a100e06e2bf' Submodule path 'sagemaker-containers': checked out 'd11ec7fa61b613d1f95f33d865bf15f27dd96346' Submodule path 'sagemaker-python-sdk': checked out '68101b4f6ecf557d8963dfe59476f8982871d982' Submodule path 'sagemaker-rl-container': checked out '986990bb0425d5529b5af6ab11e34cb634da7c50' Submodule path 'sagemaker-tensorflow-container': checked out '5f5dbb62551d38a63fc3c929ad4ed575376384a2' Cloning into 'aws-deepracer-workshops'... remote: Enumerating objects: 2259, done. remote: Total 2259 (delta 0), reused 0 (delta 0), pack-reused 2259 Receiving objects: 100% (2259/2259), 37.95 MiB | 13.23 MiB/s, done. Resolving deltas: 100% (750/750), done. Branch 'enhance-log-analysis' set up to track remote branch 'enhance-log-analysis' from 'origin'. Switched to a new branch 'enhance-log-analysis' Sending build context to Docker daemon 954.6MB Step 1/19 : FROM python:3.7.3-stretch ---> 34a518642c76 Step 2/19 : RUN apt-get update ---> Using cache ---> 161317e44174 Step 3/19 : RUN apt-get -y install apt-transport-https ca-certificates curl gnupg2 software-properties-common ---> Using cache ---> 9f0f56de75ab Step 4/19 : RUN curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add - ---> Using cache ---> 2ded7f139edf Step 5/19 : RUN add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable" ---> Using cache ---> 9e55cb8dc3a9 Step 6/19 : RUN apt-get update ---> Using cache ---> 69742361cd7e Step 7/19 : RUN apt-get -y install docker-ce ---> Using cache ---> 4dea68525c04 Step 8/19 : RUN mkdir /deepracer ---> Using cache ---> 72f152ed2329 Step 9/19 : RUN mkdir /deepracer/rl_coach ---> Using cache ---> fdf15b00730c Step 10/19 : RUN mkdir /deepracer/sagemaker-python-sdk ---> Using cache ---> 237081995311 Step 11/19 : WORKDIR /deepracer ---> Using cache ---> 48df21a69a9b Step 12/19 : ADD rl_coach rl_coach ---> Using cache ---> c6da8b1247e8 Step 13/19 : ADD sagemaker-python-sdk sagemaker-python-sdk ---> Using cache ---> cc2bfb23e631 Step 14/19 : RUN mkdir /root/.sagemaker ---> Using cache ---> 24887d149302 Step 15/19 : COPY config.yaml /root/.sagemaker/config.yaml ---> Using cache ---> 1aa25d961ae5 Step 16/19 : RUN mkdir /robo ---> Using cache ---> 113fdd14731e Step 17/19 : RUN mkdir /robo/container ---> Using cache ---> dd5df1231d03 Step 18/19 : RUN pip install -U sagemaker-python-sdk/ awscli ipython pandas "urllib3==1.22" "pyyaml==3.13" "python-dateutil==2.8.0" ---> Using cache ---> 6452cfaf974c Step 19/19 : CMD (cd rl_coach; ipython rl_deepracer_coach_robomaker.py) ---> Using cache ---> e6e5b9a9e3b8 Successfully built e6e5b9a9e3b8 Successfully tagged aschu/rl_coach:latest

Here is the output from ./start.sh for training:
` ./start.sh
Creating minio ... done
Creating rl_coach ... done
Creating robomaker ... done
waiting for containers to start up...
attempting to pull up sagemaker logs...

Option “-x” is deprecated and might be removed in a later version of gnome-terminal.

Use “-- ” to terminate the options and put the command line to execute after it.

attempting to open vnc viewer...

Option “-x” is deprecated and might be removed in a later version of gnome-terminal.

Use “-- ” to terminate the options and put the command line to execute after it.

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1808 G /usr/lib/xorg/Xorg 242MiB |
| 0 2000 G /usr/bin/gnome-shell 188MiB |
+-----------------------------------------------------------------------------+
`

Error in docker build - Package gnupg2 is not available

Hi, I'm getting Package gnupg2 is not available, but is referred to by another package. after running init.sh

~/opt/deep-racer/deepracer-for-dummies$  ./init.sh 
ln: failed to create symbolic link '/home/rmpestano/opt/deep-racer/deepracer-for-dummies/docker/volumes/.aws': File exists
fatal: destination path 'deepracer' already exists and is not an empty directory.
fatal: destination path 'aws-deepracer-workshops' already exists and is not an empty directory.
ln: failed to create symbolic link '/home/rmpestano/opt/deep-racer/deepracer-for-dummies/rl_deepracer_coach_robomaker.py': File exists
WARNING: Error loading config file: /home/rmpestano/.docker/config.json: open /home/rmpestano/.docker/config.json: permission denied
Sending build context to Docker daemon  873.5MB
Step 1/19 : FROM python:3.7.3-stretch
 ---> 34a518642c76
Step 2/19 : RUN apt-get update
 ---> Using cache
 ---> cb0353187a6a
Step 3/19 : RUN apt-get -y install apt-transport-https ca-certificates curl gnupg2 software-properties-common
 ---> Running in aa89a2a8921c
Reading package lists...
Building dependency tree...
Reading state information...
Package gnupg2 is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
  dirmngr gnupg gpgv

E: Unable to locate package apt-transport-https
E: Package 'gnupg2' has no installation candidate
E: Unable to locate package software-properties-common
The command '/bin/sh -c apt-get -y install apt-transport-https ca-certificates curl gnupg2 software-properties-common' returned a non-zero code: 100
WARNING: Error loading config file: /home/rmpestano/.docker/config.json: open /home/rmpestano/.docker/config.json: permission denied

I am on Ubuntu 18.04.4 LTS using Docker version 19.03.6, build 369ce74a3c

Any help is appreciated!

aws-deepracer-community / deepracer-for-cloud Goto Github PK

deepracer-for-cloud's Introduction

DeepRacer-For-Cloud

Introduction

Main Features

Documentation

Support

deepracer-for-cloud's People

Contributors

Stargazers

Watchers

Forkers

deepracer-for-cloud's Issues

1. Some issue about the docker images.

2. no module named boto3

Option “-x” is deprecated and might be removed in a later version of gnome-terminal.

Use “-- ” to terminate the options and put the command line to execute after it.

Option “-x” is deprecated and might be removed in a later version of gnome-terminal.

Use “-- ” to terminate the options and put the command line to execute after it.

Recommend Projects

Recommend Topics

Recommend Org

2. `no module named boto3`