arcc-race / deepracer-for-dummies Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aws-deepracer-community/deepracer-for-cloud

66.0 9.0 28.0 5.52 MB

a quick way to get up and running with local deepracer training environment

Dockerfile 0.85% Shell 7.05% Python 8.13% QMake 0.89% C++ 50.02% Makefile 20.91% C 12.15%

deepracer-for-dummies's People

Stargazers

Watchers

deepracer-for-dummies's Issues

Fix automated pushing of local models to the DeepRacer console

Segmentation fault (core dumped)

Receiving the following error when loading uo the GUI (To note, this was working without the GUI)

redrum69@Linux:~/github/deepracer-for-dummies/build-gui-Desktop-Release$ ./gui
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
Segmentation fault (core dumped)

The GUI starts up, but then it crashes, it prompts for password and crasshes,

The Logs: Memory manager failed to start :(

Seems cannot train pretrained model

these are the values of the reward supposed to be
https://i.imgur.com/MPAmU9M.png

what I get when I tried to retrain a model
https://i.imgur.com/GU1Tv7U.png

I've uncommented those two lines
https://i.imgur.com/jyTutI7.png

Besides the reward, the actions in retraining also look like random. I think I just started a new training instead of retraining a pretrained model? How to solve it? Thank you.

How can I modify docker-compose.yml?

I was trying to train my model, but then I saw messages from my computer saying storage is not enough.
It turns out that during training. robo folder has been created. I have splitted root and home folder and root folder has only 20gb of capacity. And now I want to create this robo folder in my home directory, therefore I changed 'Dockerfile' and I have to change 'docker-compose.yml'. But when I try to save it, it says:
Failed to save 'docker-compose.yml': Insufficient permissions. Select 'Retry as Sudo' to retry as superuser.

Changes I have made in "Dockerfile":

RUN mkdir ~/home/mirali/Desktop/github_files/deepracer_for_dummies/robo
RUN mkdir ~/home/mirali/Desktop/github_files/deepracer_for_dummies/robo/container

Changes I have made in "docker-compose.yml":

  rl_coach:
    image: aschu/rl_coach
    env_file: .env
    container_name: rl_coach
    volumes:
    - '//var/run/docker.sock:/var/run/docker.sock'
    - '../deepracer/sagemaker-python-sdk:/deepracer/sagemaker-python-sdk'
    - '../deepracer/rl_coach:/deepracer/rl_coach'
    - '../robo/container:/robo/container'
    depends_on:
    - minio

Did I correctly to edit both files and which permission do I need to make changes to docker-compose.yml?

Integrate GUI with log analysis

Upload issue

Hi, when I start upload, it always upload only pretrained step 0.
How can I upload my last trained model?

sagemaker is not running

I am unable to have the complete setup running properly. When I run the start.sh script, I have three containers running and two terminals pop up, one for vncviewer and another for memory manager. Looking at the script, there should be another one for sagemaker logs.
I checked the docker containers running and I did not have sagemaker one there.

Also, even after I give the correct sudo password to the memory management terminal, nothing comes up after that. Running for some time I found it prints empty line.

I have double checked that the sagemaker-local network connection exists, the necessary docker images are present and I have nvidia drivers installed.

Below is the list of packages installed in my conda env

>> conda list
# packages in environment at /opt/miniconda3/envs/deepracer:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
ca-certificates           2019.5.15                     0  
certifi                   2019.6.16                py36_1  
cuda10.0                  1.0                           0    fragcolor
cudatoolkit               10.0.130                      0  
cudnn                     7.3.1                cuda10.0_0  
libedit                   3.1.20181209         hc058e9b_0  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 9.1.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
ncurses                   6.1                  he6710b0_1  
openssl                   1.1.1c               h7b6447c_1  
pip                       19.1.1                   py36_0  
python                    3.6.9                h265db76_0  
readline                  7.0                  h7b6447c_5  
setuptools                41.0.1                   py36_0  
sqlite                    3.29.0               h7b6447c_0  
tk                        8.6.8                hbc83047_0  
wheel                     0.33.4                   py36_0  
xz                        5.2.4                h14c3975_4  
zlib                      1.2.11               h7b6447c_3  
>>

Evaluation not working

I am running the ./start.sh from the evaluation folder (thru terminal) after I finished the training. Looks like it is starting all over and trying to train the model. I tried to start from pre-trained and direct. Is there a special instruction for the eval or is there a bug on the code?

libQt5Xml.so.5: cannot open shared object file: No such file or directory

I got errors when starting gui in ubuntu 18.04.
First:

./gui: error while loading shared libraries: libQt5WebKitWidgets.so.5: cannot open shared object file: No such file or directory

and after fixing it with instaling

sudo apt-get install libqt5webkit5
the second error

./gui: error while loading shared libraries: libQt5Xml.so.5: cannot open shared object file: No such file or directory

that was fixed with installing
sudo apt-get install libqt5xml5

I'm opening and closing this issue just for reference for someone that will come accross the same issue.

make paths in qt gui absolute

current paths are relative requiring that the gui be run in the release build folder. This is not good

Fix the reward graph not refreshing correctly

Checkpoint memory management

Keeps filling up my computer's memory (20+ Gb in just checkpoint data)

Sagemaker is not starting so training is not happening.

I followed the post here: https://medium.com/@autonomousracecarclub/how-to-run-deepracer-locally-to-save-your-wallet-13ccc878687.

When ran ./start.sh this is the output:

Creating minio ... done
Creating rl_coach ... done
Creating robomaker ... done
waiting for containers to start up...
Attempting to pull up sagemaker logs...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.
Attempting to open vnc viewer...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.
Starting memory manager...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.

VNC viewer and memory manager are starting but sagemaker logs are not working.

As this is the command in ./start.sh for logs: docker logs -f $(docker ps | awk ' /sagemaker/ { print $1 }'), I looked up docker ps and this is the output:

CONTAINER ID        IMAGE                                 COMMAND                  CREATED             STATUS                             PORTS                    NAMES
7ee88b200450        crr0004/deepracer_robomaker:console   "/bin/bash -c './run…"   43 seconds ago      Up 41 seconds                      0.0.0.0:8080->5900/tcp   robomaker
e62f0bc2e845        aschu/rl_coach                        "/bin/sh -c '(cd rl_…"   44 seconds ago      Up 42 seconds                                               rl_coach
2748106431ff        minio/minio                           "/usr/bin/docker-ent…"   45 seconds ago      Up 43 seconds (health: starting)   0.0.0.0:9000->9000/tcp   minio

So, it seems sagemaker is not at all running and that's why logs are not shown. How to fix this issue? Is something missing from start.sh or docker-compose.yml?

Can you please provide enviroment name list?

If we would like to train on the re:invent 2018. what would be the name of that?

Randomly throws error after N iteration?

Here is the error report:

/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:29: RuntimeWarning: invalid value encountered in reduce
  return umr_minimum(a, axis, None, out, keepdims)
{"simapp_exception": {"version": "1.0", "date": "2019-08-30 04:29:37.384401", "function": "training_worker", "message": "NaN detected in loss function, aborting training. Job failed!", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}}

Cannot train my pretrained model

https://i.imgur.com/v5SvEm6.png
I can load the pretrained model by ./start.sh. However, the reward seems being initialized. And the actions it does also seems being initialized.

Problem with init script creating the virtual env for log analysis

# setup venv for log analysis cd ${SCRIPTPATH}/aws-deepracer-workshops/log-analysis virtualenv -p python3 log-analysis.venv source ${SCRIPTPATH}/aws-deepracer-workshops/log-analysis/log-analysis.venv/bin/activate pip install -r requirements.txt ipython kernel install --user --name=log-analysis.venv

Resuming training always results error

Either through gui or terminal, I have never had success resuming training.
Though gui says that training is running successfully, terminal does not show any progress.
Terminal displays (forever)

Found a lock file rl-deepracer-pretrained/model/.lock , waiting
Found a lock file rl-deepracer-pretrained/model/.lock , waiting

Also, gui throws log file load error.

I have configured rl_deepracer_coach_robomaker.py and reward.py correctly.

Add local training error detection and correction

Expecting checkpoint >= 1. Waiting. | After one round in the simulator

Error log of crr0004/deepracer_robomaker:console:
`SIM_TRACE_LOG:19,102,2.9274,0.7746,-0.8749,0.52,0.40,8,0.0000,True,False,1.9954,292,69.25,1570613566.0205994

reward: 50.63380000999996
Training> Name=main_level/agent, Worker=0, Episode=20, Total reward=49.63, Steps=3038, Training iteration=0
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
`

Anyone experienced this issue?
After one lap the car patiently waits at the former start position, resulting in the error log above.
I have read that it might relate to the tensorflow setup? Is it possible to change it somehow?

aschu/rl_coach bug!

aschu/rl_coach should be replaced in docker-compose.yml

docker search aschu does not result anything called rl_coach!

Add button on GUI for using pretrained

Training crashes after arbitrary iterations

Here is how the error message looks like :

/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:29: RuntimeWarning: invalid value encountered in reduce
  return umr_minimum(a, axis, None, out, keepdims)
{"simapp_exception": {"version": "1.0", "date": "2019-09-11 02:54:19.345983", "function": "training_worker", "message": "NaN detected in loss function, aborting training. Job failed!", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}}

Low GPU usage during training

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 8351 C /usr/bin/python3.6 35MiB |
+-----------------------------------------------------------------------------+

Is this normal? The memory usage usually stays around ~50 mb. Should be more, right?

program crashes after first 20 episodes!

LOG:

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
{"simapp_exception": {"version": "1.0", "date": "2019-08-12 00:45:43.968140", "function": "training_worker", "message": "An error occured while training: invalid index to scalar variable.. Job failed!.", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}}
Traceback (most recent call last):
  File "training_worker.py", line 91, in training_worker
    graph_manager.train()
  File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 400, in train
    [manager.train() for manager in self.level_managers]
  File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 400, in <listcomp>
    [manager.train() for manager in self.level_managers]
  File "/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py", line 174, in train
    [agent.train() for agent in self.agents.values()]
  File "/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py", line 174, in <listcomp>
    [agent.train() for agent in self.agents.values()]
  File "/usr/local/lib/python3.6/dist-packages/rl_coach/agents/clipped_ppo_agent.py", line 317, in train
    self.train_network(batch, self.ap.algorithm.optimization_epochs)
  File "/usr/local/lib/python3.6/dist-packages/rl_coach/agents/clipped_ppo_agent.py", line 266, in train_network
    self.value_loss.add_sample(batch_results['losses'][0])
IndexError: invalid index to scalar variable.

How to resume the training?

how to resume the training?

I tried running set-last-run-to-pretrained.sh followed by start.sh but it still shows "module_dir": "s3://bucket/rl-deepracer-sagemaker/source/sourcedir.tar.gz" . Is that correct??

How to make sure that I am resuming training in the terminal?

CPU Training

Add support for CPU training or notify user if their GPU isn't powerful enough.

Jupyter Notebook not displaying any information

When I start with the Gui, I get this in the terminal:

(base) redrum69@matrix1:~/github/deepracer-for-dummies/build-gui-Desktop-Release$ ./gui
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
"/home/redrum69/github/deepracer-for-dummies/build-gui-Desktop-Release"
QProcess::start: Process is already running
"ln: failed to create symbolic link '/home/redrum69/github/deepracer-for-dummies/scripts/log-analysis/../../aws-deepracer-workshops/log-analysis/logs/log': Permission denied\nln: failed to create symbolic link '/home/redrum69/github/deepracer-for-dummies/scripts/log-
analysis/../../aws-deepracer-workshops/log-analysis/reward/reward.py': File exists\n[I 12:53:09.614 NotebookApp] The port 8888 is already in use, trying another port.\n[I 12:53:09.614 NotebookApp] The port 8889 is already in use, trying another port.\n[I 12:53:09.614 NotebookApp] The port 8890 is already in use, trying another port.\n[I 12:53:09.631 NotebookApp] JupyterLab extension loaded from /home/redrum69/anaconda3/lib/python3.7/site-packages/jupyterlab\n[I 12:53:09.631 NotebookApp] JupyterLab application directory is /home/redrum69/anaconda3/share/jupyter/lab\n[I 12:53:09.632 NotebookApp] Serving notebooks from local directory: /home/redrum69/github/deepracer-for-dummies/aws-deepracer-workshops/log-analysis\n[I 12:53:09.632 NotebookApp] The Jupyter Notebook is running at:\n[I 12:53:09.632 NotebookApp] http://localhost:8891/?token=73ae9d0603429f3eac7548b96365d89fb7673b4a71b211b9\n[I 12:53:09.632 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).\n[C 12:53:09.634 NotebookApp] \n    \n    To access the notebook, open this file in a browser:\n        file:///run/user/1000/jupyter/nbserver-2219-open.html\n    Or copy and paste one of these URLs:\n        http://localhost:8891/?token=73ae9d0603429f3eac7548b96365d89fb7673b4a71b211b9\n"
18
Log analysis URl:  "http://localhost:8891/?token=73ae9d0603429f3eac7548b96365d89fb7673b4a71b211b9"
"http://localhost:8891/notebooks/Training_analysis.ipynb"

And get multiple errors in Jupyter:

ModuleNotFoundError: No module named 'shapely'
NameError: name 'la' is not defined
NameError: name 'complete_ones' is not defined
NameError: name 'complete_ones' is not defined
NameError: name 'simulation_agg' is not defined

how to run on windows

-Install anaconda for windows (google it). In it, install cudnn and cuda10.
-Install wsl2 (if wsl1 is installed upgrade to wsl2 also can be googled to how to do it)
-Create aliases in wsl2 for commands in ananconda in windows passing to GPU in windows (http://www.erogol.com/using-windows-wsl-for-deep-learning-development/)
-Install NVIDIA drivers for windows along with nvidia-cuda-toolkit for windows (very big file)
-Follow this (https://www.thomasmaurer.ch/2019/08/run-linux-containers-with-docker-desktop-and-wsl-2/) to install dockers for windows preview with wsl 2 to run dockers in wsl and powershell or cmd.

Use either real vnc or tightvnc for windows.
-The rest can be just followed what this repository does but in windows.
-Enjoy in windows!

Fix the reward graph

Add Training Timer dialog for GUI start button.

Training to stop automatically after specific time(mins) according to the user setting.

eg: Training Timer(m): 5, 10, 30, 60, 90, 120

Add local training model profiles

Once the training is started, and is needed to stop training gives error 'training stopped with status error', also the model is not saved in the local profile

Add convenience tools for uploading local models

Add warning box for critical buttons

init
stop
restart

Support for MacOS

This looks like a great repo -- the directions are clearly for ubuntu -- just wondered if anyone has gotten this working on MacOS and what changes need to be made. If it can be done without too much work it might be worth updating the documentation. I'd be glad to that update if anyone can give me some hints as this looks very doable. Just thought I'd ask before I started.

Thanks.

ls: cannot access '/workspace/venv/logs/*.log': No such file or directory

not sure why this is the case. Please help with this issue

Update README and medium article to include sagemaker docker

Possibly python versioning as well?

Add tutorials and help buttons in the menu bar

how to do evaluation in the GUI?

there is no button for evaluation in the GUI

thanks

log files are always empty - deepracer car does not move in gazebo

I did exactly as described in the medium article but unfortunately my log files are always blank. Also, sometimes deepracer does not move an inch in gazebo, gets stuck on green grass!

GUI ouput :

Starting training...
log analysis started correctly
Log analysis URL loaded: http://localhost:8984?token=2f95cf8079b3493391bdb7d35605768db36a515954786723
Local training started successfully

When I type docker ps it gives me

crr0004/sagemaker-rl-tensorflow:nvidia
crr0004/deepracer_robomaker:console
aschu/rl_coach
mini0/minio

Connection closed before response revcieved

Checkpoint> Saving in path=['./checkpoint/14_Step-10846.ckpt']
{"simapp_exception": {"version": "1.0", "date": "2019-08-04 04:38:10.380303", "function": "save_to_store", "message": "Exception [Connection was closed before we received a valid response from endpoint URL: "http://minio:9000/bucket/rl-deepracer-sagemaker/model/14_Step-10846.ckpt.data-00000-of-00001?uploadId=27a99f19-659f-4098-8917-02e4ba72e85f&partNumber=16".] occured while uploading files on S3 for checkpoint", "exceptionType": "s3_datastore.exceptions", "eventType": "system_error", "errorCode": "500"}}
{"simapp_exception": {"version": "1.0", "date": "2019-08-04 04:38:10.384121", "function": "training_worker", "message": "An error occured while training: Connection was closed before we received a valid response from endpoint URL: "http://minio:9000/bucket/rl-deepracer-sagemaker/model/14_Step-10846.ckpt.data-00000-of-00001?uploadId=27a99f19-659f-4098-8917-02e4ba72e85f&partNumber=16\".. Job failed!.", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}}
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 125, in _send_request
method, url, body, headers, *args, **kwargs)
File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 176, in _send_output
self.send(message_body)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 236, in send
return super(AWSConnection, self).send(str)
File "/usr/lib/python3.6/http/client.py", line 983, in send
self.sock.sendall(datablock)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/botocore/httpsession.py", line 258, in send
decode_content=False,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 344, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 125, in _send_request
method, url, body, headers, *args, **kwargs)
File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 176, in _send_output
self.send(message_body)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 236, in send
return super(AWSConnection, self).send(str)
File "/usr/lib/python3.6/http/client.py", line 983, in send
self.sock.sendall(datablock)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "training_worker.py", line 108, in training_worker
graph_manager.save_checkpoint()
File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 624, in save_checkpoint
data_store.save_to_store()
File "/opt/ml/code/markov/s3_boto_data_store.py", line 144, in save_to_store
raise e
File "/opt/ml/code/markov/s3_boto_data_store.py", line 100, in save_to_store
Key=self._get_s3_key(rel_name))
File "/usr/local/lib/python3.6/dist-packages/boto3/s3/inject.py", line 131, in upload_file
extra_args=ExtraArgs, callback=Callback)
File "/usr/local/lib/python3.6/dist-packages/boto3/s3/transfer.py", line 279, in upload_file
future.result()
File "/usr/local/lib/python3.6/dist-packages/s3transfer/futures.py", line 73, in result
return self._coordinator.result()
File "/usr/local/lib/python3.6/dist-packages/s3transfer/futures.py", line 233, in result
raise self._exception
File "/usr/local/lib/python3.6/dist-packages/s3transfer/tasks.py", line 126, in call
return self._execute_main(kwargs)
File "/usr/local/lib/python3.6/dist-packages/s3transfer/tasks.py", line 150, in _execute_main
return_value = self._main(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/s3transfer/upload.py", line 722, in _main
Body=body, **extra_args)
File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 648, in _make_api_call
operation_model, request_dict, request_context)
File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 667, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 137, in _send_request
success_response, exception):
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 231, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 183, in call
if self._checker(attempts, response, caught_exception):
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 251, in call
caught_exception)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 277, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 317, in call
caught_exception)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 223, in call
attempt_number, caught_exception)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
raise caught_exception
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 200, in _do_get_response
http_response = self._send(request)
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 244, in _send
return self.http_session.send(request)
File "/usr/local/lib/python3.6/dist-packages/botocore/httpsession.py", line 289, in send
endpoint_url=request.url
botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL: "http://minio:9000/bucket/rl-deepracer-sagemaker/model/14_Step-10846.ckpt.data-00000-of-00001?uploadId=27a99f19-659f-4098-8917-02e4ba72e85f&partNumber=16".

Sagemarker Log Terminal Missing

Here are the two terminal page showing.

Tried to reinstall ubuntu and try it one more time but still. Not sure how it goes.

Gazebo shutting down at start

Hey guys, first things first - great work!
Second - My Gazebo VNC screen crashes down shortly after the start. It seems like the car starts at the corner of the track (off the street), drives for 1 second, disappears and after that Gazebo crashes. I tried reinstalling the NVIDIA cuda drivers and everything, and it is a fresh setup of Ubuntu 18.04. Did you ever experience a similar error?

Attached are the logs of the container:

python3 -m markov.rollout_worker
/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh: line 8: 1660 Illegal instruction (core dumped) python3 -m markov.rollout_worker
================================================================================REQUIRED process [agent-9] has died!
process has died [pid 1493, exit code 132, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/root/.ros/log/dd04facc-e6fc-11e9-98a8-0242ac120004/agent-9.log].
log file: /root/.ros/log/dd04facc-e6fc-11e9-98a8-0242ac120004/agent-9*.log
Initiating shutdown!
================================================================================
[ INFO] [1570230936.370759356]: Finished loading Gazebo ROS API Plugin.
[ INFO] [1570230936.384992129]: waitForService: Service [/gazebo/set_physics_properties] has not been advertised, waiting...
[ INFO] [1570230938.103051125, 0.033000000]: waitForService: Service [/gazebo/set_physics_properties] is now available.
[ INFO] [1570230938.834061301, 0.748000000]: Physics dynamic reconfigure ready.
[racecar/controller_manager-5] escalating to SIGTERM
[WARN] [1570230962.581206, 5.927000]: Controller Spawner error while taking down controllers: transport error completing service call: receive_once[/racecar/controller_manager/switch_controller]: unexpected error [Errno 4] Interrupted system call
[gazebo-2] escalating to SIGTERM

EDIT:
it seems that this is not memory leak. As the training episodes increase, the agent is able to cover more track, and thus the training data gets big and the GPU can't handle it

arcc-race / deepracer-for-dummies Goto Github PK

deepracer-for-dummies's People

Stargazers

Watchers

Forkers

deepracer-for-dummies's Issues

Recommend Projects

Recommend Topics

Recommend Org