arcc-race / deepracer-for-dummies Goto Github PK
View Code? Open in Web Editor NEWThis project forked from aws-deepracer-community/deepracer-for-cloud
a quick way to get up and running with local deepracer training environment
This project forked from aws-deepracer-community/deepracer-for-cloud
a quick way to get up and running with local deepracer training environment
Receiving the following error when loading uo the GUI (To note, this was working without the GUI)
redrum69@Linux:~/github/deepracer-for-dummies/build-gui-Desktop-Release$ ./gui
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
Segmentation fault (core dumped)
The GUI starts up, but then it crashes, it prompts for password and crasshes,
these are the values of the reward supposed to be
https://i.imgur.com/MPAmU9M.png
what I get when I tried to retrain a model
https://i.imgur.com/GU1Tv7U.png
I've uncommented those two lines
https://i.imgur.com/jyTutI7.png
Besides the reward, the actions in retraining also look like random. I think I just started a new training instead of retraining a pretrained model? How to solve it? Thank you.
I was trying to train my model, but then I saw messages from my computer saying storage is not enough.
It turns out that during training. robo folder has been created. I have splitted root and home folder and root folder has only 20gb of capacity. And now I want to create this robo folder in my home directory, therefore I changed 'Dockerfile' and I have to change 'docker-compose.yml'. But when I try to save it, it says:
Failed to save 'docker-compose.yml': Insufficient permissions. Select 'Retry as Sudo' to retry as superuser.
Changes I have made in "Dockerfile":
RUN mkdir ~/home/mirali/Desktop/github_files/deepracer_for_dummies/robo
RUN mkdir ~/home/mirali/Desktop/github_files/deepracer_for_dummies/robo/container
Changes I have made in "docker-compose.yml":
rl_coach:
image: aschu/rl_coach
env_file: .env
container_name: rl_coach
volumes:
- '//var/run/docker.sock:/var/run/docker.sock'
- '../deepracer/sagemaker-python-sdk:/deepracer/sagemaker-python-sdk'
- '../deepracer/rl_coach:/deepracer/rl_coach'
- '../robo/container:/robo/container'
depends_on:
- minio
Did I correctly to edit both files and which permission do I need to make changes to docker-compose.yml?
Hi, when I start upload, it always upload only pretrained step 0.
How can I upload my last trained model?
I am unable to have the complete setup running properly. When I run the start.sh script, I have three containers running and two terminals pop up, one for vncviewer and another for memory manager. Looking at the script, there should be another one for sagemaker logs.
I checked the docker containers running and I did not have sagemaker one there.
Also, even after I give the correct sudo password to the memory management terminal, nothing comes up after that. Running for some time I found it prints empty line.
I have double checked that the sagemaker-local network connection exists, the necessary docker images are present and I have nvidia drivers installed.
Below is the list of packages installed in my conda env
>> conda list
# packages in environment at /opt/miniconda3/envs/deepracer:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
ca-certificates 2019.5.15 0
certifi 2019.6.16 py36_1
cuda10.0 1.0 0 fragcolor
cudatoolkit 10.0.130 0
cudnn 7.3.1 cuda10.0_0
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses 6.1 he6710b0_1
openssl 1.1.1c h7b6447c_1
pip 19.1.1 py36_0
python 3.6.9 h265db76_0
readline 7.0 h7b6447c_5
setuptools 41.0.1 py36_0
sqlite 3.29.0 h7b6447c_0
tk 8.6.8 hbc83047_0
wheel 0.33.4 py36_0
xz 5.2.4 h14c3975_4
zlib 1.2.11 h7b6447c_3
>>
I am running the ./start.sh from the evaluation folder (thru terminal) after I finished the training. Looks like it is starting all over and trying to train the model. I tried to start from pre-trained and direct. Is there a special instruction for the eval or is there a bug on the code?
I got errors when starting gui in ubuntu 18.04.
First:
./gui: error while loading shared libraries: libQt5WebKitWidgets.so.5: cannot open shared object file: No such file or directory
and after fixing it with instaling
sudo apt-get install libqt5webkit5
the second error
./gui: error while loading shared libraries: libQt5Xml.so.5: cannot open shared object file: No such file or directory
that was fixed with installing
sudo apt-get install libqt5xml5
I'm opening and closing this issue just for reference for someone that will come accross the same issue.
current paths are relative requiring that the gui be run in the release build folder. This is not good
Keeps filling up my computer's memory (20+ Gb in just checkpoint data)
I followed the post here: https://medium.com/@autonomousracecarclub/how-to-run-deepracer-locally-to-save-your-wallet-13ccc878687.
When ran ./start.sh
this is the output:
Creating minio ... done
Creating rl_coach ... done
Creating robomaker ... done
waiting for containers to start up...
Attempting to pull up sagemaker logs...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.
Attempting to open vnc viewer...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.
Starting memory manager...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.
VNC viewer and memory manager are starting but sagemaker logs are not working.
As this is the command in ./start.sh
for logs: docker logs -f $(docker ps | awk ' /sagemaker/ { print $1 }')
, I looked up docker ps
and this is the output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7ee88b200450 crr0004/deepracer_robomaker:console "/bin/bash -c './run…" 43 seconds ago Up 41 seconds 0.0.0.0:8080->5900/tcp robomaker
e62f0bc2e845 aschu/rl_coach "/bin/sh -c '(cd rl_…" 44 seconds ago Up 42 seconds rl_coach
2748106431ff minio/minio "/usr/bin/docker-ent…" 45 seconds ago Up 43 seconds (health: starting) 0.0.0.0:9000->9000/tcp minio
So, it seems sagemaker is not at all running and that's why logs are not shown. How to fix this issue? Is something missing from start.sh
or docker-compose.yml
?
If we would like to train on the re:invent 2018. what would be the name of that?
Here is the error report:
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
return umr_maximum(a, axis, None, out, keepdims)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:29: RuntimeWarning: invalid value encountered in reduce
return umr_minimum(a, axis, None, out, keepdims)
{"simapp_exception": {"version": "1.0", "date": "2019-08-30 04:29:37.384401", "function": "training_worker", "message": "NaN detected in loss function, aborting training. Job failed!", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}}
https://i.imgur.com/v5SvEm6.png
I can load the pretrained model by ./start.sh. However, the reward seems being initialized. And the actions it does also seems being initialized.
# setup venv for log analysis cd ${SCRIPTPATH}/aws-deepracer-workshops/log-analysis virtualenv -p python3 log-analysis.venv source ${SCRIPTPATH}/aws-deepracer-workshops/log-analysis/log-analysis.venv/bin/activate pip install -r requirements.txt ipython kernel install --user --name=log-analysis.venv
Either through gui or terminal, I have never had success resuming training.
Though gui says that training is running successfully, terminal does not show any progress.
Terminal displays (forever)
Found a lock file rl-deepracer-pretrained/model/.lock , waiting
Found a lock file rl-deepracer-pretrained/model/.lock , waiting
Also, gui throws log file load error.
I have configured rl_deepracer_coach_robomaker.py
and reward.py
correctly.
Error log of crr0004/deepracer_robomaker:console:
`SIM_TRACE_LOG:19,102,2.9274,0.7746,-0.8749,0.52,0.40,8,0.0000,True,False,1.9954,292,69.25,1570613566.0205994
reward: 50.63380000999996
Training> Name=main_level/agent, Worker=0, Episode=20, Total reward=49.63, Steps=3038, Training iteration=0
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
Expecting checkpoint >= 1. Waiting.
`
Anyone experienced this issue?
After one lap the car patiently waits at the former start position, resulting in the error log above.
I have read that it might relate to the tensorflow setup? Is it possible to change it somehow?
aschu/rl_coach
should be replaced in docker-compose.yml
docker search aschu
does not result anything called rl_coach
!
Here is how the error message looks like :
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
return umr_maximum(a, axis, None, out, keepdims)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:29: RuntimeWarning: invalid value encountered in reduce
return umr_minimum(a, axis, None, out, keepdims)
{"simapp_exception": {"version": "1.0", "date": "2019-09-11 02:54:19.345983", "function": "training_worker", "message": "NaN detected in loss function, aborting training. Job failed!", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}}
Wed Oct 9 09:39:44 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 On | 00000000:00:03.0 Off | N/A |
| N/A 30C P8 17W / 125W | 46MiB / 4037MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 8351 C /usr/bin/python3.6 35MiB |
+-----------------------------------------------------------------------------+
Is this normal? The memory usage usually stays around ~50 mb. Should be more, right?
LOG:
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
{"simapp_exception": {"version": "1.0", "date": "2019-08-12 00:45:43.968140", "function": "training_worker", "message": "An error occured while training: invalid index to scalar variable.. Job failed!.", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}}
Traceback (most recent call last):
File "training_worker.py", line 91, in training_worker
graph_manager.train()
File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 400, in train
[manager.train() for manager in self.level_managers]
File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 400, in <listcomp>
[manager.train() for manager in self.level_managers]
File "/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py", line 174, in train
[agent.train() for agent in self.agents.values()]
File "/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py", line 174, in <listcomp>
[agent.train() for agent in self.agents.values()]
File "/usr/local/lib/python3.6/dist-packages/rl_coach/agents/clipped_ppo_agent.py", line 317, in train
self.train_network(batch, self.ap.algorithm.optimization_epochs)
File "/usr/local/lib/python3.6/dist-packages/rl_coach/agents/clipped_ppo_agent.py", line 266, in train_network
self.value_loss.add_sample(batch_results['losses'][0])
IndexError: invalid index to scalar variable.
how to resume the training?
I tried running set-last-run-to-pretrained.sh
followed by start.sh
but it still shows "module_dir": "s3://bucket/rl-deepracer-sagemaker/source/sourcedir.tar.gz"
. Is that correct??
How to make sure that I am resuming training in the terminal?
Add support for CPU training or notify user if their GPU isn't powerful enough.
When I start with the Gui, I get this in the terminal:
(base) redrum69@matrix1:~/github/deepracer-for-dummies/build-gui-Desktop-Release$ ./gui
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
"/home/redrum69/github/deepracer-for-dummies/build-gui-Desktop-Release"
QProcess::start: Process is already running
"ln: failed to create symbolic link '/home/redrum69/github/deepracer-for-dummies/scripts/log-analysis/../../aws-deepracer-workshops/log-analysis/logs/log': Permission denied\nln: failed to create symbolic link '/home/redrum69/github/deepracer-for-dummies/scripts/log-
analysis/../../aws-deepracer-workshops/log-analysis/reward/reward.py': File exists\n[I 12:53:09.614 NotebookApp] The port 8888 is already in use, trying another port.\n[I 12:53:09.614 NotebookApp] The port 8889 is already in use, trying another port.\n[I 12:53:09.614 NotebookApp] The port 8890 is already in use, trying another port.\n[I 12:53:09.631 NotebookApp] JupyterLab extension loaded from /home/redrum69/anaconda3/lib/python3.7/site-packages/jupyterlab\n[I 12:53:09.631 NotebookApp] JupyterLab application directory is /home/redrum69/anaconda3/share/jupyter/lab\n[I 12:53:09.632 NotebookApp] Serving notebooks from local directory: /home/redrum69/github/deepracer-for-dummies/aws-deepracer-workshops/log-analysis\n[I 12:53:09.632 NotebookApp] The Jupyter Notebook is running at:\n[I 12:53:09.632 NotebookApp] http://localhost:8891/?token=73ae9d0603429f3eac7548b96365d89fb7673b4a71b211b9\n[I 12:53:09.632 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).\n[C 12:53:09.634 NotebookApp] \n \n To access the notebook, open this file in a browser:\n file:///run/user/1000/jupyter/nbserver-2219-open.html\n Or copy and paste one of these URLs:\n http://localhost:8891/?token=73ae9d0603429f3eac7548b96365d89fb7673b4a71b211b9\n"
18
Log analysis URl: "http://localhost:8891/?token=73ae9d0603429f3eac7548b96365d89fb7673b4a71b211b9"
"http://localhost:8891/notebooks/Training_analysis.ipynb"
And get multiple errors in Jupyter:
ModuleNotFoundError: No module named 'shapely'
NameError: name 'la' is not defined
NameError: name 'complete_ones' is not defined
NameError: name 'complete_ones' is not defined
NameError: name 'simulation_agg' is not defined
-Install anaconda for windows (google it). In it, install cudnn and cuda10.
-Install wsl2 (if wsl1 is installed upgrade to wsl2 also can be googled to how to do it)
-Create aliases in wsl2 for commands in ananconda in windows passing to GPU in windows (http://www.erogol.com/using-windows-wsl-for-deep-learning-development/)
-Install NVIDIA drivers for windows along with nvidia-cuda-toolkit for windows (very big file)
-Follow this (https://www.thomasmaurer.ch/2019/08/run-linux-containers-with-docker-desktop-and-wsl-2/) to install dockers for windows preview with wsl 2 to run dockers in wsl and powershell or cmd.
Training to stop automatically after specific time(mins) according to the user setting.
eg: Training Timer(m): 5, 10, 30, 60, 90, 120
init
stop
restart
This looks like a great repo -- the directions are clearly for ubuntu -- just wondered if anyone has gotten this working on MacOS and what changes need to be made. If it can be done without too much work it might be worth updating the documentation. I'd be glad to that update if anyone can give me some hints as this looks very doable. Just thought I'd ask before I started.
Thanks.
ls: cannot access '/workspace/venv/logs/*.log': No such file or directory
not sure why this is the case. Please help with this issue
Possibly python versioning as well?
how to do evaluation in the GUI?
there is no button for evaluation in the GUI
thanks
I did exactly as described in the medium article but unfortunately my log files are always blank. Also, sometimes deepracer does not move an inch in gazebo, gets stuck on green grass!
GUI ouput :
Starting training...
log analysis started correctly
Log analysis URL loaded: http://localhost:8984?token=2f95cf8079b3493391bdb7d35605768db36a515954786723
Local training started successfully
When I type docker ps
it gives me
crr0004/sagemaker-rl-tensorflow:nvidia
crr0004/deepracer_robomaker:console
aschu/rl_coach
mini0/minio
Checkpoint> Saving in path=['./checkpoint/14_Step-10846.ckpt']
{"simapp_exception": {"version": "1.0", "date": "2019-08-04 04:38:10.380303", "function": "save_to_store", "message": "Exception [Connection was closed before we received a valid response from endpoint URL: "http://minio:9000/bucket/rl-deepracer-sagemaker/model/14_Step-10846.ckpt.data-00000-of-00001?uploadId=27a99f19-659f-4098-8917-02e4ba72e85f&partNumber=16".] occured while uploading files on S3 for checkpoint", "exceptionType": "s3_datastore.exceptions", "eventType": "system_error", "errorCode": "500"}}
{"simapp_exception": {"version": "1.0", "date": "2019-08-04 04:38:10.384121", "function": "training_worker", "message": "An error occured while training: Connection was closed before we received a valid response from endpoint URL: "http://minio:9000/bucket/rl-deepracer-sagemaker/model/14_Step-10846.ckpt.data-00000-of-00001?uploadId=27a99f19-659f-4098-8917-02e4ba72e85f&partNumber=16\".. Job failed!.", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}}
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 125, in _send_request
method, url, body, headers, *args, **kwargs)
File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 176, in _send_output
self.send(message_body)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 236, in send
return super(AWSConnection, self).send(str)
File "/usr/lib/python3.6/http/client.py", line 983, in send
self.sock.sendall(datablock)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/botocore/httpsession.py", line 258, in send
decode_content=False,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 344, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 125, in _send_request
method, url, body, headers, *args, **kwargs)
File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 176, in _send_output
self.send(message_body)
File "/usr/local/lib/python3.6/dist-packages/botocore/awsrequest.py", line 236, in send
return super(AWSConnection, self).send(str)
File "/usr/lib/python3.6/http/client.py", line 983, in send
self.sock.sendall(datablock)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "training_worker.py", line 108, in training_worker
graph_manager.save_checkpoint()
File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 624, in save_checkpoint
data_store.save_to_store()
File "/opt/ml/code/markov/s3_boto_data_store.py", line 144, in save_to_store
raise e
File "/opt/ml/code/markov/s3_boto_data_store.py", line 100, in save_to_store
Key=self._get_s3_key(rel_name))
File "/usr/local/lib/python3.6/dist-packages/boto3/s3/inject.py", line 131, in upload_file
extra_args=ExtraArgs, callback=Callback)
File "/usr/local/lib/python3.6/dist-packages/boto3/s3/transfer.py", line 279, in upload_file
future.result()
File "/usr/local/lib/python3.6/dist-packages/s3transfer/futures.py", line 73, in result
return self._coordinator.result()
File "/usr/local/lib/python3.6/dist-packages/s3transfer/futures.py", line 233, in result
raise self._exception
File "/usr/local/lib/python3.6/dist-packages/s3transfer/tasks.py", line 126, in call
return self._execute_main(kwargs)
File "/usr/local/lib/python3.6/dist-packages/s3transfer/tasks.py", line 150, in _execute_main
return_value = self._main(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/s3transfer/upload.py", line 722, in _main
Body=body, **extra_args)
File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 648, in _make_api_call
operation_model, request_dict, request_context)
File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 667, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 137, in _send_request
success_response, exception):
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 231, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 183, in call
if self._checker(attempts, response, caught_exception):
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 251, in call
caught_exception)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 277, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 317, in call
caught_exception)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 223, in call
attempt_number, caught_exception)
File "/usr/local/lib/python3.6/dist-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
raise caught_exception
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 200, in _do_get_response
http_response = self._send(request)
File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 244, in _send
return self.http_session.send(request)
File "/usr/local/lib/python3.6/dist-packages/botocore/httpsession.py", line 289, in send
endpoint_url=request.url
botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL: "http://minio:9000/bucket/rl-deepracer-sagemaker/model/14_Step-10846.ckpt.data-00000-of-00001?uploadId=27a99f19-659f-4098-8917-02e4ba72e85f&partNumber=16".
Hey guys, first things first - great work!
Second - My Gazebo VNC screen crashes down shortly after the start. It seems like the car starts at the corner of the track (off the street), drives for 1 second, disappears and after that Gazebo crashes. I tried reinstalling the NVIDIA cuda drivers and everything, and it is a fresh setup of Ubuntu 18.04. Did you ever experience a similar error?
Attached are the logs of the container:
python3 -m markov.rollout_worker
/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh: line 8: 1660 Illegal instruction (core dumped) python3 -m markov.rollout_worker
================================================================================REQUIRED process [agent-9] has died!
process has died [pid 1493, exit code 132, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/root/.ros/log/dd04facc-e6fc-11e9-98a8-0242ac120004/agent-9.log].
log file: /root/.ros/log/dd04facc-e6fc-11e9-98a8-0242ac120004/agent-9*.log
Initiating shutdown!
================================================================================
[ INFO] [1570230936.370759356]: Finished loading Gazebo ROS API Plugin.
[ INFO] [1570230936.384992129]: waitForService: Service [/gazebo/set_physics_properties] has not been advertised, waiting...
[ INFO] [1570230938.103051125, 0.033000000]: waitForService: Service [/gazebo/set_physics_properties] is now available.
[ INFO] [1570230938.834061301, 0.748000000]: Physics dynamic reconfigure ready.
[racecar/controller_manager-5] escalating to SIGTERM
[WARN] [1570230962.581206, 5.927000]: Controller Spawner error while taking down controllers: transport error completing service call: receive_once[/racecar/controller_manager/switch_controller]: unexpected error [Errno 4] Interrupted system call
[gazebo-2] escalating to SIGTERM
Follow the medium article https://medium.com/@autonomousracecarclub/how-to-run-deepracer-locally-to-save-your-wallet-13ccc878687
Part of the ARCC mission. Democratize AI!
Have places to fill out names for models and compare performance across multiple saves.
Hello, after a certain number of episodes (around 400 in my case), the training stops because the GPU run out of memory, it seems that the problem is in tensorflow training
I'm trying to find how to replicate the problem, but it seems that it happens randomly at the moment
EDIT:
it seems that this is not memory leak. As the training episodes increase, the agent is able to cover more track, and thus the training data gets big and the GPU can't handle it
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.