Git Product home page Git Product logo

Comments (21)

Kismuz avatar Kismuz commented on May 29, 2024 1

@knn940506 - forgot to remove exception test case, sorry for that. Corrected, update.

from btgym.

Kismuz avatar Kismuz commented on May 29, 2024 1

Ok, base exception occured here:

File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 750, in _step
raise ConnectionError(msg)
ConnectionError: .step(): server unreachable with status: <receive_failed_due_to_connect_timeout>.

... for some reasons BTGym server did not responded to API shell in proper time; everything else are consecutive errors. This is rather strange but we can track it:

  1. Run basic notebook example to ensure bare environment run is ok:
    https://github.com/Kismuz/btgym/blob/master/examples/very_basic_env_setup.ipynb
    If it runs without exceptions (should just print a lot of info's), than:
  2. Change the following in a3c_random_on_synth_or_real_data... :
env_config = dict(
    ...
    kwargs=dict(
        ....
        connect_timeout=180,
        verbose=2,
    )
)
....
cluster_config = dict(
    ...
    num_workers=1, 
    num_ps=1,
    num_envs=1,
   ....
)
.....
launcher = Launcher(
     ...
    verbose=2,
)

and paste log output until error mentioned above.

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024 1

updated btgym but aac.py has error!

Traceback (most recent call last):
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 468, in make_tensor_proto
str_values = [compat.as_bytes(x) for x in proto_values]
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 468, in
str_values = [compat.as_bytes(x) for x in proto_values]
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got {'trial_num': <tf.Tensor 'local/on_policy_state_in_metadata_trial_num_pl:0' shape=(?,) dtype=float32>, 'type': <tf.Tensor 'local/on_policy_state_in_metadata_type_pl:0' shape=(?,) dtype=float32>, 'first_row': <tf.Tensor 'local/on_policy_state_in_metadata_first_row_pl:0' shape=(?,) dtype=float32>, 'sample_num': <tf.Tensor 'local/on_policy_state_in_metadata_sample_num_pl:0' shape=(?,) dtype=float32>}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 492, in init
self.inc_step = self.global_step.assign_add(tf.shape(pi.on_state_in[list(pi.on_state_in.keys())[0]])[0])
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 271, in shape
return shape_internal(input, name, optimize=True, out_type=out_type)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 295, in shape_internal
input_tensor = ops.convert_to_tensor(input)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 836, in convert_to_tensor
as_ref=False)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 926, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 229, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 208, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 472, in make_tensor_proto
"supported type." % (type(values), values))
TypeError: Failed to convert object of type <class 'dict'> to Tensor. Contents: {'trial_num': <tf.Tensor 'local/on_policy_state_in_metadata_trial_num_pl:0' shape=(?,) dtype=float32>, 'type': <tf.Tensor 'local/on_policy_state_in_metadata_type_pl:0' shape=(?,) dtype=float32>, 'first_row': <tf.Tensor 'local/on_policy_state_in_metadata_first_row_pl:0' shape=(?,) dtype=float32>, 'sample_num': <tf.Tensor 'local/on_policy_state_in_metadata_sample_num_pl:0' shape=(?,) dtype=float32>}. Consider casting elements to a supported type.

from btgym.

Kismuz avatar Kismuz commented on May 29, 2024 1

@knn940506 , I have recently implemented another type of runner that doesn't relies on queue;
it can be found at btgym.algorithms.runner.synchro.BaseSynchroRunner
usage can be found at MLDG implementation: https://github.com/Kismuz/btgym/tree/develop_meta_learning_gradient

from btgym.

Kismuz avatar Kismuz commented on May 29, 2024

@knn940506,
empty queue usually means that thread runner process either dint started or quietly died.
As some updates has been made since your fork @8.01.18, I recommend to update btgym package first. If error persists, please provide some details:

  • your setup (mainly: OS, CPU number of cores, TF version)
  • any modifications done to notebook( number of workers, data chosen etc.)?
  • if error appears at the beginning or in the course of training?
  • is it occasional or persistent?

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

I updated btgym by using and run again but still error occurs.


cd btgym
git pull
pip install --upgrade -e .


  • Setup : ubuntu 16.04 LTS / 8 cores/ tensorflow==1.4.1
  • Only changed number of workers
  • in the course of training
  • when it once appears -> stopped forever

error occurs pattern = many reset warnings -> global_step info -> error

[2018-01-12 02:01:25.982257] WARNING: BTgymServer_0: _reset <episode_config> kwarg not found, using default values: {'b_beta': 1, 'sample_type': 0, 'b_alpha': 1, 'get_new': True}
<INFO:tensorflow:global/global_step/sec: 261.664>
[2018-01-12 02:02:26.494250] ERROR: BTgymAPIshell_0: .step(): server unreachable with status: <receive_failed_due_to_connect_timeout>.

Thanks so much !

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

Tested other examples, looks like my workers lose Backtrader Server connection.

if program runs longer, below message always appears


~/바탕화면/git/btgym/btgym/envs/backtrader.py in _step(self, action)
748 msg = '.step(): server unreachable with status: <{}>.'.format(env_response['status'])
749 self.log.error(msg)
--> 750 raise ConnectionError(msg)
751
752 self.env_response = env_response ['message']

ConnectionError: .step(): server unreachable with status: <receive_failed_due_to_connect_timeout>.


from btgym.

Kismuz avatar Kismuz commented on May 29, 2024

@knn940506, well, it is different error.
Do the following:

  1. at line 47 of notebook set: connect_timeout=120,

  2. Pay attention to how you interrupt/restart notebook kernel:
    ( taken from #17 ):
    Every BTGYM instance launches at least two separate processes, not counting jupyter kernel itself:

  • btgym_server as backend for environment API, default port 5000, incremented by 1 for every other env. instance: 5001, 5002, ... ;

  • data_server as data providing backend for one or more btgym_server(s), default port 4999, same for all env. instances;

  • calling env.close() should stop both and it usually does (at least on MACOS and Linux);

  • interrupting parent kernel should stop childs as well, as they are not demonized, but:

  • there is some caveat in interrupting jupyter kernel: It can not be done via Ctrl-C, equivalent is web interface [KERNEL]-->[INTERRUPT]. This combination correctly finishes all stuff, while hitting [KERNEL]-->[RESTART] or [RESTART AND CLEAR...] for some reasons leaves child processes orphaned.
    In this case list processes on specified ports in terminal window:

lsof -i:5000
lsof -i:4999

...and do manual kill.

Note, that when running A3C examples there are also 12230 and 12231 to watch for.

Usually it throws errors like:

  • resource temporarily unavailable
  • could not start grpc server
  • server unreachable with status: ....
  • operation could not be accomplished in a current state
  1. Decrease number of workers to 6. Still gives full load to CPU, can eliminate inter-threads concurrence slowdowns.

  2. If nothing helps set Launcher kwarg verbose=3 and paste ~50 last lines of log output.

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

Error not changes... Here are some Terminal log


2018-01-15 11:19:33.222086: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-01-15 11:19:33.224407: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
E0115 11:19:33.225525382 2448 ev_epoll1_linux.c:1051] grpc epoll fd: 52
E0115 11:19:33.225550836 2439 ev_epoll1_linux.c:1051] grpc epoll fd: 51
2018-01-15 11:19:33.230664: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230}
2018-01-15 11:19:33.230663: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12230}
2018-01-15 11:19:33.230714: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233, 3 -> 127.0.0.1:12234}
2018-01-15 11:19:33.230717: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233, 3 -> 127.0.0.1:12234}
2018-01-15 11:19:33.231020: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12231
2018-01-15 11:19:33.231497: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12230
2018-01-15 11:19:37.685478: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session b6839cbeeb119750 with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2
2018-01-15 11:19:38.231957: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
E0115 11:19:38.232226783 2503 ev_epoll1_linux.c:1051] grpc epoll fd: 53
2018-01-15 11:19:38.236208: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-01-15 11:19:38.236407: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230}
2018-01-15 11:19:38.236446: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> localhost:12232, 2 -> 127.0.0.1:12233, 3 -> 127.0.0.1:12234}
E0115 11:19:38.236568040 2507 ev_epoll1_linux.c:1051] grpc epoll fd: 54
2018-01-15 11:19:38.236800: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12232
2018-01-15 11:19:38.240948: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230}
2018-01-15 11:19:38.240997: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> localhost:12233, 3 -> 127.0.0.1:12234}
2018-01-15 11:19:38.241403: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12233
2018-01-15 11:19:38.242178: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
E0115 11:19:38.242392991 2516 ev_epoll1_linux.c:1051] grpc epoll fd: 55
2018-01-15 11:19:38.247020: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230}
2018-01-15 11:19:38.247056: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233, 3 -> localhost:12234}
2018-01-15 11:19:38.247372: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12234
2018-01-15 11:19:41.789174: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 1e5bfb978931a13a with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:3/cpu:0" inter_op_parallelism_threads: 2
2018-01-15 11:19:41.807742: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 94c6cd7bd0b0fa12 with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:1/cpu:0" inter_op_parallelism_threads: 2
2018-01-15 11:19:42.002243: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 23214af6a52fc7cf with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:2/cpu:0" inter_op_parallelism_threads: 2


from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

There is one thing weird, I set num_worker=4 but it looks like Worker-5 is working

INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt
INFO:tensorflow:global/global_step/sec: 100.832
INFO:tensorflow:Error reported to Coordinator: <class 'queue.Empty'>,

Process Worker-5:
Traceback (most recent call last):
File "/home/joowonkim/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 241, in run
trainer.process(sess)
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 747, in process
data = self.get_data()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in get_data
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue
return queue.get(timeout=600.0)
File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get
raise Empty
queue.Empty

Is it natural?

from btgym.

Kismuz avatar Kismuz commented on May 29, 2024

@knn940506,
terminal log you provided is ok, no errors there, refer to #23 for details;

No it not natural; I see that sub-processes error reporting should be somehow improved. I'll take time to see how it should be fixed.

from btgym.

Kismuz avatar Kismuz commented on May 29, 2024

@knn940506,
I have updated error reporting for child processes. It does not solve error but can give a hint what's going wrong. Please update package, run example and post traceback here.

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

Traceback (most recent call last):
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run
self._run()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run
self.queue.put(next(rollout_provider), timeout=600.0)
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 222, in env_runner
state, reward, terminal, info = env.step(action.argmax())
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/gym/core.py", line 96, in step
return self._step(action)
File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 750, in _step
raise ConnectionError(msg)
ConnectionError: .step(): server unreachable with status: <receive_failed_due_to_connect_timeout>.

Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run
self._run()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run
self.queue.put(next(rollout_provider), timeout=600.0)
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 222, in env_runner
state, reward, terminal, info = env.step(action.argmax())
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/gym/core.py", line 96, in step
return self._step(action)
File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 750, in _step
raise ConnectionError(msg)
ConnectionError: .step(): server unreachable with status: <receive_failed_due_to_connect_timeout>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/joowonkim/anaconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 95, in run
raise RuntimeError
RuntimeError

INFO:tensorflow:global/global_step/sec: 40.9994
INFO:tensorflow:global/global_step/sec: 0
INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt
INFO:tensorflow:global/global_step/sec: 0
INFO:tensorflow:global/global_step/sec: 0
INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt
INFO:tensorflow:global/global_step/sec: 0
[2018-01-17 04:18:22.827845] ERROR: A3C_1: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last):
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process
data = self._get_data()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue
return queue.get(timeout=600.0)
File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get
raise Empty
queue.Empty
INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Process Worker-17:
Traceback (most recent call last):
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process
data = self._get_data()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue
return queue.get(timeout=600.0)
File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get
raise Empty
queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/joowonkim/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 257, in run
sv.stop()
File "/home/joowonkim/anaconda3/lib/python3.5/contextlib.py", line 77, in exit
self.gen.throw(type, value, traceback)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 954, in managed_session
yield sess
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 257, in run
sv.stop()
File "/home/joowonkim/anaconda3/lib/python3.5/contextlib.py", line 77, in exit
self.gen.throw(type, value, traceback)
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 4339, in get_controller
yield default
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 250, in run
trainer.process(sess)
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1145, in process
raise RuntimeError(msg)
RuntimeError: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

[2018-01-17 04:18:32.567306] ERROR: A3C_2: process() exception occurred

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

2018-01-17 04:18:53.225776] ERROR: A3C_0: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last):
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process
data = self._get_data()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue
return queue.get(timeout=600.0)
File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get
raise Empty
queue.Empty
INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

similar errors keep occur. Do you need moe logs??
I set env.verbos=1 and num_workers=4
Thanks !!

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

works well at step 1.

At Jupyter Notebook


[2018-01-18 08:22:00.041878] DEBUG: BTgymServer_0: Episode countdown started at: 1393, END OF DATA, r:-0.2578244975861855
[2018-01-18 08:22:00.044134] DEBUG: BTgymServer_0: Episode countdown contd. at: 1394, CLOSE, END OF DATA, r:-0.2578244975861855
[2018-01-18 08:22:00.045461] DEBUG: BTgymServer_0: Episode countdown contd. at: 1395, CLOSE, END OF DATA, r:-0.2578244975861855
[2018-01-18 08:22:00.046319] DEBUG: BTgymServer_0: COMM recieved: {'action': 'hold'}
[2018-01-18 08:22:00.046877] DEBUG: BTgymServer_0: RunStop() invoked with CLOSE, END OF DATA
[2018-01-18 08:22:00.975725] DEBUG: BTgymServer_0: Episode elapsed time: 0:00:01.763553.
[2018-01-18 08:23:00.106587] ERROR: ThreadRunner_0: RunTime exception occurred.

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last):
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run
self._run()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run
self.queue.put(next(rollout_provider), timeout=600.0)
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 263, in env_runner
episode_stat = env.get_stat() # get episode statistic
File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 772, in get_stat
if self._force_control_mode():
File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 545, in _force_control_mode
self.server_response = self.socket.recv_pyobj()
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/zmq/sugar/socket.py", line 491, in recv_pyobj
msg = self.recv(flags)
File "zmq/backend/cython/socket.pyx", line 693, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 727, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 150, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 145, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 19, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable

Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run
self._run()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run
self.queue.put(next(rollout_provider), timeout=600.0)
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 263, in env_runner
episode_stat = env.get_stat() # get episode statistic
File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 772, in get_stat
if self._force_control_mode():
File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 545, in _force_control_mode
self.server_response = self.socket.recv_pyobj()
File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/zmq/sugar/socket.py", line 491, in recv_pyobj
msg = self.recv(flags)
File "zmq/backend/cython/socket.pyx", line 693, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 727, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 150, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 145, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 19, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/joowonkim/anaconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 95, in run
raise RuntimeError
RuntimeError

INFO:tensorflow:global/global_step/sec: 5.66658
INFO:tensorflow:global/global_step/sec: 0
INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt
INFO:tensorflow:global/global_step/sec: 0
INFO:tensorflow:global/global_step/sec: 0
INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt
INFO:tensorflow:global/global_step/sec: 0
[2018-01-18 08:31:59.980364] ERROR: A3C_0: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last):
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process
data = self._get_data()
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in
data_streams = [get_it() for get_it in self.data_getter]
File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue
return queue.get(timeout=600.0)
File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get
raise Empty
queue.Empty
INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred


from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

At terminal


E0118 17:21:45.308309060 19328 ev_epoll1_linux.c:1051] grpc epoll fd: 52
2018-01-18 17:21:45.312629: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12230}
2018-01-18 17:21:45.312629: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230}
2018-01-18 17:21:45.312664: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12231}
2018-01-18 17:21:45.312664: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231}
2018-01-18 17:21:45.312991: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12231
2018-01-18 17:21:45.313294: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12230
2018-01-18 17:21:49.566307: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session ad2b7177ea7201bf with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2


Thanks for your works 👍 👍

from btgym.

Kismuz avatar Kismuz commented on May 29, 2024

@knn940506, I have corrected some unsafe code which can potentially lead to such exception.
Problem is I can't verify it locally as there is no such error appeared with my work setup (MACOS).

Update btgym and run again.
If error remains, create notebook in /examples directory an run following code in it:

import os
import backtrader as bt
from btgym import BTgymEnv, BTgymDataset
from btgym.strategy.observers import Reward, Position, NormPnL
from btgym.research import DevStrat_4_6

MyCerebro = bt.Cerebro()
MyCerebro.addstrategy(
    DevStrat_4_6,
    drawdown_call=5, # max % to loose, in percent of initial cash
    target_call=10,  # max % to win, same
    skip_frame=10,
)
# Set leveraged account:
MyCerebro.broker.setcash(2000)
MyCerebro.broker.setcommission(commission=0.0001, leverage=10.0) # commisssion to imitate spread
MyCerebro.addsizer(bt.sizers.SizerFix, stake=5000,)  

# Visualisations for reward, position and PnL dynamics:
MyCerebro.addobserver(Reward)
MyCerebro.addobserver(Position)
MyCerebro.addobserver(NormPnL)

MyDataset = BTgymDataset(
    #filename='./data/DAT_ASCII_EURUSD_M1_201703.csv',
    #filename='./data/DAT_ASCII_EURUSD_M1_201704.csv',
    filename='./data/test_sine_1min_period256_delta0002.csv',
    start_weekdays={0, 1, 2, 3},
    episode_duration={'days': 0, 'hours': 23, 'minutes': 55},
    start_00=False,
    time_gap={'hours': 6},
)

env_config = dict(
    class_ref=BTgymEnv,
    kwargs=dict(
        dataset=MyDataset,
        engine=MyCerebro,
        render_modes=['episode', 'human','external'],
        render_state_as_image=True,
        render_ylabel='OHL_diff.',
        render_size_episode=(12,8),
        render_size_human=(9, 4),
        render_size_state=(11, 3),
        render_dpi=75,
        port=5000,
        data_port=4999,
        verbose=1,
    )
)

# Make environment:
env = env_config['class_ref'](**env_config['kwargs'])

# Run several episodes with statistic fetches:
for episode in range(4):
    o = env.reset()
    done = False
    while not done:
        obs, reward, done, info = env.step(env.action_space.sample())
    episode_stat = env.get_stat() 
    for k, v in episode_stat.items():
        print('{}: {}'.format(k, v))
        
env.close()

Is any exception raised? If yes, provide feedback pls.

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

no exception raised in your new example code. here's the result.

[2018-01-19 01:55:44.229338] INFO: BTgymAPIshell_0: ...done.
[2018-01-19 01:55:44.230378] INFO: BTgymAPIshell_0: Custom Cerebro class used.
[2018-01-19 01:55:44.318731] INFO: BTgymServer_0: PID: 28047
[2018-01-19 01:55:45.318373] INFO: BTgymAPIshell_0: Server started, pinging tcp://127.0.0.1:5000 ...
[2018-01-19 01:55:45.321071] INFO: BTgymAPIshell_0: Server seems ready with response: <{'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}>
[2018-01-19 01:55:45.322550] INFO: BTgymAPIshell_0: Environment is ready.
[2018-01-19 01:55:45.327601] INFO: BTgymAPIshell_0: Data domain reset() called prior to reset_data() with [possibly inconsistent] defaults.
[2018-01-19 01:55:45.332980] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_0_at_2017-01-03 12:47:00>.
[2018-01-19 01:55:45.337404] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_1_at_2017-01-05 02:48:00>.
[2018-01-19 01:55:45.357896] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-03 12:47:00>.
[2018-01-19 01:55:47.013175] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_2_at_2017-01-03 09:38:00>.
[2018-01-19 01:55:47.025657] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-05 02:48:00>.
episode: 0
length: 1380
runtime: 0:00:01.593744
[2018-01-19 01:55:48.638609] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_3_at_2017-01-03 21:30:00>.
[2018-01-19 01:55:48.653948] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-03 09:38:00>.
episode: 1
length: 1424
runtime: 0:00:01.553601
[2018-01-19 01:55:50.253536] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_4_at_2017-01-04 11:51:00>.
[2018-01-19 01:55:50.264350] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-03 21:30:00>.
episode: 2
length: 1424
runtime: 0:00:01.539417
[2018-01-19 01:55:51.793564] INFO: BTgymServer_0: Exiting.
episode: 3
length: 1424
runtime: 0:00:01.394918
[2018-01-19 01:55:51.795087] INFO: BTgymAPIshell_0: Exiting. Exit code: None
[2018-01-19 01:55:51.796303] INFO: BTgymDataServer_0: {'ctrl': 'Exiting.'}
[2018-01-19 01:55:51.797510] INFO: BTgymAPIshell_0: {'ctrl': 'Exiting.'} Exit code: None
[2018-01-19 01:55:51.798299] INFO: BTgymAPIshell_0: Environment closed.

from btgym.

Kismuz avatar Kismuz commented on May 29, 2024

That one was tricky but good it popped out. Corrected, please update and try again.
I also installed Python 3.5 (as yours, maybe error is version dependant) and have run tests, but still it works on my machine.

from btgym.

knn940506 avatar knn940506 commented on May 29, 2024

Sadly, It doesn't work. Maybe error comes from other things. I'll give you feedback soon
Thanks a lot :)

from btgym.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.