substantic / rain Goto Github PK
View Code? Open in Web Editor NEWFramework for large distributed pipelines
Home Page: https://substantic.github.io/rain/docs/
License: MIT License
Framework for large distributed pipelines
Home Page: https://substantic.github.io/rain/docs/
License: MIT License
I tried executing the example from the README using Windows Subsystem for Linux. The server starts up with no apparent errors but executing the Python client script results in the client crashing with a SIGSEGV address boundary error. I tried running it with strace
but there nothing obvious in the output. I can include the strace output if you want, but I'm not sure what other information I can gather to help diagnose.
I'm using Windows 10 Enterprise 10.0.16299 with Ubuntu 16.04 in WSL. Something to note is that I had to install the cython
system package before running pip3 install rain-python
. I used virtualenv to create a virtualized Python install into which I installed rain-python
.
Testing the same, but without the virtualized Python, in a Ubuntu 16.04 Docker container works so it's something to do with WSL.
Is it possible to get the server working on Windows directly? I tried compiling it but it makes use of crates that provide Unix-only functions like gethostname. Is there no cross-platform equivalent in Rust std or another crate?
The documentation at http://rain.readthedocs.io/en/latest/install.html states that just Rust and SQLite3 is required to install Rain from sources, but Cap'n Proto compiler is needed too :)
Currently a crash of a subworker may crash a worker, and a crash of a worker may crash the server. We need to improve this. However, we are not aiming for infrastructure resiliency now. Subworker crash may still fail the task (and so also the session) and worker crash may still lose all the objects and fail all involved sessions. The main goal is to keep the server running and deliver a graceful error.
A robust failure handling will open up the road to retrying tasks (possibly on different workers) and later to worker crash resiliency.
Rain server panics while a task becomes redy here. The relevant part of the log seems to be the following:
...
DEBUG 2018-03-17T15:31:49Z: librain::server::scheduler: Scheduler: New ready task (1,23092)
... [many New ready task info lines, various IDs]
DEBUG 2018-03-17T15:31:49Z: librain::server::scheduler: Scheduler: New ready task (1,23092)
thread 'main' panicked at 'assertion failed: r', src/server/scheduler.rs:148:17
stack backtrace:
0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
DEBUG 2018-03-17T15:31:49Z: tokio_reactor: loop process - 1 events, 0.000s
at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
1: std::sys_common::backtrace::_print
at libstd/sys_common/backtrace.rs:71
2: std::panicking::default_hook::{{closure}}
at libstd/sys_common/backtrace.rs:59
at libstd/panicking.rs:207
3: std::panicking::default_hook
at libstd/panicking.rs:223
4: std::panicking::rust_panic_with_hook
at libstd/panicking.rs:402
5: std::panicking::begin_panic
6: librain::server::scheduler::ReactiveScheduler::schedule
7: librain::server::state::State::run_scheduler
8: librain::server::state::<impl librain::common::wrapped::WrappedRcRefCell<librain::server::state::State>>::turn
9: rain::main
10: std::rt::lang_start::{{closure}}
11: std::panicking::try::do_call
at libstd/rt.rs:59
at libstd/panicking.rs:306
12: __rust_maybe_catch_panic
at libpanic_unwind/lib.rs:102
13: std::rt::lang_start_internal
at libstd/panicking.rs:285
at libstd/panic.rs:361
at libstd/rt.rs:58
14: main
15: __libc_start_main
16: _start
DEBUG 2018-03-17T15:31:49Z: tokio_reactor: loop process - 1 events, 0.000s
DEBUG 2018-03-17T15:31:49Z: tokio_reactor::background: shutting background reactor down NOW
...
However, a small test for multiple identical inputs passes, even with subsequent submits. The benchmark only fails with >500 tasks per layer. See the benchmark attached. It was run as python3 scalebench.py net -l 256 -w 1024 -s 0
, the error happens around layer 10.
The debug checks with RAIN_DEBUG_MODE=1
do not find any consistency problems.
The following code crashes rain on Ubuntu 17.10:
# crash.py
from rain.client import Client, tasks, blob
client = Client("localhost", 7210)
with client.new_session() as session:
task = tasks.execute(["echo", "-n"], stdout=True)
task.output.keep()
session.submit()
print(task.output.fetch().get_bytes()) # Should print b""
$ ./rain start --simple && python crash.py
$ ./rain start --simple && python crash.py
๐ง INFO 2018-04-09T09:51:14Z Starting Rain 0.2.0
๐ง INFO 2018-04-09T09:51:14Z Log directory: /tmp/rain-logs/worker-cmp-12356
๐ง INFO 2018-04-09T09:51:14Z Starting local server (0.0.0.0:7210)
๐ง INFO 2018-04-09T09:51:14Z Dashboard: http://cmp:8080/
๐ง INFO 2018-04-09T09:51:14Z Server pid = 12357
๐ง INFO 2018-04-09T09:51:14Z Process 'server' is ready
๐ง INFO 2018-04-09T09:51:14Z Starting 1 local worker(s)
๐ง INFO 2018-04-09T09:51:14Z Process 'worker-0' is ready
๐ง INFO 2018-04-09T09:51:14Z Rain started. ๐ง
Traceback (most recent call last):
File "crash.py", line 10, in <module>
print(task.output.fetch().get_bytes()) # Should print b""
File "/home/user/anaconda3/lib/python3.6/site-packages/rain/client/data.py", line 86, in fetch
return self.session.fetch(self)
File "/home/user/anaconda3/lib/python3.6/site-packages/rain/client/session.py", line 232, in fetch
return self.client._fetch(dataobject)
File "/home/user/anaconda3/lib/python3.6/site-packages/rain/client/client.py", line 125, in _fetch
result = req.send().wait()
File "capnp/lib/capnp.pyx", line 1932, in capnp.lib.capnp._RemotePromise.wait (capnp/lib/capnp.cpp:41109)
capnp.lib.capnp.KjException: capnp/rpc.c++:2092: disconnected: Peer disconnected.
stack: 0x7f6787fa9840 0x7f6787fa8140 0x7f6787f39db7 0x7f6787f23f03 0x7f6787f24027 0x7f6787f2440b 0x55751bc9501a 0x55751bd22bec 0x55751bd4719a 0x55751bd1c7db 0x55751bd22cc5 0x55751bd4719a 0x55751bd1c7db 0x55751bd22cc5 0x55751bd4719a 0x55751bd1c7db 0x55751bd22cc5 0x55751bd4719a 0x55751bd1d529 0x55751bd1e2cc 0x55751bd9aaf4 0x55751bd9aef1 0x55751bd9b0f4 0x55751bd9ec28 0x55751bc6671e 0x7f67893411c1 0x55751bd4dc98
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "crash.py", line 10, in <module>
print(task.output.fetch().get_bytes()) # Should print b""
File "/home/user/anaconda3/lib/python3.6/site-packages/rain/client/session.py", line 108, in __exit__
self.close()
File "/home/user/anaconda3/lib/python3.6/site-packages/rain/client/session.py", line 116, in close
self.client._close_session(self)
File "/home/user/anaconda3/lib/python3.6/site-packages/rain/client/client.py", line 163, in _close_session
self._service.closeSession(session.session_id).wait()
File "capnp/lib/capnp.pyx", line 1932, in capnp.lib.capnp._RemotePromise.wait (capnp/lib/capnp.cpp:41109)
capnp.lib.capnp.KjException: capnp/rpc.c++:2092: disconnected: Peer disconnected.
stack: 0x7f6787fa8140 0x7f6787f39db7 0x7f6787f23f03 0x7f6787f24027 0x7f6787f2440b 0x55751bc9501a 0x55751bd22bec 0x55751bd4719a 0x55751bd1c7db 0x55751bd22cc5 0x55751bd4719a 0x55751bd1c7db 0x55751bd22cc5 0x55751bd4719a 0x55751bd1ce4b 0x55751bc9539f 0x55751bc99ff3 0x55751bc951bb 0x55751bcb340d 0x55751bd48d22 0x55751bd1d529 0x55751bd1e2cc 0x55751bd9aaf4 0x55751bd9aef1 0x55751bd9b0f4 0x55751bd9ec28 0x55751bc6671e 0x7f67893411c1 0x55751bd4dc98
A document to track the directions from 0.3, replacing #26. Our mid- and long-term goals, their [priority], (asignee) and any sub-tasks.
Any help is welcome with mentoring available for most tasks!
Will be updated after prioritization discussion.
Replace capnp RPC and the current monitoring dashboard HTTP API with common protocol.
Part of #11 (more discussion there) but specific to the public API.
Multiple options, priorities may vary. (@spirali)
Pythonize the client API.
rain start
and running on OpenStack, Exoscale, AWS. Does not have to be a part of CI (even for running locally). Depends on / part of #37.Lower priority, best based on real use-cases. Ideas: numpy subtasks, C++/Rust subworkers
utils/bench/simple_task_scaling.py
. The results as of 0.2 are here.README.md:
Installation via cargo
If you have installed Rust, you can install and start Rain as follows:
$ cargo install rust_server
$ pip3 install rain-python
$ rain start --simple
Shouldn't it be:
cargo install rain_server
?
This is a proposal for directory manipulation API. Directory can be considered as a tarball blob with content type "dir". The only special "magic" happens when such data object is mapped to a file system,
then it is unpacked as a directory. It also works in the opposite direction, when a data object of content type "dir" is collected from the file system, it has to be directory (in contarst to a file in case of non-dir content type). Of course usual optizations for moving dataobject inside a worker is applied, hence
there is no obsolete packing/unpacking to from/tar tar when it is not needed.
The wrapper over blob (as for example 'pickled') that creates a constant data object:
directory(path="/abc") # Creates a data object from client's local directory
directory(dict={"fileX": b"xxxx", "fileY": b"yyyy" }) # Creates a directory dataobject from "dict" description.
Now let "d" be a directory dataobject, then following tasks creates a new dataobject from a file/directory inside "d".
tasks.get_path(d, "subdir/file1.txt", content_type="json")
tasks.get_path(d, "subdir", content_type="dir")
I propose to define a new term "fragment" that is a pointer inside a directory. The following code creates a fragment:
x = d.get_pah("subdir/file1.txt", content_type="json")
Fragment can be used as dataobject, e.g.:
t = tasks.concat((x, x))
But creating a fragment does not generate a task, the task consuming fragment directly gets a subpart of the original object. (It utilizes to name a subdir in inputs of a task).
The part of the proposed API is also "write" method on DataIntance that writes an object to a local file system. If data object is normal blob, it creates a file, if it is a directory it is extracted as a directory to filesystem.
d.fetch().write("/path")
Task that creates directory from given data objects:
tasks.make_directory({"subdir": {"file1": a }})
While it currently serves reasonably well, Capnp has several drawbacks for Rain:
Therefore it might be better to use different protocols for different interfaces:
This issue is a basis for discussion of requirements and options. The transition from Capnp is not critical.
The start
command conveniently launches a cluster (server + workers), but it makes it a bit cumbersome to stop the cluster.
For a short term, hacky solution, a simple killall
on Linux could solve it, or the start command could run the server in-process, so that it would block while the server lives.
But in general I think that the server should have a more general and graceful option to stop itself, defined in its API.
Proposal:
The stop event would stop all the connected workers and then shutdown the server.
The command could look like this:
rain stop server:1234
It would probably make sense to stop individual workers too, possibly with the same command and a --worker
flag.
The ability for clients to shutdown the server is a non-issue from a security standpoint, since there is already no client authentication and the cluster supposes that it's run in a trusted environment.
There could also be an option for a "soft stop", that would cause the server to stop scheduling tasks, but it would first wait for the current in-progress tasks to finish (and maybe checkpoint them to disk) before stopping the cluster.
The attributes are currently not well-specified and there seem to be ways to implement them better before we release the (much more stable) v0.3.
Goals:
hi all:
when installing rain-python
using pip3 install rain-python
error information
Installing collected packages: pycapnp, rain-python
Running setup.py install for pycapnp ... error
Complete output from command /Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -u -c "import setuptools, tokenize;__file__='/private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-record-5zxre8wk/install-record.txt --single-version-externally-managed --compile:
Compiling capnp/lib/capnp.pyx because it changed.
[1/1] Cythonizing capnp/lib/capnp.pyx
running install
running build
running build_py
creating build/lib.macosx-10.6-intel-3.6
creating build/lib.macosx-10.6-intel-3.6/capnp
copying capnp/version.py -> build/lib.macosx-10.6-intel-3.6/capnp
copying capnp/__init__.py -> build/lib.macosx-10.6-intel-3.6/capnp
copying capnp/_gen.py -> build/lib.macosx-10.6-intel-3.6/capnp
copying capnp/__init__.pxd -> build/lib.macosx-10.6-intel-3.6/capnp
copying capnp/schema.capnp -> build/lib.macosx-10.6-intel-3.6/capnp
copying capnp/c++.capnp -> build/lib.macosx-10.6-intel-3.6/capnp
creating build/lib.macosx-10.6-intel-3.6/capnp/helpers
copying capnp/helpers/helpers.pxd -> build/lib.macosx-10.6-intel-3.6/capnp/helpers
copying capnp/helpers/non_circular.pxd -> build/lib.macosx-10.6-intel-3.6/capnp/helpers
copying capnp/helpers/__init__.pxd -> build/lib.macosx-10.6-intel-3.6/capnp/helpers
copying capnp/helpers/rpcHelper.h -> build/lib.macosx-10.6-intel-3.6/capnp/helpers
copying capnp/helpers/fixMaybe.h -> build/lib.macosx-10.6-intel-3.6/capnp/helpers
copying capnp/helpers/capabilityHelper.h -> build/lib.macosx-10.6-intel-3.6/capnp/helpers
copying capnp/helpers/asyncHelper.h -> build/lib.macosx-10.6-intel-3.6/capnp/helpers
copying capnp/helpers/checkCompiler.h -> build/lib.macosx-10.6-intel-3.6/capnp/helpers
copying capnp/helpers/serialize.h -> build/lib.macosx-10.6-intel-3.6/capnp/helpers
creating build/lib.macosx-10.6-intel-3.6/capnp/includes
copying capnp/includes/types.pxd -> build/lib.macosx-10.6-intel-3.6/capnp/includes
copying capnp/includes/__init__.pxd -> build/lib.macosx-10.6-intel-3.6/capnp/includes
copying capnp/includes/schema_cpp.pxd -> build/lib.macosx-10.6-intel-3.6/capnp/includes
copying capnp/includes/capnp_cpp.pxd -> build/lib.macosx-10.6-intel-3.6/capnp/includes
creating build/lib.macosx-10.6-intel-3.6/capnp/lib
copying capnp/lib/capnp.pxd -> build/lib.macosx-10.6-intel-3.6/capnp/lib
copying capnp/lib/__init__.pxd -> build/lib.macosx-10.6-intel-3.6/capnp/lib
copying capnp/lib/__init__.py -> build/lib.macosx-10.6-intel-3.6/capnp/lib
copying capnp/lib/pickle_helper.py -> build/lib.macosx-10.6-intel-3.6/capnp/lib
copying capnp/lib/capnp.pyx -> build/lib.macosx-10.6-intel-3.6/capnp/lib
creating build/lib.macosx-10.6-intel-3.6/capnp/templates
copying capnp/templates/setup.py.tmpl -> build/lib.macosx-10.6-intel-3.6/capnp/templates
copying capnp/templates/module.pyx -> build/lib.macosx-10.6-intel-3.6/capnp/templates
running build_ext
creating var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/tmpl5n886va
/usr/bin/clang -fno-strict-aliasing -Wsign-compare -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -arch i386 -arch x86_64 -g -c /var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/tmpl5n886va/vers.cpp -o var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/tmpl5n886va/vers.o --std=c++11
/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/tmpl5n886va/vers.cpp:4:10: fatal error: 'capnp/common.h' file not found
#include "capnp/common.h"
^~~~~~~~~~~~~~~~
1 error generated.
*WARNING* no libcapnp detected or rebuild forced. Will download and build it from source now. If you have C++ Cap'n Proto installed, it may be out of date or is not being detected. Downloading and building libcapnp may take a while.
already have /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/bundled/capnproto-c++
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/lib/libkj-async.a(async-win32.o) has no symbols
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/lib/libkj-async.a(async-io-win32.o) has no symbols
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/lib/libcapnp.a(list.o) has no symbols
building 'capnp.lib.capnp' extension
creating build/temp.macosx-10.6-intel-3.6
creating build/temp.macosx-10.6-intel-3.6/capnp
creating build/temp.macosx-10.6-intel-3.6/capnp/lib
/usr/bin/clang -fno-strict-aliasing -Wsign-compare -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -arch i386 -arch x86_64 -g -I. -I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m -I/private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/include -c capnp/lib/capnp.cpp -o build/temp.macosx-10.6-intel-3.6/capnp/lib/capnp.o --std=c++11
In file included from capnp/lib/capnp.cpp:520:
./capnp/helpers/checkCompiler.h:4:8: warning: "Your compiler supports C++11 but your C++ standard library does not. If your system has libc++ installed (as should be the case on e.g. Mac OSX), try adding -stdlib=libc++ to your CFLAGS (ignore the other warning that says to use CXXFLAGS)." [-W#warnings]
#warning "Your compiler supports C++11 but your C++ standard library does not. If your system has libc++ installed (as should be the case on e.g. Mac OSX), try adding -stdlib=libc++ to your CFLAGS (ignore the other warning that says to use CXXFLAGS)."
^
In file included from capnp/lib/capnp.cpp:520:
In file included from ./capnp/helpers/checkCompiler.h:9:
In file included from /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/include/capnp/dynamic.h:40:
In file included from /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/include/capnp/schema.h:33:
In file included from /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/include/capnp/schema.capnp.h:7:
In file included from /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/include/capnp/generated-header-support.h:31:
In file included from /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/include/capnp/raw-schema.h:29:
In file included from /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/include/capnp/common.h:34:
/private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/build/include/kj/string.h:29:10: fatal error: 'initializer_list' file not found
#include <initializer_list>
^~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.
error: command '/usr/bin/clang' failed with exit status 1
----------------------------------------
Command "/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -u -c "import setuptools, tokenize;__file__='/private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-record-5zxre8wk/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/pip-install-pdz0rc5o/pycapnp/
clang version
/usr/bin/clang --version
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
can some one give an advice to fix this error?
Rain worker crashes with thread 'main' panicked at 'DataObject is not finished', src/worker/graph/dataobj.rs:81:18
.
The deployment consisted of 1 local and 1 remote worker running on Exoscale infrastructure connecting from a remote client.
Worker log including stack backtrace:
ubuntu@test-0:~$ rain worker <server-ip>:7210 --listen=7211
๐ง INFO 2018-03-26T10:22:40Z Starting Rain 0.1.1 as worker
๐ง INFO 2018-03-26T10:22:40Z Resources: 1 cpus
๐ง INFO 2018-03-26T10:22:40Z Working directory: "/tmp/rain-work/worker-test-0-1261"
๐ง INFO 2018-03-26T10:22:40Z Server address <server-ip>:7210 was resolved as <server-ip>:7210
๐ง INFO 2018-03-26T10:22:40Z Start listening on port=7211
๐ง INFO 2018-03-26T10:22:40Z Connecting to server addr=<server-ip>:7210
๐ง INFO 2018-03-26T10:22:40Z Connected to server; registering as worker
๐ง INFO 2018-03-26T10:22:43Z New connection from <server-ip>:40118
๐ง INFO 2018-03-26T10:23:33Z New connection from 127.0.0.1:54122
thread 'main' panicked at 'DataObject is not finished', src/worker/graph/dataobj.rs:81:18
note: Run with `RUST_BACKTRACE=1` for a backtrace.
ubuntu@test-0:~$ RUST_BACKTRACE=1 rain worker 185.150.9.32:7210 --listen=7211
๐ง INFO 2018-03-26T10:24:19Z Starting Rain 0.1.1 as worker
๐ง INFO 2018-03-26T10:24:19Z Resources: 1 cpus
๐ง INFO 2018-03-26T10:24:19Z Working directory: "/tmp/rain-work/worker-test-0-1317"
๐ง INFO 2018-03-26T10:24:19Z Server address 185.150.9.32:7210 was resolved as 185.150.9.32:7210
๐ง INFO 2018-03-26T10:24:19Z Start listening on port=7211
๐ง INFO 2018-03-26T10:24:19Z Connecting to server addr=185.150.9.32:7210
๐ง INFO 2018-03-26T10:24:19Z Connected to server; registering as worker
๐ง INFO 2018-03-26T10:24:23Z New connection from 185.150.9.32:40124
๐ง INFO 2018-03-26T10:24:24Z New connection from 127.0.0.1:54126
thread 'main' panicked at 'DataObject is not finished', src/worker/graph/dataobj.rs:81:18
stack backtrace:
0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
1: std::sys_common::backtrace::print
at /checkout/src/libstd/sys_common/backtrace.rs:68
at /checkout/src/libstd/sys_common/backtrace.rs:57
2: std::panicking::default_hook::{{closure}}
at /checkout/src/libstd/panicking.rs:381
3: std::panicking::default_hook
at /checkout/src/libstd/panicking.rs:397
4: std::panicking::rust_panic_with_hook
at /checkout/src/libstd/panicking.rs:577
5: std::panicking::begin_panic
6: <librain::worker::rpc::datastore::DataStoreImpl as librain::datastore_capnp::data_store::Server>::create_reader
7: <librain::datastore_capnp::data_store::ServerDispatch<_T> as capnp::capability::Server>::dispatch_call
8: <futures::future::lazy::Lazy<F, R> as futures::future::Future>::poll
9: <capnp_rpc::attach::AttachFuture<F, T> as futures::future::Future>::poll
10: <futures::future::chain::Chain<A, B, C>>::poll
11: <futures::future::chain::Chain<A, B, C>>::poll
12: futures::task_impl::std::set
13: <capnp_rpc::forked_promise::ForkedPromise<F> as futures::future::Future>::poll
14: <futures::future::chain::Chain<A, B, C>>::poll
15: <futures::future::select::Select<A, B> as futures::future::Future>::poll
16: <futures::future::map::Map<A, F> as futures::future::Future>::poll
17: <futures::future::map_err::MapErr<A, F> as futures::future::Future>::poll
18: <futures::future::chain::Chain<A, B, C>>::poll
19: futures::task_impl::std::set
20: <futures::stream::futures_unordered::FuturesUnordered<T> as futures::stream::Stream>::poll
21: <capnp_rpc::task_set::TaskSet<T, E> as futures::future::Future>::poll
22: <futures::future::chain::Chain<A, B, C>>::poll
23: futures::task_impl::std::set
24: <futures::stream::futures_unordered::FuturesUnordered<T> as futures::stream::Stream>::poll
25: <capnp_rpc::task_set::TaskSet<T, E> as futures::future::Future>::poll
26: <futures::future::map_err::MapErr<A, F> as futures::future::Future>::poll
27: <futures::task_impl::Spawn<T>>::poll_future_notify
28: <std::thread::local::LocalKey<T>>::with
29: <tokio::executor::current_thread::scheduler::Scheduler<U>>::tick
30: <scoped_tls::ScopedKey<T>>::set
31: <std::thread::local::LocalKey<T>>::with
32: <std::thread::local::LocalKey<T>>::with
33: tokio_core::reactor::Core::poll
34: tokio_core::reactor::Core::turn
35: rain::main
36: std::rt::lang_start::{{closure}}
37: std::panicking::try::do_call
at /checkout/src/libstd/rt.rs:59
at /checkout/src/libstd/panicking.rs:480
38: __rust_maybe_catch_panic
at /checkout/src/libpanic_unwind/lib.rs:101
39: std::rt::lang_start_internal
at /checkout/src/libstd/panicking.rs:459
at /checkout/src/libstd/panic.rs:365
at /checkout/src/libstd/rt.rs:58
40: main
41: __libc_start_main
42: _start
Server log:
๐ง INFO 2018-03-26T10:24:05Z Starting Rain 0.1.1 server at port 0.0.0.0:7210
๐ง INFO 2018-03-26T10:24:05Z Start listening on address=0.0.0.0:7210
๐ง INFO 2018-03-26T10:24:05Z Dashboard is running at http://test-1:8080/
๐ง INFO 2018-03-26T10:24:05Z Lite dashboard is running at http://test-1:8080/lite/
๐ง INFO 2018-03-26T10:24:16Z New connection from 127.0.0.1:54004
๐ง INFO 2018-03-26T10:24:16Z Connection 127.0.0.1:54004 registered as worker 127.0.0.1:7211 with Resources { cpus: 1 }
๐ง INFO 2018-03-26T10:24:20Z New connection from 185.150.8.101:42994
๐ง INFO 2018-03-26T10:24:20Z Connection 185.150.8.101:42994 registered as worker 185.150.8.101:7211 with Resources { cpus: 1 }
๐ง INFO 2018-03-26T10:24:24Z New connection from 195.113.113.226:55798
๐ง INFO 2018-03-26T10:24:24Z Connection 195.113.113.226:55798 registered as client
๐ง INFO 2018-03-26T10:24:24Z New task submission (11 tasks, 11 data objects) from client 195.113.113.226:55798
๐ง INFO 2018-03-26T10:24:24Z New get_state request (0 tasks, 1 data objects) from client
๐ง INFO 2018-03-26T10:24:24Z Client 195.113.113.226:55798 disconnected
๐ง INFO 2018-03-26T10:24:25Z New connection from 195.113.113.226:55800
๐ง INFO 2018-03-26T10:24:25Z Connection 195.113.113.226:55800 registered as client
๐ง INFO 2018-03-26T10:24:25Z New task submission (11 tasks, 11 data objects) from client 195.113.113.226:55800
๐ง ERROR 2018-03-26T10:24:26Z Connection to worker 185.150.8.101:7211 lost
thread 'main' panicked at 'not yet implemented', src/server/state.rs:94:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.
Implement a library to make rust subworker implementation easier. One possible inspiration is Rocket (as in the example below).
use rain_task::{Context, DataInstance, Output, Result, Subworker};
// One possibility: macro magic
#[rain_task("sometask")]
pub fn sometask(ctx: &mut Context, in1: &DataInstance, in2: &DataInstance, out1: &mut Output) -> Result<()> {
write!(out1, "Length of in1 is {}, type of in2 is {}", in1.len(), in2.content_type())?;
out1.set_context_type("text")?;
// Set user attributes to anything serializable
ctx.set_attribute("my-attr", [42, 43])?;
}
// Variable inputs / outputs
pub fn othertask(ctx: &mut Context, ins: &[DataInstance], outs: &mut [Output]) -> Result<()> {
// ...
}
pub fn main() {
// annotated functions are added autoamtically here
let s = Subworker::with_annotated();
// macro to add fixed arity function
add_task!(s, 2, 1, sometask).unwrap();
// add variadic task
s.add_task("othertask", othertask).unwrap();
s.run().unwrap();
}
Capnp limits builders to 8M words (64 MB) which bounds the size of submitted nodes to cca 200 000. This limit is fixed in message.h since Builder does not accept any options and uses the default limit.
๐ง INFO 2018-03-17T14:53:11Z Starting local server (0.0.0.0:7210)
๐ง INFO 2018-03-17T14:53:11Z Dashboard: http://lomikamen:8080/
๐ง INFO 2018-03-17T14:53:11Z Server pid = 95230
๐ง ERROR 2018-03-17T14:53:11Z Process 'server' terminated with exit code exit code: 101; process outputs can be found in server.{out/err}
๐ง INFO 2018-03-17T14:53:11Z Error occurs; clean up started processes ...
Expect 16384 objects of size 32768, total 512.000 MB
Traceback (most recent call last):
File "scalebench.py", line 61, in <module>
session.submit()
File "/aux/gavento/rtest/rain/python/rain/client/session.py", line 164, in submit
self.client._submit(self._tasks, self._dataobjs)
File "/aux/gavento/rtest/rain/python/rain/client/client.py", line 82, in _submit
req.send().wait()
File "capnp/lib/capnp.pyx", line 1932, in capnp.lib.capnp._RemotePromise.wait (capnp/lib/capnp.cpp:41109)
capnp.lib.capnp.KjException: src/capnp/rpc.c++:1527: context: sending RPC call; callBuilder.getInterfaceId() = 14056942509132787212; callBuilder.getMethodId() = 3
capnp/rpc-twoparty.c++:92: failed: expected size < ReaderOptions().traversalLimitInWords; size = 12047798; Trying to send Cap'n Proto message larger than the single-message size limit. The other side probably won't accept it and would abort the connection, so I won't send it.
stack: 0x7ffff66aacd6 0x7ffff66a88c0 0x7ffff66a7580 0x7ffff663d0e9 0x7ffff661e9f3 0x7ffff661eb1d 0x7ffff661ef4b 0x5555556d4411 0x5555556d493f 0x5555556d493f 0x5555556d9286 0x5555556d9f9f 0x5555557a78f2 0x5555557a9e1d 0x5555557aa5be 0x5555557d84d7 0x555555668c01 0x7ffff6cee2b1 0x5555557721ba
Add task attribute to allow only a subset of workers. An empty attribute
Usage:
open
and store
tasks (but not limited to them).Constant data objects already have a similar feature for placement, but this API is not planned to be published (and it is not really relevant for computed data objects).
Alternative: Every worker could be a resource, required by the task. However, this is not ergonomic and we do not have a clear picture of how to do (virtual) resources in the right way.
$ ./rain start --worker-host-file=hosts.txt
๐ง INFO 2018-04-11T08:15:45Z Starting Rain 0.2.0
๐ง INFO 2018-04-11T08:15:45Z Log directory: /tmp/rain-logs/worker-p1-16085
๐ง INFO 2018-04-11T08:15:45Z Starting local server (0.0.0.0:7210)
๐ง INFO 2018-04-11T08:15:45Z Dashboard: http://p1:8080/
๐ง INFO 2018-04-11T08:15:45Z Server pid = 16086
๐ง INFO 2018-04-11T08:15:45Z Process 'server' is ready
๐ง INFO 2018-04-11T08:15:45Z Starting 1 remote worker(s)
๐ง INFO 2018-04-11T08:15:45Z Connecting to jupiter (remote log dir: "/tmp/rain-logs/worker-p1-16085")
๐ง ERROR 2018-04-11T08:15:45Z Remote process at jupiter, the following error occurs: Cannot create log file
๐ง INFO 2018-04-11T08:15:45Z Error occurs; clean up started processes ...
I believe the log directory should be created first by calling mkdir -p {log_dir}
before trying to create the log file at
Line 98 in f3ec16c
touch
does not create parent directories on my Debian server.
In the current version, data instance has data types with possible values: blob, directory. After some discussions it seems that it makes sense to put data types also into data objects, i.e. task graph already contains information about data type.
This is a proposal how to handle this in Python API. This is relevant for Python tasks and tasks.execute + Program class.
Now, the user indicates that output is directory by setting content_type to 'dir', e.g.:
Output("mydata", content_type="dir")
Here we propose to introduce OutputDir
class indicate directory output, and reserve ```Output`` for blob data type. Both can be used in the remote python tasks, tasks.execute, and Program.
Output("mydata") # blob
OutputDir("mydata") # directory
To make it symetric, we can also introduce Input
/InputDir
for blob/directory inputs. Strictly speaking it is not necessary as data type may be obtain from provided object (and in case Program, decision of data type may be postponed). However, the ideas is to provide additional level of "type" check:
Input("mydata", dataobj=d) # Fail if 'd' is data object of directory type
InputDir("mydata", dataobj=d) # Fail if 'd' is data object of blob type
In the case of implicit input, right data type is derived from provided data object:
tasks.execute(["du", "-h", d]) # This will work for 'd' being directory or blob
Input
and Output
:Input("mydata", data_type=DataType.Blob)
Output("mydata", data_type=DataType.Directory)
The purpose of this proposal to make more public and more explicit two already internally implemented functions in worker (under different names). They serves to mapping data instances to file system. The proposed concepts are:
(1) Note that linking may silently fallback to writing, for example when we have only in-memory representation of a data object.
(2) Note that in upcomming directory support, it is expect that linking/writing "dir" data object creates directory representation on filesystem.
I propose to implement two new methods on data instance: link and write. They should behave as described above.
@remote()
def my_remote(ctx, data):
data.write("mydata") # creates stand-alone copy of data instance into file "mydata"
data.link("mydata2") # creates a read-only copy of data instance (may possibily create a symlink)
I propose to add a new parameter write
for class Input
. When set to True, it forces write operation when data object is mapped to the file system.
tasks.execute(["/some/binary", Input("mydata", write=True, dataobj=x)])
When write
is False, then link operation is used (this is behaviour in the current version).
We have received many feedbacks from our reddit post (https://www.reddit.com/r/rust/comments/89yppv/rain_rust_based_computational_framework/). I think that now is time to recap our plans and maybe reconsider priorities to reflect real needs of users. This is meant as a kind of brainstorming white board; each individal task should have own issue at the end. Also, I would like to focus on a relatively short term road plan (let us say things that could be finished within 3 months), maybe we can create a similar post for our long term plans.
EDIT (gavento): Moved @spirali's TODO to a post below.
Introduce optional task/data object groups and names for better monitoring and possibly debugging. Currently, the data objects have a label
indicating the role it plays in the producer task (e.g. log
, score
, ...).
Introduce:
task and data object name - arbitrary string, optional, to be set for any tasks the user wants to distinguish later by their name. May not be unique in the session but is intended to be unique. Always set by the user.
task and data object groups - a textual label splitting the nodes into disjoint groups (including the None
group).
tags
, aggregate by tags combination. Drawbacks are added complexity (to many tag combinations) and less monitoring meaning (the stats no longer add to all nodes).When rain is built or installed with log 0.3.x, it fails.
error[E0432]: unresolved import `log::Level`
--> src/bin.rs:349:21
|
349 | use log::Level;
| ^^^^^^^^^^ no `Level` in the root
error: aborting due to previous error
It seems to depend on a change rust-lang-nursery/log#162. So If we use log 0.4.x, build or install success.
So, we should change "*"
in Cargo.toml to >=0.4
.
I'll make a PR.
Improve the online monitoring to:
This depends on a streaming API (#11) and may require a complete UI framework.
The recent version added more "safe" read-only mapping data objects to files via unix file permissions. However it seems that our CI runs as root and it can modify files even it has not write permission. This leads to failing tests on CI server. I propose to create a normal user in the container and switch to this user before running the tests.
While Rain startup on PBS already has basic support, it would be great to have scripts to setup and control (at least teardown) deployment on some of: CloudStack API (Exoscale), AWS, Goole cloud, ...
@vojtechcima is already working on that - Vojta, would you please take over and fill some plans and status?
In the Python API, all the tasks are currently an instance of the Task
class with different types indicated by task_type
. It would be more pythonic to implement them as a hierarchy of classes, one for each task_type
(or even one for every Program
or python Remote
func. - TBD). This would have the advantages to allow using isinstance
and better introspection (__repr__
), access to parameters specific to the task type and allow subclassing.
Independently of this change, we can keep the task-creation operations as functions (and lower-cased), e.g. keep tasks.concat
in additon to having tasks.ConcatTask
.
This should be mostly non-disruptive change and needs a more detailed spec of Task interface.
hi all:
I have read the doc on https://substantic.github.io/rain/docs/index.html.
The example given in get-started
as below
from rain.client import Client, tasks, blob
# Connect to server
client = Client("localhost", 7210)
# Create a new session
with client.new_session() as session:
# Create task (and two data objects)
task = tasks.Concat((blob("Hello "), blob("world!"),))
# Mark that the output should be kept after submit
task.output.keep()
# Submit all crated tasks to server
session.submit()
# Wait for completion of task and fetch results and get it as bytes
result = task.output.fetch().get_bytes()
# Prints 'Hello world!'
print(result)
This is so amusing.
But Python3.5+
support async/await
syntax, does this project support this syntax?
in roadmap I find this task Update in the Python API (using aiohttp for async API) (@gavento) [medium]
. Is it in order to support `async/await1 syntax?
After supporting async/await
syntax, the example maybe shows as below
from rain.client import Client, tasks, blob
# Connect to server
client = Client("localhost", 7210)
# Create a new session
async with client.new_session() as session:
# Create task (and two data objects)
task = tasks.Concat((blob("Hello "), blob("world!"),))
# Mark that the output should be kept after submit
task.output.keep()
# Submit all crated tasks to server
session.submit()
# Wait for completion of task and fetch results and get it as bytes
result = await task.output.fetch().get_bytes()
# Prints 'Hello world!'
print(result)
Process 'server' terminated with exit code exit code: 101; process outputs can be found in server.{out/err}
hi all:
I install rain_server with cargo install -v rain_server
, but can not install rain_server successfully.
error as showing below
error: failed to run custom build command for `rain_core v0.3.0`
process didn't exit successfully: `/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/cargo-install.LarbQVDOQLHm/release/build/rain_core-211500adc0948f2e/build-script-build` (exit code: 101)
--- stderr
thread 'main' panicked at 'schema compiler command: Error { kind: Failed, description: "Error while trying to execute `capnp compile`: Failed: No such file or directory (os error 2). Please verify that version 0.5.2 or higher of the capnp executable is installed on your system. See https://capnproto.org/install.html" }', src/libcore/result.rs:906:4
note: Run with `RUST_BACKTRACE=1` for a backtrace.
warning: build failed, waiting for other jobs to finish...
error: failed to run custom build command for `capnp-rpc v0.8.3`
process didn't exit successfully: `/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/cargo-install.LarbQVDOQLHm/release/build/capnp-rpc-9631959463f9bb8f/build-script-build` (exit code: 101)
--- stderr
thread 'main' panicked at 'capnp compile: Error { kind: Failed, description: "Error while trying to execute `capnp compile`: Failed: No such file or directory (os error 2). Please verify that version 0.5.2 or higher of the capnp executable is installed on your system. See https://capnproto.org/install.html" }', src/libcore/result.rs:906:4
note: Run with `RUST_BACKTRACE=1` for a backtrace.
warning: build failed, waiting for other jobs to finish...
error: failed to compile `rain_server v0.3.0`, intermediate artifacts can be found at `/var/folders/y3/xb_8fx894lq_89d7kv0k3gq00000gp/T/cargo-install.LarbQVDOQLHm`
Caused by:
build failed
cargo version :
cargo -V
cargo 0.23.0 (61fa02415 2017-11-22)
Is there something wrong when using cargo install rain_server
?
Is this project still alive? Any plans for a new release?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.