sjdv1982 / seamless Goto Github PK

Seamless is a framework to set up reproducible computations (and visualizations) that respond to changes in cells. Cells contain the input data as well as the source code of the computations, and all cells can be edited interactively.

Home Page: http://sjdv1982.github.io/seamless

License: Other

Python 91.03% HTML 2.08% Makefile 0.02% CSS 0.85% Jupyter Notebook 0.64% JavaScript 3.73% Shell 1.30% C++ 0.01% R 0.01% Roff 0.27% Dockerfile 0.04% Jinja 0.01%

data-science framework interactive interoperability protocol python reproducible-science scientific-computing web-services

seamless's People

Contributors

Stargazers

Watchers

Forkers

agoose77 mvandepanne xetaiz not-so-rabh isaurecdb rpbs-platform

seamless's Issues

Allow RESULT to be a directory in docker/bash transformers

bash/docker transformers no longer need tar. Instead, RESULT can be a directory.
The bash/docker transformer then detects this and builds and returns the result by running tar by itself.
Note that to ensure reproducibility, tar must be run with reproducible options:

tar --sort=name \
      --mtime='1970-01-01 00:00:00' \
      --owner=0 --group=0 --numeric-owner \
      -cf RESULT.tar RESULT

GPU support

Various plans to use the GPU with Seamless.

General

NOTE: the GPU is and will be a can-of-worms for reproducibility
Part of this is the unordered computation itself (see Seamless Zen about add/multiply)
But another big part of this is the environment (see #38). Is Docker+GPU reproducible regardless of host hardware/driver?

Requires implicit (expose GPU drivers using nvidia-docker; install pycuda/pyopencl) or explicit (use capability system and job delegation) GPU capabilities in Seamless.

1: Pure CUDA/OpenCL transformers

i.e. wrapping a single kernel. This is in the spirit of pycuda, and should be adaptable for OpenCL as well.

Calculate checksum on the CPU. Copying to the GPU will be a caching operation.
Don't use unified memory access, since it isn't compatible with malloc (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-system-allocator).
Tricky part is always the execution context, since CUDA contexts cannot be fork()'ed (!!!)
CUDA low-level API is nice and simple. cuda2py gives a good example, albeit also of how not to do things.
In short:
1. compile CUDA code to PTX data using nvcc (see cuda2py._py.Module)
2. Import low-level API as lib, e.g. using pycuda.cuda, or directly using cffi (cuda2py._cffi).
  Start a CUDA context.
3. Use cuModuleLoadData with PTX data
4. Load function from module (cuda2py._py.Function)
  5a. Allocate buffers on the GPU
  5b. copy data (async?, stream?) to the GPU
  5c. build params array of GPU pointers (cuda2py does NOT do this correctly for npy arrays)
5. Execute function using params array (cuLaunchKernel, stream?)
6. Copy back result data. As this is in the order of gigabytes/sec, it won't be a big thing.
From the 7 steps above, only do step 1 inside the transformation. A modified version of gen_header
generates a kernel declaration of the form "extern C global transform(...)". This declaration
is added to the supplied CUDA code that is compiled to PTX.

Delegate step 2-7 to a CUDA server that runs in a separate process
(network communication latency is microsecond range).
The CUDA server receives and sends back only checksums, i.e. it requires a shared database (or delegate checksum
requests to Seamless instances connected via communion).
Request format is very similar to transformation requests, however, the code buffer refers to PTX code, not Python.
Input buffers can be kept GPU-allocated between requests, and liberated after some seconds or minutes.
Each Seamless instance can use only a single CUDA server; write a smarter server if you want to use multi-GPU
Some authentication might also be in order?

2: CUDA in compiled transformers

Alternatively: It should be already possible to mix in CUDA code objects into a regular compiled transformer
This is because CUDA compiles to a standard .o that can be linked as usual (for functions declared as global).
In this case, the transformer code is responsible for creating a CUDA context and invoking kernels. The CUDA
context will not persist in between invocations of the transformer.
This is for transformations that use multiple kernels, or that use the same kernel multiple times.
This is closer to the normal usage of CUDA (as a C++ extension).

3: Pycuda in a Python transformer

Finally, there is always the option of a Python transformer that calls pycuda all by itself.

graph node API

._get_hcell and ._get_htf refer to the graph node. Should be renamed to a read-only (get deepcopy) .node property, that calls ._get_node() internally. Same for Context, Library, Macro.

hcell and htf should be refactored to node in private code.

Also add support for getting the mount and the share.

Eliminate global state

With the exception of context manager state, we should avoid building in global state at this stage of development

Cleanup

Have a look at all Python files. Move in-code TODOs etc. to documentation

Start to reorganize some code, rename some APIs, etc.

Perhaps a bit more formal unit tests

Have a look at what it would take to go to PEP8 compliance.

More fluid access to the values of Seamless cells in a Notebook

To break the "ctx barrier" and have a more Notebook-like experience ("values at your fingertips")

A function decorator @seamless and a IPython cell magic %%seamless.
Both of them take ctx (a seamless.highlevel.Context) as an argument, or extract it from the calling frame.

What it does: to exec code in a "ctx-mapped namespace",
so that "a[:10]" becomes "ctx.a.value.unsilk[:10]". For example:

ctx = Context()
ctx.a = 12
ctx.b = 3
await ctx.computation()

# 0. Without using it

print(10 * ctx.a.value.unsilk + ctx.b.value.unsilk) 
# => 123

# 1. Using the IPython magic

%%seamless
print(10 * a + b)  

# => 123

# 2. Using the decorator

@seamless
def func():
    print(10 * a + b)

func()

# => 123

Before executing the code, cache ctx.a.value.
I.e. eagerly evaluate ctx.a.value and insert the result into the namespace where the code executes.
Only do this for cells! Can be done recursively for subcontexts.

Refuse it for very large values (>1 GB or so)
Raise an error if the entire cache goes beyond 1 GB or so.
These limits can be arguments to the decorator/magic

Note that the values are read-only: setting ctx values is not possible (ctx itself is not mapped)

Also make a version that doesn't unsilk (argument to the decorator/magic).

high level API using wrappers

In particular, the various self wrappers and the transformer's pins wrapper
Make them more functional and more introspective

Mixed pins by default

Inside a transformer or a macro, input pins should be "mixed" by default, not silk.
As it is, the pins are already called "mixed", but a schema is attached if it can be retrieved from the input schema, else an empty schema (core/transformation.py).
Make a special pin celltype "silk" that has the current behavior.

Re-run carefully the highlevel tests and the examples!

Non-reproducible computations give cache misses

This happens when a cell is "fingertipped" using a known computation. This means that Seamless knows that the checksum of the computation is X, but not the buffervalue of X; therefore, it re-performs the computation A + B => X . However, if the computation A + B is now Y, then there is a cache miss for X.

At the very least, this should give an understandable error message, with a web link to the explanation above.
Better, it should be possible to mark (the transformer of) the computation as "non-deterministic", causing fingertipping to morph X into Y.

Re-implement the high level API as Silk

Everything is a Silk structure: cells, contexts, transformers, reactors, observers, macros.
But they will be heavily-modified subclasses of Silk. (maybe less modification in the future)
Lots of hooks in the vein "what happens when something assigns to me"
Normally, ctx.a = 2 will create a cell, but it could create a constant too.
"ctx.c = ctx.a + ctx.b" will normally create ctx.c as an operator_add object.
This object will be stored in the data dict.
A Silk context has a single "big high-level macro" to generate a mid-level graph.
(UPDATE: or simply manipulate the graph directly using high-level API...)
This is done by the top-level context (may invoke subcontexts recursively)
It is done again and again whenever a new cell/context/... is added or removed.
(Not when the value is changed, though, unless there is a high-level macro connected to it
UPDATE: not even then. A high-level macro is nothing but a macro that returns a mid-level graph structure.
The default language is the Python-Seamless high-level API, but it can be any language)

cell - cell connections are not (always) re-established upon macro object recreation

To reproduce:
examples/fireworks/tutorial/step8.py:

t = ctx.timer.trigger
=>
t = ctx.timer.trigger.cell()

and change ctx.period

Remote peer IDs

remote peer id
The purpose of this is to pass around in a remote (communion) request
the communion peer ID of the peer that issued the original request.
This is to prevent requests from bouncing in circles.

To be implemented, there is now only a stub!

Low priority. For now, communion is used in a master-slave fashion, not so much in a peer-to-peer fashion.

Fallback mode

For during development. Graphs can be served with fallback=False

Soft fallback mode, for cells and transformers.
A fallback value for a cell or input pin that is used if upstream-None, i.e. if some error is happening upstream.

Hard fallback mode ( kind of debug mode), for transformers.
There must now be a fallback value for the output pin. Regardless of what the transformer does, this value is what is propagated downstream.
Input pins (including code) can have a fallback mount, from which the value is read. For code pins, this is also the file name for debugging.
Output pins can have a fallback mount too, where the result is written.
To be integrated with shells (#47)

Static Seamless web page (for publishing)

It is important to deliver Seamless graph results as demonstrated web pages that can
be inspected and visualized (in the browser)
without being backed up by a Seamless instance.
Jupyter Notebook widgets have something like that (embed widget state)
but it does not work very well in most cases.

In any case, we can do much better than Jupyter widgets.
Some preliminary tests (see /docs/plans/notebook-html-test), show that under most circumstances,
raw HTML, raw Notebooks and HTML-converted Notebooks can access Seamless cells
that have been dumped to a file.
Such HTML/Notebook files can be served by github.io (not github.com) and/or nbviewer.jupyter.org
What is needed is to adapt seamless-client.js so that it can run in "static mode":
instead of getting the pulse of cell value updates from the Seamless websocket server,
it will obtain each cell value once, from a single GET request (that will normally resolve
to a file). Changes (from JS) of the cell (in JS) will trigger the onchange() callback,
but will obviously not call Seamless.
To make this work, in static mode, the Seamless client should request ./SEAMLESS-STATIC and read all
shares, then read each cell in the sharelist.
The Seamless client could be initialized in mode=static, mode=dynamic or mode=both, where
it is first tried to read ./SEAMLESS-STATIC, and if that fails, a standard dynamic connection is made.

On the long run, one could also add a mode=graph. Here, the Seamless client tries to request graph.seamless.
If it succeeds, it is parsed as JSON, the shares are identified, as well as their checksums. These checksums
are then read from some kind of checksum-to-buffer server.

Note that seamless-client.js is and always will be about providing buffers. Parsing them to values has to
be done manually in JS.

Cells "written in ipython" are not fully supported

IPython support isn't well supported at the moment. This means that other languages (such as R) are not easily supported via IPython magics either.

IPython support exists of two parts:
Modules with language=ipython, and low-level cells with celltype "ipython".
Both are converted to Python (.core.protocol.conversion).
However, this conversion normally does not always work,
as converted line magics and cell magics rely on get_ipython(), which is missing from the namespace.

It DOES NOT work for transformations run under Python.
Doing "from ipython import get_ipython" will not fix the problem under Python:
this results in a dummy IPython that has no access to the namespace.
It DOES work if transformations and all of Seamless is run under IPython,
as IPython inserts get_ipython into the builtins module.
It DOES work for modules, as build_module executes the code for IPython modules under seamless.ipython.execute.
This creates a separate IPython kernel, so it also works under Python.
It CAN BE FIXED if seamless.ipython.execute is used for transformations as well.
Right now, transformations are executed using core.execute._execute
(so are macros and libinstances, but not reactors, for some reason)
the call stack is core.execute._execute => cached_compile.exec_code => exec
if exec is replaced with seamless.ipython.execute, then it works as well as modules do.
BUT: we don't want this for every transformer.
Perhaps, mark a transformer as "ipython"?

In terms of tests, the following tests work under python and ipython:

lowlevel/injection4.py
lowlevel/injection5.py
The following tests work ONLY under ipython:
lowlevel/simple-ipython.py
highlevel/simpler-ipython.py

Solution: marking a transformer as "ipython"?
This would have implications for communion, jobless and scripts/run-transformer as well.
Save it until "environment capabilities" (#38) are being implemented.

Seamless graphs as self-consistent descriptors of computation results (without re-doing work)

Seamless graphs as containers of computation results

UPDATE: the big data branch (elision + expression cache) has partially solved this.

Seamless graphs contain the checksum of the computation results. However,
currently they are not supported enough in this regard.
In particular, if you load a graph and do ctx.compute(), Seamless will pull in all inputs,
erase all results, and recompute them, unless the database provides:

all the transformation result checksums
(which it already does, if it was previously used as a sink in the computation.
Therefore, no computation is re-done if a database is present)
all the expression result checksums
(which are not stored at all. Therefore, even if a database is present,
all buffer input values will be pulled in!).

An important goal is to make the Seamless graph a self-consistent descriptor of
a computation result, i.e. if you load and do ctx.compute(), this is a no-op as
Seamless considers that no work nor value pull-in needs to be done.

There are three solutions to this:

straightforward cases.
For expressions, this is inchannels/input pins without celltype morphing.
For transformers, this is Python transformers.
In these cases, the graph loading must be smarter. In that case, result checksums
can be distilled from the graph and added to the cache, right before ctx.set_graph.
Of course, this must only be done if the graph is trusted.
One caching feature that must be added is Structured cell join caching. The input of a join
contains the checksum of auth and checksum dict of the inchannels. The output of the
join is the checksum of the buffer cell (the data cell is either the same, or None,
depending on the checksum of the schema; this must also be cached!).
Intermediate cases. This is expressions that involve celltype morphing, and
non-Python transformers. In this case, more checksums must be monitored/recorded
of the internal part. In addition, the graph must contain an explicit expression
cache section.
Hard cases. This is Macros that have a non-empty internal low-level context
(as opposed to Macros that just connect incoming/outgoing paths dynamically).
Example of such a macro would be map-reduce, creating a transformer for every
data item in the input.
It will be very difficult to store this into a self-contained graph, since the observer
monitoring of standard cells and transformers is absent. This monitoring must be added
manually by the macro, and facilities must be added to pass this into the high-level for
storage!

Sphinx auto-generated documentation

This must be done for Context, Cell and Transformer for sure.
Context and Cell have their own descriptive documentation, Transformer is documented only in the guide.

Library instance roadmap

NOTE: As of Seamless 0.5, library instances are an undocumented stealth feature, requires documentation (#41). See /plans/big-data for some design docs and tests.

TODO

Constructors, finish testing
Test library-containing-another-library
Make sure constructors work correctly when copying parent contexts
Test indirect library update system, also with constructors

UPDATE: some progress has been made with stdlib.switch/join .
library-containing-another-library has not yet been tested.

"indirect library update system"? Hasn't that been ripped? Test+document that an explicit translate is required

TODO: fix default values for lib constructors (function signature check fails)

"help" attribute

At the high-level, allow descriptions/docstrings everywhere (Cell, Transformer, but also pin, subcell). Pins: highlevel Transformer, Macro, Library: "help" field.

Improved stability

The stability of seamless.core tasks is not 100 %.
This shows mostly in:

the traitlet value changed code (before the "timer handle" throttle).
the bind status graph observer code in v0.5.1 and earlier, when every status/exception
update led to a modification using ctx2.status_.handle
Getting cell.value from within SeamlessTraitlet observe callbacks is unstable (at least within Jupyter)

The root cause is most likely the add_synctask hack in structured cell tasks,
which is necessary because manager.structured_cell_trigger
does not work well when called from tasks

For now, 0.5.2 seems to solve the acute stability problems, but on the long term,
the problem should be fixed for real.

Documentation

Status of the documentation web page

Need to rebalance between "Seamless explained" and feature documentation for several features. The former should be more conceptual, the latter more practical, but this is difficult to separate for these features.
- Mounting to the file system
- HTTP shareserver
- Structured cells (especially authority and the joining process). Schemas must be discussed in "validation"
- Deep cells
Need to rebalance between "Seamless explained" and feature documentation for several more features that are conceptual, but not as advanced (accessible for beginners too, and for sysadmins). To be balanced with seamless-tools docs too.
- Visualization
- Validation
- Job management
- Deployment
For the rest, the missing / incomplete sections have now been highlighted
Need to finalize "Seamless explained" and beginner's guide.

Missing sections

TODO fill in here

Specific todo

Basic example notebook / README.md . Add "edit cell over HTTP" section.
Simple index.html for cell ctx.a with two buttons (get value, set value) and two fields.
Move section 3 and 4 to the end (de-emphasize)
write a big data example, e.g. a hhblits search where the database path is a DeepFolder
checksum.

General todo

Make some example video, do some publicity online
Expand man pages for seamless-cli
Low-level API documentation (long term)

better support for compiled transformers that return an array

Needs #27 to be solved first.

The array needs to be pre-allocated anyway, but now this is done using the result schema, which indicates both the max (pre-allocated) size, the actual size, and the schema-enforced size. This needs to be separated: max_shape property for allocation, result_shape as an extra function parameter, and the schema-enforced size as it is now.

Improved module features

Add support for:

Multi-file modules (a bit like structured cells, where every inchannel is a file)
Compiled modules (always multi-file)
Related to #30

It should check that the mode, language and code are set.
See core/build_module.py:build_interpreted_module

R language support

In a development environment, this should already be possible.

Install R and rpy2 into the running Seamless container
Write a language="ipython" cell/transformer/module, using an R magic (see https://rpy2.github.io/doc/latest/html/interactive.html)

In contrast, supporting this in a production environment requires #32 to be solved.

After that, implementing language="r" transformers and modules require some midlevel translation machinery.

Improved SnakeMake integration

Seamless has SnakeMake support by letting SnakeMake build its DAG, and then
convert this DAG to a Seamless context.
This is currently done using the snakemake2seamless command line tool, also
because high-level macros (contexts with constructors) are not yet working.

SnakeMake always pulls, having as target a rule output or a set of files.
If it is a rule, it may not contain wildcards. Therefore, SnakeMake always has well-defined, statically known output files.
This is not always so for inputs and intermediate results. SnakeMake has two mechanisms
to dynamically determine the input files of a rule. The "dynamic" flag delays the
evaluation of a wildcard file pattern until runtime. It must be declared as the
output of one rule, and, identically, as the input of one or more other rules.
This mechanism is being deprecated in SnakeMake 6, in favor of checkpoint rules.
Checkpoint rules are to be used together with input functions. If an input function
tries to access a checkpoint rule, the input function is halted until the checkpoint
rule has been evaluated, and then re-triggered. (Note that in all other cases, input functions are evaluated while the DAG is being built, so no special Seamless-side support for input functions is necessary.)
Seamless will never, ever support either of these dynamic mechanisms.
If you need dynamic DAGs, you need to do the dynamic part in Seamless, letting it generate a (static-DAG) Snakefile if needed.
Example:
Snakefile 1 takes a static number of input files to create a single clustering file. Snakefile 1 can be simply wrapped in a Seamless macro that does the same as snakemake2seamless. It requires the target rule / file list, a Snakefile, and optionally an OUTPUTS (see below)
Snakefile 2 splits the clustering file into a clusterX.list for each cluster X.
It may be a single rule that generates all the outputs; in that case, it must depend on a list OUTPUTS, e.g. ["1", "2", "3"]. OUTPUTS must be generated dynamically by a custom Seamless transformer that reads the clustering file, counts the clusters.
Snakefile 2 can then be generated by a general-purpose transformer that takes in a
rump Snakefile and outputs list and adds 'OUTPUTS = ["1", "2", "3"]' on top of the SnakeFile (This can be done using same Seamless macro, which may take an OUTPUT as an optional input).
Alternatively, the rule may selectively extract specific clusters. In that case,
Snakefile 2 itself is static, but must be invoked with a list of target files
rather than a target rule. This list of target files is what must be generated by
a custom Seamless transformer. (again, the same macro can execute it)
Snakefile 3 generates clusterX.stat and clusterX.log for every cluster X. Snakefile 3
is static, but has a dynamic number of inputs and outputs. Again, you have the choice
between generating OUTPUTS or generating the target files.
In all cases, the macro offers the option to pass either individual "files" in separate input pins, or to pass in a whole filesystem-like JSON, creating a binding for each input "file". The output is always a filesystem-like JSON.
Long-term improvements:

Support SnakeMake run-functions (Python code using the SnakeMake API) within rules, inside a static DAG.
Support for SnakeMake inputs/outputs that are a file list, rather than a single file.

Smarter graph loading

See #34
See scripts/serve_graph.py
See tests/highlevel/load_graph

Things to do for graph management

(None of them are very urgent)

Adapt graph format wrt structured cells. Do we need to store
"buffer" and "value"? They are always the same!
Better to make some kind of "validated" attribute
Also, when loading the graph, if it is trusted, not only update
transformer cache and expression cache with elements mined from
the graph, but also prevent StructuredCell joins if validated=True
=> loading and doing compute on a colored graph should do nothing, no transformations and no data loading!

TODO: buffer provenance cache
TODO: buffer size cache.
TODO: expression cache policy. Normally, cache only if the result buffer is much smaller than
the input buffer.

TODO: expression provenance cache: expression-result-to-expression.
TODO: buffersize cache (checksum-to-buffersize) in ValueCache. Also in Redis and via communionserver
Both provenance caches (transformation and expression) are kept as local cache for any buffersize,
but they are offloaded to Redis only for larger buffers (>= 100 bytes)
TODO: implement and document text below:
TODO: graph storing/loading: Checksums of accessors (connections) are stored/loaded as well as the checksum of cells
NOTE: .protocol conversion/validation caches are never offloaded to Redis.
However, validation is only invoked if:

The cell value (not the checksum!) is changed authoratively from the command line (SetCellValueTask)
By an accessor, if the checksum of the accessor (or its target cell) changes
Therefore, graphs that contain results (including partial results; almost every graph contains some expression results)
can be loaded in trusted or non-trusted mode. In both cases, all checksums are set, but the differences are as follows.

In non-trusted mode:
No caches are filled at all. This means that all expressions get re-evaluated, starting from authorative cells.
All transformations will be re-executed as well.
However, since all results will (should!) be the same as the checksums of their outputs, there will be no propagations.
Reactors and macros get re-launched also.
However, checksums of reactor output pin accessors are set to "provisional" (TODO). When a reactor outputs a value, the
provisional status cleared. Whenever a reactor gives an exception,
all provisional accessors are void-cancelled. After equilibration, all provisional accessors are void-cancelled as well.

In trusted mode:
All caches get filled:

Transformation result cache
Expression result cache
Validation/conversion caches
This means that even though accessors/transformations will be initially re-evaluated, they all immediately get a cache hit.
In addition, since the cache hit is the same as their checksum, there will be no propagations.
Reactors and macros get re-launched anyway, which is as it should be (TODO: unless declared as pure).
Reactors should give the same outputs, but if they do not, it gets overwritten.

A change in mimetype should trigger re-translation

nice error message if database can't be found

Reactor state management

Reactor start and stop side effects

The reactor start and stop code may produce side effects. To work properly,
Seamless requires idempotency. For example, of the event chains below, not only
must A and B give the same results, but also B and C. In other words, stop followed
by start must cancel out in terms of side effects. Seamless has the choice of a
spectrum between 1. never do a stop until the reactor terminates
(and with the way cache hits work for the reactor, this may be at the end of the program),
or 2. stop after every update execution, and restart whenever a new value arrives.

may save CPU time, while 2. may save memory.
Some evaluation policy will tell Seamless what to do (this will not affect the result).
By default it is 1., which is essential in the case of a GUI in a reactor. (While the
disappearance of the GUI technically has no result on the computation, it will definitely
be an unpleasant surprise to the user).

event chain A:
set value X to 10
set value Y to 20
start
update
stop

event chain B
set value X to 1
set value Y to 2
start
update
set value X to 10
update
set value Y to 20
update
stop

event chain C
set value X to 1
set value Y to 2
start
update
stop
set value X to 10
set value Y to 20
start
update
stop

TODO:
(Re-)introduce caching for reactors
They give a cache hit not just if the value of all cells are the same, but also:
- if the connection topology stays the same, and
- the value of all three code cells stays the same
In that case, the regeneration of the reactor essentially becomes an update() event

General plans for reactors

(Run-time) Reactors are currently executed in-process, in the main thread.
(They can of course fire up threads of their own, if they desire)
In the future, make some kind of "reactor server" that executes reactors
in a separate process (probably without fork(); reactors shouldn't handle terabytes of data?)
and eventually, in a remote manner (or maybe only remote, i.e. the reactor server goes
over the network).
In order for this to work, one must think about start/stop policy and side effects,
and probably for some kind of "heartbeat" (reactors may die for any reason at all; in that
case, a new one should be fired up)

On the long term, think of reactors in other languages than Python.

RIP Proxy

At the high level, rip getattr(Context) => Proxy
Have a look at CodeProxy and HeaderProxy, because those are useful
Get rid of special syntax << and >> (check documentation also)

This will give much nicer error messages when an unknown attribute is being accessed.

Dependency version bumping

TODO, but to test carefully

Stability tests (related to Python / asyncio / ipykernel / Jupyter versions)
Compatibility verification, especially related to Python module injection (#30)

High-level copy/rename

Rename a cell/transformer/context/macro
Copy a cell/transformer/context/macro

This is particularly useful for transformers, as it could reuse topology and schema,
i.e. does something akin to CWL's CommandLineTool instantiation.

Connected cells outside the copied object are not cloned. Example:

def func(a,b):
    return a + b
ctx.tf = func()
ctx.tf.a = ctx.a
ctx.tf.b = 20
ctx.tf2 = ctx.tf.copy()

In this case, the value 20 gets copied, and Transformer.b can be re-assigned independently in ctx.tf and ctx.tf2
In contrast, ctx.a now connects to both ctx.tf.a and ctx.tf2.a

mounts and shares will never be copied (they will disappear in the copy).
However, they can be in a renamed context/transformer/cell/macro

Transformer multiprocessing errors (in particular on Windows)

Error running the demo file given in readme.md with seamless test_file.py

IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
*********** ERROR in transformer .transform: execution error **************
Traceback (most recent call last):
  File "d:\pycharmprojects\seamless\seamless\core\pythreadkernel\__init__.py", line 173, in run
    self.update(self.updated, self.semaphore)
  File "d:\pycharmprojects\seamless\seamless\core\pythreadkernel\transformer.py", line 89, in update
    executor.start()
  File "c:\users\angus\appdata\local\programs\python\python36\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "c:\users\angus\appdata\local\programs\python\python36\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "c:\users\angus\appdata\local\programs\python\python36\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "c:\users\angus\appdata\local\programs\python\python36\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "c:\users\angus\appdata\local\programs\python\python36\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle code objects

Transformer/macro shells

Create new namespace upon request/upon update; either pre-execution or post-execution [post-execution may re-execute transformer/macro]
The shell should be created and the connection info dumped in a JSON file, to be connected

Example:

from ipykernel.kernelapp import IPKernelApp
namespace = {"a": 1234}
app = IPKernelApp.instance(user_ns=namespace, connection_file="/tmp/seamless-1.json")
app.initialize()
app.start()

This works with ipdb in jupyter console as well. See #23 for more details about debugging.

Shells are a high-level feature.
The above will work for Python transformers and for macros. The command to connect (seamless-cli) will be printed.
A special case will be bash/docker transformers. Instead of a jupyter JSON file, this will generate a folder (a temp folder! unequal to any mount or fallback mount!) with the files and a .bashrc. The command to start a shell in that folder (seamless-cli), in the appropriate Docker image, will be printed. (Seamless can detect if it itself is not run in a Docker image, and adapt the printed command accordingly).

Shells will not work for compiled transformers, nor for Python transformers with capabilities (#38) that are not in the current Seamless environment (as defined by the Docker image).

Status/exception/logging

There should be an easy util to monitor the stdout/stderr/exception,
dumping them into the terminal printing function.
This monitor should take into account the last printed value.
If there is a cache hit, it should print out the corresponding last printed value.
If that value cannot be found, a monitored transformer should then be fingertipped
(i.e. re-computed).
Monitoring should not be saved in the graph; in fact, it should end upon translation
(printing a message)

In fact, this sounds very similar to the SeamlessTraitlet and cell.output(); perhaps adapt it from there?

UPDATE 0.3
Status and exception are computed properties, that can be interrogated using polling observers.
In the future, secondary outputs (stdout/stderr and cell.status/cell.exception, and get_graph?) will be "second-rate cells", in the sense that they can be connected only to (normal) cells in other contexts.

On the longer term, do the same for progress, and for execution time
as well.

All of this will replace the Logging and Report cells mentioned in earlier plans.
/UPDATE

TODO (long term)

Finalize exception logging. There are still the core aux systems
(mount, share, communion), link them to cells if possible. For the rest,
there are still translation/highlevel API exceptions, but they should be printed.
Did we forget anything?

Pseudo-cells

Exceptions and state should never be real cells, as they are not deterministic.
But in the future, perhaps support connecting them to a cell from a different graph...

Galaxy/CWL support

Galaxy

See if it is possible to adapt Galaxy workflows to Seamless.
Possibly via CWL??

Implement specific exceptions

explicit None values

Dealing properly with explicit None values

Allow simple cells to have an support_explicit_none flag.
If so:

They can be set to explicit none with cell.set_explicit_none().
This sets the value to Seamless.ExplicitNone.
Setting the cell to a normal None means cancellation, as usual.
Connections between two explicit-None-supporting cells carry Seamless.ExplicitNone.
Connection between two simple cells of which only one supports explicit None leads to a normal None and cancellation.
Connection of an outchannel to an explicit-None-supporting cell sets the cell to ExplicitNone if the outchannel is None.
In other words, structured cells are explicit-None-supporting as far as downstream connections are considered.
All downstream connections of a void cell or worker remain void.
Structured cells remain void if their auth is None and all of their inchannels are void.

Allow any worker inputpin to be annotated as support_explicit_none
These inputpins must be connected from explicit-None-supporting cells.
If the upstream cell is Seamless.ExplicitNone, a normal None is fed into the worker namespace.

It remains illegal for a transformer to return None.
A None set by a reactor remains a normal None. To avoid confusion, editpins and reactor outputpins cannot be connected to explicit-None-supporting cells.
However, ExplicitNone can be used as a fallback value

Debugging transformers in Seamless

Part of this should go into documentation (#41), integrate with the "new project" tutorial.

Also make clear that this is for debugging inside a Docker container. For inside a Conda environment, a different launch.json would be needed.
Inside Singularity, probably never debug.
Integrate with shells (#47) and fallback mode (#45)

Debugging transformers in Seamless

Debugging is a pain. This is because debuggers for Python and C:

a) claim the tty; and/or:

b) have a file-centric view of the world; and/or:

c) are opinionated about threads vs processes.

Seamless prefers to clean up its mess, i.e. delete files after compilation.
(Only the compiled Python module file is kept, for now; TODO: find a way to change this).
Jupyter (and IPython) claim the tty; if this interferes, you may convert your
notebook to a Python script and execute it with Python. If, at the end, you
do asyncio.get_event_loop().run_forever(), Seamless will keep listening for events
(worker, mount, share, communion, reactor edits), without claiming the tty.
All Python debuggers will work like this (but see below).

Transformers are executed as processes.

CONCRETE ISSUES:

Python debuggers have trouble with processes;
for this reason, Seamless includes a patched version of pdb that does
work with processes. Still, none of them work inside Jupyter within processes.
For compiled transformers, the .debug option will not only save debugging symbols;
it will also print out the pid and pause it until it receives SIGUSR1.
gdb attach + signal SIGUSR1 will work, except that:

Visual Studio Code does not work with gdb in this manner (communication goes wrong)
In Ubuntu, you need to be sudo gdb to attach.

UPDATE: VS code now works for C/C++

Best approach:

expose .vscode/launch.json to the Seamless Docker container: TODO let Seamless look for it in /cwd/.vscode/launch.json .c

TODO: Let Seamless synthesize an entry to launch.json, as follows:
container ID: c121bde3bd57 (is now exposed as /home/jovyan/DOCKER_CONTAINER)
transformer process ID: 1114
build dir: /tmp/seamless-extensions/seamless_module_01ec725d4bf65b54f6c807f8380e88f71713f9abd6d3579eb69593dc15ccde83
name of main.cpp as mounted: /tmp/code.cpp (need mount mapping; supported are either /tmp or /cwd (translated using $HOSTCWD))

  {
      "name": "ctx.tf: debug Seamless C/C++ transformer",
      "type": "cppdbg",
      "request": "attach",
      "program": "/opt/conda/bin/python",
      "processId": "1114",
      "pipeTransport": {
          "debuggerPath": "/usr/bin/gdb",
          "pipeProgram": "docker",
          "pipeArgs": ["exec", "-u", "root", "--privileged", "-i", "c121bde3bd57", "sh", "-c"],
          "pipeCwd": ""
      },
      "sourceFileMap": {
          "/tmp/seamless-extensions/seamless_module_01ec725d4bf65b54f6c807f8380e88f71713f9abd6d3579eb69593dc15ccde83/main.cpp":"/tmp/code.cpp"
      },
      //NOT WORKING: "skipFiles": ["/build/glibc-S7xCS9/glibc-2.27/sysdeps/unix/sysv/linux/select.c"],
      "MIMode": "gdb",
      "setupCommands": [
          {
              "description": "Enable pretty-printing for gdb",
              "text": "-enable-pretty-printing",
              "ignoreFailures": true
          },
      ]
  },

In VSCode, launch the ctx.tf debugger in the debugger icon menu.
Then, press F6, and press Esc to ignore the "Cannot find select.c" error message
then press Ctrl+Shift+Y, and type "-exec break".
Make sure that you have some breakpoint set by now.
Then, type "-exec signal SIGUSR1" and press Esc to ignore the "Cannot find select.c" error message
Finally, press F5

Interactive editing of Silk schemas

This seems to be broken in many cases, multiple issues

As of Seamless 0.5, direct schema manipulation (via example) is not working well.
The /tests/lowlevel/structured_cell/schema.py tests are not working

Deleting subschemas does not work properly (it becomes None instead)
See the commented-out section of tests/highlevel/subsubcell.py

Minor: schema.add_validator does not work, only example/value.add_validator

New Seamless project / web generator

Project loading with -BAK is misconceived (serious bugs)
Documentation (#41) is to be improved, in particular for the web generator. Update guide so it shows the correct status weblink. Include the status weblink by default in the generated form as well.
Some users may use the status graph but code the index.html/index.js by hand. In any case, have a soft fallback (#45) that is essentially empty.
Some users prefer to store context-generating code in a Python script, instead of making the .seamless the authoritative representation. Vault should still be added under version control (it is either small OR it contains computation result).
Medium term: merge the webgen improvements in nademo.
Medium term: make webform.json / web components for the status as well (use seamless JS client to connect to webctx?)

Better error messages for compiled transformers regarding schemas

Relates to: #27

Right now, schema and result_schema are by default empty, leading to uninformative error messages such as:
'Status: upstream => result_schema undefined'
ctx.tf.schema.set({}) and ctx.tf.result.schema.set({}) are needed first.
The current error message in gen_header "TypeError: Input schema needs to be defined"
isn't the best either.

Use relative imports

Currently many imports that access other sub packages are accessing them from the seamless.XXX package. This should be changed to support relative imports

Forbid translation of a ctx in a lib constructor

Module injection

injecting Python modules synthesized from source is still a bit problematic

If you do "from .my_module import blah" and blah is not found within my_module,
a ModuleNotFoundError is returned, instead of an ImportError.
"from .my_module import blah" does not work at all under system Python3.7.6,
but it does work for the current Docker image (Anaconda Python 3.7.3).
my_module.blah always works.
Perhaps bump the Python version all the way up to 3.9: would require some solid testing,
especially in Jupyter if Jupyter is bumped up as well (see #31).

In addition, if two modules are injected, and module A depends on (i.e. imports) module B, this currently does not work without some kind of sys.modules hack.

Capabilities

At the high-level, unify bash and docker transformers. Mid-level translations remain split (and jobless as well)
Rip docker_options.
Allow docker_image for any transformer, not just bash. If non-bash, docker_image is interpreted as a capability.
Each Seamless instance will have a list of capabilities, which means that it can execute transformers as if it were
running inside that Docker image (also bash).
If not in that Docker image, the transformer must be delegated by communion, else error.
Adapt run-transformation + jobless backend as well.

UPDATE: capabilities are distinct from images. Both can be part of an environment definition.

Running Seamless inside the browser

What would be needed to port the Seamless high level API to run inside the browser

node["TEMP"] refers to a literal (i.e. Python) construction value. Low-level cells are needed to
parse this to a serializable value. Right now, the mid-level does this on the first translation pass, then pops the TEMP.
This is very ugly. mid-level should be configured to invoke remote parsing, and remote parsing should be invokable without
translation. This way, the translate.py can be disconnected, and the remaining mid-and-high-level routines could run in
the browser (under Pyodide; Transcrypt is probably less useful since we would want a Python REPL). If all of the state
is in the graph (i.e. Python Context, Cell etc are just thin wrappers), then the graph could be manipulated in parallel
by JS, e.g. manipulating the position of the node on a visualization canvas).
Right now this is not exactly the case: there is context._graph but ALSO context._children.

Polyglot support

Support more foreign languages: Julia, R (#33), JavaScript.

IPython magics (pixiedust for Node).

Also have a look at python-bond.

For compiled languages, may be as simple as editing compilers.cson and install a compiler package into the image/environment.

Low level-mounting refactor

Low-level mounting of entire contexts is fragile
(e.g. tests/lowlevel/mount-direct.py => cannot remove directory /tmp/mount-test/sub)
Perhaps better to make it a high-level feature,
(which would forbid the mounting of low-level-macro-generated cells)
and simplify the mountmanager accordingly.

Probably also make it asyncio instead of thread-based

Automatic connection for Transformer/Macro

Connect all pins to parent context cells of the same name.
Example:

def add(a, b):
    return a+b
ctx.sub.tf = add
ctx.sub.tf.connect()

=> try to connect ctx.sub.a and ctx.sub.b, if they exist
Print a report about the result (cells connected, cells not found).

Contexts can have an "autoconnect" attribute that does this for every new transformer. This must be stored in the graph node.

sjdv1982 / seamless Goto Github PK

seamless's People

Contributors

Stargazers

Watchers

Forkers

seamless's Issues

General

1: Pure CUDA/OpenCL transformers

2: CUDA in compiled transformers

3: Pycuda in a Python transformer

Seamless graphs as containers of computation results

Status of the documentation web page

Missing sections

Specific todo

General todo

Improved SnakeMake integration

Things to do for graph management

Reactor start and stop side effects

General plans for reactors

Pseudo-cells

Galaxy

Dealing properly with explicit None values

Debugging transformers in Seamless

Recommend Projects

Recommend Topics

Recommend Org