python-streamz / streamz Goto Github PK

View Code? Open in Web Editor NEW

1.2K 34.0 148.0 2.91 MB

Real-time stream processing for python

Home Page: https://streamz.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Python 99.25% Dockerfile 0.49% Shell 0.26%

streaming-data python async real-time

streamz's Introduction

Streamz

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.

Optionally, Streamz can also work with both Pandas and cuDF dataframes, to provide sensible streaming operations on continuous tabular data.

To learn more about how to use Streamz see documentation at streamz.readthedocs.org.

LICENSE

BSD-3 Clause

streamz's People

Contributors

Stargazers

Watchers

Forkers

kszucs danielballan cj-wright snth blavejr martindurant jdc08161063 elementace maxalbert cmwaura kcompher ianchen83 hsparra gitrekm nikolayvoronchikhin aymar73 philippjfr femtotrader coperius nagyistge jakirkham draguraman-zz fstakem willingc faducoder eddienko jbcrail gccmbr gustavocarita dcferreira wolframschulte ksindi sdoof mariusvniekerk quipa detectivebag wmayner priya-gittest jiangpizi zjw0358 limx0 yijxiang jsmaupin semanticfact rlukata jhirniak skmatti mengjin001 chinmaychandak hamroune jaykimbravekjh peymankor stenpiren alisaei frstudy mrafayaleem ersawant cltai9145 mirekwilmer y12uc231 yutiansut vishalbelsare soasme gccrowther quickpanda xuzifan08 balajirrao pkuphenix saraedum hazho alessiom matthewcarbone grignards mbrukman kustomzone joskid masums benayab dioptre yevgenyr roveo l3dlp-sandbox fagan2888 sharadmalmanchi heyuejun ian-r-rose webclinic017 lzlgboy alamehor lijiwei2018 bednar nils-braun selmanay amwi04 cheunhong iitians jbednar salah93 battyone badiku

streamz's Issues

Example of how to run event_loop in separate thread for using the `timed_window()` method

Hi,

Could you please add an example of how to run an event loop in a separate thread so that one can use the timed_window() and delay() methods?

The only reference I found is that it works in a Jupyter notebook because that already has an event loop running in the background. This doesn't help me as I'm not running in a Jupyter notebook.

I'm using asyncio in Python 3.6 and mostly that's working fine but I can't get the timed_window() function to work.

editing pipeline

Would it be possible to have an .edit() method on the nodes so that it opens a GUI which we can use to delete/add nodes and connections?
I guess this is getting a bit close to vistrails and all but... (I also know that we may not want to spend our time writing a GUI)

Delay doesn't work with zip?

It seems that delay doesn't work with zip.
I keep getting a ValueError: Stream has multiple children error.

Make RX like graphs

RX has some nice tutorials, with great graphs showing how streams get combined, eg combine latest. It would be nice if we could have these too, especially as we have extended some of the operators beyond their tutorials. Even better it would be awesome if we could generate these through the streams themselves.

One potential way to do this:

Attach sinks to all the nodes in question which keep track of the time when the data came through the nodes
Use each stream's timestamps to generate a timeline of the execution
Make pretty timeline? (RX is interactive, which is great but I don't know if we need that functionality) I just don't know how to do this. Tikz, matplotlib, something else?

Ideally we would create a whole bunch of these examples so we could put them in the docs. Then every time the docs are generated we could recreate them, just in case we introduce a new execution model or change the nodes (or even check to make certain that the nodes behave as we expect them to be)

@ordirules (since we were talking about this on #48)

Garbage collect streams

Currently streams contain reference cycles between parent and child. As a result, they never get cleaned up. Generally this is ok, they're pretty small in memory. It does make it difficult to stop operating on a stream though.

source.map(print)  # this stream will print forever

This behavior can be both convenient and limiting.

We might instead choose to require that a reference to a stream be kept explicitly in order to keep it around.

printer = source.map(print)  # this stream will print forever

Then if we delete all references to the printer, the stream will be cleaned up and printing will stop

del printer

This would be a change in semantics. Do we want this change?

caching with distributed

I'm not sure where to post this but I think it seems to have something to do with streams.
There must be something fundamental I don't understand of distributed. I
would expect the following to result in a cached result on the cluster:

    from distributed import Client
    from streams import Stream
    from dask import delayed

    client = Client("xx.xx.x.x:xxxxx")
    client.has_what()

    #output: {'tcp://xx.xx.x.x:xxxxx': []}

    s = Stream(); sout = s.map(delayed).map(lambda x : x).map(client.compute)

    s.emit(10)
    client.has_what()
    #output: {'tcp://xx.xx.x.x:xxxxx': []}

    s = Stream(); sout = s.map(delayed).map(lambda x : x).map(client.compute).map(print)

    s.emit(10)
    #output: <Future: status: pending, key: finalize-30855c9e1ecca7881ff1355bbe9335e7>

    client.has_what()
    #output: {'tcp://xx.xx.x.x:xxxxx': []}

However, as you can see, the output is nothing.

On the other hand, this works:

    from distributed import Client
    from streams import Stream
    from dask import delayed

    client = Client("xx.xx.x.x:xxxxx")

   
    client.compute(delayed(lambda x : x))

    client.has_what()
    #output {'tcp://xx.xx.x.x:xxxxx': ['finalize-2e05de10e50701fc34ec28edcdf03599']}

Do you know what is possibly happening? I have given a quick look and
don't see anything obvious. This issue may be deeper than my current
understanding of distributed.

Dynamically change node parameters

Is it possible to dynamically change node parameters? What would be the best way to do this? Would it be possible to regain control of the terminal during emitting such that the pipeline could be edited in real time?

Should streamz handle exceptions?

What do we do when a stream receives bad data that causes an exception to be raised. For ex:

def foo(x):
    if x is None:
        raise Exception
    else:
        return x + 1
s = Stream()
s2 = s.map(foo)
s3.sink(print)

s.emit(1)
s.emit(None)
s.emit(2)

Here, foo is a point of vulnerability in the stream, where it may or may not cause the whole stream architecture to halt.

Is it worth trying to incorporate some quiet exception handling? I am not sure exactly how to tackle this so I'm being a little vague at this point. I can think of many ways of doing this. Here are a few:

Catch the exceptions and emit them somehow (will require defining a data type). We can also not emit (and perhaps sink errors to a global list) but this may cause unintended synchronization consequences to the user.
catch the exceptions in s.emit. Note that in this case catching the exception may be harder to find

I'll think about it, but I would like to hear opinions from @mrocklin and @CJ-Wright (who has already handled this in his streams extension). My current method is to wrap all mapped functions to look for exceptions, and return a document that flags the document as having encountered an exception. This works in my subclassed module only though. It would be nice to unify this I think.

(Note: exceptions can occur not just in map but other things like filter etc. Other modules like zip may also want to be exception aware, that something passing through is bad data, and pass this on etc.)

[Question] What is the best way to parallelize on the graph level

The dask extensions have given us the ability to parallelize on the "inside a node" level. However, some nodes can be run completely independently from one another. What is the best way to access that level of parallelizem?

Rename library

The name "streams" is overly generic. It is also taken on PyPI.

We should find a better name. One option would be to use the dask namespace and call this dask-streams although this makes less sense if the vanilla stream implementation is to be a core feature.

Release?

Would it be possible to send out an alpha release?
I don't mind conda-forge packaging.

zip_latest with multiple lossless streams

Currently zip_latest only buffers the first stream, we may want to buffer multiple streams.

Decouple node and edge creation

Would it be possible/desirable to separate node and edge creation?

When we create a streamz pipeline we are creating a directed task graph. Currently this is done by creating nodes and simultaneously attaching them to other nodes (creating an edge). However, we may run into the situation where we want a bunch of nodes sitting in a library waiting to be used. This way users could mix and match how their nodes connected, essentially adding edges after the nodes were imported.

Error in graph with lambda

It seems that lambda functions now make the graph plotting rather unhappy.

def test_create_file():
    source1 = Stream(stream_name='source1')
    source2 = Stream(stream_name='source2')

    n1 = source1.zip(source2)
    n2 = n1.map(add).scan(mul).map(lambda x: x + 1)
    n2.sink(source1.emit)

    with tmpfile(extension='png') as fn:
        visualize(n1, filename=fn)
        assert os.path.exists(fn)

    with tmpfile(extension='svg') as fn:
        n1.visualize(filename=fn, rankdir="LR")
        assert os.path.exists(fn)

    with tmpfile(extension='dot') as fn:
        n1.visualize(filename=fn, rankdir="LR")
        with open(fn) as f:
            text = f.read()

        for word in ['rankdir', 'source1', 'source2', 'zip', 'map', 'add',
                     'shape=box', 'shape=ellipse']:
            assert word in text

/home/christopher/mc/envs/dp_dev/bin/python /home/christopher/pycharm-2016.3/helpers/pycharm/_jb_pytest_runner.py --target test_graph.py::test_create_file
Testing started at 12:43 PM ...
Launching py.test with arguments test_graph.py::test_create_file in /home/christopher/dev/streamz/streamz/tests

============================= test session starts ==============================
platform linux -- Python 3.5.4, pytest-3.2.2, py-1.4.34, pluggy-0.4.0
rootdir: /home/christopher/dev/streamz, inifile:
plugins: xonsh-0.5.12, env-0.6.0
collected 1 item
test_graph.py .Error: Unknown HTML element <lambda> on line 1 
in label of node map; <lambda>
Error: Unknown HTML element <lambda> on line 1 
in label of node map; <lambda>
Error: Unknown HTML element <lambda> on line 1 
in label of node map; <lambda>

I'm following up on this, I'm not certain how this broke from our previous versions.

Use case: Stream arrays from redis store

We are developing a tool for streaming N-D arrays from low level languages (Fortran/C) to Python. We are using Redis and are in the process of designing an Xarray backend (nbren12/geostreams#6) as the user facing API. We are exploring different stream handling intermediaries that will allow us do many of the common reactive programing tasks on a collection of key/value pairs mapped to Xarray objects. Ideally, we come up with a solution that works well with dask arrays too.

@nbren12 may want to expand my initial description.

References:

Geostreams Issue
Geostreams Wiki describing project

cc @nbren12, @phargogh, @ajijohn, @mrocklin

Compose with collections

We probably want a Stream variant that moves around not individual elements, but batches or sequences of elements. We probably also want a Stream variant that moves around Pandas dataframes. Each of these would probably want a different API. Tor example map on a batched stream might look like the following:

class map(BatchStream):
    def __init__(self, func):
        ...
    def update(self, batch):
        new_batch = list(builtins.map(self.func, batch))
        self.emit(new_batch)

However each of these new collection-wise interfaces would probably want to compose with both the lower level local and dask Stream objects.

To that end maybe it makes sense to encapsulate a low-level stream within a user-level stream.

Merging Hygiene

How do we feel about git hygiene when merging? I personally tend to squash-and-merge PRs, except if git history is particularly clean (this happens very rarely for me). Any thoughts?

When people maintain larger projects it's usually nice to be able to assume a few things about git history

It is entirely linear, without branching
All tests pass on all commits

These conditions are very rarely held during normal active development of any feature branch, but are typically held when a branch is merged. We can use github's squash-and-merge feature to automatically squash all commits in a PR to one. This loses some history though, so there is a balance here if the git history of that branch is particularly valuable.

Most pydata projects that I know of use squash-and-merge by default.

Push to master

@mrocklin I seem to have pushed to master by accident, do you want a revert and PR or is the code ok
Sorry 😦

ENH: colorful graphs

Is it possible to color each node in graphvis? If so we may want to come up with a color scheme such that every node type has a dedicated color for ease of viewing (and making nice graphs for presentations/publications).

ENH: inverse of zip (split?)

We may want the inverse of zip, which takes in a stream of tuples and then either: a) breaks them up into separate streams or b) selects one "column" of the tuples to report. I think a) might be more general.

ENH: plot the streams

Dask has a nice feature where it draws out the task graph and saves it as an image.
While this is not dask and some things could be very different, a similar feature could be very helpful to tracing exactly what is going on in an otherwise complex pipeline.

Use case: dask + redis + progress reporting

I have a couple of DAG workflows in dask. These workflows are triggered via user interfactions through a web API. I'd like to achieve the followings:

The web API puts the workflow to a redis queue (RabbitMQ would be nice too, but redis is easier to start with).
Run workers like the majority of job queue frameworks do (rq, arq, huey)
Store the progress and current status of the workflows (with MultiProgress I guess) as key/values in redis

What would be the correct approach to implement it?

Best way to emit more than came in

Is it possible to emit more things than were put into the stream?

Optional strong ref

Would it be possible to add a kwarg to make a given node a strong ref?

graph dependencies

Right now, graph

from streams import Stream
s1 = Stream()
s2 = Stream()

sout1 = s1.map(lambda x : x +1)
sout1.map(print)

# connect stream 2 to stream 1
s2.map(s1.emit)

from streams.graph import visualize 
visualize(s2, 'stream2.png')

Result is this:
http://imgur.com/a/0Z0JA

What would solve this is that rather than using emit, we could define a new function function, say s2.inject(s1) which would add s2 as a child to s1. The only issue I see here is we'll need an update defined for the base stream object, which will just call emit. What do you think?
@mrocklin @CJ-Wright
I'm interested to hear of a better way, thanks!

Docs?

Any interest in web documentation?

Example: Read growing log file, create a dataframe or append to existing data frame?

Is it possible to create a dataframe from a growing log file, or is pygtail a better approach?

visualize() issue

Logging the details here for now. This now raises an error:

In [1]: from streamz import Stream

In [2]: s = Stream()

In [3]: s2 = s.map(lambda x : x + 1)

In [4]: s.visualize()
Error: Unknown HTML element <lambda> on line 1 
in label of node map; <lambda>
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)

but this works:

In [1]: from streamz import Stream

In [2]: s =Stream()

In [3]: def inc(x):
   ...:     return x+1
   ...: 

In [4]: s2 =s.map(inc)

In [5]: s.visualize()
Out[5]: <IPython.core.display.Image object>

I did not have issues with it before.

I will try to get back to it later and add more details if I cannot figure it out. If anyone has a quick idea to save time that would be great. thanks!

REL: add requirements.txt to release

This causes problems when we try to pull stuff from pypi for the conda-forge release.

Split stream class

Would it be possible to split the stream class into two classes?
The first class would hold the init, emit, child, and loop methods.
The second class would inherit from the first and implement map, sink, buffer, etc.

Inspiration:
I have a very specific data topology and I need the various functions to operate differently then they currently do. By splitting the class I'd be able to use the same base class and just have to re-implement map, filter, etc. for my data needs.

A similar proposition; would it be possible to make the various internals of Streams hot swappable?
I don't need to change all the methods, for example delay most likely could stay the same, as could buffer. But map, filter, sliding_windo, etc won't work for the event model data topology.

Thoughts?
@danielballan

sink args and kwargs

Should sink mirror map in having args and kwargs?

combine_latest emit_on behavior

Should we buffer the combine_latest emit_on stream?

Context

Consider the following situation, we have two streams a and b. We are going to use combine_latest to combine them, with an emit only on a. 5 entries come down from a then one comes down from b then one from a. Due to the lack of data from b none of the 5 initial a entries have been emitted since the b was missing. Now that we have b data do we expect it to emit 6 times (which would require the buffering of the emit_on stream or only once (which may violate the idea of the emit_on stream always emitting)?

@mrocklin @ordirules @danielballan

Documentation about weakrefs

We should eventually make sure this gets into the documentation (in reaction to #72)

Something like:

s = Stream()
s2 = s.map(lambda x : x  + 1)
s2.map(print)
s.emit(1)

won't print
but

s = Stream()
s2 = s.map(lambda x : x  + 1)
s2.sink(print)
s.emit(1)

will print.
Also, mention the subtlety that the first code block will print if run in an ipython terminal because the return results are stored in internal variables. (i.e.

In [1]: from streamz import Stream

In [2]: s = Stream()

In [3]: s2 = s.map(lambda x : x+1)

In [4]: s2.map(print)
Out[4]: <streamz.core.map at 0x7fc750b13ba8>

In [5]: _
Out[5]: <streamz.core.map at 0x7fc750b13ba8>

Execution order

When the pipeline branches, how do we know/specify which branch gets executed first?

Named Pipelines

I'm interested in streamz as a possible higher-level api for implementing ETL workflows I'm currently using dask.distributed for. @mrocklin - is this something you think streamz would be useful for?

One of the issues I'm facing is that an Extract-Transform-Load pipeline is composed of multiple tasks (including error handling and analytics jobs) but when multiple pipelines are running in parallel it can be impossible to know which distributed-level tasks are running which pipelines in the Bokeh UI. It might be nice if you could give a name to a pipeline which could then be used to aggregate tasks running under that name.

This may be something better suited to dask/distributed itself but I thought I'd at least bring it up somewhere. I'm curious if this is something anyone else is interested in?

subclassing

This point has been raised a few times over PR's and issues here. However, I feel it deserves its own post.

This is strongly tied to #15.

May be a nice alternative for subclassing could be to use a metaclass to contain some boilerplate. Here is an example:

from functools import partial

class StreamMeta(type):
    ''' Handles the boilerplate of the function registry.
    '''
    def __init__(cls, name, bases, dct):
        # I want to modify the cls methods
        print("Preparing class: {}".format(cls))

        #print("Updating function registry...")
        # set up the function registry
        if not hasattr(cls, '_fun_reg'):
            cls._fun_reg = dict()

        # intercept the functions coming in and add them 
        # (or modify) the function registry of streams
        for key, elem in dct.items():
            if not key.startswith("_") and callable(elem):
                cls._fun_reg[key] = elem
                delattr(cls, key)

        def __getattr__(self, name):
            #print("Checking function registry")
            if name in self._fun_reg:
                return partial(self._fun_reg[name],self)
            else:
                raise AttributeError

        cls.__getattr__ = __getattr__

        #print("Finishing initialization")
        super(StreamMeta, cls).__init__(name, bases, dct)


class Stream(object, metaclass=StreamMeta):
    # this was python 2.7, add for compatibility?
    #__metaclass__ = StreamMeta

    def map(self):
        print("Stream map")

s = Stream()
print("Stream's map method")
s.map()

class CustomStream(Stream):
    def map(self):
        print("Custom map")
    def map2(self):
        print("Custom map2")


s2 = CustomStream()
print("CustomStream's map method")
s2.map()
print("CustomStream's map2 method")
s2.map2()

print("Stream's map method")
s.map()

The output is:

Preparing class: <class '__main__.Stream'>
Stream's map method
Stream map
Preparing class: <class '__main__.CustomStream'>
CustomStream's map method
Custom map
CustomStream's map2 method
Custom map2
Stream's map method
Custom map

The _fun_reg dictionary is used to save the functions in conjunction with __getattr__. The metaclass is used so that the stream classes may still be defined in the usual way. The drawback with this method is that only one instance of Stream may ever be used.
Any thoughts, is this perhaps overdone? @mrocklin @CJ-Wright @danielballan ?

Outer Product

Here is my attempt at an example:

from streams import Streams
from operator import mul
s1 = Stream()
s2 = Stream()

op = s1.product(s2, mul)
L = op.sink_to_list()
a = [1, 2, 3]
b = [4, 5, 6]

for x, y in zip(a, b):
    s1.emit(x)
    s2.emit(y)
print(L)

[4, 5, 6, 8, 10, 12, 12, 15, 18]

The best way that I can kind of come up with this is to stash all but one of the streams into a list and then for each incoming piece of data perform an itertools.product on the resulting iterable and push it into some function. I'm not so crazy about this as it means we end up storing a bunch of stuff in memory which seems un-stream like.

Use Case - Financial modelling

Firstly, this looks super interesting @mrocklin. There is a definite use case in finance / trading - we often build complex DAGs and ideally need some form of streaming service. In the past, I have used Luigi, but it doesn't really fit the streaming model and the scheduler isn't as intelligent as dask.

Some info about the process (skip if you don't care): generally we ingest data from multiple streaming sources, do some transformations or run data through a model, and output multiple streams to different trading strategies. We also need to do some sort of model fitting/backtesting/validation. Often these models or strategies are of varying complexity - some strategies may be able to trade with only one or two inputs (and we want them to be fast) and then others may require more complex calculations or several nodes to complete.

Ideally (and at a high level) we would like the ability to do the following:

Define a DAG in a simple and flexible way that doesn't require a bunch of boilerplate.
Run the DAG historically over some past data to validate our models. Also being able to parameterise -eg, run this for 2017-01-01, 2017-01-02...
Update nodes in the DAG, re-run and have the executor only update required data (An extension of this is exceptions - if a node/function breaks, being able to update some code and continue running a pipeline where it left off is a massive plus)
Run the DAG in production in some form of streaming fashion, so we can process and transform data live to feed our models.

Luigi offered most of this, except

there was quite a bit of boilerplate to set up dependencies,
it was targeted at slower running pipelines and
scaling it out was quite a pain.

I also believe we have most of this with dask. 1. is obvious, 2. is easy, I had a working version of 3. using dask and joblibs memory.cache pretty well. The only missing piece is being able to move that dask graph into production. This could be done by creating a DAG in dask and calling it repeatedly with new data, however in the case of nodes completing at different times, the pipeline becomes as fast as its' slowest part, which isn't ideal when you want paths to complete ASAP.

I suppose my question is, given the information above, do you see this as a potential use case for streams, or should I be working harder to get dask to play how I want it to. I am very keen to contribute if this problem fits into the broader goals streams is trying to achieve.

If the motivation isn't clear I can try and provide some simple examples of what I mean

DaskStreams CancelledErrors

I am not sure where to put this. It seems that I am running into problems with this simple code:

from distributed import Client
import streamz.dask as sd
from streamz import Stream
from tornado import gen
from tornado.ioloop import IOLoop
from distributed import sync

HOSTNAME = "localhost"
PORT = 8786
net_string = "{}:{}".format(HOSTNAME, PORT)
client = Client(net_string)

def foo(x):
    print("incrementing")
    return x + 1

s = Stream()
s2 = sd.scatter(s)
s3 = s2.map(foo)
s4 = sd.gather(s3)
s4.sink(print)

@gen.coroutine
def start():
    for i in range(10):
        yield s.emit(1)


def start_run():
    loop = IOLoop()
    sync(loop, start)


start_run()

where I am getting CancelledError: foo-dbfccbe1f665466a68906d614a6d8348. I am submitting the job to a dask-scheduler that I have initialized. Running the code a second time resolves the issue.
I am not sure what is going on, but I suspect this may have something to do with references to futures being lost before information is gathered. My question is, am I doing something wrong, or is this a potential issue with the way streamz handles futures? If I find any more information I'll post here.

Also, please let me know where this may be best to post. One option is to start a #streamz on stackoverflow. Thanks!
EDIT : Corrected typo in source code (s4 to s5)

turn dask task graph into streaming processing

Would it be possible to parse a dask task graph dictionary and turn it into a streaming pipeline?

Coverage?

Should we have coverage statistics?

Gate node, super node or docs

It may be helpful to have a Gate node, or super node or at least some usage docs/pattern docs.

A gate node would filter one stream by another stream. This could be done by

s1 = Stream()
s2 = Stream()

l = s1.zip_latest(s2).filter(lambda x: bool(x[1])).pluck(0).sink_to_list()

We have some options in terms of implementation:

Make a node to do this as syntactical sugar
Create some sort of system for "super nodes" where there are underlying nodes but we can treat it as a single big node.
Just document usage patterns like this one (and the average square and others)

I don't know which one is better.
Thoughts?

Reset accumulate after something happens

One could imagine using accumulate to sum things together (or do something else) until some threshold is met and then starting the average all over again (resetting the result to the starting value). What would be the best way to do this?

How can I use with dask.delayed functions?

Could You provide an example to map functions returning Delayed objects?

Stream methods on DaskStream lose daskiness

Inherited methods on DaskStreams produce Streams

>>> type(source.to_dask())
DaskStream
>>> type(source.to_dask().buffer(10))
Stream

Somewhat related to #2

Put graph in readme?

It might be nice to have a pretty task graph in the readme.

Combine latest on some

In conversation with @ordirules and @mrocklin

The idea is that you may only want combine latest to trigger on some of the updates.

ENH: plot execution order on graph

Maybe color graph nodes by when the nodes trigger?
This could use the message concept in #111, where we pass a message which tracks each call time and node.

Live source for experimentation and demonstration

It would be useful when explaining streamz to have a live source of data. Are there any good web APIs that we can query from somewhat rapidly without making anyone angry at us? Perhaps a time series of changing data like stock data? If anyone has time to search around the internet that would be helpful. If anyone finds something nice with requests or whatnot I'd be more than happy to tornado-ify it.

Plumbing metadata

From the initial blog post:

Annotate elements: we want to pass through event time, processing time, and presumably other metadata

It might be good to have this discussion in the near future (now?) since we are having some discussion on the SHED side about things that are like this/may include this.

Not that this is necessarily the best option:
Make a dedicated object for metadata (really just a dict). Every node knows to check if the thing that came down was this object before doing anything. If isinstance (x, MetadataObject) then just pass it on or append something to it or modify it. The sinks know to ignore it for the most part.

@mrocklin @danielballan @jrmlhermitte