Firstly, this looks super interesting @mrocklin. There is a definite use case in finance / trading - we often build complex DAGs and ideally need some form of streaming service. In the past, I have used Luigi, but it doesn't really fit the streaming model and the scheduler isn't as intelligent as dask.
Some info about the process (skip if you don't care): generally we ingest data from multiple streaming sources, do some transformations or run data through a model, and output multiple streams to different trading strategies. We also need to do some sort of model fitting/backtesting/validation. Often these models or strategies are of varying complexity - some strategies may be able to trade with only one or two inputs (and we want them to be fast) and then others may require more complex calculations or several nodes to complete.
Ideally (and at a high level) we would like the ability to do the following:
- Define a DAG in a simple and flexible way that doesn't require a bunch of boilerplate.
- Run the DAG historically over some past data to validate our models. Also being able to parameterise -eg, run this for 2017-01-01, 2017-01-02...
- Update nodes in the DAG, re-run and have the executor only update required data (An extension of this is exceptions - if a node/function breaks, being able to update some code and continue running a pipeline where it left off is a massive plus)
- Run the DAG in production in some form of streaming fashion, so we can process and transform data live to feed our models.
Luigi offered most of this, except
- there was quite a bit of boilerplate to set up dependencies,
- it was targeted at slower running pipelines and
- scaling it out was quite a pain.
I also believe we have most of this with dask. 1. is obvious, 2. is easy, I had a working version of 3. using dask and joblibs memory.cache pretty well. The only missing piece is being able to move that dask graph into production. This could be done by creating a DAG in dask and calling it repeatedly with new data, however in the case of nodes completing at different times, the pipeline becomes as fast as its' slowest part, which isn't ideal when you want paths to complete ASAP.
I suppose my question is, given the information above, do you see this as a potential use case for streams, or should I be working harder to get dask to play how I want it to. I am very keen to contribute if this problem fits into the broader goals streams is trying to achieve.
If the motivation isn't clear I can try and provide some simple examples of what I mean