nytlabs / streamtools Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 111.0 34.37 MB

tools for working with streams of data

Home Page: http://nytlabs.github.io/streamtools/

License: Apache License 2.0

HTML 2.05% CSS 9.26% JavaScript 27.62% Makefile 0.11% Shell 0.11% Go 60.85%

streamtools's People

Contributors

Stargazers

Watchers

Forkers

puppykhan jasoncapehart mikedewar nikhan imclab web5design andrewzm jacqui durple harlo mynameisfiber rtvt123 jamesboehmer willcode2surf jpfairbanks ds604 kangman baltes cgiogkarakis spencerx drocamor danfriedma oiclid josephwinston thinkberg jafairweather lemonhall ulfst akavel bmabey mozacoval headinthebox larseriksson theosis eldabbagh jprobinson friej715 greco kuguobing hellcoderz kyssley openhero iaincollins flaviofalcao williamcodes henfee strogo wicked7578 mdmarek gijs harrisj alisheikh yilab tomstafford fp4me ellisonbg jeroenjanssens epsniff edwardt redi-singhdev oftensmile jlam1436 pfig patricktoca prateek zofuthan shunwang wavelets yanlinaung applied-duality huangyan9188 se77en cdfpaz bradparks sarvex deepglint gabhi he0x mrafayaleem perif acaranta vishnuvr linearregression wxdublin devicehive arokbig mastef yuanwr sts0mrg0 paulsputer riseofthetigers tweakmy 744996162 nickman mfcardenas the-shoeshine-trust-fund nyxwulf etsangsplk admariner xiaoweihong

streamtools's Issues

generalized wrappers for http responses

Currently there is a bunch of duplicate header code in daemon.go that should be put in some kind of wrapper so that we don't have to add it every time we have a new type of HTTP response. blech.

add flag in `import/random_stream` so we can toggle a time-accurate vs. time-jumbled stream

sometimes you want to mimic a time-accurate stream.

should `post` be toPost?

i like the idea that the sources are called 'from' or 'poll' and the sinks are all 'to'

toGet

make get requests by converting specific message keys, or all the message key/values into URL params

blocks that do HTTP operations should gracefully fail

Post dial tcp 127.0.0.1:7070: connection refused

causes postValue to fail dramatically

long_http should be fromHTTP

I think a 'long poll' has more subtleties to it than I thought...

graph

a new message contributes a new node if we've not seen that node type before, and increments an edge weight.

store in a standard json graph format (d3 must have specified one) and keep in mind this will be backed by some graph database one day.

make `last seen` endpoint into `last message` endpoint for connections

make sure each connection has a copy of the last message it saw so we can interrogate it.

don't allow a blank id

if you create a block with id="" it should complain

fromFile

should read through a file, line by line, and emit into streamtools

block logging should include id

it would be nice if we could tell which block was logging. We should include id, possibly blocktype.

vegaBar

emits a bar chart vega object based on the incoming message

example scripts that explain simple functionality of ST

there need to be some updated example scripts that demonstrate simple data processing cases with ST, for example:

from file, through mask, count, then back to file
from nsq, filter, bunch, back to nsq
http streaming/polling/rule setting, etc

all blocks should accept an NSQ lookupd address

It would be super useful for one streamtools apps to be able to use multiple computers. For this, though, we need to make sure each block is using the same lookupd!

check for malformed JSON nonsense

we should probably spot bad JSON on the way in

addition of timestamp for all messages in all blocks

it may be convenient/nice if any block could add a time stamp of simple time.Now() to a processed messsage/saved state/etc.

HTTP responses should end in newline?

I don't know if this is a bug, but I sort of feel like newlines at the end of the http responses would be nice.

remove all test/example scripts in /st

some of the scripts in /st don't work anymore and they are named inconsistently.

error handling in polls3

br, _ := bucket.GetReader(v.Key)

needs to check to handle the error and break on error.

in filter block don't log that we can't make the comparison

it leaves a lot of logs like

cannot perform an equals operation on this type

'touch' topics that are being read from even if they don't exist yet

Often streams are being read from that haven't ever seen any messages, especially at startup. It would be nice if we could 'touch' a topic when we're reading from it, in order to quell the lookupd errors, and to spot mis-matched names!

fromSQS

collect data from SQS

fromSNS

collect messages from Amazon SNS

make sure all channels are *simplejson.Json

We need to be passing pointers to JSON around, not the values themselves. This helps simplejson work, as well as massively reducing the copying of data that we are doing currently.

vegaForce

emits a force-directed network vis vega object based on the incoming message

proposed standards for blocks

Some ideas for the standardization of blocks, heavily inspired by #34 :

terms

Block Routine
A Block Routine is the func that is called as a go routine that contains only logic. Has a simplejson.Json in chan, a simpleJson out chan, or both. Block Routines are wrapped in Blocks. Block Routines reside as part of the /streamtools library.

Block Function
Block Functions are dependencies for Block Routines . They are not run as go routines and are located in the /streamtools library. It may be a good idea to divorces Block Routines from Block Functions in /streamtools.

Blocks
Blocks provide an NSQ wrapper for Block Routines. Blocks take care of wiring up go channels to NSQ and are responsible for initializing and running Block Routines. Typically, there will be 2-3 go routines per block: an NSQ reader, NSQ publisher, and a Block Routine.

A block should act as a standalone executable and be able to interface to a standard mode of execution and introspection.

principles

one Block Routine per Block
All logic for a block should be contained in a single function ( Block Routine ). All Block Routine state should be maintained within that single go routine. All messages in and out of that block should be managed by that single go routine. A Block Routine should not share state with any other go routines, unless it is through a channel.

Block Routines allow introspection
Block Routines should have a chan of some kind that allow reports on what is currently happening in the routine. A health chan may also be nice, to divorce technical stats (in flight, backed up queues, processing time, num msgs processed ) from stats that come from the blocks logic (distribution of X, etc).

online setting of rules
We have yet to standardize how a Block Routine is initialized with rules that govern its logic. I propose that Block Routines should have a rules chan that initializes the logic and signals it to start processing. This would help us in avoiding any kind of flag soup, and potentially allowing for run-time rule changes.

all blocks log their settings on init

every block should write their settings to the log files for santiy checking!

ideas for interesting and helpful examples?

creating a stream via sampling endpoint/reading file
muxing of some kind
outputting/polling on condition
online rule changing/blocks setting rules for other blocks
block hotswap
st talking to st/network IO/NSQ (?)
how blocking effects ST
performance test
end-to-end stream to vis

ideas?

toMongo

write messages into mongo

content-type of data on inbound routes is fragile

Right now one MUST encode one's POSTs as application/x-www-form-urlencoded which is terrible given we're only posting JSON.

agree on dashes or underscores on block names and library functions

terribly inconsistent!

make all blocks non blocking on intialisation

all blocks should be doing SOMETHING while waiting for their initial rule set. They should be queryable, connectable etc

Arch v3 proposal

So we have a ton of blocks now, and some experience of using this in prod. Let's do one more big architecture review before all this get serious. Suggestions should go in the comments below!

pollHTTP

poll an HTTP endpoint

vegaScatter

emits a scatter chart vega object based on the incoming message

Block creation should alert user if no default-but-required rules are set

create a fromNSQ block
create a count block
connect blocks 1 and 2

...this results in the command line never becoming available again, because count and fromNSQ both are created but none of the required rules are set.

One nice solution would be for the return message (ie, {"daemon":"BLOCK_CREATED"}) would have text describing which rules must be set.

Another approach is to create defaults for all blocks, but this encourages errors and might reinforce bad habits IMO.

vegaLine

emits a line chart vega object based on the incoming message

kernelDensityEstimate

a kernel density estimate that is updated with each new message.

Are there online methods for this? Gotta be...

all blocks that read from NSQ should have a settable channel name

So that they don't share a channel when reading from the same topic.

Connections should measure their rate of throughput

doesn't have to be complicated

basic block documentation

Many of our blocks are missing documentation. Every block should have a basic description in the file that contains the blockroutine. This should include both how the block works, what kind of data can be expected form the block, whether or not the block hasoutputs, and what kind of parameters that can be sent to it.

architecture to allow for megablocks, agnostic logic

Consider an example flow:

Import from NSQ
Filter
Synchronize
Export to Web Sockets

The import from NSQ takes a stream from a non-local NSQ and puts in the local NSQ. The Filter then reads from the local NSQ and then publishes to the local NSQ. The Synchronizer reads from the local NSQ and then publishes to the local NSQ, and finally, the Export reads from the local NSQ.

This would work fine for smaller streams, but the load caused by putting things on and off a local NSQ causes a bunch of redundancy. The thing is that all of the filtering/synchronizing/export logic is still super useful, the only problem is that the logic to speed up the architecture drastically is locked away in binaries that include NSQ readers/publishers.

I propose an architecture for the design of megablocks

something like this:
/streamtools just contains the structs that we use to deal with Go chan messages.
/blocks contains basically everything that is in the root right now (w/ NSQ stuff)

a file in streamtools would look something like this:

package streamtools

type Filter struct{
    in chan []byte,
    out chan []byte,
    pattern string
}

func NewFilter(in chan []byte, out chan[]byte, pattern string){
    this = &Filter{
        in: make(chan []byte),
        out: make(chan []byte),
        pattern: pattern
    }

    go func(){
        this.run();
    }
    return this
}

func (this *Filter) run(){
    for{
        select{
            case in<-:
           // do filter stuff here

}

This way, all of the logic in streamtools becomes agnostic as to how they are implemented. You could use them as part of the streamtools suite, or if you are just handling Go msgs in your own application you can import from /streamtools and use them without NSQ. Or you could chain them together to make megablocks.

/blocks would be full of NSQ-ready binaries, with really simple code that are basically NSQ wrappers around the streamtools logic. a filer would look something like this:

import "github.com/nytlabs/stream_tools"

func main(){
    streamtools.NewNSQReader(params, channel A)
    streamtools.NewFilter(pattern, channel A, channel B)
    streamtools.NewNSQPublisher(params, channel B)
}

and this also means you could do something like

import "github.com/nytlabs/stream_tools"

func main(){
    streamtools.NewNSQReader(params, channel A)
    streamtools.NewFilter(pattern, channel A, channel B)
    streamtools.NewSynchronizer(channel B, channel C)
    streamtools.NewNSQPublisher(params, channel C)
}

basically, it allows for streamtools core to be a library that we use in the making of the NSQ-based block binaries. It's also good for when we want to start sharing util functions, like flatten, map, etc

eh?

jqblock pukes with complex commands

... like

jqblock -command="'if .data.stories[0].url == \"http://www.nytimes.com/\" then .data.stories[1] else .data.stories[0] end'" -read-topic="top_stories_by_tweet" -write-topic="top_story_by_tweet" -name="extract_top_story"

toS3

write messages to S3

proper error when we send a malformed connect command

If I forget quotes around my connect curl it only gets a "from" parameter and pukes.

standardized key/value retrieval

all our blocks use different methods to grab values from a key specified in the rule. Many of the blocks lack the ability to grab nested keys. We should have some standard way of grabbing keys in streamtools/utils where we accept '.' delineated key paths.

filter value by set membership accepts a nested key in the format of key_a.key_b.key_c, I think other blocks shoud probably follow suit

Connection should alert on outlying behaviour

A connection should have a cheap model of the time between events. If time since the last event is unlikely under that model, then begin alerting.

handling of HTTP streaming/websockets through daemon

Our current architecture is limited in that we cannot use the http server created by daemon to handle websockets or http streaming capabilities on a per block basis. This means that each block that needs to http stream/use websockets has to stand up another http server on a different port.

Ideally we would write our own handler in the future that allows blocks more basic control of the kinds of http traffic it can handle.

nik said something yesterday

... but I can't remember what it was. This issue is a placeholder for whatever that was.