Git Product home page Git Product logo

oklog's Introduction

OK Log is archived

I hoped to find the opportunity to continue developing OK Log after the spike of its creation. Unfortunately, despite effort, no such opportunity presented itself. Please look at OK Log for inspiration, and consider using the (maintained!) projects that came from it, ulid and run.


OK Log

OK Log is a distributed and coördination-free log management system for big ol' clusters. It's an on-prem solution that's designed to be a sort of building block: easy to understand, easy to operate, and easy to extend.

Is OK Log for me?

You may consider OK Log if...

  • You're tailing your logs manually, find it annoying, and want to aggregate them without a lot of fuss
  • You're using a hosted solution like Loggly, and want to move logs on-prem
  • You're using Elasticsearch, but find it unreliable, difficult to operate, or don't use many of its features
  • You're using a custom log pipeline with e.g. Fluentd or Logstash, and having performance problems
  • You just wanna, like, grep your logs — why is this all so complicated?

Getting OK Log

OK Log is distributed as a single, statically-linked binary for a variety of target architectures. Download the latest release from the releases page.

Quickstart

$ oklog ingeststore -store.segment-replication-factor 1
$ ./myservice | oklog forward localhost
$ oklog query -from 5m -q Hello
2017-01-01 12:34:56 Hello world!

Deploying

Small installations

If you have relatively small log volume, you can deploy a cluster of identical ingeststore nodes. By default, the replication factor is 2, so you need at least 2 nodes. Use the -cluster flag to specify a routable IP address or hostname for each node to advertise itself on. And let each node know about at least one other node with the -peer flag.

foo$ oklog ingeststore -cluster foo -peer foo -peer bar -peer baz
bar$ oklog ingeststore -cluster bar -peer foo -peer bar -peer baz
baz$ oklog ingeststore -cluster baz -peer foo -peer bar -peer baz

To grow the cluster, just add a new node, and tell it about at least one other node via the -peer flag. Optionally, you can run the rebalance tool (TODO) to redistribute the data over the new topology. To shrink the cluster, just kill nodes fewer than the replication factor, and run the repair tool (TODO) to re-replicate lost records.

All configuration is done via commandline flags. You can change things like the log retention period (default 7d), the target segment file size (default 128MB), and maximum time (age) of various stages of the logging pipeline. Most defaults should be sane, but you should always audit for your environment.

Large installations

If you have relatively large log volume, you can split the ingest and store (query) responsibilities. Ingest nodes make lots of sequential writes, and benefit from fast disks and moderate CPU. Store nodes make lots of random reads and writes, and benefit from large disks and lots of memory. Both ingest and store nodes join the same cluster, so provide them with the same set of peers.

ingest1$ oklog ingest -cluster 10.1.0.1 -peer ...
ingest2$ oklog ingest -cluster 10.1.0.2 -peer ...

store1$ oklog store -cluster 10.1.9.1 -peer ...
store2$ oklog store -cluster 10.1.9.2 -peer ...
store3$ oklog store -cluster 10.1.9.3 -peer ...

To add more raw ingest capacity, add more ingest nodes to the cluster. To add more storage or query capacity, add more store nodes. Also, make sure you have enough store nodes to consume from the ingest nodes without backing up.

Forwarding

The forwarder is basically just netcat with some reconnect logic. Pipe the stdout/stderr of your service to the forwarder, configured to talk to your ingesters.

$ ./myservice | oklog forward ingest1 ingest2

OK Log integrates in a straightforward way with runtimes like Docker and Kubernetes. See the Integrations page for more details.

Querying

Querying is an HTTP GET to /query on any of the store nodes. OK Log comes with a query tool to make it easier to play with. One good thing is to first use the -stats flag to refine your query. When you're satisfied it's sufficiently constrained, drop -stats to get results.

$ oklog query -from 2h -to 1h -q "myservice.*(WARN|ERROR)" -regex
2016-01-01 10:34:58 [myservice] request_id 187634 -- [WARN] Get /check: HTTP 419 (0B received)
2016-01-01 10:35:02 [myservice] request_id 288211 -- [ERROR] Post /ok: HTTP 500 (0B received)
2016-01-01 10:35:09 [myservice] request_id 291014 -- [WARN] Get /next: HTTP 401 (0B received)
 ...

To query structured logs, combine a basic grep filter expression with a tool like jq.

$ oklog query -from 1h -q /api/v1/login
{"remote_addr":"10.34.115.3:50032","path":"/api/v1/login","method":"POST","status_code":200}
{"remote_addr":"10.9.101.113:51442","path":"/api/v1/login","method":"POST","status_code":500}
{"remote_addr":"10.9.55.2:55210","path":"/api/v1/login","method":"POST","status_code":200}
{"remote_addr":"10.34.115.1:51610","path":"/api/v1/login","method":"POST","status_code":200}
...

$ oklog query -from 1h -q /api/v1/login | jq '. | select(.status_code == 500)'
{
	"remote_addr": "10.9.55.2:55210",
	"path": "/api/v1/login",
	"method": "POST",
	"status_code": 500
}
...

UI

OK Log ships with a basic UI for making queries. You can access it on any store or ingeststore node, on the public API port (default 7650), path /ui. So, e.g. http://localhost:7650/ui.

Further reading

Integrations

Unofficial Docker images

Translation


OK icon by Karthik Srinivas from the Noun Project. Development supported by DigitalOcean.

oklog's People

Contributors

1046102779 avatar bkmit avatar drdaeman avatar dvrkps avatar fabxc avatar juanpabloaj avatar lizhaohui836 avatar peterbourgon avatar tsenart avatar xla avatar zombor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oklog's Issues

store: limit maximum number of concurrent segment readers

Right now, during queries, we spawn a new segment reader (and filterer) per matching segment. That'll hit our fdlimit real real fast, and at some point we get diminishing returns from the concurrency. We should limit the maximum number of concurrent segment readers, and have some smarts in there to effectively distribute those readers among queries.

TLS support

Hi,

Sorry for opening a lame ticket instead of a pull request.

Is TLS (with client cert authentication) something that may eventually be implemented in oklog?

Empty response body

Please check this request
EDIT: removing potentially malicious link —PB

Tag or label feature

Hi,

Just wondering if there was any ideas on how distinguish between very similar logs that are pulled / pushed from multiple log endpoints? At the moment I am unable to find a way to distinguish between very similar logs, unless the log records happen to include something like a host name/IP, but then I could be missing something obvious. Is there any metadata added by the log forward feature? If not, might tags / labels added by the forward feature be a possible solution or would this be out of scope?

Thanks to Peter for the great idea and all the work so far. I am looking forward to see how oklog evolves in the future. 👍

Documentation by example

Hey,

Your project look very promissing and very interesing however I can't figured out how start to use this tool :/

I only know how to plug docker with logspout forwarder to oklog ingester from the documentation and ... that's all.

I'm looking to integrate oklog in docker-compose stack, my questions are :

  • How register the oklog ingester ? (command line, config)
  • How expose the ui ? (command line to execute, port exposed ?)
  • Forward is explained, so it's ok 👍

Could you provide or explain me how to archieve it (in order to provide a docker-compose example as PR) please ?

Thank you

Syslog integration

Is there any pattern for integrating syslog streams into this? If not would you be interested in a patch that accepted the syslog protocol ? This seems to be a standard way to handle containers

UI not parsing correctly query response

Hello,

I've been using query log for few days, and just tried the UI interface that looks promising. Although I get it working on Chrome, it doesn't on Firefox with this error.
image

Might not have been noticed yet!
Julien

Intermittent test failure on TestHandleConnectionsCleanup()

This part of the test seems to be failing under certain conditions: https://github.com/oklog/oklog/blob/master/pkg/ingest/conn_test.go#L68-L85

See Travis job https://travis-ci.org/oklog/oklog/jobs/218623755 for a failure. (It's on my branch, but I don't think my changes affect this test)

=== RUN   TestHandleConnectionsCleanup
--- FAIL: TestHandleConnectionsCleanup (1.06s)
	conn_test.go:75: HandleConnections successfully torn down (accept tcp [::]:43165: use of closed file or network connection)
	conn_test.go:84: timeout waiting for Close: initial Closes=0, current Closes=0
...
...

The command "go test -v -race ./{cmd,pkg}/..." exited with 1.

store: after panic LOCK file left in place

matthew@log-store-1:~$ oklog store -cluster hostname --ip-address -peer log-store-1 -peer log-store-2 -peer log-store-3 -peer log-ingest-1 -peer log-ingest-2 -peer log-ingest-3 --store.path=/data/logs
ts=2017-04-06T12:21:04.761341821Z level=info cluster=10.240.0.24:7659
ts=2017-04-06T12:21:04.761668709Z level=info API=tcp://0.0.0.0:7650
ts=2017-04-06T12:21:04.778903484Z level=info StoreLog=/data/logs
ts=2017-04-06T12:21:04.859704498Z level=debug component=cluster Join=1
panic: ulid: bad data size when unmarshaling

goroutine 16 [running]:
github.com/oklog/oklog/vendor/github.com/oklog/ulid.MustParse(0xc420242a8b, 0x8, 0x0, 0x0)
/Users/pbourgon/src/github.com/oklog/oklog/vendor/github.com/oklog/ulid/ulid.go:98 +0x88
github.com/oklog/oklog/pkg/store.(*fileLog).Overlapping(0xc42016eec0, 0x40, 0x87ae4b, 0xc42003abc8, 0xc42003abc0, 0xc42004cd20)
/Users/pbourgon/src/github.com/oklog/oklog/pkg/store/file_log.go:157 +0x70b
github.com/oklog/oklog/pkg/store.(Log).Overlapping-fm(0xe00000040, 0xa07ee8, 0xc420059380, 0x9ebd47, 0xb)
/Users/pbourgon/src/github.com/oklog/oklog/pkg/store/compact.go:48 +0x2f
github.com/oklog/oklog/pkg/store.(*Compacter).compact(0xc420059380, 0x9ebd47, 0xb, 0xc42003adf8, 0x0, 0x0, 0x0)
/Users/pbourgon/src/github.com/oklog/oklog/pkg/store/compact.go:86 +0xf0
github.com/oklog/oklog/pkg/store.(*Compacter).Run.func1()
/Users/pbourgon/src/github.com/oklog/oklog/pkg/store/compact.go:48 +0x7a
github.com/oklog/oklog/pkg/store.(*Compacter).Run(0xc420059380)
/Users/pbourgon/src/github.com/oklog/oklog/pkg/store/compact.go:58 +0x2b1
main.runStore.func7(0x0, 0x0)
/Users/pbourgon/src/github.com/oklog/oklog/cmd/oklog/store.go:260 +0x2a
github.com/oklog/oklog/pkg/group.(*Group).Run.func1(0xc420059440, 0xc420013250, 0xc420013260)
/Users/pbourgon/src/github.com/oklog/oklog/pkg/group/group.go:55 +0x27
created by github.com/oklog/oklog/pkg/group.(*Group).Run
/Users/pbourgon/src/github.com/oklog/oklog/pkg/group/group.go:56 +0xa8
matthew@log-store-1:~$ oklog store -cluster hostname --ip-address -peer log-store-1 -peer log-store-2 -peer log-store-3 -peer log-ingest-1 -peer log-ingest-2 -peer log-ingest-3 --store.path=/data/logs
ts=2017-04-06T12:21:08.870118774Z level=info cluster=10.240.0.24:7659
ts=2017-04-06T12:21:08.870255775Z level=info API=tcp://0.0.0.0:7650
/data/logs/LOCK already exists; another process is running, or the file is stale

testing: integration, system, and failure modeling

The failure modes are thought-through, but not empirically validated. We should build a basic testing harness for failure modeling, in the style of a simplified Jepsen. Also, it would be good to establish a means to verify overall system throughput (MBps) and latency (ingest-to-query).

Structured logging and querying a single column in a large data set

Hey @peterbourgon great project.

I was quickly reading the source and it looks like you're creating append-only segments of the raw logs and for queries you traverse the full set of logs from a small chunk of time.

My use case is to send structured logs to oklog and search along a column for a large data set over a large period of time and keeping the queries fast (<100ms per search result set).

Have you considered using an embedded db (ie. like github.com/dgraph-io/badger) that could potentially offer the building blocks to add support for structured logging schema's, adapters, and indexing?

it would be cool to hook up bleve to certain columns too

forwarder, others: remove the record size limit

When I try to ingest a large log file, the ingester will fail with bufio.Scanner: token too long.

> cat 2016-11-04-06.tsv|head -n 100000|./oklog forward localhost
ts=2017-01-18T22:46:24.876892152Z level=debug raw_target=tcp://localhost:7651 resolved_target=tcp://localhost:7651
ts=2017-01-18T22:46:25.025694007Z level=info stdin=exhausted due_to="bufio.Scanner: token too long"

crash when testing "oklog stream"

This is with oklog 0.2.1. I had run in a seperate window "oklog stream -store log-store-1 -q "JID-" and when I ^C'd I got a crash.

matthew@log-store-1:~$ !okl
oklog store -cluster hostname --ip-address -peer log-store-1 -peer log-store-2 -peer log-store-3 -peer log-ingest-1 -peer log-ingest-2 -peer log-ingest-3 --store.path=/data/logs
ts=2017-04-06T13:14:47.361018209Z level=info cluster=10.240.0.24:7659
ts=2017-04-06T13:14:47.361442714Z level=info API=tcp://0.0.0.0:7650
ts=2017-04-06T13:14:47.361859484Z level=info StoreLog=/data/logs
panic: send on closed channel

goroutine 3632 [running]:
github.com/oklog/oklog/pkg/stream.readOnce(0xc9c3c0, 0xc4252b5540, 0xc4252b76a0, 0xc422a77d20, 0x10, 0xc4252b1e60, 0x0, 0x0)
/Users/peter/src/github.com/oklog/oklog/pkg/stream/stream.go:132 +0x309
github.com/oklog/oklog/pkg/stream.readUntilCanceled(0xc9c3c0, 0xc4252b5540, 0xc4252b76a0, 0xc422a77d20, 0x10, 0xc4252b1e60, 0xa335b0)
/Users/peter/src/github.com/oklog/oklog/pkg/stream/stream.go:115 +0x61
created by github.com/oklog/oklog/pkg/stream.updateActive
/Users/peter/src/github.com/oklog/oklog/pkg/stream/stream.go:94 +0x23c

query: perform read repair

Queries can perform read repair, by counting each unique ULID (and thus log record) encountered during the deduplicate phase. Those which have lower representation than expected can be bundled together into a repair segment, which can be replicated through the normal replication mechanisms. Care should be taken to only do this work some percent of all queries; to perform the replication outside of the query lifecycle; etc.

ingeststore non functional

Following the directions in the README I've setup two hosts to test ingeststore. After starting the hosts I get:

tthew@log-store-1:~$ oklog ingeststore -cluster log-store-1 -peer log-store-1 -peer log-store-2
ts=2017-03-30T17:36:20.410582124Z level=info cluster=log-store-1:7659
ts=2017-03-30T17:36:20.410667763Z level=info fast=tcp://0.0.0.0:7651
ts=2017-03-30T17:36:20.410689283Z level=info durable=tcp://0.0.0.0:7652
ts=2017-03-30T17:36:20.410707298Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-03-30T17:36:20.410733243Z level=info API=tcp://0.0.0.0:7650
ts=2017-03-30T17:36:20.410890213Z level=info ingest_path=data/ingest
ts=2017-03-30T17:36:20.410950353Z level=info store_path=data/store
ts=2017-03-30T17:36:20.421332098Z level=debug component=cluster Join=1
ts=2017-03-30T17:36:20.521814528Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:21.522022322Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:22.522240947Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:23.621790312Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:24.621966192Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:25.42175071Z level=warn component=cluster NumMembers=1
ts=2017-03-30T17:36:25.622150195Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:26.721837292Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:27.722031419Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:28.722254832Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:29.821805223Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"

....

matthew@log-store-2:~$ oklog ingeststore -cluster log-store-2 -peer log-store-1 -peer log-store-2
ts=2017-03-30T17:36:34.307731343Z level=info cluster=log-store-2:7659
ts=2017-03-30T17:36:34.307863365Z level=info fast=tcp://0.0.0.0:7651
ts=2017-03-30T17:36:34.307892432Z level=info durable=tcp://0.0.0.0:7652
ts=2017-03-30T17:36:34.307917061Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-03-30T17:36:34.307940364Z level=info API=tcp://0.0.0.0:7650
ts=2017-03-30T17:36:34.308098722Z level=info ingest_path=data/ingest
ts=2017-03-30T17:36:34.308193966Z level=info store_path=data/store
ts=2017-03-30T17:36:34.319719011Z level=debug component=cluster Join=2
ts=2017-03-30T17:36:41.420291562Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-03-30T17:36:42.420504608Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
^Cts=2017-03-30T17:36:43.420713048Z level=debug component=Compacter shutdown_took=7.052µs
received signal interrupt

multiple processes to a single forwarder

What would be the best practice for multiple processes that want to forward the logs?

A VM can have multiple processes running. Having a single forwarder for each process seems like overkill.

Ideally, I feel like a single forwarder should be able to handle it.

I thought about using a single named piped that things could write to and the forward reads from. There are limitations in a named pipe though.

I like that the process can just write to STDIN of a forward, that way it does not have to worry about back pressure itself. It is always the kernal and the forward's worry.

Any leads for a convention?

Documentation for fixing up data after an abnormal exit / disk full

I have recently caused a 'disk full' situation on an ingeststore node, and this left the data in a state where ingeststore would not start cleanly. (This could also be caused by e.g. an untimely kill -9).

I'd like to document (and clarify) the process for managing this. How would you want to document this kind of procedure? Some kind of operations manual?

So, this is my process I documented internally, to remove LOCK files and invalid .pending files. I think it's roughly the right approach, but I'd like to clarify. Note that my internal docs begin with a link to a 'resizing disks' doc.

  1. Try restarting oklog service. If it fails with a 'lock files message, run the following:
    cd /media/logs/oklog # wherever your oklog service is storing log data
    pgrep oklog # confirm that oklog is NOT still running
    rm data/store/LOCK data/ingest/LOCK
  1. Try restarting oklog service again. If it fails with 'short record', it seems like a record was only partly written. It seems to me that this is only likely in a '.pending' file.
    cd /media/logs/oklog # wherever your oklog service is storing log data
    mv data/ingest/*.pending /tmp/
  1. Try restarting oklog again. Should be good now.

forwarder: optionally drop messages instead of applying backpressure

At the moment, if a producer produces records faster than the forwarder can forward them, the producer will experience backpressure. I think this is the correct default choice, but it would be useful if the forwarder had a "drop" mode, where it would buffer records to an e.g. ringbuffer, and drop records that can't be forwarded in time.

error when querying store

I have a 6 peer cluster setup. Three ingestors and three store nodes. I started piling in a lots of logs for load testing purposes. After 10 minutes I queried:

$ oklog query -store log-store-1 -from 1h -q "JID-b6179b2707845b363de309af"
results all good (about 5 lines of text)

After 15 minutes:

$ oklog query -store log-store-1 -from 1h -q "JID-b6179b2707845b363de309af"
results all good (same 5 lines of text)

After 20 minutes:

$ oklog query -store log-store-1 -from 1h -q "JID-b6179b2707845b363de309af"

Nothing.

In the console for the store node I see:

ts=2017-04-05T23:06:15.888364628Z level=error during=query_gather status_code=500 err="open /data/logs/01BD05E0P7H7B1WGY576CMHMSX-01BD05GNK917D75W1SNMXV1TMK.flushed: no such file or directory"

I can also query log-store-2 and -3 and I get the same result. In log-store-1 I also sometimes get:

ts=2017-04-05T23:06:01.113046951Z level=error during=query_gather err="Get http://10.240.0.32:7650/store/_query?from=2017-04-05T22%3A05%3A56Z&to=2017-04-05T23%3A05%3A56Z&q=JID-b6179b2707845b363de309af: net/http: timeout awaiting response headers"

where 32 is store-2.

ts=2017-04-05T23:06:01.113173666Z level=error during=query_gather err="Get http://10.240.0.40:7650/store/_query?from=2017-04-05T22%3A05%3A56Z&to=2017-04-05T23%3A05%3A56Z&q=JID-b6179b2707845b363de309af: net/http: timeout awaiting response headers"

And where 40 is store-3.

Use oklog/pkg/group with async Run methods.

after watching your amazing talk for run groups
https://youtu.be/LHe1Cb_Ud_M?t=1487

I decided to refactor the main.go in prometheus.
prometheus/prometheus#3246

Some of the Run methods are async so when I put it in a group it triggers a teardown.
https://github.com/prometheus/prometheus/blob/master/rules/manager.go#L380
the package still has .Stop method which should run on app exit.

I see 2 options:

  1. refactor all .Run methods to be blocking
  2. put the async Run-ers in the same group as a single blocking one and group destructors the same way.

I prefer option 1 , where option 2 would be easier to implement so it would be interesting to hear a second opinion.

some Packages wouldn't make sense to have a blocking constructor but still need a destructor.

panic

ts=2017-02-17T17:03:13.401949521Z level=info cluster=0.0.0.0:7659
ts=2017-02-17T17:03:13.402359566Z level=info fast=tcp://0.0.0.0:7651
ts=2017-02-17T17:03:13.402686896Z level=info durable=tcp://0.0.0.0:7652
ts=2017-02-17T17:03:13.403006829Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-02-17T17:03:13.403323151Z level=info API=tcp://0.0.0.0:7650
ts=2017-02-17T17:03:13.403588199Z level=info ingest_path=data/ingest
ts=2017-02-17T17:03:13.403772636Z level=info store_path=data/store
ts=2017-02-17T17:03:13.405190477Z level=debug component=cluster Join=0
2017/02/18 01:08:14 http: panic serving 127.0.0.1:11723: runtime error: invalid memory address or nil pointer dereference
goroutine 1939 [running]:
net/http.(*conn).serve.func1(0xc42a811040)
	/usr/local/Cellar/go/1.8rc2_2/libexec/src/net/http/server.go:1721 +0xd0
panic(0x8eef80, 0xc0b410)
	/usr/local/Cellar/go/1.8rc2_2/libexec/src/runtime/panic.go:489 +0x2cf
github.com/oklog/oklog/pkg/store.makeConcurrentFilteringReadClosers.func1(0xc42bed3830, 0xc42bed3810)
	/Users/peter/src/github.com/oklog/oklog/pkg/store/query.go:232 +0x57
github.com/oklog/oklog/pkg/store.makeConcurrentFilteringReadClosers(0xbe93e0, 0xc4201b284c, 0xc482cf5580, 0x5, 0x8, 0xc43415df90, 0x100000, 0xc42eb8ea50, 0x5, 0x5, ...)
	/Users/peter/src/github.com/oklog/oklog/pkg/store/query.go:242 +0x2c9
github.com/oklog/oklog/pkg/store.newQueryReadCloser(0xbe93e0, 0xc4201b284c, 0xc424e2a200, 0x1f, 0x20, 0xc43415df90, 0x100000, 0x0, 0x0, 0x9ea4f887, ...)
	/Users/peter/src/github.com/oklog/oklog/pkg/store/query.go:155 +0x44e
github.com/oklog/oklog/pkg/store.(*fileLog).Query(0xc4201b8d40, 0xed03916ed, 0xc400000000, 0xc1a0a0, 0xed03924fd, 0x0, 0xc1a0a0, 0xc482cf5461, 0x10, 0x100, ...)
	/Users/peter/src/github.com/oklog/oklog/pkg/store/file_log.go:104 +0x27e
github.com/oklog/oklog/pkg/store.(*API).handleInternalQuery(0xc4201a4500, 0xbe3f80, 0xc482cf2a40, 0xc43fc7a000, 0xc482cf2a01)
	/Users/peter/src/github.com/oklog/oklog/pkg/store/api.go:195 +0x265
github.com/oklog/oklog/pkg/store.(*API).ServeHTTP(0xc4201a4500, 0xbe4580, 0xc420f7f960, 0xc43fc7a000)
	/Users/peter/src/github.com/oklog/oklog/pkg/store/api.go:68 +0x593
net/http.StripPrefix.func1(0xbe4580, 0xc420f7f960, 0xc43fc7a000)
	/usr/local/Cellar/go/1.8rc2_2/libexec/src/net/http/server.go:1977 +0xcf
net/http.HandlerFunc.ServeHTTP(0xc420268180, 0xbe4580, 0xc420f7f960, 0xc43fc7a000)
	/usr/local/Cellar/go/1.8rc2_2/libexec/src/net/http/server.go:1942 +0x44
net/http.(*ServeMux).ServeHTTP(0xc420268000, 0xbe4580, 0xc420f7f960, 0xc43fc7a000)
	/usr/local/Cellar/go/1.8rc2_2/libexec/src/net/http/server.go:2238 +0x130
net/http.serverHandler.ServeHTTP(0xc42013ee70, 0xbe4580, 0xc420f7f960, 0xc43fc7a000)
	/usr/local/Cellar/go/1.8rc2_2/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc42a811040, 0xbe4fc0, 0xc42031c880)
	/usr/local/Cellar/go/1.8rc2_2/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
	/usr/local/Cellar/go/1.8rc2_2/libexec/src/net/http/server.go:2668 +0x2ce

ingest: implement connection balancing based on write load

Ingesters can gossip some concept of load, e.g. connected clients, total ingest throughput, etc. With that knowledge shared, they can do some kind of connection balancing, in order of increasing severity

  1. Extend the forwarder/ingester protocol to allow the ingesters to suggest other ingesters to newly connected forwarders. I think this can be as simple as having the heaviest-loaded ingesters write the address of the lightest-loaded ingesters to each new connection, and relying on the forwarder to read that data and choose to drop the connection to the one forwarder, and connect to the other one.
  2. Have highly-loaded ingester/s turn off their connection listener when certain conditions are met. Those conditions must be very conservative. There must be very few ingesters in this state in a given cluster, perhaps no more than one. They must not stay that way permanently. Basically, it's always better to accept new connections to an overloaded ingester than to accidentally DoS ourselves by shutting down too many listeners.
  3. Have highly-loaded ingester/s actively terminate forwarder connections. This must only occur after the connection listener is shut down, and must be even more conservative. They should probably bias to killing the most recent connections, or the ones that are doing the lowest throughput, either instantaneous or lifetime. It should not terminate connections down to zero. It should practice some form of hysteresis. And so on.

integration: Kubernetes

We should make it as easy as possible to hook up OK Log to an existing Kubernetes cluster. At first glance, this involves some configuration or manifest files to install forwarders at the appropriate place/s, and an optional set of manifest files to actually deploy an OK Log installation into the cluster. (It probably makes sense to host that off-cluster for most people, but an all-in solution will be nice to have.)

bosh release and deployment with oklog

Introduction

I'm with the Loggregator team on Cloud Foundry -- name dropping. 😛

Cloud Foundry uses a deployment tool (known as bosh), which manages VMs and deploys software on to them. I was trying to create an oklog bosh release.

I have managed to deploy the ingest and store on to two VMs. I've tried using the testsvc and forward to then query from them. It didn't work. It has been hard to track down the issue, so I am going to brain dump everything I know so far about the deployment.

Deployment

VMs

$ bosh -e lite -d oklog vms
Using environment '192.168.50.4' as user 'admin'

Task 54. Done

Deployment 'oklog'

Instance                                     Process State  AZ  IPs           VM CID                                VM Type    
ingest/680e4538-39ba-4bc4-915a-c1945da3b3c8  running        z1  10.244.0.131  621a2e09-5457-4b39-7e4c-86629e6cb6bc  m3.medium  
ingest/91d147fc-d391-49b0-ab9b-65d5278a45c9  running        z1  10.244.0.128  a1ef02bc-e395-4a14-7793-38bf8f8a3a18  m3.medium  
store/a4d1e94e-4f2a-4849-bfe6-3298f6aa4740   running        z1  10.244.0.132  a952c452-55e5-4504-7189-0cc5a0a6c497  m3.medium  
store/cdccad96-13d1-4dc7-8870-f246276d75ef   running        z1  10.244.0.129  112b2dd1-8eb8-4f36-6cad-7372e50f7e51  m3.medium  



4 vms

Succeeded

Startup scripts

Store

#!/bin/bash -e


RUN_DIR=/var/vcap/sys/run/oklog-store
LOG_DIR=/var/vcap/sys/log/oklog-store
PIDFILE=$RUN_DIR/oklog.pid


source /var/vcap/packages/oklog/pid_utils.sh

case $1 in
  start)
    pid_guard $PIDFILE "oklog-store"

    mkdir -p $RUN_DIR
    chown -R vcap:vcap $RUN_DIR

    mkdir -p $LOG_DIR
    chown -R vcap:vcap $LOG_DIR

    store_path=$(dirname "/var/vcap/data/oklog-store/data")
    mkdir -p $store_path
    chown -R vcap:vcap $store_path

    echo $$ > $PIDFILE

    
    
    exec chpst -u vcap:vcap /var/vcap/packages/oklog/oklog store \
      -api tcp://0.0.0.0:7650 \
      -cluster tcp://0.0.0.0:7659 \
      -filesystem real \
      -store.path /var/vcap/data/oklog-store/data \
      -store.segment-buffer-size 1048576 \
      -store.segment-consumers 1 \
      -store.segment-purge 24h0m0s \
      -store.segment-replication-factor 2 \
      -store.segment-retain 168h0m0s \
      -store.segment-target-age 3s \
      -store.segment-target-size 134217728 \
       -peer 10.244.0.132  -peer 10.244.0.128  -peer 10.244.0.131  \
      1>>$LOG_DIR/oklog.stdout.log \
      2>>$LOG_DIR/oklog.stderr.log

    ;;

  stop)
    kill_and_wait $PIDFILE

    ;;

  *)
    echo "Usage $0 (start|stop)"

    ;;

esac

Ingest

#!/bin/bash -e


RUN_DIR=/var/vcap/sys/run/oklog-ingest
LOG_DIR=/var/vcap/sys/log/oklog-ingest
PIDFILE=$RUN_DIR/oklog.pid


source /var/vcap/packages/oklog/pid_utils.sh

case $1 in
  start)
    pid_guard $PIDFILE "oklog-ingest"

    mkdir -p $RUN_DIR
    chown -R vcap:vcap $RUN_DIR

    mkdir -p $LOG_DIR
    chown -R vcap:vcap $LOG_DIR

    ingest_path=$(dirname "/var/vcap/data/oklog-ingest/data")
    mkdir -p $ingest_path
    chown -R vcap:vcap $ingest_path

    echo $$ > $PIDFILE

    
    
    exec chpst -u vcap:vcap /var/vcap/packages/oklog/oklog ingest \
      -api tcp://0.0.0.0:7650 \
      -cluster tcp://0.0.0.0:7659 \
      -filesystem real \
      -ingest.bulk tcp://0.0.0.0:7653 \
      -ingest.durable tcp://0.0.0.0:7652 \
      -ingest.fast tcp://0.0.0.0:7651 \
      -ingest.path /var/vcap/data/oklog-ingest/data \
      -ingest.segment-flush-age 3s \
      -ingest.segment-flush-size 16777216 \
      -ingest.segment-pending-timeout 1m0s \
       -peer 10.244.0.131  -peer 10.244.0.129  -peer 10.244.0.132  \
      1>>$LOG_DIR/oklog.stdout.log \
      2>>$LOG_DIR/oklog.stderr.log

    ;;

  stop)
    kill_and_wait $PIDFILE

    ;;

  *)
    echo "Usage $0 (start|stop)"

    ;;

esac

Logs

Store

/:~$ head -30 /var/vcap/sys/log/oklog-store/oklog.stderr.log 
ts=2017-01-27T00:47:45.032791311Z level=info cluster=0.0.0.0:7659
ts=2017-01-27T00:47:45.032956542Z level=info API=tcp://0.0.0.0:7650
ts=2017-01-27T00:47:45.033133903Z level=info StoreLog=/var/vcap/data/oklog-store/data
ts=2017-01-27T00:47:45.033543196Z level=debug component=cluster Join=0
ts=2017-01-27T01:08:40.033822884Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:08:41.035525186Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:08:42.05396819Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:08:43.134136691Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:08:44.140632238Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:08:45.142074193Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:08:47.068612433Z level=debug component=Compacter shutdown_took=7.396µs
received signal terminated
ts=2017-01-27T01:08:58.347728509Z level=info cluster=0.0.0.0:7659
ts=2017-01-27T01:08:58.347892984Z level=info API=tcp://0.0.0.0:7650
ts=2017-01-27T01:08:58.348059072Z level=info StoreLog=/var/vcap/data/oklog-store/data
ts=2017-01-27T01:08:58.350535461Z level=debug component=cluster Join=1
ts=2017-01-27T01:08:58.453592207Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:08:59.454136853Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:09:00.45520491Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:09:01.559431343Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:09:02.559832078Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:09:03.560331983Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:09:04.660447896Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-27T01:09:08.358662209Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:09:13.356201575Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:09:18.351343846Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:09:23.351298909Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:09:28.358792823Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:09:33.351236223Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:09:38.351236934Z level=warn component=cluster NumMembers=1
# this message repeats

Ingest

/:~$ head -n 20 /var/vcap/sys/log/oklog-ingest/oklog.stderr.log 
ts=2017-01-27T00:47:26.456368326Z level=info cluster=0.0.0.0:7659
ts=2017-01-27T00:47:26.456533118Z level=info fast=tcp://0.0.0.0:7651
ts=2017-01-27T00:47:26.456570619Z level=info durable=tcp://0.0.0.0:7652
ts=2017-01-27T00:47:26.456594985Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-01-27T00:47:26.456618124Z level=info API=tcp://0.0.0.0:7650
ts=2017-01-27T00:47:26.456765551Z level=info ingest_path=/var/vcap/data/oklog-ingest/data
ts=2017-01-27T00:47:26.457175671Z level=debug component=cluster Join=0
received signal terminated
ts=2017-01-27T01:08:39.98690952Z level=info cluster=0.0.0.0:7659
ts=2017-01-27T01:08:39.987050301Z level=info fast=tcp://0.0.0.0:7651
ts=2017-01-27T01:08:39.98713915Z level=info durable=tcp://0.0.0.0:7652
ts=2017-01-27T01:08:39.987178126Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-01-27T01:08:39.987203832Z level=info API=tcp://0.0.0.0:7650
ts=2017-01-27T01:08:39.989955609Z level=info ingest_path=/var/vcap/data/oklog-ingest/data
ts=2017-01-27T01:08:39.992765407Z level=debug component=cluster Join=1
ts=2017-01-27T01:08:49.993246362Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:08:54.993428255Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:09:10.013011473Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:09:14.995265826Z level=warn component=cluster NumMembers=1
ts=2017-01-27T01:09:20.002733333Z level=warn component=cluster NumMembers=1
# this message repeats

Testing

$ ./oklog testsvc | ./oklog forward 10.244.0.131 10.244.0.128
reticulating splines...
ts=2017-01-27T21:29:55.829250893Z level=debug raw_target=tcp://10.244.0.131:7651 resolved_target=tcp://10.244.0.131:7651
foo starting, 512 bytes/record, 5 records/sec; 1 records/cycle, 200ms/cycle
^C
$ ./oklog query -q 0000000 -store 10.244.0.132 -regex -v
-from 2017-01-27T13:30:40-07:00 -to 2017-01-27T14:30:40-07:00
Response in 3.217658ms
Queried from 2017-01-27T13:30:40-07:00
Queried to 2017-01-27T14:30:40-07:00
Queried regular expression "0000000"
1 node(s) queried
0 segment(s) queried
0B (0MiB) maximum data set size
0 error(s)
1.054923ms server-reported duration

Since testsvc generates sequential identifiers, that is why I was looking for the 000000 identifier.

Any leads would be greatly appreciated. I think I have everything deployed correctly.

store: implement rebalance process

If we add new store node/s to a cluster, they will boot up empty. There should be some process by which they can receive some of the data set from other nodes. (Needs design thinking.)

Add helm chart for Kubernetes integration

Per discussion in #13, I think it makes a lot of sense to run oklog ingest alongside logspout as a DaemonSet, with oklog store running as a deployment. I'll put together a helm chart in a few weeks for easy k8s setup.

panic in github.com/djherbis/nio

Was doing a simple test as described in the quickstart. When querying, I got this panic from the ingeststore:

% ~/Downloads/oklog-0.1.0-darwin-amd64 ingeststore -store.segment-replication-factor 1
ts=2017-01-17T14:20:54Z level=info cluster=0.0.0.0:7659
ts=2017-01-17T14:20:54Z level=info fast=tcp://0.0.0.0:7651
ts=2017-01-17T14:20:54Z level=info durable=tcp://0.0.0.0:7652
ts=2017-01-17T14:20:54Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-01-17T14:20:54Z level=info API=tcp://0.0.0.0:7650
ts=2017-01-17T14:20:54Z level=info ingest_path=data/ingest
ts=2017-01-17T14:20:54Z level=info store_path=data/store
ts=2017-01-17T14:20:54Z level=debug component=cluster Join=0
panic: runtime error: slice bounds out of range

goroutine 1858 [running]:
github.com/djherbis/nio.(*PipeWriter).Write(0xc42000e050, 0xc4205ec99c, 0x1e, 0x664, 0x0, 0x0, 0x0)
	/Users/peter/src/github.com/djherbis/nio/sync.go:135 +0x30a
github.com/oklog/oklog/pkg/store.newConcurrentFilteringReadCloser.func1(0x2693d80, 0xc421f83620, 0xc42000e050, 0xc4200148c0)
	/Users/peter/src/github.com/oklog/oklog/pkg/store/query.go:265 +0x198
created by github.com/oklog/oklog/pkg/store.newConcurrentFilteringReadCloser
	/Users/peter/src/github.com/oklog/oklog/pkg/store/query.go:276 +0x210

Was running ingeststore like so:

% ~/Downloads/oklog-0.1.0-darwin-amd64 ingeststore -store.segment-replication-factor 1

a producer like so:

% while true; do echo hi; done | ~/Downloads/oklog-0.1.0-darwin-amd64 forward localhost

and this query:

% ~/Downloads/oklog-0.1.0-darwin-amd64 query -from 1m -v -q hi
-from 2017-01-17T10:25:07-04:00 -to 2017-01-17T10:26:07-04:00
Get http://localhost:7650/store/query?from=2017-01-17T10%3A25%3A07-04%3A00&to=2017-01-17T10%3A26%3A07-04%3A00&q=hi: EOF

Will try and dig in but perhaps you'll spot the problem sooner.

Is peer function can work?

I've downloaded the 0.1.1 and 0.1.2 version to test.
Using the Quickstart setting at localhost and single host is ok.
But i can not use peer with small installation, it shows error like it can not connect to other host.
When i change the order to start program, the Join number will change, like it can communicate in some level.
Is there any step I missing when create small installation?

droi@foo:~$ ./oklog-0.1.2-linux-amd64 ingeststore -peer 10.128.81.101 -peer 10.128.81.201 -peer 10.128.81.55
ts=2017-01-23T03:27:02.725786724Z level=info cluster=0.0.0.0:7659
ts=2017-01-23T03:27:02.726006903Z level=info fast=tcp://0.0.0.0:7651
ts=2017-01-23T03:27:02.726118947Z level=info durable=tcp://0.0.0.0:7652
ts=2017-01-23T03:27:02.726235426Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-01-23T03:27:02.72634867Z level=info API=tcp://0.0.0.0:7650
ts=2017-01-23T03:27:02.726719155Z level=info ingest_path=data/ingest
ts=2017-01-23T03:27:02.72689813Z level=info store_path=data/store
ts=2017-01-23T03:27:02.73028649Z level=debug component=cluster Join=1
ts=2017-01-23T03:27:02.83197139Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:12.731418765Z level=warn component=cluster NumMembers=1
ts=2017-01-23T03:27:12.8318313Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:13.832421946Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"

--

droi@bar:~$ ./oklog-0.1.2-linux-amd64 ingeststore -peer 10.128.81.201 -peer 10.128.81.101 -peer 10.128.81.55
ts=2017-01-23T03:27:03.039458148Z level=info cluster=0.0.0.0:7659
ts=2017-01-23T03:27:03.039743636Z level=info fast=tcp://0.0.0.0:7651
ts=2017-01-23T03:27:03.039908272Z level=info durable=tcp://0.0.0.0:7652
ts=2017-01-23T03:27:03.040079728Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-01-23T03:27:03.040222073Z level=info API=tcp://0.0.0.0:7650
ts=2017-01-23T03:27:03.040747816Z level=info ingest_path=data/ingest
ts=2017-01-23T03:27:03.041030583Z level=info store_path=data/store
ts=2017-01-23T03:27:03.047122842Z level=debug component=cluster Join=2
ts=2017-01-23T03:27:10.1484933Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:11.149158942Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:12.149831836Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:13.048114673Z level=warn component=cluster NumMembers=1
ts=2017-01-23T03:27:13.24837349Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:14.248878347Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"

--

droi@baz:~$ ./oklog-0.1.2-linux-amd64 ingeststore -peer 10.128.81.55 -peer 10.128.81.101 peer 10.128.81.201
ts=2017-01-23T03:27:03.545319117Z level=info cluster=0.0.0.0:7659
ts=2017-01-23T03:27:03.54560008Z level=info fast=tcp://0.0.0.0:7651
ts=2017-01-23T03:27:03.545696295Z level=info durable=tcp://0.0.0.0:7652
ts=2017-01-23T03:27:03.545807221Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-01-23T03:27:03.545925404Z level=info API=tcp://0.0.0.0:7650
ts=2017-01-23T03:27:03.546407185Z level=info ingest_path=data/ingest
ts=2017-01-23T03:27:03.546595281Z level=info store_path=data/store
ts=2017-01-23T03:27:03.55011206Z level=debug component=cluster Join=2
ts=2017-01-23T03:27:13.550546117Z level=warn component=cluster NumMembers=1
ts=2017-01-23T03:27:13.650812293Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:14.651879566Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:15.652473985Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:16.750697194Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:17.751224377Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-01-23T03:27:18.550751229Z level=warn component=cluster NumMembers=1

store: implement segment compression

We did a spike trying out LZ4 compression of segment files, which didn't exactly work out. We should try that again: compressing segment files on disk. The main goal is to increase retention periods, but we may be able to speed up queries, too, if we compress to clients.

gRPC support

Would it make sense to use gRPC instead of POSTing data?

store: implement streaming queries

There are lots of use cases that are well-served by streaming queries. That is, set up a query that delivers results as they arrive at each store node. At a first thought this would look like

  • User makes a stream-type query to any store node
  • The store node broadcasts that query to all store nodes
  • Each store node registers the query in some kind of table as a websocket
  • When new segments are replicated, records are matched against each registered streaming query
  • Matching records are written to the websocket
  • The originating store node uses some kind of ringbuffer to deduplicate records over a time window
  • Every effort should be made to keep the websocket connections alive: reconnect logic, etc.

ingest: implement bulk mode

The ingester can take an entire segment in a single write or series of writes, somehow. It can block until that segment has been successfully replicated to store nodes, and only then accept the next segment from the forwarder. This would provide the "most durable" mode of ingestion.

There are some tricky things here. One, plumbing through the necessary infrastructure to communicate to the forwarder that the segment has been successfully replicated. Two, deciding when and how to assign ULIDs. Three, thinking through the failure modes carefully.

Robust user input verification

Let's sit down and have a think about what protecting the system at its borders would look like.

  • Exhaustive input validation on user-facing store APIs, e.g. dates in range
  • Data validation at ingestion (though I'm not sure what this would mean)
  • Others?

Javascript error in UI on Firefox (TypeError on response.body at ports.js:68)

The actual error is "TypeError: response.body is undefined"

This is on FF 56.0 on MacOS 10.12.6. I'm not familiar enough with Elm to know what the type of response in this function is supposed to be and if the lack of a body field is a normal occurrence, a browser-compatibility issue, or a framework bug. Since it works on the same machine under Chrome, I'm leaning toward one of the latter two.

I'll try to dig deeper later.

PS - really like the project!

store: implement repair process

The repair process should walk the complete data set (the complete timespace) and essentially perform read repair. When it completes, we should have guaranteed that all records are at the desired replication factor. (There may be smarter ways to do this.)

Update README with build instructions

I am a newbie to golang. It would be great if you add build instructions in README.
When I try to clone the repo and build dep ensure is giving me following error:

split absolute project root: /home/user/projects/oklog/oklog not in any $GOPATH

My GOPATH is /home/user/projects/oklog/ and project root is in /home/user/projects/oklog/oklog
I am sure I am doing something wrong here.

data/ingest/LOCK already exists; another process is running, or the file is stale

Hi,

Just testing oklog in the wild, or in an simple alpine container to be specific, and I have come across this upon this:

ts=2017-02-06T02:31:56.463440662Z level=info cluster=0.0.0.0:7659
ts=2017-02-06T02:31:56.463619672Z level=info fast=tcp://0.0.0.0:7651
ts=2017-02-06T02:31:56.463676751Z level=info durable=tcp://0.0.0.0:7652
ts=2017-02-06T02:31:56.463725543Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-02-06T02:31:56.463822398Z level=info API=tcp://0.0.0.0:7650
data/ingest/LOCK already exists; another process is running, or the file is stale

I believe this was a result of having to restart docker, but I can't be 100% sure (not helpful, I know). Oklog is running a pid1, the nobody user runs oklog and oklog is in ingeststore mode (segment-replication-factor 1). Because the container was continually restarting, I was not able to enter into the container in order to attempt to find out more information (I am debating on whether to use something like s6, supervisord or dumb-init to possibly get around this).

If I understand the error correctly, oklog is erroring because both an ingest file and LOCK exist within /data/ingest.

return nil, errors.Errorf("%s already exists; another process is running, or the file is stale", lock)

My question is how does oklog remove the LOCK when the interrupt is issued?

signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)

return interrupt(cancel)

if err := run(os.Args[2:]); err != nil {

Is this done somehow in pkg/fs? If so, I can't see how this links through to the signal handling in main.

I have not been able to replicate this issue again so far, but I will keep trying.

Any thoughts welcome.

Tail-like behaviour

I think it may be useful if oklog have possibility to get some last lines from logs without specifying from/to datetime. Can I get this functionality with current version or it is impossible now?

store: don't poll ingesters, rather subscribe

Right now the stores are polling every 100ms or so to consume segments from ingesters. This is of course inefficient. It would be better if they attached some kind of subscription to each ingester, and did some kind of select over all of them, biasing randomly or via whatever selection criteria, to consume the next segment. But this is pretty low-priority work, all considered: this inefficiency isn't anywhere close to being the bottleneck of the system.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.