spotify / heroic Goto Github PK

View Code? Open in Web Editor NEW

843.0 58.0 109.0 9.65 MB

The Heroic Time Series Database

Home Page: https://spotify.github.io/heroic/

License: Apache License 2.0

Java 83.69% ANTLR 0.14% Shell 0.05% Python 0.31% Dockerfile 0.02% Kotlin 15.79%

time-series monitoring ha elasticsearch cassandra tsdb java google-bigtable google-pubsub

heroic's Introduction

DEPRECATION NOTICE

This repo is no longer actively maintained. While it should continue to work and there are no major known bugs, we will not be improving Heroic or releasing new versions.

Heroic

A scalable time series database based on Bigtable, Cassandra, and Elasticsearch. Go to https://spotify.github.io/heroic/ for documentation.

This project adheres to the Open Code of Conduct. By participating, you are expected to honor this code.

Install

Docker

Docker images are available on Docker Hub.

$ docker run -p 8080:8080 -p 9091:9091 spotify/heroic

Heroic will now be reachable at http://localhost:8080/status.

In production it's advised to use a tagged version.

Configuration

For help on how to write a configuration file, see the Configuration Section of the official documentation.

Heroic has been tested with the following services:

Cassandra (2.1.x, 3.5) when using metric/datastax.
Cloud Bigtable when using metric/bigtable.
Elasticsearch (7.x) when using metadata/elasticsearch or suggest/elasticsearch.
Kafka (0.8.x) when using consumer/kafka.

Developing

Building from source

In order to compile Heroic, you'll need:

A Java 11 JDK
Maven 3
Gradle

The project is built using Gradle:

# full build, runs all tests and builds the shaded jar
./gradlew build

# only compile
./gradlew assemble

# build a single module
./gradlew heroic-metric-bigtable:build

The heroic-dist module can be used to produce a shaded jar that contains all required dependencies:

./gradlew heroic-dist:shadowJar

After building, the entry point of the service is com.spotify.heroic.HeroicService. The following is an example of how this can be run:

./gradlew heroic-dist:runShadow <config>

which is the equivalent of doing:

java -jar $PWD/heroic-dist/build/libs/heroic-dist-0.0.1-SNAPSHOT-shaded.jar <config>

Building with Docker

$ docker build -t heroic:latest .

This is a multi-stage build and will first build Heroic via a ./gradlew clean build and then copy the resulting shaded jar into the runtime container.

Running heroic via docker can be done:

$ docker run -d -p 8080:8080 -p 9091:9091 -v /path/to/config.yml:/heroic.yml spotify/heroic:latest

Logging

Logging is captured using SLF4J, and forwarded to Log4j.

To configure logging, define the -Dlog4j.configurationFile=<path> parameter. You can use docs/log4j2-file.xml as a base.

Testing

We run tests with Gradle:

# run unit tests
./gradlew test

# run integration tests
./gradlew integrationTest

or to run a more comprehensive set of checks:

./gradlew check

This will run:

unit tests
integration tests
Checkstyle
Coverage Reporting with Jacoco

It is strongly recommended that you run the full test suite before setting up a pull request, otherwise it will be rejected by Travis.

Full Cluster Tests

Full cluster tests are defined in heroic-dist/src/test/java.

This way, they have access to all the modules and parts of Heroic.

The JVM RPC module is specifically designed to allow for rapid execution of integration tests. It allows multiple cores to be defined and communicate with each other in the same JVM instance.

Code Coverage

There's an ongoing project to improve test coverage. Clicking the above graph will bring you to codecov.io, where you can find areas to focus on.

Bypassing Validation

To bypass automatic formatting and checkstyle validation you can use the following stanza:

// @formatter:off
final List<String> list = ImmutableList.of(
   "Welcome to...",
   "... The Wild West"
);
// @formatter:on

To bypass a FindBugs error, you should use the @SupressFBWarnings annotation.

@SupressFBWarnings(value="FINDBUGS_ERROR_CODE", justification="I Know Better Than FindBugs")
public class IKnowBetterThanFindbugs() {
    // ...
}

Module Orientation

The Heroic project is split into a couple of modules.

The most critical one is heroic-component. It contains interfaces, value objects, and the basic set of dependencies necessary to glue different components together.

Submodules include metric, suggest, metadata, and aggregation. The first three contain various implementations of the given backend type, while the latter provides aggregation methods.

heroic-core contains the com.spotify.heroic.HeroicCore class which is the central building block for setting up a Heroic instance.

heroic-elasticsearch-utils is a collection of utilities for interacting with Elasticsearch. This is separate since we have more than one backend that needs to talk with elasticsearch.

Finally there is heroic-dist, a small project that depends on all module. Here is where everything is bound together into a distribution — a shaded jar. It also provides the entry-point for services, namely com.spotify.heroic.HeroicService or through an interactive shell com.spotify.heroic.HeroicShell. The shell can either be run standalone or connected to an existing Heroic instance for administration.

Contributing

Guidelines for contributing can be found here.

heroic's People

Contributors

Stargazers

Watchers

Forkers

jp2007 gaurav46 tharanga-abeyseela lmeyemezu zfrank juruen stone1100 deejay1 delkyd tomzhang is00hcw quedexco pseudosky itang wendelicious rtvt123 weiyhcarlos mbrukman dimaslv malaki12003 ciena sankara05 dbrounst tempbottle linearregression bigdata0803 sduskis dmichel1 jcabmora mattnworb cloudxtreme ieasydevops codeqqby sokit2em buckha abhishekdiwakar gabrielgerhardsson maniacs-db at15 zivyu plx927 denissytenkov rugby110 raceli teotikalki jo-ri steccami zalenski hemulin richard2011 giang12 wanglingsong astocko nezdolik tariq1890 bignerdgithub ylywyn zmyer andrew8305 interair iuliandumitru infinitiesloop zjpjohn odentech ycodedotme cube3power nabam kant dbochman krystlebennett lucilecoutouly jsferrei kanthgithub diffblue-benchmarks tool-recommender-bot nishantgupta204 javalibrary 0xflotus puppup420247-org zdqf odenio rychenga sunilpentapati wade1990 fagan2888 dizhaung andypeng2015 knowledgehacker suyambuganesh tsdb-io spamaps rochesterinnyc stjordanis cyberiaozsys7ems tanmaylaud tabishimran devopstoday11 doytsujin ao2017 adambsteele

heroic's Issues

Clarification on Series

Hello,
I am trying to understand how to organize our metrics, and I was wondering if you can provide a little bit more info in the Docs about series. Specifically, what implications exist in terms of querying the data ( I am using the grafana heroic datasource plugin).
Also, when playing with the API I noticed an odd behavior:

First, add a series:

curl -s -X PUT -H "Content-Type: application/json" http://localhost:8080/metadata/series -d '{"key":"3175c2d7-3e93-4e30-9ca3-3bdf3b17baeb", "tags":{"meter":"cpu_util"}}' | python -m json.tool
{
    "errors": [],
    "times": [
        5920
    ]
}

Second, verify the series

curl -s -X POST -H "Content-Type: application/json" http://localhost:8080/metadata/series -d '{"key":"3175c2d7-3e93-4e30-9ca3-3bdf3b17baeb"}' | python -m json.tool
{
    "errors": [],
    "limited": false,
    "series": [
        {
            "key": "3175c2d7-3e93-4e30-9ca3-3bdf3b17baeb",
            "tags": {
                "meter": "cpu_util"
            }
        }
    ]
}

Third, delete the series, note deleted = 0

curl -s -X DELETE -H "Content-Type: application/json" http://localhost:8080/metadata/series -d '{"filter":["and", ["key","3175c2d7-3e93-4e30-9ca3-3bdf3b17baeb"],["=","meter","cpu_util"]]}' | python -m json.tool
{
    "deleted": 0,
    "errors": [],
    "failed": 0
}

Fourth, even though "deleted":0, it was actually deleted, which is evident if you query again

curl -s -X POST -H "Content-Type: application/json" http://localhost:8080/metadata/series -d '{"key":"3175c2d7-3e93-4e30-9ca3-3bdf3b17baeb"}' | python -m json.tool
{
    "errors": [],
    "limited": false,
    "series": []
}

However, you can't add the series back again:

curl -s -X PUT -H "Content-Type: application/json" http://localhost:8080/metadata/series -d '{"key":"3175c2d7-3e93-4e30-9ca3-3bdf3b17baeb", "tags":{"meter":"cpu_util"}}' | python -m json.tool
{
    "errors": [],
    "times": []
}

How to select metrics?

Hi,
I'm sure this is a noob question but how do I select metrics from heroic?
I'm interested in doing this using grafana and also the heroicsh.

Any input is appreciated.

Support Elasticsearch 2.x

Bump the Elasticsearch version and test to support 2.x.

This will not be a backwards compatible change because of elastic/elasticsearch#13272 unless we switch to some REST client.

Fix findTags support in MetadataBackendKVIT

It was discovered after implementing a more comprehensive test suite for metadata backends that findTags(...) requests do not appear to work in KV.

See:
https://github.com/spotify/heroic/blob/master/metadata/elasticsearch/src/test/java/com/spotify/heroic/metadata/elasticsearch/MetadataBackendKVIT.java#L39

Issue with Expression parsing

Hi guys,

I am using the metadata-fetch task from HeroicShell. I ran into some inconsistencies in how the filters are parsed that I believe are caused by how the HeroicQuery grammar is defined.
As an example, supposed these two keys exist: 4cf41d4b-a5d2-420e-b008-644f0b3d7832 and c89bf1ee-32ad-44c9-9641-5ee7fe3cac17. If I run this command I can successfully get the series related to key c89bf1ee-32ad-44c9-9641-5ee7fe3cac17:
metadata-fetch $key = c89bf1ee-32ad-44c9-9641-5ee7fe3cac17
However, if I run the same for the other key, I get a com.spotify.heroic.grammar.ParseException:
metadata-fetch $key = 4cf41d4b-a5d2-420e-b008-644f0b3d7832
I found that you have to escape the whole expression to be able to make it work:
metadata-fetch '$key = "4cf41d4b-a5d2-420e-b008-644f0b3d7832"'

This problem only happens with expressions that begin with a numeric. I believe the problem is caused when this Rule is applied, expr is matched first for an ExpressionInteger and then instead of finding EOF if finds the rest of the string. When quotes are applied it correctly matches QuotedString. I know this can be worked around, but probably having consistency in the syntax would be good.

Regards,

Jorge

Tag Equals and StartsLike fails to return results

If you could point me to where I have gone wrong or where in the code a mistake may have been made, it would be appreciated.

I have done the following:

curl -s -H "Content-Type: application/json" http://localhost:8080/write -d '{"series": {"key": "foo", "tags": {"host": "www.london.com","site":"lon"}}, "data": {"type": "points", "data": [[1463605960000, 46.0], [1463605970000, 69.0]]}}'
curl -s -H "Content-Type: application/json" http://localhost:8080/write -d '{"series": {"key": "foo", "tags": {"host": "www.petaluma1.com","site":"pet"}}, "data": {"type": "points", "data": [[1463605960000, 46.0], [1463605970000, 69.0]]}}'
curl -s -H "Content-Type: application/json" http://localhost:8080/write -d '{"series": {"key": "foo", "tags": {"host": "www.petaluma2.com","site":"pet"}}, "data": {"type": "points", "data": [[1463605960000, 46.0], [1463605970000, 69.0]]}}'

Successfully retrieved results:

curl -H "Content-Type: application/json" http://localhost:8080/query/metrics -d '{"range": {"type": "relative", "unit": "HOURS", "value": 2}, "filter": ["key", "foo"]}'

Failed to retrieved results (equals):

curl -H "Content-Type: application/json" http://localhost:8080/query/metrics -d '{"range": {"type": "relative", "unit": "HOURS", "value": 3}, "filter": ["and",["key", "foo"],["=","site", "pet"]]}'

Failed to retrieved results (starts like):

curl -H "Content-Type: application/json" http://localhost:8080/query/metrics -d '{"range": {"type": "relative", "unit": "HOURS", "value": 3}, "filter": ["and",["key", "foo"],["^","site", "p"]]}'

RFC: Flatten DSL

This RFC suggests to introduce the two following syntactic sugars in aggregation DSLs, the by and the | (pipe) keywords.

by expressions take the form <aggregation> by <tags>, and are equivalent to the current group(<tags>, <aggregation>) syntax.

| (pipe) expressions take the form <aggregation> | <aggregation> | .., and are equivalent to the current chain(<aggregation>, <aggregation>, ..).

It is also possible to group expressions using parenthesis to override priority, like the following:

(average() by host | sum()) by role

Which would read as something like: calculate the sum of all averages by host for each role

The rationale behind the proposal is to flatten the need for overly nested expressions, and to allow for aggregations to read from left to right more naturally what their intent is.

Take the following example:

chain(group([host], average()), group([host], sum()))

This could be written as the following:

average() by host | sum() by site

This would read as: average by host, then sum by site.

alternative to pipe

Pipe has very strong connotations, especially when considering shell pipes. It might be inadvisable to use such a well-established keyword. I'm open for proposals of alternative operators or syntax.

An alternative proposal is to simply introduce then, however this can get a bit verbose:

average() by host then sum() by site

Work in Progress

https://github.com/udoprog/heroic/tree/rfc-flatten-dsl

Metrics not written to Metadata backend

Hello,

We ran into a problem where several metrics are not written to the ElasticSearch metadata backend.

We have an openstack backend that collect metrics from several managed resources. This causes a few hundred messages to be written to kafka almost at the same time (for our testing we are only sending one datapoint for each series to make it easier to troubleshoot). The messages are consumed, and they are written to kafka and cassandra (the message and record count matches on both), however the number of docs created in the metadata index in elastic search does not match (verified using the _cat/indices ElasticSearch API). This causes many metrics to not be available for retrieval using the query APIs. Something that is interesting is that if we call the query/metrics API, the number of series returned by this call matches the number of docs in the metadata index, so we know that there is definitely a problem there.

There are no exceptions thrown by Heroic, and I added a few log statement to the MetadataBackendKV.write() (https://github.com/spotify/heroic/blob/master/metadata/elasticsearch/src/main/java/com/spotify/heroic/metadata/elasticsearch/MetadataBackendKV.java#L1710 method and I can confirm that all the series are processed by this method. I increased the log level of the ElasticSearch client classes and no errors were found.

If we take those same messages and feed them to kafka by throttling the message producer (at lets say 10 per second), then all the series are written to the metadata index.

It all seems that ElasticSearch is not able to process the requests and just drops them. I saw in the docs that there are options flushInterval and bulkActions to configure the elasticsearch client, but by looking at the code I believe those are currently not implemented (https://github.com/spotify/heroic/blob/master/heroic-elasticsearch-utils/src/main/java/com/spotify/heroic/elasticsearch/ConnectionModule.java#L59) (please correct me if I am wrong)

After this long description, my questions are: Have you experienced a similar issue? If so, do you have any suggestions about how to avoid this problem? I am really not familiar with ElasticSearch, so if the answer to these questions relies on ElasticSearch knowledge, I apologize first hand. The other question that I have, is there a way to rebuild the ElasticSearch indices from Heroic?

Thanks!

Jorge

Elasticsearch error: "document already exists"

Hi,

I'm seeing this error quite often being returned from the Heroic HTTP API. Example:

{"message":"[heroic-1464825600000][1] [series][b5a903eb9726102bf3fb620392d26e00]: document already exists","reason":"Internal Server Error","status":500,"type":"internal-error"}

If I retry the request, it usually returns a 200 straight away. I am wondering if firing concurrent POST requests at the same time (for new metrics) is sending duplicate metadata creation requests to the Elasticsearch API?

Heroic throws NullPointerException if it cannot find ElasticSearch index

Here is an example of the stack trace printed by Heroic when trying to access a non-existing ES index in the suggest code path:

2016-03-03 09:12:48,387 ERROR c.s.h.r.n.NativeRpcServerSession [elasticsearch[SomeNode][transport_client_worker][T#8]{New I/O worker #17}] [id: 0x3cb82c63, /0.0.0.0:53840 => /0.0.0.0:1394]: request failed java.lang.Exception: 1 exception(s) caught: error in transform
at eu.toolchain.async.TinyThrowableUtils.buildCollectedException(TinyThrowableUtils.java:23)
at eu.toolchain.async.helper.CollectHelper.done(CollectHelper.java:142)
at eu.toolchain.async.helper.CollectHelper.add(CollectHelper.java:128)
at eu.toolchain.async.helper.CollectHelper.failed(CollectHelper.java:79)
at eu.toolchain.async.DirectAsyncCaller.fail(DirectAsyncCaller.java:19)
at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:212)
at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.fail(ConcurrentResolvableFuture.java:106)
at eu.toolchain.async.helper.ResolvedTransformHelper.resolved(ResolvedTransformHelper.java:26)
at eu.toolchain.async.DirectAsyncCaller.resolve(DirectAsyncCaller.java:10)
at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:221)
at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.resolve(ConcurrentResolvableFuture.java:97)
at com.spotify.heroic.elasticsearch.AbstractElasticsearchBackend$1.onResponse(AbstractElasticsearchBackend.java:43)
at org.elasticsearch.action.support.AbstractListenableActionFuture.executeListener(AbstractListenableActionFuture.java:120)
at org.elasticsearch.action.support.AbstractListenableActionFuture.done(AbstractListenableActionFuture.java:97)
at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:166)
at org.elasticsearch.action.support.AdapterActionFuture.onResponse(AdapterActionFuture.java:96)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onResponse(TransportClientNodesService.java:234)
at org.elasticsearch.action.TransportActionNodeProxy$1.handleResponse(TransportActionNodeProxy.java:73)
at org.elasticsearch.action.TransportActionNodeProxy$1.handleResponse(TransportActionNodeProxy.java:57)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:163)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:132)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Suppressed: eu.toolchain.async.TransformException: error in transform
    ... 37 more
Caused by: java.lang.NullPointerException
    at com.spotify.heroic.suggest.elasticsearch.SuggestBackendKV.lambda$null$3(SuggestBackendKV.java:294)
    at eu.toolchain.async.helper.ResolvedTransformHelper.resolved(ResolvedTransformHelper.java:24)
    ... 36 more

Problems with query/metrics API

We have been testing the query/metrics API and we had some confusing results. The biggest problem that we have found is that the end of ranges changes based on the length of the range. We believe this inconsistency can potentially affect usability, ease of consumption of the APIs by frontends and why not, unit testing.

Here's the procedure that we have followed to put this in evidence:

make a call to the query/metrics API with a relative range of 1 second and extract the latest time in milliseconds. We are aware this is an approximation and the response latency between this call and the next call make it inaccurate. At this point we just want to get a lower bound of the heroic server date since we are going to be using calls with "relative" ranges.
immediately make a call to the /write API to push two data points, one with the time obtained in the previous step and another data point one second earlier
make several calls to the query/metrics API using an absolute range of +_ 1500 milliseconds, and relative ranges of 5 seconds, 1 day and 3 days. We have noticed in the responses to these calls:

for the absolute range, the returned range in the response does not correspond to the requested range
the 5 second range includes the two added datapoints, and the start and end are usually within reasonable values (considering the latencies of the requests/responses)
the 5 minutes range is shifted, usually to a timestamp that is before in time than the end of the previous response. Sometimes the rows added in step 2 are returned, sometimes they are not.
the 3 day range is shifted even more, making the recently added rows to not be returned.

We believe the reason for this is the logic that is implemented in the buildShiftedRange() method. I do believe that logic might apply in many situations, a perfect example could be when you want to aggregate data on time slots, and data will only make sense if all slots are of the same size, but when you do not need to aggregate data based on time and just want to return data points, or your aggregation is, say for example, based on device type, then the logic of shifting ranges does cause unexpected results to be returned. To test this, we used this bash script (you'll need jq installed on your machine to be able to test this. Also, all calls are made to localhost:8080 , so you might need to update accordingly)

StackOverflowError when running write-performance

I am getting StackOverflowErrors when running the write-performance task.

Steps to reproduce:

Run Heroic with ShellServer enabled

port: 8080

shellServer:
    host: 192.168.42.2
    port: 9190

cluster:
  protocols:
    - type: nativerpc
      host: localhost
      port: 1394

  discovery:
    type: static
    nodes:
      - "nativerpc://127.0.0.1:1394"
...(other settings for suggest, metadata, metrics and consumer follow)

Running with Cassandra c2.1.13, ElasticSearch 2.2.1 (based on PR 62), zookeeper 3.4.6 and Kafka 0.9.0.1

On another host, connect to the ShellServer:

java -cp "../target/*" com.spotify.heroic.HeroicShell --connect 192.168.42.2:9190

set the timeout to something greater than default(10 seconds):

heroic> timeout 60
Timeout updated to 60 seconds

Run the write-performance task

heroic> write-performance --from=heroic --series=10 --target=100 --writes=200
Warmup step 1/4
..................................Command timed out (current timeout = 60s)

The following is the output captured in the heroic logs:

2016-08-04T22:52:05.0969: 22:52:05.092 [heroic-core-2] ERROR com.spotify.heroic.HeroicCore - Unhandled exception caught in core executor
2016-08-04T22:52:05.0971: java.lang.StackOverflowError
2016-08-04T22:52:05.0972: #011at java.util.Spliterators$IteratorSpliterator.<init>(Spliterators.java:1710)
2016-08-04T22:52:05.0973: #011at java.util.Spliterators.spliterator(Spliterators.java:420)
2016-08-04T22:52:05.0974: #011at java.util.Set.spliterator(Set.java:411)
2016-08-04T22:52:05.0976: #011at java.util.Collection.stream(Collection.java:581)
2016-08-04T22:52:05.0977: #011at com.spotify.heroic.common.SelectedGroup.stream(SelectedGroup.java:45)
2016-08-04T22:52:05.0978: #011at com.spotify.heroic.metric.LocalMetricManager$Group.map(LocalMetricManager.java:356)
2016-08-04T22:52:05.0979: #011at com.spotify.heroic.metric.LocalMetricManager$Group.write(LocalMetricManager.java:268)
2016-08-04T22:52:05.0980: #011at com.spotify.heroic.shell.task.WritePerformance.lambda$buildWrites$5(WritePerformance.java:249)
2016-08-04T22:52:05.0981: #011at eu.toolchain.async.DelayedCollectCoordinator.setupNext(DelayedCollectCoordinator.java:124)
2016-08-04T22:52:05.1024: #011at eu.toolchain.async.DelayedCollectCoordinator.checkNext(DelayedCollectCoordinator.java:115)
2016-08-04T22:52:05.1026: #011at eu.toolchain.async.DelayedCollectCoordinator.resolved(DelayedCollectCoordinator.java:60)
2016-08-04T22:52:05.1027: #011at eu.toolchain.async.DirectAsyncCaller.resolve(DirectAsyncCaller.java:10)
2016-08-04T22:52:05.1028: #011at eu.toolchain.async.immediate.ImmediateResolvedAsyncFuture.onDone(ImmediateResolvedAsyncFuture.java:59)
2016-08-04T22:52:05.1029: #011at eu.toolchain.async.DelayedCollectCoordinator.setupNext(DelayedCollectCoordinator.java:130)
2016-08-04T22:52:05.1031: #011at eu.toolchain.async.DelayedCollectCoordinator.checkNext(DelayedCollectCoordinator.java:115)
2016-08-04T22:52:05.1032: #011at eu.toolchain.async.DelayedCollectCoordinator.resolved(DelayedCollectCoordinator.java:60)
2016-08-04T22:52:05.1033: #011at eu.toolchain.async.DirectAsyncCaller.resolve(DirectAsyncCaller.java:10)
2016-08-04T22:52:05.1035: #011at eu.toolchain.async.immediate.ImmediateResolvedAsyncFuture.onDone(ImmediateResolvedAsyncFuture.java:59)
2016-08-04T22:52:05.1036: #011at eu.toolchain.async.DelayedCollectCoordinator.setupNext(DelayedCollectCoordinator.java:130)
2016-08-04T22:52:05.1037: #011at eu.toolchain.async.DelayedCollectCoordinator.checkNext(DelayedCollectCoordinator.java:115)
2016-08-04T22:52:05.1038: #011at eu.toolchain.async.DelayedCollectCoordinator.resolved(DelayedCollectCoordinator.java:60)
2016-08-04T22:52:05.1039: #011at eu.toolchain.async.DirectAsyncCaller.resolve(DirectAsyncCaller.java:10)

Heroic Shell status Code

When the Heroic Shell ends with an error, an incorrect exit code of zero is returned. This makes it inconvenient to use in shell scripts to automate tasks. In the following example you can see that the exit code was 0 even though the previous command did not succeed.

13:33:56 heroic $ java -cp "target/*" com.spotify.heroic.HeroicShell --connect 172.16.0.13:9190
13:34:32.145 [main] INFO  com.spotify.heroic.HeroicShell - Setting up interactive shell...
13:34:32.154 [main] ERROR com.spotify.heroic.HeroicShell - Error when running shell
java.net.ConnectException: Connection refused
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at java.net.Socket.connect(Socket.java:538)
    at com.spotify.heroic.shell.RemoteCoreInterface.connect(RemoteCoreInterface.java:285)
    at com.spotify.heroic.shell.RemoteCoreInterface.commands(RemoteCoreInterface.java:254)
    at com.spotify.heroic.HeroicShell.runInteractiveShell(HeroicShell.java:197)
    at com.spotify.heroic.HeroicShell.interactive(HeroicShell.java:182)
    at com.spotify.heroic.HeroicShell.main(HeroicShell.java:116)
13:34:32.155 [main] INFO  com.spotify.heroic.HeroicShell - Closing core bridge...
13:34:32 heroic  $ echo $?
0

Fix distributed cardinality aggregation when sent over gRPC

See:
https://github.com/spotify/heroic/blob/master/heroic-dist/src/test/java/com/spotify/heroic/GrpcClusterQueryIT.java#L15

For some reason, the calculated cardinality is incorrect when used over gRPC.

This could be related to grpc having less predictive ordering when recombining the results than the jvm transport. It could also be this, in the combination of the cardinality implementation being a statistical method (with some associated error) and that condition triggers it.

Only create a Cassandra keyspace if it doesn't exist when CREATE permission is missing

The current approach CREATE KEYSPACE IF NOT EXISTS apparently doesn't work without create permissions.

Consider if there's another way to detect if a keyspace exists before attempting to create it.
Allow for explicitly disabling the creation of a keyspace so it's not attempted.

This relates to #34

Multi-DC HA deployment

Hi,
I was wondering on ways to improve resiliency in case of DC failure - and how to marry cross-DC replication with having a federated cluster. While we could set up cross–datacenter replication for C*, this alone would not work due to the fact that the data will be there – but the metadata will be missing Also – since the data wold be replicated into multiple data centers, a federated query would read the same data twice – this would probably break some aggregations like count() / sum().

From our point of view having a hot-hot multi-DC deployment is an important requirement, both in case of ingestion as well as querying.

I was trying to devise a way how to work around this limitation, some options I was considering were:

Create a Cassandra-backed metadata store (this means that probably not all types of filters would be supported – esp. like / regexp). This would basically mean doing something similar to what Kairos is doing (and possibly with similar restrictions on rows with huge tag cardinality). Again, this defeats the whole concept of federation.
Use something like Elasticsearch tribe node to act as a federated cluster (this would not play nice with rolling index policy, again the federation would not work) - this is just a thought.
Use another way to replicate ES (using a Kafka topic?) - or just mirror the pre-ingestion data into different datacenter using Kafka's mirror-maker
Other ideas?

By the way I was wondering what approach did you take with regard to rebuilding Elasticsearch indexes – as far as I understand, there’s no need to scan over all the dat a in Cassandra, only the row keys – is there an efficient way to do so- and what numbers are you seeing when rebuilding indexes? I was wondering if we could live with "normal" federation and only rebuilding Elastic indexes when there's failover - they're not that big and we could replicate other datacenters into different C* keyspaces - if there was a failover, the process would have to regenerate/update only the indexes for "remote" data - similarily, the metric-collecting agent could in this case switch to a different DC.

Heroic Shell Leading Space

Might be good to trim the commands from the shell

Log spam from io.grpc.internal.TransportSet$TransportListener to console

Start heroic via java -cp heroic.jar -Dlog4j.configurationFile=config/log4j2-file.xml com.spotify.heroic.HeroicService heroic.yaml.

Then, on the console, we get a constant stream of the following messages:

Jun 22, 2016 7:29:05 PM io.grpc.internal.TransportSet$TransportListener transportTerminated
INFO: Transport io.grpc.netty.NettyClientTransport@d6179e1(localhost/127.0.0.1:0) for localhost/127.0.0.1:0 is terminated
Jun 22, 2016 7:30:05 PM io.grpc.internal.TransportSet$1 call
INFO: Created transport io.grpc.netty.NettyClientTransport@5405e612(localhost/127.0.0.1:0) for localhost/127.0.0.1:0
Jun 22, 2016 7:30:05 PM io.grpc.internal.TransportSet$TransportListener transportShutdown
INFO: Transport io.grpc.netty.NettyClientTransport@5405e612(localhost/127.0.0.1:0) for localhost/127.0.0.1:0 is being shutdown
Jun 22, 2016 7:30:05 PM io.grpc.internal.TransportSet$TransportListener transportTerminated
INFO: Transport io.grpc.netty.NettyClientTransport@5405e612(localhost/127.0.0.1:0) for localhost/127.0.0.1:0 is terminated

Even, setting io.grpc to ERROR in the log4j2.xml file does not fix this issue.

"Sending GOAWAY failed"

Hello everyone,

I'm new to heroic and not sure how to approach this error.
I've configured the heroic API, using debian package built from git repo.
After the service starts I'm getting the following message every minute.

10:11:15.922 [heroic-scheduler#2] INFO com.spotify.heroic.cluster.CoreClusterManager - [new] grpc://131.x.x.x:8100 10:11:15.929 [nioEventLoopGroup-2-6] INFO com.spotify.heroic.cluster.CoreClusterManager - [refresh] no nodes discovered, including local node 10:11:15.929 [nioEventLoopGroup-2-6] INFO com.spotify.heroic.cluster.CoreClusterManager - [update] [{}] shards: 1 result(s), 1 failure(s) 10:11:15.930 [nioEventLoopGroup-2-5] WARN io.netty.handler.codec.http2.Http2ConnectionHandler - [id: 0x7ffeead0] Sending GOAWAY failed: lastStreamId '0', errorCode '2', debugData 'Connection refused: /131.x.x.x:8100'. Forcing shutdown of the connection. java.nio.channels.ClosedChannelException

Any ideas why ?

Cassandra authentication and keyspace setting

Currently there seems to be no way of specifying in the settings authentication credentials for the Cassandra DataStax backend and also you can't choose the keyspace to use or am I missing something?

Right now I worked around this issue by hardcoding both things in datastax/ManagedSetupConnection.java.

Consider using gRPC instead of nativerpc

gRPC has many of the characteristics useful for a RPC protocol in a high-latency environment, meaning it could be a suitable replacement for nativerpc.

However, the following things are still concerns which must be addressed.

protobuf and Java can be a bit of a mess. The maven protobuf generator plugin is shaky and depends on protoc.
heartbeats per-request must be supported, especially when requests are being multiplexed over the connection to rapidly detect issues.

OSX Problems

Config

java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

Still in the process of getting this to actually run, but preliminary findings

Homebrew Cassandra (V2.0 & V3.X) don't work with this

I have gotten the config to work (I think) with cassandra from the apache repo
apache-cassandra-2.1.12

Errors

java -cp heroic-dist/target/heroic-dist-0.0.1-SNAPSHOT-shaded.jar com.spotify.heroic.HeroicService heo.yaml

Still having problems with elasticsearch

From the heroic shell

From the elasticsearch shell

Build FakeModuleLoader that simplifies test bootstrapping for many modules

I want to turn this:

mapper = new ObjectMapper();
mapper.addMixIn(AggregationInstance.class, TypeNameMixin.class);
mapper.registerSubtypes(new NamedType(AboveKInstance.class, AboveK.NAME));
mapper.registerSubtypes(new NamedType(BelowKInstance.class, BelowK.NAME));
mapper.registerSubtypes(new NamedType(BottomKInstance.class, BottomK.NAME));
mapper.registerSubtypes(new NamedType(TopKInstance.class, TopK.NAME));

Into this:

final FakeModuleLoader loader = FakeModuleLoader.builder().module(Module.class).build();
mapper = loader.json();

Module already maps out the type names. The reason why it can't be used in the tests are because it requires the loading phase of Heroic to be configured.

In this, I propose to introduce FakeModuleLoader which sets up a fake loading environment and loads the specified module in order to provide a correctly configured ObjectMapper.

Preconfigured keyspace support

In our installation we have preconfigured keyspaces with access to operations only inside that keyspace.
Unfortunately it seems that the datastax backend doesn't support such case. It would be great if it could detect that the keyspace is present and create only the tables.

Unfortunately "CREATE KEYSPACE IF NOT EXISTS" doesn't work, as it seems that the create keyspace permission is more important.

Exception:

16:01:29.844 [cluster1-nio-worker-5] INFO  com.spotify.heroic.metric.datastax.schema.ng.NextGenSchema - Creating keyspace heroic
16:01:29.863 [main] ERROR com.spotify.heroic.HeroicService - Failed to start Heroic instance
java.util.concurrent.ExecutionException: eu.toolchain.async.TransformException: error in transform
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$Sync.get(ConcurrentResolvableFuture.java:527)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.get(ConcurrentResolvableFuture.java:304)
    at com.spotify.heroic.HeroicService.main(HeroicService.java:106)
    at com.spotify.heroic.HeroicService.main(HeroicService.java:59)
Caused by: eu.toolchain.async.TransformException: error in transform
    at eu.toolchain.async.helper.ResolvedTransformHelper.resolved(ResolvedTransformHelper.java:26)
    at eu.toolchain.async.DirectAsyncCaller.resolve(DirectAsyncCaller.java:10)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:221)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.resolve(ConcurrentResolvableFuture.java:97)
    at com.spotify.heroic.HeroicCore$Instance.start(HeroicCore.java:866)
    ... 2 more
Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 1 exception(s) caught: User heroic_adm has no CREATE permission on <all keyspaces> or any of its parents
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$Sync.get(ConcurrentResolvableFuture.java:542)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.get(ConcurrentResolvableFuture.java:316)
    at com.spotify.heroic.HeroicCore.awaitLifeCycles(HeroicCore.java:512)
    at com.spotify.heroic.HeroicCore.startLifeCycles(HeroicCore.java:461)
    at com.spotify.heroic.HeroicCore$Instance.lambda$new$128(HeroicCore.java:817)
    at com.spotify.heroic.HeroicCore$Instance$$Lambda$112/1234586997.transform(Unknown Source)
    at eu.toolchain.async.helper.ResolvedTransformHelper.resolved(ResolvedTransformHelper.java:24)
    ... 7 more
Caused by: java.lang.Exception: 1 exception(s) caught: User heroic_adm has no CREATE permission on <all keyspaces> or any of its parents
    at eu.toolchain.async.TinyThrowableUtils.buildCollectedException(TinyThrowableUtils.java:23)
    at eu.toolchain.async.helper.CollectHelper.done(CollectHelper.java:142)
    at eu.toolchain.async.helper.CollectHelper.add(CollectHelper.java:128)
    at eu.toolchain.async.helper.CollectHelper.cancelled(CollectHelper.java:85)
    at eu.toolchain.async.DirectAsyncCaller.cancel(DirectAsyncCaller.java:28)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:217)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.cancel(ConcurrentResolvableFuture.java:115)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.cancel(ConcurrentResolvableFuture.java:121)
    at eu.toolchain.async.helper.CollectHelper.checkFailed(CollectHelper.java:94)
    at eu.toolchain.async.helper.CollectHelper.failed(CollectHelper.java:80)
    at eu.toolchain.async.DirectAsyncCaller.fail(DirectAsyncCaller.java:19)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:212)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.fail(ConcurrentResolvableFuture.java:106)
    at eu.toolchain.async.helper.ResolvedTransformHelper.failed(ResolvedTransformHelper.java:16)
    at eu.toolchain.async.DirectAsyncCaller.fail(DirectAsyncCaller.java:19)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:212)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.fail(ConcurrentResolvableFuture.java:106)
    at eu.toolchain.async.helper.ResolvedLazyTransformHelper.failed(ResolvedLazyTransformHelper.java:16)
    at eu.toolchain.async.DirectAsyncCaller.fail(DirectAsyncCaller.java:19)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:212)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.fail(ConcurrentResolvableFuture.java:106)
    at eu.toolchain.async.helper.ResolvedLazyTransformHelper$1.failed(ResolvedLazyTransformHelper.java:33)
    at eu.toolchain.async.DirectAsyncCaller.fail(DirectAsyncCaller.java:19)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:212)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.fail(ConcurrentResolvableFuture.java:106)
    at eu.toolchain.async.helper.ResolvedTransformHelper.failed(ResolvedTransformHelper.java:16)
    at eu.toolchain.async.DirectAsyncCaller.fail(DirectAsyncCaller.java:19)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:212)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.fail(ConcurrentResolvableFuture.java:106)
    at eu.toolchain.async.helper.ResolvedLazyTransformHelper$1.failed(ResolvedLazyTransformHelper.java:33)
    at eu.toolchain.async.DirectAsyncCaller.fail(DirectAsyncCaller.java:19)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:212)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.fail(ConcurrentResolvableFuture.java:106)
    at eu.toolchain.async.helper.ResolvedLazyTransformHelper.failed(ResolvedLazyTransformHelper.java:16)
    at eu.toolchain.async.DirectAsyncCaller.fail(DirectAsyncCaller.java:19)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture$2.run(ConcurrentResolvableFuture.java:212)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.run(ConcurrentResolvableFuture.java:439)
    at eu.toolchain.async.concurrent.ConcurrentResolvableFuture.fail(ConcurrentResolvableFuture.java:106)
    at com.spotify.heroic.metric.datastax.Async$1.onFailure(Async.java:46)
    at com.google.common.util.concurrent.Futures$6.run(Futures.java:1764)
    at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:817)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:753)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:634)
    at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:149)
    at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:183)
    at com.datastax.driver.core.RequestHandler.access$2300(RequestHandler.java:44)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:751)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:573)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1009)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:932)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:618)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: com.datastax.driver.core.exceptions.UnauthorizedException: User heroic_adm has no CREATE permission on <all keyspaces> or any of its parents
        at com.datastax.driver.core.Responses$Error.asException(Responses.java:101)
    ... 25 more

Retry failing requests in QueryManager

In order to provide a more stable environment for Heroic clients it would be beneficial if QueryManager supported the retrying of requests that are failing towards other nodes.

This hardens the system against conditions where access to the other peer becomes compromised, either because it suddenly shuts down or there is connectivity issues.

The request should have a total request timeout for which it is OK to retry the sending of the current request, after which it should give up.
optional Nodes which are failing too much could be forcible removed from the local registry so that they won't be used for future requests.
optional We maintain a small set of statistics about the peer in order to implement something like els or other load balancing algorithm that takes latency and pending requests into account.

No way to configure cassandra schema and table in docs

The only configuration step for cassandra according to docs is running tools/heroic-shell -P cassandra -X cassandra.seeds= -X cassandra.configure, but it fails with "Keyspace heroic does not exist".
It works only after importing keyspace.cql and tables.cql manually.

Heroic in Docker

Do you have a Docker image running heroic to make setting up development a little easier? Or do you have a dockerfile which I can use to create my own with all the dependencies and such?

/metadata/tag-suggest

hello,

i receive empty suggest.

{
  "errors": [],
  "suggestions": []
}

here is my config

# heroic.yaml
port: 8080

cluster:
  tags:
    site: nrtb
  protocols:
    - type: grpc
  discovery:
    type: static
    nodes:
      - "grpc://localhost"

metrics:
  backends:
    - type: datastax
      seeds:
        - cassandra

metadata:
  backends:
    - type: elasticsearch
      connection:
        clusterName: elasticsearch
        seeds:
          - elasticsearch
        index:                                                                                                                                                
          type: rotating                                                                                                                                      
          pattern: metadata-%s                                                                                                                                
          interval: 1w    

suggest:
  backends:
    - type: elasticsearch
      connection:
        clusterName: elasticsearch
        seeds:
          - elasticsearch
        index:                                                                                                                                                
          type: rotating                                                                                                                                      
          pattern: metadata-%s                                                                                                                                
          interval: 1w    

consumers:
  - type: kafka
    schema: com.spotify.heroic.consumer.schemas.Spotify100
    topics:
      - "metrics"
    config:
      group.id: heroic-consumer
      zookeeper.connect: kafka
      auto.offset.reset: smallest
      auto.commit.enable: true

for elasticsearch i did

script.inline: on
script.indexed: on

but i have no template inside

A kafka corrupt message makes the consumer get stuck

One of our consumers stopped consuming. While looking into the log we found the exception below. A restart didn't help, as soon as the faulty message was read the consumer would stop.

My initial research seemed to indicate that you can't recover from this programmatically. However, we should verify this is the case and that is not already addressed in a newer version.

I had to resort to remove the partition that contained the faulty message as we could afford doing so. Although maybe increasing the offset in ZK is enough if this happens again before it is properly fixed.

2016-06-17 09:36:47,193 ERROR c.s.h.c.k.ConsumerThread [com.spotify.heroic.consumer.kafka.ConsumerThread:]:0: Error in thread kafka.message.InvalidMessageException: Message is corrupt (stored crc = 0, computed crc = 225171
0752)
at kafka.message.Message.ensureValid(Message.scala:166)
at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:102)
at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33)
at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
at com.spotify.heroic.consumer.kafka.ConsumerThread.guardedRun(ConsumerThread.java:140)
at com.spotify.heroic.consumer.kafka.ConsumerThread.run(ConsumerThread.java:89)

How to push metrics?

Hi,

I'm making a json structure in logstash and pushing it into a kafka topic (works as expected) but it seems that heroic is not consuming the metrics for some reason.

My structure is the following

{
     "time" => "1477401246000",
     "host" => "host1.cc",
     "tags" => {
        "loginResult" => "ok",
          "loginType" => "sshd",
          "loginUser" => "xuser"
    },
      "key" => "c2uo0wqqn",
    "value" => "1.0"
}

If I connect to the shell server and run keys I only see the keys generated..

{"series":{"key":"key-11","tags":{"host":"11.example.com","role":"web","what":"disk-used"}},"base":1475321265489,"type":"points","token":6743956916690856207}
{"series":{"key":"key-81","tags":{"host":"81.example.com","role":"ingestor","what":"teleported-goats"}},"base":1475321265489,"type":"points","token":7120557367222539016}
{"series":{"key":"key-51","tags":{"host":"51.example.com","role":"ingestor","what":"disk-used"}},"base":1475321265489,"type":"points","token":7178361716493023162}

Any ideas what am I doing wrong?

Thank you!

Cassandra & Data retention questions

I know this is not the appropriate place for questions but with the lack of a mailing list or gitter channel - I think it's forgivable (feel free to delete if not).

I'm setting up a VM with heroic in order to monitor some of our homegrown metrics (ingesting metric with the REST API into heroic) and got some questions for which I couldn't find any answers on Wiki and homepage.

Data retention - Is this configurable somehow? Ideally - I'd like to store 7 days of raw data and than store pre-aggregated series for as long as we need them (year maybe?)
Cassandra fetching - Any reason heroic is doing multiple fetches based on range segments? I'm not too familiar with Cassandra so it's probably a dumb question. What's wrong with doing one query that returns all rows vs. multiple queries?

Thanks!

[metric/bigtable] ResultScanner blocks main thread pool

The current implementation of the result scanner has a synchronous component:
https://github.com/spotify/heroic/blob/master/metric/bigtable/src/main/java/com/spotify/heroic/metric/bigtable/api/BigtableDataClientImpl.java#L162

This should be delegated to a Cached Thread Pool until upstream has fixed googleapis/java-bigtable-hbase#703.

This was discovered as a potential issue arising when doing stress tests, a single request effectively occupies all available threads for an extended period of time instead of time sharing across all requests.

Update API documentation to match implementation

Return type for /metrics/write is wrong.

What is the intent of a time series database?

I work on an education startup and content engagement is a great concern for us, I've been thinking how can I know if a given user has engaged with a video or an audio. Is it an arbitraty percentage of content consumed? How can I store the moments of content consumed, which moment has been consumed more than one time, etc.

Is this database or this category of database something I should be looking into to solve this kind of problem?

Implement slowlog

A slowlog is a log that prints information about queries which are slow.

It is typically used to figure out what is adding stress to the system, and is a feature that is required to find them.

@juruen has done some experimentation regarding this, so I'm assigning this issue to him.

K filters behavior?

It doesn't seem like any of the K-filtering aggregations work. Regardless of the filter / k values that I use, the same sample gets returned every time. Am I using these wrong?

For example, in a set of values:

[
    [1, 10.0],
    [2, 20.0],
    [3, 30.0]
]

I would expect a query w/ "aggregation: {"type": "abovek", "k": 25} to only return [3, 30.0]. However, all points from the sample are being returned with or without the aggregation (same goes w/ bottomk, topk, and belowk).

API for /write example incorrect

API documentation differs from implementation - actual data should be more deeply nested (data has type and actual data containing points).

{
    "series": {
        "key": "foo",
        "tags": {
            "site": "lon",
            "host": "www.example.com"
        }},

    "data" : {
        "type" : "points",
        "data": [
        [
            1300000000000,
            42
        ],
        [
            1300001000000,
            84
        ]
    ]
    }
}```

How to delete keys

Hi,

For some reason I'm not able to delete keys from heroic...
I've tried using the following curl example and it's just not doing anything.
curl -XDELETE -H "Content-Type: application/json" http://localhost:8080/metadata/series -d '{"filter": ["and", ["key", "key-17"], ["=", "role", "ingestor"]]}'

I ended up truncating the whole table in cqlsh. Any ideas why it's not working?

DSE and SOLR support

DSE now supports SOLR indexing pretty much out of the box. Does Heroic have plans to use this behavior natively? Or is ElasticSearch the only game in town for this?

Error when deserializing response from other nodes

During certain queries the intra-cluster transport is seemingly corrupting data.

Example query in my case:

{
    "range" : { "type": "relative", "unit": "DAYS",  "value" : "365"},
    "filter" : ["key", "my-metric-2-PT3M"],
    "aggregation" : {"type" : "average", "sampling" : {"unit": "HOURS",  "value" : "2"}}
}

If the query runs locally (ie. within current instance) it completes successfully:

{
  "range": {
    "start": 1416830400000,
    "end": 1448373600000
  },
  "result": [
    {
      "type": "points",
      "hash": "7fafe0fd",
      "shard": {},
      "cadence": 7200000,
      "values": [
        [
          1416844800000,
          2.768182805494971
        ],
        [
          1416852000000,
          2.734896426939305
        ],
       ],
....

  "statistics": {
    "counters": {}
  },
  "errors": [],
  "latencies": [],
  "trace": {
    "what": {
      "name": "com.spotify.heroic.CoreQueryManager#query"
    },
    "elapsed": 331830371,
    "children": [
      {
        "what": {
          "name": "com.spotify.heroic.cluster.LocalClusterNode#query"
        },
        "elapsed": 331439004,
        "children": [
          {
            "what": {
              "name": "com.spotify.heroic.metric.LocalMetricManager#query"
            },
            "elapsed": 328125094,
            "children": []
          }
        ]
      }
    ]
  }
}

Response from cluster when querying against remote node:

{
  "range": {
    "start": 1416657600000,
    "end": 1448373600000
  },
  "result": [
    {
      "type": "points",
      "hash": "ae1a6628",
      "shard": {},
      "cadence": 7200000,
      "values": [],
      "tags": {},
      "tagCounts": {}
    }
  ],
  "statistics": {
    "counters": {}
  },
  "errors": [
    {
      "type": "node",
      "nodeId": "80aa41b3-7c79-4fa3-a7f5-864600ff0b62",
      "tags": {
        "site": "bos"
      },
      "error": "Failed to handle response, caused by Illegal character ((CTRL-CHAR, code 0)): only regular white space (\r, \n, \t) is allowed between tokens\n at [Source: [B@4a0d15; line: 1, column: 1000] (through reference chain: com.spotify.heroic.metric.ResultGroups[\"groups\"]->java.util.ArrayList[0]->com.spotify.heroic.metric.ResultGroup[\"group\"]), caused by Illegal character ((CTRL-CHAR, code 0)): only regular white space (\r, \n, \t) is allowed between tokens\n at [Source: [B@4a0d15; line: 1, column: 1000]",
      "internal": true,
      "node": "nativerpc://198.18.157.86:1394"
    }
  ],
  "latencies": [],
  "trace": {
    "what": {
      "name": "com.spotify.heroic.CoreQueryManager#query"
    },
    "elapsed": 325662602,
    "children": [
      {
        "what": {
          "name": "com.spotify.heroic.CoreQueryManager#query_node"
        },
        "elapsed": 0,
        "children": []
      }
    ]
  }
}

Fix or filter support in MetadataBackendV1

It was discovered after implementing a more comprehensive test suite for metadata backends that or filtering appears to not work in V1 (soon to be deprecated). This should be fixed until V1 is actually deprecated.

See:
https://github.com/spotify/heroic/blob/master/metadata/elasticsearch/src/test/java/com/spotify/heroic/metadata/elasticsearch/MetadataBackendV1IT.java#L39

Downsampling a range into a single value gives unexpected results

Seeing weird results when trying to do any kind of aggregation and downsample into a single value.

For example, we're trying to get the min/max/avg and stddev over a timerange of ~5 minutes for one metric. Looking at the raw data, the values range from [197, 199]. In order to only get one value back, we're adding a sampling aggregation that looks like this

{
  "range": {
    "type": "absolute",
    "start": 1479434197000,   
    "end": 1479434538000   // range = 341 seconds 
  },
  "aggregation": {
      "type": "spread",
      "sampling": {
        "size": "341s",
        "extent": "341s"
      }
  },
  "filter": [
    "and",
    [ "key", "foo.bar" ],
    [ "=", "tag1", "baz" ],
    [ "=", "tag2", "qux" ]
  ]
}

Now, the results I would expect to see, would be something like this:

min = 197
avg = (sum/count) = 198.123
max = 199

stddev = 0.123

But instead, it gives us seemingly random values back, like below.

min = 156
avg = (sum/count) = 187.123
max = 199

stddev = 17.123

Findings

Querying for the stddev also yields seemingly random values.
Bumping the timerange ±1 second sometimes gives completely different values - especially for the min aggregation.
The max usually is correct.
Doing some experiments with the size and extent, the result sets does not even include the correct min values until you reach a pretty small size/extent.
If we chain the spread aggregation, we actually get the correct results.

...
"aggregation": {
    "type": "chain",
    "chain": [
      {
        "type": "spread",
        "sampling": {
          "size": "341s",
          "extent": "341s"
        }
      },
      {
        "type": "spread",
        "sampling": {
          "size": "1s"
          "extent": "1s"
        }
      }
    ]
  }
...

Perhaps I've misunderstood how the downsampling behaves, but can't really see what other options I have for getting the data.

POST: /query/metrics does not seem to work

I build heroic-db (commit 54f4443) and the service is up and running using Cassandra as the back end (but no Elastic Search). The service is up and running where both GET /status and POST: /write do not return back errors.

However, the following two queries are returning back errors:
curl -H "Content-Type: application/json" http://localhost:8080/query/metrics
-d '{"range": {"type": "relative"}, "filter": ["and", ["key", "foo"], ["=", "foo", "bar"], ["+", "role"]], "groupBy": ["site"], "aggregation": []}'
This example is from https://spotify.github.io/heroic/#!/docs/api and the error message is the following:
{"message":"Unexpected token (END_ARRAY), expected VALUE_STRING: need JSON String that contains type id (for subtype of com.spotify.heroic.aggregation.Aggregation)","reason":"Bad Request","status":400,"path":"aggregation","type":"json-error"}

I tried changing the query instead to the following:
curl -X -H "Content-Type: application/json" http://localhost:8080/query/metrics
-d '{"range": {"type": "relative", "string": "MONTHS", }, "filter": ["and", ["key", "foo"], ["=", "foo", "bar"], ["+", "role"]], "groupBy": ["site"], "aggregation": []}'

And I am now getting this error message:
curl: (6) Could not resolve host: Content-Type; Unknown error
{"message":"HTTP 405 Method Not Allowed","reason":"Method Not Allowed","status":405,"type":"error"}

I am looking at the logs and I am not seeing anything there which indicates why the queries are failing.

Poor write performance (BigTable)

Hi,

First off wanted to say thank you for open sourcing this excellent project. Our company migrated to Google Cloud Platform specifically for the benefits afforded by BigTable, and we're really happy to see Heroic being built w/ support for it from the outset. We are currently using OpenTSDB in production, but the BigTable adapter is pretty alpha and some important features, such as deleting data, are unsupported at the moment.

With OpenTSDB, we are seeing latencies ~30-40ms for writing 60 points to a series. With Heroic, however, we are seeing performance ~10x slower, ranging from 300-400ms for writing the exact same data. Both services are running on a single node in Google Compute Engine (4 vCPUs, 15 GB). I have included below the API calls to both Heroic and OpenTSDB, as well as a screengrab from our Grafana dashboard that shows the difference in latency.

It might be worth mentioning that we are seeing good read times. It only appears that writes are slower than OpenTSDB.

Multiple separate/non-chained aggregations in one query?

Hey guys,

We're trying out Heroic in addition to a few other databases. Both KairosDB and OpenTSDB have the functionality to make multiple aggregations in the same query, and return them as separate result sets.

More specific, is it possible to fetch the min/avg/max for a given metric, without sending 3 separate queries?

Any input appreciated. Thanks!

Diff aggregations (deltas)

Hi,

Are there any plans for a rate of change aggregator (deltas)? KairosDB has a "diff" aggregator that performs no sampling. We usually use it as the first aggregator in a chain to feed into downsampling aggregators (min, max, avg, etc.)

It seems like Heroic uses a concurrent bucketing pattern for SamplingAggregation, so it seems like a FilterAggregation would be more appropriate, since you would need to keep track of the state of the last point. Something like the FilterKAreaStrategy might work, which implements MetricCollection, perhaps containing the two points at a time... e.g.:

private double computeDiff(MetricCollection metricCollection) {
    final List<Point> metrics = metricCollection.getDataAs(Point.class);
    return metrics.get(1).getValue() - metrics.get(0).getValue()
}

Am I on the right track?

write and query metric api not working

Hey.I found that some of the apis in document wouldn't work when using curl.
when i tested POST: /write, the response was:

{"message":"Unexpected token (START_ARRAY), expected START_OBJECT","reason":"Bad Request","status":400,"path":"data","type":"json-error"}

then POST: /query/metrics:

{"message":"Instantiation of [simple type, class com.spotify.heroic.QueryDateRange$Relative] value failed: value","reason":"Bad Request","status":400,"path":"range","type":"json-error"}

however I can write and query in shell.
Also , I could not find any api to delete metric in cassandra.Can we do that through api?

TopKInstance and BottomKInstance must have a @JsonCreator method

RPC calls including these require them to be de-serializable. Without the @JsonCreator method, jackson can't figure out how to build instances.

poke @juruen

Cadence: 0 when using Whatever-K

Seems like any time series value for cadence gets lost after being passed through any of the K filters. Is there a reason for this or could we just pass on the cadence that each time series get from the sample aggregation prior to the K filter (if there is one)?

Filtering aggregations should not take child aggregations any longer

29d83a7 removed of as a parameter to filtering aggregations. This is an API regression and should be replaced with a check.