jaegertracing / jaeger-analytics-java Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 24.0 767 KB

Data analytics pipeline and models for tracing data

License: Apache License 2.0

Java 74.25% Dockerfile 0.66% Jupyter Notebook 21.04% Makefile 2.14% Shell 1.90%

jaeger-analytics-java's People

Contributors

Stargazers

Watchers

jaeger-analytics-java's Issues

Testcontainer query does not work with current jaeger

Describe the bug
Trying to establish a gRPC connection to 16686 fails with the current jaegertracing/all-in-one:1.24.0 as it is not http2 and presumably not gRPC. As documented in https://www.jaegertracing.io/docs/1.24/deployment/#query-service--ui gRPC query functionality is on port 16685 instead of 16686.

I created a patch at hacst@7749594 (untested right now) but I'm unsure how involved this would be to get landed.

To Reproduce
Steps to reproduce the behavior:

Any operation on the QueryServiceBlockingStub which initiates a request will fail with a failure to establish the underlying HTTP2 channel:

JaegerAllInOne jaeger = new JaegerAllInOne("jaegertracing/all-in-one:1.24.0")
jaeger.start();
jaeger.createBlockingQueryService().findTraces(...)

Expected behavior
Query completes without any connectivity issues

Version (please complete the following information):

io.jaegertracing:jaeger-testcontainers:0.7.0
jaegertracing/all-in-one:1.24.0

Split project into submodules

spark
traceDSL
generated proto

Test jupyter notebooks

Grafana dashboard for metrics

Create a grafana dashboard to show other derived metrics per service.

io.jaegertracing.analytics.model.Converter limits span durations to durations shorter than 1 second

To convert from protobuf io.jaegertracing.api_v2.Model.Span to io.jaegertracing.analytisc.model.Span the class io.jaegertracing.analytics.model.Converter offers a utility method Span toSpan(Model.Span) which extracts and converts the data from the protobuf Model.Span and writes it to a custom Span which is later used to create a graph.

It seems, however, that toSpan() only copies the nano second fraction of the duration and completely ignores the seconds data field.

span.durationMicros = protoSpan.getDuration().getNanos() / 1000;

From the documentation a duration consists of a seconds part and a nanos part. The latter representing a fraction of a second at nanosecond resolution in the range -999,999,999 to +999,999,999 inclusive. nanos can only store durations less than a second.
The same applies to Timestamp and any code that considers only the nanos part of a Timestamp.

Can you confirm, that this is an issue?

Generate span metrics at the collector level

I am using Jaeger from ephemeral but long running (minutes to days) processes to trace execution of engineering workflows. The tracing part is working great.

I would like to additionally track metrics from these processes. Since Prometheus is notoriously bad at handling ephemeral processes, and since Jaeger already provides a high performance, reliable, and scalable data path for the trace data, I would like to collect the metrics on the server side much along the lines of https://medium.com/jaegertracing/data-analytics-with-jaeger-aka-traces-tell-us-more-973669e6f848 . However, I would prefer to not add the additional requirements of running and maintaining Kafka.

I have created a prototype gRPC storage plugin which accepts trace data, but does not handle read operations. Since Jaeger allows multiple storage plug-ins but only reads from the first, it can be installed behind Cassandra or Elasticsearch plug-ins.

This plug-in uses the Golang Prometheus client to provide metrics on the spans that it sees. Currently it is hardcoded to collect the metrics that I particularly need and is not generic.

The metrics I am currently collecting do not require assimilating multiple spans. The main ones we are looking to get are average duration, run count, and failure count for particular span types. For us, our durations are long so latency effects between trace spans aren't that interesting. I am converting some, but not all, of the span tags into labels so that I can issue the required queries out of Prometheus.

One difficulty with this solution is that Prometheus expects each collection target to be definitive. With Jaeger scalability, the collector can be replicated. Prometheus currently expects that a single scrape target contains all the values for a particular time series/label combination. It has no ability to sum or aggregate values from different scrape targets that match, even with the honor_labels option (if you try, it ends up flopping back and forth between the values each scrape target provides). Without honor_labels, you can easily write labels for the actual source instance/ip and write queries to sum the results however you want, but there is a significant implication for the Prometheus time-series storage. In my case, if I have n computers reporting traces and m replicas of the jaeger-collector, I'll end up with n*m time series in Prometheus' storage.

Calculate network latency

Calculate network latency between client and server spans. The latencies could be provided for service tuples (or IP address tuples).

How to calculate this?
server span ts - client span ts

Document use of each metric and what problem it solves

Created from #31 (comment)

For each metric document why it is useful, and what problem it solves.

Use gRPC services to talk to jaeger-query

Enable JaegerUI/Query test

This might require using testcontainers to use run Jaeger query service.

Move utility methods from GraphDSL examples to a separate class

For instance move print dfs to a separate class.

missing some dependencies

I am following this article: https://medium.com/jaegertracing/jaeger-data-analytics-with-jupyter-notebooks-b094fa7ab769
While running the third cell i.e., for running Trace DSL with Apache Gremlin throws this error.

|   TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
  symbol:   class TraceTraversal

|   TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
  symbol:   class Vertex

|   TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
  symbol:   class Vertex

|   TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
  symbol:   class TraceTraversalSource

|           .repeat(__.out())
cannot find symbol
  symbol:   variable __

|           .until(__.hasName("SQL SELECT"));
cannot find symbol
  symbol:   variable __

cc : @jpkrohling

Grafana dashboard for trace quality metrics

Provide a Grafana dashboard for trace quality metrics. Preferably a dashboard per service.

visualize each individual metric
calculate a single KPI e.g. 1.0 if all checks are passing

Dependencies resolution error on kernel restart in spark-runner

When the kernel in the spark-runner notebook is restarted and run, it gives the following error -

|   System.out.println(org.apache.spark.SparkConf.class);
cannot find symbol
  symbol:   class SparkConf

Migrate trace quality

Migrate or integrate with https://github.com/jaegertracing/jaeger-analytics-flink/tree/master/tracequality-job/src/main/java/io/jaegertracing/tracequality/score

It currently defines:

ClientVersion - span contains client version and the version if higher or equal to
HasClientServerSpans - a trace has client and server spans
UniqueSpanIds - all spans contains unique span ids.

Migrate unique span id from trace quality metric

https://github.com/jaegertracing/jaeger-analytics-flink/blob/master/tracequality-job/src/main/java/io/jaegertracing/tracequality/score/UniqueSpanId.java

Handle multiple parents/references

The internal span model in this repository has only one reference to parent. When the model is being created we should parse reference array and appropriately change all metrics that use parent references.

Provide a deployment manifest for spark

Provide yaml manifest with deployment for spark streaming.

Rename service_height metric to service_depth and label it with service name

Populate label(service) values to dropdown in tracequality Grafana dashboard

From #22 (comment)

At the moment service name has to be specified in the text input. Pre populating existing service names would simplify the user experience.

Monitoring mixins

From https://github.com/jaegertracing/jaeger-analytics-java/pull/22/files/9e685d50c76bdf9c61a4917044ecf39a5b5de3cd#diff-f80ad592b359d09d6eef8e113412a26b use monitoring mixins for Grafana dashboards and prometheus alerts. At the moment we host only dashboards, once the complexity rises we should consider using mixins.

A monitoring mixin is a package of configuration containing Prometheus alerts, Prometheus recording rules and Grafana dashboards.

https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/view#
https://docs.google.com/document/d/1oXfthGcAOMriy7PEqrq_E8ecz1U_Jyn3QYqEWoHN7S8/edit#heading=h.rfr677ib684e

Package models to fatjar with Spark

Package models/extraction functions as a fatjar with spark and publish it as a docker image.

Streaming dependency items

Requirement - what kind of business use case are you trying to solve?

Create a production-grade implementation of System Architecture Feature that is able to build dependency items from a continuous stream of spans consumed from Kafka.

Problem - what in Jaeger blocks you from solving the requirement?

It is worth recognizing that there is an existing means of computing dependency items from spans through spark-dependencies. This solution is based on a single query on spans in a backing store that builds and bulk-loads the dependency items into the backing store.

However, to maintain an accurate count of edges (call counts) between services and an up-to-date topology, a streaming solution would be more suitable while also removing the need to manage cron jobs and date boundaries.

Proposal - what do you suggest to solve the problem or improve the existing situation?

We currently have a streaming solution for the System Architecture feature running in our (Logz.io) production environment. It is based on the Kafka Streams library, which could be integrated into the existing Jaeger architecture with Kafka as an intermediate buffer.

The following illustrates how we have implemented the System Architecture feature currently, courtesy of the engineer of this solution @PropAnt:

The reason why dependency items are written back to Kafka is to allow the Kafka streams application to efficiently write processed data without being limited by back-pressure from by the dependency items backing store.

High-level description of the Kafka streams topology:

We propose to adopt and open source our implementation that already works in production.

Any open questions to address

There are few ways to design the solution:

A single module with a Kafka Streams application can be added to calculate dependency items and stream them to the separate Kafka topic for further ingestion. As above, this approach is tried and tested in our production environment and code is available to open-source (more or less) as-is.
Another approach is to structure the code to encapsulate the business logic and data models in separate modules. The advantage of this approach is that Jaeger will have a streaming framework-agnostic implementation of System Architecture Feature. The trade-off is the additional effort required to architect the existing code to be agnostic to the streaming framework.

Publish jar with Jaeger all-in-one test container and utility classes

https://github.com/jaegertracing/jaeger-analytics-java/tree/master/proto/src/test/java/io/jaegertracing/api_v2

This is helpful to use tracing in e2e tests. It could be moved to a separate module or test-jar.

Calculate number of dependent downstream services

Calculate metrics showing the number of services this service is calling.

For instance A calls B and C. Then the result should be 2.

Calculate service height metric

Calculate service height metric for each service in a trace. Choose the highest value for the metric and label metric with service name

Service height: number of service hops from a service to leaf service
Service depth: number of service hops from a service to root service

Calculate metric for number of errors per service

Tracequality metric for messaging middleware

We could calculate trace quality metric for messaging middleware in a similar way how we calculate has client, server tags.

For instance, we could fail a metric if there is no consumer for a producer and if there are more consumers pass the metric.

Anomaly metrics

Brainstorm a set of metrics that can lean to anomaly detection. Anomaly detection in a sense for anything which can improve performance, security...

For instance:

RPC call within the same service
dangerous SQL queries
dangerous patterns in URL

It would be great to have a readme file on how to get started with the current notebooks, and/or how to create new notebooks.

Release to maven central

Configure release to maven central

Calculate service depth

Calculate maximum service depth in a trace e.g. maximum network hops in a trace.

TraceDepth

jaeger-analytics-java/src/main/java/io/jaegertracing/dsl/gremlin/TraceDepth.java

Line 17 in 1500069

public class TraceDepth {

calculates maximum span depth.