Git Product home page Git Product logo

jaeger-analytics-java's People

Contributors

alefhar avatar dependabot[bot] avatar pavolloffay avatar yurishkuro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jaeger-analytics-java's Issues

Testcontainer query does not work with current jaeger

Describe the bug
Trying to establish a gRPC connection to 16686 fails with the current jaegertracing/all-in-one:1.24.0 as it is not http2 and presumably not gRPC. As documented in https://www.jaegertracing.io/docs/1.24/deployment/#query-service--ui gRPC query functionality is on port 16685 instead of 16686.

I created a patch at hacst@7749594 (untested right now) but I'm unsure how involved this would be to get landed.

To Reproduce
Steps to reproduce the behavior:

Any operation on the QueryServiceBlockingStub which initiates a request will fail with a failure to establish the underlying HTTP2 channel:

JaegerAllInOne jaeger = new JaegerAllInOne("jaegertracing/all-in-one:1.24.0")
jaeger.start();
jaeger.createBlockingQueryService().findTraces(...)

Expected behavior
Query completes without any connectivity issues

Version (please complete the following information):

  • io.jaegertracing:jaeger-testcontainers:0.7.0
  • jaegertracing/all-in-one:1.24.0

io.jaegertracing.analytics.model.Converter limits span durations to durations shorter than 1 second

To convert from protobuf io.jaegertracing.api_v2.Model.Span to io.jaegertracing.analytisc.model.Span the class io.jaegertracing.analytics.model.Converter offers a utility method Span toSpan(Model.Span) which extracts and converts the data from the protobuf Model.Span and writes it to a custom Span which is later used to create a graph.

It seems, however, that toSpan() only copies the nano second fraction of the duration and completely ignores the seconds data field.

span.durationMicros = protoSpan.getDuration().getNanos() / 1000;

From the documentation a duration consists of a seconds part and a nanos part. The latter representing a fraction of a second at nanosecond resolution in the range -999,999,999 to +999,999,999 inclusive. nanos can only store durations less than a second.
The same applies to Timestamp and any code that considers only the nanos part of a Timestamp.

Can you confirm, that this is an issue?

Generate span metrics at the collector level

I am using Jaeger from ephemeral but long running (minutes to days) processes to trace execution of engineering workflows. The tracing part is working great.

I would like to additionally track metrics from these processes. Since Prometheus is notoriously bad at handling ephemeral processes, and since Jaeger already provides a high performance, reliable, and scalable data path for the trace data, I would like to collect the metrics on the server side much along the lines of https://medium.com/jaegertracing/data-analytics-with-jaeger-aka-traces-tell-us-more-973669e6f848 . However, I would prefer to not add the additional requirements of running and maintaining Kafka.

I have created a prototype gRPC storage plugin which accepts trace data, but does not handle read operations. Since Jaeger allows multiple storage plug-ins but only reads from the first, it can be installed behind Cassandra or Elasticsearch plug-ins.

This plug-in uses the Golang Prometheus client to provide metrics on the spans that it sees. Currently it is hardcoded to collect the metrics that I particularly need and is not generic.

The metrics I am currently collecting do not require assimilating multiple spans. The main ones we are looking to get are average duration, run count, and failure count for particular span types. For us, our durations are long so latency effects between trace spans aren't that interesting. I am converting some, but not all, of the span tags into labels so that I can issue the required queries out of Prometheus.

One difficulty with this solution is that Prometheus expects each collection target to be definitive. With Jaeger scalability, the collector can be replicated. Prometheus currently expects that a single scrape target contains all the values for a particular time series/label combination. It has no ability to sum or aggregate values from different scrape targets that match, even with the honor_labels option (if you try, it ends up flopping back and forth between the values each scrape target provides). Without honor_labels, you can easily write labels for the actual source instance/ip and write queries to sum the results however you want, but there is a significant implication for the Prometheus time-series storage. In my case, if I have n computers reporting traces and m replicas of the jaeger-collector, I'll end up with n*m time series in Prometheus' storage.

Calculate network latency

Calculate network latency between client and server spans. The latencies could be provided for service tuples (or IP address tuples).

How to calculate this?
server span ts - client span ts

missing some dependencies

I am following this article: https://medium.com/jaegertracing/jaeger-data-analytics-with-jupyter-notebooks-b094fa7ab769
While running the third cell i.e., for running Trace DSL with Apache Gremlin throws this error.

|   TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
  symbol:   class TraceTraversal

|   TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
  symbol:   class Vertex

|   TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
  symbol:   class Vertex

|   TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
  symbol:   class TraceTraversalSource

|           .repeat(__.out())
cannot find symbol
  symbol:   variable __

|           .until(__.hasName("SQL SELECT"));
cannot find symbol
  symbol:   variable __

cc : @jpkrohling

Grafana dashboard for trace quality metrics

Provide a Grafana dashboard for trace quality metrics. Preferably a dashboard per service.

  • visualize each individual metric
  • calculate a single KPI e.g. 1.0 if all checks are passing

Handle multiple parents/references

The internal span model in this repository has only one reference to parent. When the model is being created we should parse reference array and appropriately change all metrics that use parent references.

Monitoring mixins

From https://github.com/jaegertracing/jaeger-analytics-java/pull/22/files/9e685d50c76bdf9c61a4917044ecf39a5b5de3cd#diff-f80ad592b359d09d6eef8e113412a26b use monitoring mixins for Grafana dashboards and prometheus alerts. At the moment we host only dashboards, once the complexity rises we should consider using mixins.

A monitoring mixin is a package of configuration containing Prometheus alerts, Prometheus recording rules and Grafana dashboards.

https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/view#
https://docs.google.com/document/d/1oXfthGcAOMriy7PEqrq_E8ecz1U_Jyn3QYqEWoHN7S8/edit#heading=h.rfr677ib684e

Streaming dependency items

Requirement - what kind of business use case are you trying to solve?

Create a production-grade implementation of System Architecture Feature that is able to build dependency items from a continuous stream of spans consumed from Kafka.

Problem - what in Jaeger blocks you from solving the requirement?

It is worth recognizing that there is an existing means of computing dependency items from spans through spark-dependencies. This solution is based on a single query on spans in a backing store that builds and bulk-loads the dependency items into the backing store.

However, to maintain an accurate count of edges (call counts) between services and an up-to-date topology, a streaming solution would be more suitable while also removing the need to manage cron jobs and date boundaries.

Proposal - what do you suggest to solve the problem or improve the existing situation?

We currently have a streaming solution for the System Architecture feature running in our (Logz.io) production environment. It is based on the Kafka Streams library, which could be integrated into the existing Jaeger architecture with Kafka as an intermediate buffer.

The following illustrates how we have implemented the System Architecture feature currently, courtesy of the engineer of this solution @PropAnt:

Screen Shot 2021-06-15 at 9 53 15 pm

The reason why dependency items are written back to Kafka is to allow the Kafka streams application to efficiently write processed data without being limited by back-pressure from by the dependency items backing store.

High-level description of the Kafka streams topology:

Screen Shot 2021-06-15 at 9 53 36 pm

We propose to adopt and open source our implementation that already works in production.

Any open questions to address

There are few ways to design the solution:

  • A single module with a Kafka Streams application can be added to calculate dependency items and stream them to the separate Kafka topic for further ingestion. As above, this approach is tried and tested in our production environment and code is available to open-source (more or less) as-is.
  • Another approach is to structure the code to encapsulate the business logic and data models in separate modules. The advantage of this approach is that Jaeger will have a streaming framework-agnostic implementation of System Architecture Feature. The trade-off is the additional effort required to architect the existing code to be agnostic to the streaming framework.

Calculate service height metric

Calculate service height metric for each service in a trace. Choose the highest value for the metric and label metric with service name

Service height: number of service hops from a service to leaf service
Service depth: number of service hops from a service to root service

Tracequality metric for messaging middleware

We could calculate trace quality metric for messaging middleware in a similar way how we calculate has client, server tags.

For instance, we could fail a metric if there is no consumer for a producer and if there are more consumers pass the metric.

Anomaly metrics

Brainstorm a set of metrics that can lean to anomaly detection. Anomaly detection in a sense for anything which can improve performance, security...

For instance:

  • RPC call within the same service
  • dangerous SQL queries
  • dangerous patterns in URL

Calculate network hops

Similar to service depth #6, however this focuses on network hops. The difference is that in #6 A->network->A->network->A produces 1, here it should be 2.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.