jaegertracing / jaeger-analytics-java Goto Github PK
View Code? Open in Web Editor NEWData analytics pipeline and models for tracing data
License: Apache License 2.0
Data analytics pipeline and models for tracing data
License: Apache License 2.0
Describe the bug
Trying to establish a gRPC connection to 16686 fails with the current jaegertracing/all-in-one:1.24.0 as it is not http2 and presumably not gRPC. As documented in https://www.jaegertracing.io/docs/1.24/deployment/#query-service--ui gRPC query functionality is on port 16685 instead of 16686.
I created a patch at hacst@7749594 (untested right now) but I'm unsure how involved this would be to get landed.
To Reproduce
Steps to reproduce the behavior:
Any operation on the QueryServiceBlockingStub which initiates a request will fail with a failure to establish the underlying HTTP2 channel:
JaegerAllInOne jaeger = new JaegerAllInOne("jaegertracing/all-in-one:1.24.0")
jaeger.start();
jaeger.createBlockingQueryService().findTraces(...)
Expected behavior
Query completes without any connectivity issues
Version (please complete the following information):
Create a grafana dashboard to show other derived metrics per service.
To convert from protobuf io.jaegertracing.api_v2.Model.Span
to io.jaegertracing.analytisc.model.Span
the class io.jaegertracing.analytics.model.Converter
offers a utility method Span toSpan(Model.Span)
which extracts and converts the data from the protobuf Model.Span
and writes it to a custom Span
which is later used to create a graph.
It seems, however, that toSpan()
only copies the nano second fraction of the duration and completely ignores the seconds
data field.
span.durationMicros = protoSpan.getDuration().getNanos() / 1000;
From the documentation a duration consists of a seconds
part and a nanos
part. The latter representing a fraction of a second at nanosecond resolution in the range -999,999,999 to +999,999,999 inclusive. nanos
can only store durations less than a second.
The same applies to Timestamp
and any code that considers only the nanos
part of a Timestamp
.
Can you confirm, that this is an issue?
I am using Jaeger from ephemeral but long running (minutes to days) processes to trace execution of engineering workflows. The tracing part is working great.
I would like to additionally track metrics from these processes. Since Prometheus is notoriously bad at handling ephemeral processes, and since Jaeger already provides a high performance, reliable, and scalable data path for the trace data, I would like to collect the metrics on the server side much along the lines of https://medium.com/jaegertracing/data-analytics-with-jaeger-aka-traces-tell-us-more-973669e6f848 . However, I would prefer to not add the additional requirements of running and maintaining Kafka.
I have created a prototype gRPC storage plugin which accepts trace data, but does not handle read operations. Since Jaeger allows multiple storage plug-ins but only reads from the first, it can be installed behind Cassandra or Elasticsearch plug-ins.
This plug-in uses the Golang Prometheus client to provide metrics on the spans that it sees. Currently it is hardcoded to collect the metrics that I particularly need and is not generic.
The metrics I am currently collecting do not require assimilating multiple spans. The main ones we are looking to get are average duration, run count, and failure count for particular span types. For us, our durations are long so latency effects between trace spans aren't that interesting. I am converting some, but not all, of the span tags into labels so that I can issue the required queries out of Prometheus.
One difficulty with this solution is that Prometheus expects each collection target to be definitive. With Jaeger scalability, the collector can be replicated. Prometheus currently expects that a single scrape target contains all the values for a particular time series/label combination. It has no ability to sum or aggregate values from different scrape targets that match, even with the honor_labels option (if you try, it ends up flopping back and forth between the values each scrape target provides). Without honor_labels, you can easily write labels for the actual source instance/ip and write queries to sum the results however you want, but there is a significant implication for the Prometheus time-series storage. In my case, if I have n computers reporting traces and m replicas of the jaeger-collector, I'll end up with n*m time series in Prometheus' storage.
Calculate network latency between client and server spans. The latencies could be provided for service tuples (or IP address tuples).
How to calculate this?
server span ts - client span ts
Created from #31 (comment)
For each metric document why it is useful, and what problem it solves.
This might require using testcontainers to use run Jaeger query service.
For instance move print
dfs
to a separate class.
I am following this article: https://medium.com/jaegertracing/jaeger-data-analytics-with-jupyter-notebooks-b094fa7ab769
While running the third cell i.e., for running Trace DSL with Apache Gremlin throws this error.
| TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
symbol: class TraceTraversal
| TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
symbol: class Vertex
| TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
symbol: class Vertex
| TraceTraversal<Vertex, Vertex> traversal = graph.traversal(TraceTraversalSource.class)
cannot find symbol
symbol: class TraceTraversalSource
| .repeat(__.out())
cannot find symbol
symbol: variable __
| .until(__.hasName("SQL SELECT"));
cannot find symbol
symbol: variable __
cc : @jpkrohling
Provide a Grafana dashboard for trace quality metrics. Preferably a dashboard per service.
When the kernel in the spark-runner
notebook is restarted and run, it gives the following error -
| System.out.println(org.apache.spark.SparkConf.class);
cannot find symbol
symbol: class SparkConf
Migrate or integrate with https://github.com/jaegertracing/jaeger-analytics-flink/tree/master/tracequality-job/src/main/java/io/jaegertracing/tracequality/score
It currently defines:
ClientVersion
- span contains client version and the version if higher or equal toHasClientServerSpans
- a trace has client and server spansUniqueSpanIds
- all spans contains unique span ids.The internal span model in this repository has only one reference to parent. When the model is being created we should parse reference array and appropriately change all metrics that use parent references.
Provide yaml manifest with deployment for spark streaming.
From #22 (comment)
At the moment service name has to be specified in the text input. Pre populating existing service names would simplify the user experience.
From https://github.com/jaegertracing/jaeger-analytics-java/pull/22/files/9e685d50c76bdf9c61a4917044ecf39a5b5de3cd#diff-f80ad592b359d09d6eef8e113412a26b use monitoring mixins for Grafana dashboards and prometheus alerts. At the moment we host only dashboards, once the complexity rises we should consider using mixins.
A monitoring mixin is a package of configuration containing Prometheus alerts, Prometheus recording rules and Grafana dashboards.
https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/view#
https://docs.google.com/document/d/1oXfthGcAOMriy7PEqrq_E8ecz1U_Jyn3QYqEWoHN7S8/edit#heading=h.rfr677ib684e
Package models/extraction functions as a fatjar with spark and publish it as a docker image.
Create a production-grade implementation of System Architecture Feature that is able to build dependency items from a continuous stream of spans consumed from Kafka.
It is worth recognizing that there is an existing means of computing dependency items from spans through spark-dependencies. This solution is based on a single query on spans in a backing store that builds and bulk-loads the dependency items into the backing store.
However, to maintain an accurate count of edges (call counts) between services and an up-to-date topology, a streaming solution would be more suitable while also removing the need to manage cron jobs and date boundaries.
We currently have a streaming solution for the System Architecture feature running in our (Logz.io) production environment. It is based on the Kafka Streams library, which could be integrated into the existing Jaeger architecture with Kafka as an intermediate buffer.
The following illustrates how we have implemented the System Architecture feature currently, courtesy of the engineer of this solution @PropAnt:
The reason why dependency items are written back to Kafka is to allow the Kafka streams application to efficiently write processed data without being limited by back-pressure from by the dependency items backing store.
High-level description of the Kafka streams topology:
We propose to adopt and open source our implementation that already works in production.
There are few ways to design the solution:
This is helpful to use tracing in e2e tests. It could be moved to a separate module or test-jar
.
Calculate metrics showing the number of services this service is calling.
For instance A calls B and C. Then the result should be 2.
Calculate service height metric for each service in a trace. Choose the highest value for the metric and label metric with service name
Service height: number of service hops from a service to leaf service
Service depth: number of service hops from a service to root service
We could calculate trace quality metric for messaging middleware in a similar way how we calculate has client, server tags.
For instance, we could fail a metric if there is no consumer for a producer and if there are more consumers pass the metric.
Brainstorm a set of metrics that can lean to anomaly detection. Anomaly detection in a sense for anything which can improve performance, security...
For instance:
Currently, the only documentation we seem to have for the Jupyter directory is this blog post:
https://medium.com/jaegertracing/jaeger-data-analytics-with-jupyter-notebooks-b094fa7ab769
It would be great to have a readme file on how to get started with the current notebooks, and/or how to create new notebooks.
Configure release to maven central
Calculate maximum service depth in a trace e.g. maximum network hops in a trace.
TraceDepth
calculates maximum span depth.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.