Comments (14)
Is there some way to modify the timeout on the internal OTel gRPC calls?
You can increase the timeout period by setting TimeoutMillisecondsThanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?
Yes - it is used for setting the deadline time of a grpc call we make here.
from opentelemetry-dotnet.
Please see if you can get internal logs (Warning and above) https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/README.md#self-diagnostics
Are you missing all metrics from the server, or just a subset of metrics? (There is metric cardinality caps implemented which can explain the behavior, but if it is every metric stopping at the same time, unlikely to be cause).
Is metrics missing in Console too?
from opentelemetry-dotnet.
Is there some way to modify the timeout on the internal OTel gRPC calls?
You can increase the timeout period by setting TimeoutMilliseconds
If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.
This is correct, even if the retries are enabled. #5436
from opentelemetry-dotnet.
are observables actually queried during the exporter's network call?
No. (If you were using Prometheus scraping, then the observables callbacks are done in response to scrape request, so they do contribute to response time of the scrape itself.) In short - observable callback is not at play in your case, as you are doing push exporter. (Sorry I confused you with the mention of observables :( )
from opentelemetry-dotnet.
Okay, we found a newly failing server and the diagnostic log helped a great deal. It appears we are timing out:
Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
Here's the full entry, repeated every minute:
2024-07-04T12:43:18.7570236Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}
So now I have two questions:
- Is there some way to modify the timeout on the internal OTel gRPC calls?
- Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)
Thank you again for your help!
from opentelemetry-dotnet.
Is there some way to modify the timeout on the internal OTel gRPC calls?
For gRPC, not much option to customize. There are open issues/prs for related settings that will allow exposing this. Eg: #2009
Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)
I don't think any "queue" up occurs today. If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.
(@vishweshbankwar to keep me honest here.)
from opentelemetry-dotnet.
Is there some way to modify the timeout on the internal OTel gRPC calls?
You can increase the timeout period by setting TimeoutMilliseconds
Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?
from opentelemetry-dotnet.
Thanks!. Looks like #1735 is still open which state we don't really enforce the timeouts, but I could be wrong. (or its only for traces!)
from opentelemetry-dotnet.
You can increase the timeout period by setting TimeoutMilliseconds
Great info! However, it looks like the default is 10s, so it worries me that we're exceeding that -- especially if each call only includes the latest metrics. I could increase it to 20s or 30s, but I wonder if I'm just doing something wrong. Do you have any suggestions for diagnosing why my sends are exceeding 10s, or just how much data I'm sending? Or is this more likely a network issue between the servers and the collector?
Thanks again, and feel free to close this issue if you feel you've provided all the info you can!
from opentelemetry-dotnet.
especially if each call only includes the latest metrics.
This is only true if using Delta. If using Cumulative, then everything from start will always be exported... Are you using Delta or Cumulative?
Also, are there many Observable instruments with callbacks potentially taking a lot of time?
from opentelemetry-dotnet.
We are indeed using Delta mode:
reader.TemporalityPreference = MetricReaderTemporalityPreference.Delta;
We have a handful of observables. You're suggesting that the time it takes to observe those metrics must be accounted for in the gRPC deadline? That's interesting. We've tried to make those calls quick, but it's certainly something we could take a closer look at -- that could also explain why our servers never recover from this condition.
Any other ideas are most welcome, and thank you again for all the help!
from opentelemetry-dotnet.
@ladenedge - Just to confirm, you don't have retries enabled, correct?
it's odd that once the server hits DeadlineExceeded, it is not able to recover from that and continues to throw that error until re-started.
from opentelemetry-dotnet.
I assume you're talking about retries via the HttpClient? If so, then no, I'm using the default factory.
from opentelemetry-dotnet.
Also, to follow up on the observables: are observables actually queried during the exporter's network call? Looking over our handful of observable counters, they appear quick (eg. Dictionary.Count
) -- but is it even possible that they contribute to the missed deadline?
from opentelemetry-dotnet.
Related Issues (20)
- Derived method 'TryCreateLogger' in type 'OpenTelemetry.Logs.LoggerProviderSdk' from assembly 'OpenTelemetry, Version=1.0.0.0, Culture=neutral, PublicKeyToken=7bd6737fe5b67e3c' cannot reduce access. HOT 12
- Milliseconds bucket HOT 1
- Return the old examplars sample
- Manipulate Scope Storage
- OpenTelemetry-dotnet: Support for true Delta UpDownCounters & ObservableGauges
- [repo] Investigate enabling NuGetAudit
- Proposal: OTel SDK should expose a metric to inform about sampling decisions made HOT 3
- Compiling code HOT 5
- CLOMonitor - verify recommendations
- Allow Exporter to export a protobuf service request directly
- Is http/json expected to be implemented? HOT 1
- Log level per exporter? HOT 2
- Testing BaseProcessor based filters HOT 5
- Log message is not formatted HOT 2
- Invalid Characters in "{OriginalFormat}" when using OTLPExporter HOT 1
- [zipkin\otlp] Call SendAsync on mobile
- Proposal for Stale issue automation HOT 6
- Metric Update - Optimize lookup costs
- Async BatchExportProcessor HOT 1
- Poor Histogram Support for the Prometheus Exporter HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opentelemetry-dotnet.