What is the question? We're using metrics extensively in our .NET

Please see if you can get internal logs (Warning and above) <a href="https://github.co

Thanks!. Looks like <a class="issue-link js-issue-link" data-error-text="Failed to loa

You can increase the timeout period by setting <a href="https://github.co

Metrics stop exporting at seemingly random times every week or so about opentelemetry-dotnet HOT 14 OPEN

ladenedge commented on August 16, 2024

Metrics stop exporting at seemingly random times every week or so

from opentelemetry-dotnet.

Comments (14)

vishweshbankwar commented on August 16, 2024 2

Is there some way to modify the timeout on the internal OTel gRPC calls?
You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

Yes - it is used for setting the deadline time of a grpc call we make here.

from opentelemetry-dotnet.

cijothomas commented on August 16, 2024 1

Please see if you can get internal logs (Warning and above) https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/README.md#self-diagnostics

Are you missing all metrics from the server, or just a subset of metrics? (There is metric cardinality caps implemented which can explain the behavior, but if it is every metric stopping at the same time, unlikely to be cause).
Is metrics missing in Console too?

from opentelemetry-dotnet.

vishweshbankwar commented on August 16, 2024 1

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.

This is correct, even if the retries are enabled. #5436

from opentelemetry-dotnet.

cijothomas commented on August 16, 2024 1

are observables actually queried during the exporter's network call?

No. (If you were using Prometheus scraping, then the observables callbacks are done in response to scrape request, so they do contribute to response time of the scrape itself.) In short - observable callback is not at play in your case, as you are doing push exporter. (Sorry I confused you with the mention of observables :( )

from opentelemetry-dotnet.

ladenedge commented on August 16, 2024

Okay, we found a newly failing server and the diagnostic log helped a great deal. It appears we are timing out:

Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")

Here's the full entry, repeated every minute:

2024-07-04T12:43:18.7570236Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
   at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}

So now I have two questions:

Is there some way to modify the timeout on the internal OTel gRPC calls?
Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

Thank you again for your help!

from opentelemetry-dotnet.

cijothomas commented on August 16, 2024

Is there some way to modify the timeout on the internal OTel gRPC calls?

For gRPC, not much option to customize. There are open issues/prs for related settings that will allow exposing this. Eg: #2009

Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

I don't think any "queue" up occurs today. If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.
(@vishweshbankwar to keep me honest here.)

from opentelemetry-dotnet.

cijothomas commented on August 16, 2024

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

from opentelemetry-dotnet.

cijothomas commented on August 16, 2024

Thanks!. Looks like #1735 is still open which state we don't really enforce the timeouts, but I could be wrong. (or its only for traces!)

from opentelemetry-dotnet.

ladenedge commented on August 16, 2024

You can increase the timeout period by setting TimeoutMilliseconds

Great info! However, it looks like the default is 10s, so it worries me that we're exceeding that -- especially if each call only includes the latest metrics. I could increase it to 20s or 30s, but I wonder if I'm just doing something wrong. Do you have any suggestions for diagnosing why my sends are exceeding 10s, or just how much data I'm sending? Or is this more likely a network issue between the servers and the collector?

Thanks again, and feel free to close this issue if you feel you've provided all the info you can!

from opentelemetry-dotnet.

cijothomas commented on August 16, 2024

especially if each call only includes the latest metrics.

This is only true if using Delta. If using Cumulative, then everything from start will always be exported... Are you using Delta or Cumulative?
Also, are there many Observable instruments with callbacks potentially taking a lot of time?

from opentelemetry-dotnet.

ladenedge commented on August 16, 2024

We are indeed using Delta mode:

reader.TemporalityPreference = MetricReaderTemporalityPreference.Delta;

We have a handful of observables. You're suggesting that the time it takes to observe those metrics must be accounted for in the gRPC deadline? That's interesting. We've tried to make those calls quick, but it's certainly something we could take a closer look at -- that could also explain why our servers never recover from this condition.

Any other ideas are most welcome, and thank you again for all the help!

from opentelemetry-dotnet.

vishweshbankwar commented on August 16, 2024

@ladenedge - Just to confirm, you don't have retries enabled, correct?

it's odd that once the server hits DeadlineExceeded, it is not able to recover from that and continues to throw that error until re-started.

from opentelemetry-dotnet.

ladenedge commented on August 16, 2024

I assume you're talking about retries via the HttpClient? If so, then no, I'm using the default factory.

from opentelemetry-dotnet.

ladenedge commented on August 16, 2024

Also, to follow up on the observables: are observables actually queried during the exporter's network call? Looking over our handful of observable counters, they appear quick (eg. Dictionary.Count) -- but is it even possible that they contribute to the missed deadline?

from opentelemetry-dotnet.

Metrics stop exporting at seemingly random times every week or so about opentelemetry-dotnet HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent