Git Product home page Git Product logo

Comments (14)

vishweshbankwar avatar vishweshbankwar commented on August 16, 2024 2

Is there some way to modify the timeout on the internal OTel gRPC calls?
You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

Yes - it is used for setting the deadline time of a grpc call we make here.

from opentelemetry-dotnet.

cijothomas avatar cijothomas commented on August 16, 2024 1

Please see if you can get internal logs (Warning and above) https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/README.md#self-diagnostics

Are you missing all metrics from the server, or just a subset of metrics? (There is metric cardinality caps implemented which can explain the behavior, but if it is every metric stopping at the same time, unlikely to be cause).
Is metrics missing in Console too?

from opentelemetry-dotnet.

vishweshbankwar avatar vishweshbankwar commented on August 16, 2024 1

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.

This is correct, even if the retries are enabled. #5436

from opentelemetry-dotnet.

cijothomas avatar cijothomas commented on August 16, 2024 1

are observables actually queried during the exporter's network call?

No. (If you were using Prometheus scraping, then the observables callbacks are done in response to scrape request, so they do contribute to response time of the scrape itself.) In short - observable callback is not at play in your case, as you are doing push exporter. (Sorry I confused you with the mention of observables :( )

from opentelemetry-dotnet.

ladenedge avatar ladenedge commented on August 16, 2024

Okay, we found a newly failing server and the diagnostic log helped a great deal. It appears we are timing out:

Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")

Here's the full entry, repeated every minute:

2024-07-04T12:43:18.7570236Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
   at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}

So now I have two questions:

  1. Is there some way to modify the timeout on the internal OTel gRPC calls?
  2. Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

Thank you again for your help!

from opentelemetry-dotnet.

cijothomas avatar cijothomas commented on August 16, 2024

Is there some way to modify the timeout on the internal OTel gRPC calls?

For gRPC, not much option to customize. There are open issues/prs for related settings that will allow exposing this. Eg: #2009

Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

I don't think any "queue" up occurs today. If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.
(@vishweshbankwar to keep me honest here.)

from opentelemetry-dotnet.

cijothomas avatar cijothomas commented on August 16, 2024

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

from opentelemetry-dotnet.

cijothomas avatar cijothomas commented on August 16, 2024

Thanks!. Looks like #1735 is still open which state we don't really enforce the timeouts, but I could be wrong. (or its only for traces!)

from opentelemetry-dotnet.

ladenedge avatar ladenedge commented on August 16, 2024

You can increase the timeout period by setting TimeoutMilliseconds

Great info! However, it looks like the default is 10s, so it worries me that we're exceeding that -- especially if each call only includes the latest metrics. I could increase it to 20s or 30s, but I wonder if I'm just doing something wrong. Do you have any suggestions for diagnosing why my sends are exceeding 10s, or just how much data I'm sending? Or is this more likely a network issue between the servers and the collector?

Thanks again, and feel free to close this issue if you feel you've provided all the info you can!

from opentelemetry-dotnet.

cijothomas avatar cijothomas commented on August 16, 2024

especially if each call only includes the latest metrics.

This is only true if using Delta. If using Cumulative, then everything from start will always be exported... Are you using Delta or Cumulative?
Also, are there many Observable instruments with callbacks potentially taking a lot of time?

from opentelemetry-dotnet.

ladenedge avatar ladenedge commented on August 16, 2024

We are indeed using Delta mode:

reader.TemporalityPreference = MetricReaderTemporalityPreference.Delta;

We have a handful of observables. You're suggesting that the time it takes to observe those metrics must be accounted for in the gRPC deadline? That's interesting. We've tried to make those calls quick, but it's certainly something we could take a closer look at -- that could also explain why our servers never recover from this condition.

Any other ideas are most welcome, and thank you again for all the help!

from opentelemetry-dotnet.

vishweshbankwar avatar vishweshbankwar commented on August 16, 2024

@ladenedge - Just to confirm, you don't have retries enabled, correct?

it's odd that once the server hits DeadlineExceeded, it is not able to recover from that and continues to throw that error until re-started.

from opentelemetry-dotnet.

ladenedge avatar ladenedge commented on August 16, 2024

I assume you're talking about retries via the HttpClient? If so, then no, I'm using the default factory.

from opentelemetry-dotnet.

ladenedge avatar ladenedge commented on August 16, 2024

Also, to follow up on the observables: are observables actually queried during the exporter's network call? Looking over our handful of observable counters, they appear quick (eg. Dictionary.Count) -- but is it even possible that they contribute to the missed deadline?

from opentelemetry-dotnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.