Git Product home page Git Product logo

Comments (9)

joshfree avatar joshfree commented on May 29, 2024

@anuchandy please take a look

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on May 29, 2024

Hello @neumannm,

  1. There shouldn't be a need for enableCrossEntityTransactions(), the reading and forwarding from DLQ can still be done without it, so in that case we don't have to enable this config, you can remove them.

  2. The maxAutoLockRenewDuration(..) has no impact on message peeked using "peek*" api. The lock renew matters only if the client is used to receive messages using "receive*" api (those messages which later gets disposition (complete|abandon)). The same goes with disableAutoComplete(), but keeping these two doesn't hurt, the peek calls won’t be impacted by these two settings.


The message -

{"az.sdk.message":"AMQP response did not contain OK status code.","statusCode":"NO_CONTENT"}

means that there is no message in the queue at the moment, in this case, the SDK will return an empty list. So, this is not a problem but service telling the SDK that there is no message to peek.


The message "not opened within.." means that the SDK cannot establish its TCP connection from the host machine to the Service Bus endpoint. The reason for seeing the multiple occurrences of the error-message

Error occurred while refreshing token that is not retriable. Not scheduling refresh task

after the connectivity error-message (i.e., "not opened within..") is - SDK was trying to establish a TCP connection, but there were also several peekMessages (or other api calls) that were queued up. All of these calls need a shared authentication that is also dependent on the TCP connection. However, SDK could not open the connection after waiting for a minute, so all the peekMessage enqueued that were waiting for auth to complete received signal indicating connectivity issue. This is what the sequence of log messages shows.

In this case, it indicates that there is a temporary network connectivity problem from host running the app to service bus.


Regarding the last question on recommendations –

  1. One suggestion in terms of coding patterns is - you may use separate builder instance to build the client for each queue (i.e., use 1builder:1queue), rather than the current approach of using a shared builder instance (i.e., not use 1builder:Nqueue). This approach will give more resiliency to your app.
  2. I would also suggest upgrading to 7.15.1. Please refer to this documentation on upgrade and v2 flags to enable. Troubleshoot Azure Service Bus - Azure SDK for Java | Microsoft Learn

One question - in the application, are you opening, peeking and closing the client for every http request?

from azure-sdk-for-java.

neumannm avatar neumannm commented on May 29, 2024

Hi @anuchandy, thank you for your detailed response and the useful hints! I will adapt my code accordingly.

The message "not opened within.." means that the SDK cannot establish its TCP connection from the host machine to the Service Bus endpoint. [...] In this case, it indicates that there is a temporary network connectivity problem from host running the app to service bus.

Good to know! Might also explain why I don't see this error when testing locally, only in production (where the application runs in a Kubernetes cluster and the requests go through Azure API management... maybe there's some issue specific to this environment causing connectivity problems 🤔).

Regarding your comment:

The maxAutoLockRenewDuration(..) has no impact on message peeked using "peek*" api. The lock renew matters only if the client is used to receive messages using "receive*" api (those messages which later gets disposition (complete|abandon)).

-> we also have methods to delete or resend, where we indeed use the receive* api. So what about those - is a maxAutoLockRenewDuration of 1 minute advisable (when lock duration is also 1 minute), or should it be higher? I think I read somewhere that these durations should not be the same.

One question - in the application, are you opening, peeking and closing the client for every http request?

Well yes, for every http request the peekMessages Method is called that contains the code snippet from above. Since I'm using try-with-resources, the (AutoClosable) receiver client should be closed after the method returns. This is what it looks like after refactoring:

try (ServiceBusReceiverClient receiver = new ServiceBusClientBuilder()
			.connectionString(connectionString)
			.configuration(new ConfigurationBuilder()
					.putProperty("com.azure.messaging.servicebus.nonSession.syncReceive.v2", "true")
					.build())
			.receiver()
			.queueName(queueName)
			.buildClient()) {

Why are you asking?

from azure-sdk-for-java.

neumannm avatar neumannm commented on May 29, 2024

Update: I implemented all the suggestions, including updating to 7.15.1 and setting the v2 flag on the receiver client, and also creating a separate builder instance to build the client for each queue (for each request).

Unfortunately, it did not help with the problem. The connection is still lost frequently.

WARN 1 --- [ctor-executor-3] c.a.c.a.i.handler.ConnectionHandler: {"az.sdk.message":"onTransportError","connectionId":"MF_5af603_1710239185932","errorCondition":"amqp:connection:framing-error","errorDescription":"org.apache.qpid.proton.engine.TransportException: connection aborted","hostName":"my-sbns.servicebus.windows.net"}

I cannot reproduce this when running the application locally (using the same service bus connection), it only happens with the instance running in the Kubernetes cluster. Any ideas how to debug that? I tried doing
watch nc -z -v my-sbns.servicebus.windows.net 443
from another container in the same namespace, but this continuously yields
Connection to my-sbns.servicebus.windows.net (23.102.0.186) 443 port [tcp/https] succeeded!
even during the time when the timeout occurs.

Sidenote: Each request now gives me a warn message
WARN 1 --- [nio-8080-exec-1] c.a.m.s.ServiceBusClientBuilder: 'enableAutoComplete' is not supported in synchronous client except through callback receive.
now that I dropped disableAutoComplete().

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on May 29, 2024

Hi @neumannm, thanks for the additional details. I didn't realize that client creation-disposal happens for every HTTP request.

It is not an efficient pattern to create and get rid of the client instances for each HTTP request, as this incurs extra costs, such as connection/link negotiation, AD authentication calls and other overhead, impacting the networking. In the common messaging (Service Bus, Event Hubs) scenarios, the client instances last for a long time, so the application should follow a caching approach. The approach here, for example, would be to use ConcurrentHashMap with entry key as the queue name and entry value as client instance. This map will be scoped to the application and is discarded when the application shuts down. An entry could be populated using computeIfAbsent method with a provider that is responsible to new up a builder and client from it. Each time application needs to call peek*, it can reach out to this global map to obtain an already cached client or computeIfAbsent will create one and cache it.

I don't have expertise in AKS infra or it's low-level networking debugging, but I would say, we start with the above approach, which makes the app more suitable for the limited container environments, and then check if it lowers the frequency of the networking connection abort logs that you noticed.

Also, I know that the application has some code for receive/delete/forward as well. If they are in the same pod where peek runs, you can simplify the debugging by temporarily disabling (commenting out) those receive/delete/forward code and only running peek. This will make the logs less cluttered and help you see if the network situation gets better for that peek run. Also disable any other runs in the app or other apps in the same pod.


Few questions,

  1. How many queues does the application monitor using the message-peek method you showed earlier?
  2. What is the HTTP request rate looks like – requests / min / sec?

Few things to check,

  1. Are the pod (hosting application) and Service Bus namespace deployed to the same region? Client and server located in distinct regions also have an impact on networking.
  2. I wonder the cores and memory settings for a pod that runs the Java application, too low resourcing (e.g., 0.2, 0.5 cores / pod) is another reason for frequent time out or stalling in constrained environment. Based on the case study of many Java containerized production apps, the OpenJDK team at Microsoft suggests no less than 1 core per pod. Here is the documentation by that team reg: core/memory - https://learn.microsoft.com/en-us/azure/developer/java/containers/overview#determine-how-many-cpu-cores-are-needed

Also, we can ignore the WARN about auto-complete, it’s a feature available only for ProcessorClient, so SDK is simply saying the chosen client does not support that. You may keep disableAutoComplete (which has no impact on receiver client anyway) if WARN is a noise. But with the caching approach discussed above, we should be seeing this warn only once when a cache entry is populated.

(Regarding the question on max-renewal, I’ll follow up)

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on May 29, 2024

@neumannm, checking back, did the recommendations above help with your use case?

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on May 29, 2024

Closing this issue, assuming that the suggestions were useful, or this is not a priority at the moment. Feel free to reach out if any assistance is needed at later point.

from azure-sdk-for-java.

neumannm avatar neumannm commented on May 29, 2024

@anuchandy So sorry for the late reply, project work got me distracted from this issue unfortunately...

Yes your recommendations seem to have helped. I have changed the code for the peeking receiver as you suggested, using a Map to store receivers per queue, creating each receiver only at first request. Since then, we did not have any notable issues with peeking into our queues.

I still need to refactor the code regarding the functionality to receive from DLQ and re-send to the corresponding "normal" queue for reprocessing. I tried to do the same as with the peeking receiver, but my first attempt was not successful. Which might have to do with the code being suboptimal. You mentioned that cross-entity-transactions are not needed for this functionality, but I was not able to solve it without. I hope I find time soon to dig again into that part.

You said you wanted to follow up on my question regarding max-renewal, did you find anything?

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on May 29, 2024

Hello @neumannm, no worries. Glad to hear that caching pattern worked.

Regarding the lock-renew for receiver, you’ll need to set an auto-lock renew duration in the client only if the client application is expected to hold on to the received message more than the lock duration set in the entity (queue, topic) level (E.g., in Azure portal). The client-side lock renewal means calls to the service, so it’s a tradeoff b/w setting a higher value in the portal vs reducing client-side calls to service.

For example, at an entity level if it is currently 60 seconds, but if the application almost always takes ~70 seconds to process and call complete then consider bumping entity level value to ~70-80 seconds.

from azure-sdk-for-java.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.