Git Product home page Git Product logo

Comments (25)

antempus avatar antempus commented on June 15, 2024 3

@christopheranderson what kind of information/metrics would be valuable to get more movement behind direct mode?

We've already started to batch requests to the API in lieu of singleton's due to the SNAT port limitations on Azure App Services; we'd really benefit from the direct mode as we move forward due to the throughput needed.

from azure-sdk-for-js.

tony-gutierrez avatar tony-gutierrez commented on June 15, 2024 3

Approaching half a year later... Any updates on direct mode?

from azure-sdk-for-js.

southpolesteve avatar southpolesteve commented on June 15, 2024 2

@tony-gutierrez @antempus As Chris mentioned in the v3 issue, implementing Direct Mode is a big undertaking for us. 8 months, 3 devs, when we did it for Java.

When I work with other customers on perf problems, we often find Direct Mode is not the solution. Direct mode is primarily an optimization for latency, not throughput.

In the case of port exhaustion, its likely Direct Mode will make it worse. Consider an API exposing a cross partition query that hits 50 partitions with maxDegreeOfParallism = -1. At 20 concurrent API requests, node will attempt to open 1000 concurrent connections. The node default for an agent with keepAlive: true is 256 so you'll be port constrained by the agent. The queries will get slow (surfaces as latency), but the real problem is throughput.

In gateway mode, all connections go to the same host so they are much easier for the agent to reuse. Direct mode will tax the agent even more since each connection can only be reused for the exact same partition. You'll see a lot more stale connections pruned from the server-side and pay more costs for reconnection/TLS/etc.

Some ideas on how to get more perf out of the current SDK:

  1. Increase agent maxFreeSockets. I worked with a customer this week who did it and saw huge improvements. I understand Azure App service doesn't have great limits here. You'll have to bring that up with their team.
  2. Turn off keepAlive. If your traffic is spikey it might not be benefiting you much. Node's default maxSockets without keepAlive is infinite. Likely this increases your p50 query latency but may help throughput.
  3. Tune maxItemCount. If omitted, the backend will pick its own which may not be the best for your scenario.
  4. Tune maxDegreeOfParallelism. "-1" is full parallelism and probably not the best option. Can easily exhaust ports or CPU.
  5. Page buffering. This is on our end. We are working on it: Azure/azure-cosmos-js#397
  6. Queue writes on your end. Maybe we could add SDK features to help here? I am open to ideas.
  7. Avoid cross partition queries altogether. Or at least avoid scenarios where they are executed concurrently on the same box. Caveat: If you are seeing high latency for a single partition query please share. I would like to investigate more.

I would be happy to discuss your specific workloads, help debug the perf issues, and chat alternative partitioning strategies. Drop me an email [email protected]

from azure-sdk-for-js.

tony-gutierrez avatar tony-gutierrez commented on June 15, 2024 2

Steve, while I appreciate the suggestions, I feel like most people who are clamoring for direct mode have probably already tuned to the point of latency being the limiting factor.

Node is not a limiting factor. The ports on app service are a limiting factor already, so why not have their usage be accomplished faster with one less network hop?

We have pretty constant high db traffic. We have been all over the map with keep alive, custom agents, and port limits. Our current mostly stable configuration is agentkeepalive (because MS products close all connections after 120 seconds, no matter what, and it's the only agent that has a socket TTL setting that can be set to 110 seconds) with the following config:

// attempt to keep global socket usage under 160 for app service.
agentConfig: {
		keepAlive: true,
		maxSockets: 25, // this is per host
		maxFreeSockets: 10, //per host
		timeout: 60000,
		freeSocketTimeout: 30000, //not used if using normal agent
		socketActiveTTL: 110000
	}

With this setup we will usually have 3 Cosmos hosts, with the sockets maxed. I would much rather have 10 times the hosts (although Cosmos always seems to fit our data into 5 partitions) with fewer sockets allowed and faster execution of queries by id (95% of our db traffic). But even better would be TCP direct mode, eliminating the overhead of the HTTP agent all together, and the overhead of all the headers and query string parameters that caused the recent overflow issue.

We stay fast by spreading the load over many more servers than we really need, just due to the bottleneck of cosmos connections and latency of the additional network hop.

from azure-sdk-for-js.

christopheranderson avatar christopheranderson commented on June 15, 2024 1

Hey @janis91 - thanks for moving this here. This is something that @southpolesteve and I would really like to do. This likely will be done in stages (non-session writes are really easy, session consistency reads are pretty complex). We also won't work on this until we've gotten the new SDK released to GA since we cannot deprecate the old SDK until this one has GA'd which doubles the cost for us.

Short version: We really want to do this, but it likely won't start until after the new version GA's.

from azure-sdk-for-js.

janis91 avatar janis91 commented on June 15, 2024 1

Hi @christopheranderson, you're welcome. Actually I already expected that this won't make it into the initial version, here. I like the new approach with the async/await style / promise-based methods. So I think this will already be a big step forward. And after that I am hoping for further improvement :-)

from azure-sdk-for-js.

janis91 avatar janis91 commented on June 15, 2024 1

Unfortunately, we are not able to do this at the moment. But we are still looking forward to the release :-)

from azure-sdk-for-js.

MuhamedSalihSeyedIbrahim avatar MuhamedSalihSeyedIbrahim commented on June 15, 2024 1

Hi Team - any update on the direct mode feature?

from azure-sdk-for-js.

christopheranderson avatar christopheranderson commented on June 15, 2024

Absolutely. We're also moving all(*) our development onto GitHub, so you should be able to track the improvements as they are in the works. :)

FYI - we're doing some user studies on the new model. If you're interested, email me at chrande (at) microsoft (dot) com. We have "final" round before our first preview release starting the 23rd.

(*) - There might be "surprise" features that get developed on a private feature branch, but they'll be merged into the main dev/master branches. We'll try to do this as little as possible.

from azure-sdk-for-js.

tony-gutierrez avatar tony-gutierrez commented on June 15, 2024

Ah crap, the only reason I started implementing this SDK was because I thought it supported direct?!

from azure-sdk-for-js.

christopheranderson avatar christopheranderson commented on June 15, 2024

This SDK will support Direct, but it wasn't a requirement for GA. It's the next feature on my queue, though, if that makes you feel better (but it's a heavy feature, so still a bit out).

from azure-sdk-for-js.

tony-gutierrez avatar tony-gutierrez commented on June 15, 2024

Any progress?

from azure-sdk-for-js.

hassellof avatar hassellof commented on June 15, 2024

Also waiting for this. Getting much better performance with MongoDB driver at the moment, but would prefer to use this to get support for multi master. However, current performance makes it faster to run a single master on a region on the other side of the planet and use MongoDB driver than this one in the same region.

from azure-sdk-for-js.

tony-gutierrez avatar tony-gutierrez commented on June 15, 2024

@hassellof, can you elaborate on the stack? What node js mongo library do you use?

from azure-sdk-for-js.

tony-gutierrez avatar tony-gutierrez commented on June 15, 2024

@christopheranderson ?

from azure-sdk-for-js.

tony-gutierrez avatar tony-gutierrez commented on June 15, 2024

@southpolesteve ?

from azure-sdk-for-js.

christopheranderson avatar christopheranderson commented on June 15, 2024

Sorry for the delay in response here.

No news on Direct Mode support for JS. We're in the final phases of Direct Mode for Java (HTTP is out and live in 2.4.0 and TCP is feature complete but not at the quality bar we need). We'll evaluate how we want to approach the next Direct Mode implementation once we've gotten Java to the place we need it to be.

RE: performance - would appreciate any numbers you're willing to share (what do you need/expect vs what you're currently observing). I've worked with a few customers to help make Gateway mode work fine performance wise.

from azure-sdk-for-js.

tony-gutierrez avatar tony-gutierrez commented on June 15, 2024

Any update?

from azure-sdk-for-js.

tony-gutierrez avatar tony-gutierrez commented on June 15, 2024

This issue isn't on the V3 feature list?

from azure-sdk-for-js.

antempus avatar antempus commented on June 15, 2024

@southpolesteve

Great info and we will give this a shot to see if the tuning helps; I think it will help in our dev/cert env, but in prod, we're going to see orders of magnitude more spikey requests to the API.

What's interesting is that the majority of the challenges stem not from Cross Partition queries, but rather the sheer number of request coming into the App Service; we're already batching the POST to the API but still see errors on the API.

I'll try these suggestions and bump the App Service up to the now lower priced P1V2 and post the results.

from azure-sdk-for-js.

southpolesteve avatar southpolesteve commented on June 15, 2024

@tony-gutierrez Can you share some of the latency #s you are seeing and at what kind of load? If point reads are slow, I would like to dig into that more. I will try to repro on app service.

It would also be helpful to know if requests are being queued. You can grab it off of agent. Some code I have used before to log this:

const agent = new Agent({
  keepAlive: true
})

setInterval(() => {
  Object.values(agent.requests).flat().length
}, 1000)

from azure-sdk-for-js.

jay-most avatar jay-most commented on June 15, 2024

@antempus s What were your results?

from azure-sdk-for-js.

antempus avatar antempus commented on June 15, 2024

@jay-most you can mark my comments are no longer valid, I've moved on to different work/company and I cannot recall why we needed, or thought we needed, this feature.

from azure-sdk-for-js.

sabharwal-garv avatar sabharwal-garv commented on June 15, 2024

Hi @jay-most can you confirm when direct mode will be available, as with Gateway mode we are seeing 4 to 5 second latency when we have spike in load in Azure metrics,  or can someone please suggest what can be the optimal configuration to reduce latency.

Container Configuration:

MAX RU's: 25000 with autoscaling enabled.

Query configuration :

We are querying on the basis of partitionKey and Index filed.

export const MAX_DEGREE_OF_PARALLELISM = -1;

export const MAX_ITEM_COUNT = 1000;

export const BUFFER_ITEMS = true;

export const FORCE_QUERY_PLAN = true;

from azure-sdk-for-js.

github-actions avatar github-actions commented on June 15, 2024

Hi @janis91, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.

from azure-sdk-for-js.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.