I noticed that the kernel scheduler was showing up a lot in my perf report, and I trac

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Busy yield-loops hurt performance about rayon HOT 12 CLOSED

rayon-rs commented on July 16, 2024

Busy yield-loops hurt performance

from rayon.

Comments (12)

nikomatsakis commented on July 16, 2024

The other yield loop is in steal_until, which is harder because that's looping for either latch completion or available work -- suggestions welcome.

Yeah, I figured I'd have to remove the busy loops sooner or later. It'd be worth benchmarking on other systems too (not sure what system you are using there). In the past, I've found that linux in particular has a more efficient condvar impl than other systems (I think it is based on futexes, but I've not researched this stuff in detail in a while). Another option is to use a spin loop with an increasing backoff -- i.e., sleep for increasing amounts of time.

from rayon.

nikomatsakis commented on July 16, 2024

Heh, I was wondering why wait was having such an impact on performance: I thought that I only used it to block until the worker threads started, but I see I also used it to wait for the main threads to start. That was... silly. Good catch, a condvar definitely seems to make sense there.

from rayon.

mneumann commented on July 16, 2024

I tried rayon with my embarrassingly parallel evolutionary algorithm. It was about 10-15% slower than the simple_parallel version IIRC. I will try again to see if performance is en par.

from rayon.

AlisdairO commented on July 16, 2024

Bit of a hanger-on to this issue, but is SeqCst necessary for this AtomicBool? My (hurried) reading of it is that Latch essentially waits until b is true, and then allows other code to run - that being the case, it should be possible to use Ordering::Acquire for the probe method, and Ordering::Release for the set method. This would potentially have performance benefits on at least some platforms.

from rayon.

cuviper commented on July 16, 2024

@nikomatsakis The numbers I gave were on Linux, Fedora 23 x86_64 on an i7-2600. I just tried Win10 x64 on an i7-3770K, and it goes from around 3-3.5x parallel speedup (quite variable!) to a more stable 4x speedup. (FWIW both of those are quad-core, giving 8 logical CPUs with hyperthreading.) I don't have OSX to try.

I can send a PR for the easy latch-wait condvar -- anything you want tweaked from the patch above?

@mneumann I'm curious to hear how much this helps you too!

@AlisdairO I think you're right that ordering could be relaxed. However, if we can get the latch fully converted to condvars, the atomic bool might go away in favor of just mutex-guarded state.

from rayon.

nikomatsakis commented on July 16, 2024

Hmm, I think I'd take the patch, but I'd prefer to separate the case of stolen jobs from other jobs, since only the latter benefit from the mutex condvar arrangement, and for the rest it is just overhead.

Niko

-------- Original message --------

From: Josh Stone [email protected]

Date:12/31/2015 17:15 (GMT-05:00)

To: nikomatsakis/rayon [email protected]

Cc: Niko Matsakis [email protected]

Subject: Re: [rayon] Busy yield-loops hurt performance (#9)

I can send a PR for the easy latch-wait condvar -- anything you want tweaked from the patch above?

@mneumann I'm curious to hear how much this helps you too!

@AlisdairO I think you're right that ordering could be relaxed. However, if we can get the latch fully converted to condvars, the atomic bool might go away in favor of just mutex-guarded state.

—
Reply to this email directly or view it on GitHub.

from rayon.

cuviper commented on July 16, 2024

Well, supposedly stealing is rare anyway, so overhead doesn't matter. Plus I hope eventually to figure out how to make stealing sleep more precisely too. But until then, sure, I can try tweaking it so only fully-awaited jobs will notify the condvar.

from rayon.

nikomatsakis commented on July 16, 2024

Stealing is rare, but creating a latch is very, very common -- it
happens on every fork.

On Thu, Dec 31, 2015, at 07:44 PM, Josh Stone wrote:

Well, supposedly stealing is rare anyway, so overhead doesn't matter.
Plus I hope eventually to figure out how to make stealing sleep more
precisely too. But until then, sure, I can try tweaking it so only
fully-awaited jobs will notify the condvar.

— Reply to this email directly or view it on GitHub[1].

Links:

https://github.com/nikomatsakis/rayon/issues/9#issuecomment-168269732

from rayon.

mneumann commented on July 16, 2024

@nikomatsakis @cuviper : I redid my benchmarks. My numbers are:

Simple parallel: ~50 sec
Rayon rev e15798d: ~55.8 sec
Rayon rev 97b04cd: ~56 sec
Rayon rev e15798d with threshold 1.0: ~54.7 sec
Rayon rev 97b04cd with threshold 1.0: ~49.8 sec.

Note that I am using par_iter, and in my case it's important to run every single iteration in parallel. Only changing the THRESHOLD constant did the trick. Good news, it's now en par with simple_parallel. So when setting the threshold I do see a 10% improvement with the new code!!! Now waiting for a way to set the threshold without patching the code :)

from rayon.

nikomatsakis commented on July 16, 2024

@mneumann interesting. I had planned to include the option in the par_iter API to add weights, so that you don't need to manually adjust the threshold. I also wanted to have an "autotune" operation that would try to find an ideal weight experimentally. (But I'm not really sure if that's the best overall approach.)

from rayon.

mneumann commented on July 16, 2024

@nikomatsakis To clarify: The 10% improvement I saw was in cpu time, not in wall clock time. simple_parallel was still a bit faster, but it also kept the cpu slightly more busy.

I experimented a bit with making the ParallelLen a trait and make the functions generic over it. The decision when to fallback to sequential mode would then be encapsulated there. One would use it like:

// this is equivalent to THRESHOLD = 1.0
pop.into_par_iter::<FullyParallel>().map(...) { }.collect_into()

struct WeightedParallel { cost: f64, ... }
impl ParallelLen for WeightedParallel {
    const THRESHOLD: f64 = 10_000.0;
   // ...
   fn is_parallel(&self) { self.cost > THRESHOLD && self.max_length > 1 }
}

pop.into_par_iter::<WeightedParallel>().map(...) {}

But I am not sure if it isn't overkill. Also one could use associated constants to specify the THRESHOLD of course.

I think it all depends on the type of algorithms that are being used. I can imagine algorithms were the left and right costs aren't always equally distributed (left and right subtrees for instance). So we'd need specialized ParallelLen implementations anyway.

from rayon.

nikomatsakis commented on July 16, 2024

@mneumann this is more germane to #7, so I'm going to post my reply there.

from rayon.

Busy yield-loops hurt performance about rayon HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent