Git Product home page Git Product logo

Comments (12)

nikomatsakis avatar nikomatsakis commented on July 16, 2024

The other yield loop is in steal_until, which is harder because that's looping for either latch completion or available work -- suggestions welcome.

Yeah, I figured I'd have to remove the busy loops sooner or later. It'd be worth benchmarking on other systems too (not sure what system you are using there). In the past, I've found that linux in particular has a more efficient condvar impl than other systems (I think it is based on futexes, but I've not researched this stuff in detail in a while). Another option is to use a spin loop with an increasing backoff -- i.e., sleep for increasing amounts of time.

from rayon.

nikomatsakis avatar nikomatsakis commented on July 16, 2024

Heh, I was wondering why wait was having such an impact on performance: I thought that I only used it to block until the worker threads started, but I see I also used it to wait for the main threads to start. That was... silly. Good catch, a condvar definitely seems to make sense there.

from rayon.

mneumann avatar mneumann commented on July 16, 2024

I tried rayon with my embarrassingly parallel evolutionary algorithm. It was about 10-15% slower than the simple_parallel version IIRC. I will try again to see if performance is en par.

from rayon.

AlisdairO avatar AlisdairO commented on July 16, 2024

Bit of a hanger-on to this issue, but is SeqCst necessary for this AtomicBool? My (hurried) reading of it is that Latch essentially waits until b is true, and then allows other code to run - that being the case, it should be possible to use Ordering::Acquire for the probe method, and Ordering::Release for the set method. This would potentially have performance benefits on at least some platforms.

from rayon.

cuviper avatar cuviper commented on July 16, 2024

@nikomatsakis The numbers I gave were on Linux, Fedora 23 x86_64 on an i7-2600. I just tried Win10 x64 on an i7-3770K, and it goes from around 3-3.5x parallel speedup (quite variable!) to a more stable 4x speedup. (FWIW both of those are quad-core, giving 8 logical CPUs with hyperthreading.) I don't have OSX to try.

I can send a PR for the easy latch-wait condvar -- anything you want tweaked from the patch above?

@mneumann I'm curious to hear how much this helps you too!

@AlisdairO I think you're right that ordering could be relaxed. However, if we can get the latch fully converted to condvars, the atomic bool might go away in favor of just mutex-guarded state.

from rayon.

nikomatsakis avatar nikomatsakis commented on July 16, 2024

Hmm, I think I'd take the patch, but I'd prefer to separate the case of stolen jobs from other jobs, since only the latter benefit from the mutex condvar arrangement, and for the rest it is just overhead. 

Niko

-------- Original message --------
From: Josh Stone [email protected]
Date:12/31/2015 17:15 (GMT-05:00)
To: nikomatsakis/rayon [email protected]
Cc: Niko Matsakis [email protected]
Subject: Re: [rayon] Busy yield-loops hurt performance (#9)
@nikomatsakis The numbers I gave were on Linux, Fedora 23 x86_64 on an i7-2600. I just tried Win10 x64 on an i7-3770K, and it goes from around 3-3.5x parallel speedup (quite variable!) to a more stable 4x speedup. (FWIW both of those are quad-core, giving 8 logical CPUs with hyperthreading.) I don't have OSX to try.

I can send a PR for the easy latch-wait condvar -- anything you want tweaked from the patch above?

@mneumann I'm curious to hear how much this helps you too!

@AlisdairO I think you're right that ordering could be relaxed. However, if we can get the latch fully converted to condvars, the atomic bool might go away in favor of just mutex-guarded state.


Reply to this email directly or view it on GitHub.

from rayon.

cuviper avatar cuviper commented on July 16, 2024

Well, supposedly stealing is rare anyway, so overhead doesn't matter. Plus I hope eventually to figure out how to make stealing sleep more precisely too. But until then, sure, I can try tweaking it so only fully-awaited jobs will notify the condvar.

from rayon.

nikomatsakis avatar nikomatsakis commented on July 16, 2024

Stealing is rare, but creating a latch is very, very common -- it
happens on every fork.

On Thu, Dec 31, 2015, at 07:44 PM, Josh Stone wrote:

Well, supposedly stealing is rare anyway, so overhead doesn't matter.
Plus I hope eventually to figure out how to make stealing sleep more
precisely too. But until then, sure, I can try tweaking it so only
fully-awaited jobs will notify the condvar.

— Reply to this email directly or view it on GitHub[1].

Links:

  1. https://github.com/nikomatsakis/rayon/issues/9#issuecomment-168269732

from rayon.

mneumann avatar mneumann commented on July 16, 2024

@nikomatsakis @cuviper : I redid my benchmarks. My numbers are:

Simple parallel: ~50 sec
Rayon rev e15798d: ~55.8 sec
Rayon rev 97b04cd: ~56 sec
Rayon rev e15798d with threshold 1.0: ~54.7 sec
Rayon rev 97b04cd with threshold 1.0: ~49.8 sec.

Note that I am using par_iter, and in my case it's important to run every single iteration in parallel. Only changing the THRESHOLD constant did the trick. Good news, it's now en par with simple_parallel. So when setting the threshold I do see a 10% improvement with the new code!!! Now waiting for a way to set the threshold without patching the code :)

from rayon.

nikomatsakis avatar nikomatsakis commented on July 16, 2024

@mneumann interesting. I had planned to include the option in the par_iter API to add weights, so that you don't need to manually adjust the threshold. I also wanted to have an "autotune" operation that would try to find an ideal weight experimentally. (But I'm not really sure if that's the best overall approach.)

from rayon.

mneumann avatar mneumann commented on July 16, 2024

@nikomatsakis To clarify: The 10% improvement I saw was in cpu time, not in wall clock time. simple_parallel was still a bit faster, but it also kept the cpu slightly more busy.

I experimented a bit with making the ParallelLen a trait and make the functions generic over it. The decision when to fallback to sequential mode would then be encapsulated there. One would use it like:

// this is equivalent to THRESHOLD = 1.0
pop.into_par_iter::<FullyParallel>().map(...) { }.collect_into()

struct WeightedParallel { cost: f64, ... }
impl ParallelLen for WeightedParallel {
    const THRESHOLD: f64 = 10_000.0;
   // ...
   fn is_parallel(&self) { self.cost > THRESHOLD && self.max_length > 1 }
}

pop.into_par_iter::<WeightedParallel>().map(...) {}

But I am not sure if it isn't overkill. Also one could use associated constants to specify the THRESHOLD of course.

I think it all depends on the type of algorithms that are being used. I can imagine algorithms were the left and right costs aren't always equally distributed (left and right subtrees for instance). So we'd need specialized ParallelLen implementations anyway.

from rayon.

nikomatsakis avatar nikomatsakis commented on July 16, 2024

@mneumann this is more germane to #7, so I'm going to post my reply there.

from rayon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.