rayon-rs / rayon Goto Github PK

Rayon: A data parallelism library for Rust

License: Apache License 2.0

Rust 99.89% Shell 0.11%

rayon's Introduction

Rayon

Rayon is a data-parallelism library for Rust. It is extremely lightweight and makes it easy to convert a sequential computation into a parallel one. It also guarantees data-race freedom. (You may also enjoy this blog post about Rayon, which gives more background and details about how it works, or this video, from the Rust Belt Rust conference.) Rayon is available on crates.io, and API documentation is available on docs.rs.

Parallel iterators and more

Rayon makes it drop-dead simple to convert sequential iterators into parallel ones: usually, you just change your foo.iter() call into foo.par_iter(), and Rayon does the rest:

use rayon::prelude::*;
fn sum_of_squares(input: &[i32]) -> i32 {
    input.par_iter() // <-- just change that!
         .map(|&i| i * i)
         .sum()
}

Parallel iterators take care of deciding how to divide your data into tasks; it will dynamically adapt for maximum performance. If you need more flexibility than that, Rayon also offers the join and scope functions, which let you create parallel tasks on your own. For even more control, you can create custom threadpools rather than using Rayon's default, global threadpool.

No data races

You may have heard that parallel execution can produce all kinds of crazy bugs. Well, rest easy. Rayon's APIs all guarantee data-race freedom, which generally rules out most parallel bugs (though not all). In other words, if your code compiles, it typically does the same thing it did before.

For the most, parallel iterators in particular are guaranteed to produce the same results as their sequential counterparts. One caveat: If your iterator has side effects (for example, sending methods to other threads through a Rust channel or writing to disk), those side effects may occur in a different order. Note also that, in some cases, parallel iterators offer alternative versions of the sequential iterator methods that can have higher performance.

Using Rayon

Rayon is available on crates.io. The recommended way to use it is to add a line into your Cargo.toml such as:

[dependencies]
rayon = "1.10"

To use the parallel iterator APIs, a number of traits have to be in scope. The easiest way to bring those things into scope is to use the Rayon prelude. In each module where you would like to use the parallel iterator APIs, just add:

use rayon::prelude::*;

Rayon currently requires rustc 1.63.0 or greater.

Usage with WebAssembly

By default, when building to WebAssembly, Rayon will treat it as any other platform without multithreading support and will fall back to sequential iteration. This allows existing code to compile and run successfully with no changes necessary, but it will run slower as it will only use a single CPU core.

You can build Rayon-based projects with proper multithreading support for the Web, but you'll need an adapter and some project configuration to account for differences between WebAssembly threads and threads on the other platforms.

Check out the wasm-bindgen-rayon docs for more details.

Contribution

Rayon is an open source project! If you'd like to contribute to Rayon, check out the list of "help wanted" issues. These are all (or should be) issues that are suitable for getting started, and they generally include a detailed set of instructions for what to do. Please ask questions if anything is unclear! Also, check out the Guide to Development page on the wiki. Note that all code submitted in PRs to Rayon is assumed to be licensed under Rayon's dual MIT/Apache 2.0 licensing.

Quick demo

To see Rayon in action, check out the rayon-demo directory, which includes a number of demos of code using Rayon. For example, run this command to get a visualization of an N-body simulation. To see the effect of using Rayon, press s to run sequentially and p to run in parallel.

> cd rayon-demo
> cargo run --release -- nbody visualize

For more information on demos, try:

> cd rayon-demo
> cargo run --release -- --help

License

Rayon is distributed under the terms of both the MIT license and the Apache License (Version 2.0). See LICENSE-APACHE and LICENSE-MIT for details. Opening a pull request is assumed to signal agreement with these licensing terms.

rayon's People

Contributors

Stargazers

Watchers

Forkers

yuriks kevinmehall faern huonw asafmah diwic yetone defuz hoops dirvine cuviper ngaut ekleog brendanzab kinghajj kerneltime willi-kappler amanieu pnkfelix bluss hatahet sharpler arielb1 delskayn frewsxcv jamwt emilio gitter-badger spacejam durango sophiajt creativcoder danielsanchezq utkarshkukreti pengowen123 martinhath bwo christopherdavenport ghotiphud schuster johnathan79717 jamougha bholley gsquire schets f001 pegomes torkleyy hardvain leshow petezhut mbrubeck manuel-woelker gaurikholkar-zz froydnj perlun scottmcm antoinewdg mschmo crazymykl superfluffy glandium nox hodgesds seeekr julian-seward1 michael-f-bryan alexxnica kryndex behnam amosbird raviqqe aaron1011 chrisvittal vishalsodani fitzgen dns2utf8 andygauge aceeri tmccombs smt923 scotto3394 zimurgh munael marmistrz majorbreakfast denialadams johnchildren 17ai mamuleanu zoxc dtibbs clippered ignatenkobrain nanne007 obv-mikhail enet4 amilajack julioolvr maxfurman

rayon's Issues

Warning on README.md points to closed bug

This warning probably ought to be updated now that #10 is closed:

WARNING: Rayon is still in experimental status. It is probably not very robust and not (yet) suitable for production applications. In particular, the handling of panics is known to be terrible. See nikomatsakis/rayon#10 for more details.

Negative tests abusing the consumer protocol

To test unsafe consumers like collect, we need some negative tests of evil producers that fail to uphold their end of the bargain in various ways; we would want to be sure that the collect consumer doesn't fail under those cases.

Here are some ideas:

fail to produce correct number of items
produce too many items
abuse consumer protocol by never calling complete
abuse consumer protocol by leaking calling
not sure what else

settle on one reduce, one fold

Currently we have a lot of method for doing "fold-like" things:

    fn reduce_with<OP>(self, op: OP) -> Option<Self::Item>
        where OP: Fn(Self::Item, Self::Item) -> Self::Item + Sync;
    fn reduce_with_identity<OP>(self, identity: Self::Item, op: OP) -> Self::Item
        where OP: Fn(Self::Item, Self::Item) -> Self::Item + Sync,
              Self::Item: Clone + Sync;
    fn reduce<REDUCE_OP>(self, reduce_op: &REDUCE_OP) -> Self::Item
        where REDUCE_OP: ReduceOp<Self::Item>;

And of course there is also the (experimental, but faster in some cases) fold operation:

    fn fold<I,FOLD_OP,REDUCE_OP>(self,
                                 identity: I,
                                 fold_op: FOLD_OP,
                                 reduce_op: REDUCE_OP)
                                 -> I
        where FOLD_OP: Fn(I, Self::Item) -> I + Sync,
              REDUCE_OP: Fn(I, I) -> I + Sync,
              I: Clone + Sync + Send;

Part of the challenge here is that parallel reduce is different from fold: you can't just start from the left and combine elements until the end. Instead, you wind up dividing the list into a kind of tree, folding each of those segments from left-to-right, and then reducing the results as you walk up the tree.

Each of the variations we currently have makes sense in its own way:

reduce_with is simplest, but it's the most inefficient. Because it requires two values from the iterator before we can invoke the closure, it has to introduce options under the hood to track things. Maybe there is a nicer way to implement it?
reduce_with_identity avoids the problem of introducing Option by requiring an initial element to use as the "accumulator" value. However, unlike in sequential fold, this value must be cloneable, which is unfortunate sometimes. (It has to be cloneable because we will need to use it for each of the parallel segments we wind up with.)
The original reduce was intended for letting people write custom operations that configure each detail. Its pretty silly for this to have the best name, but it might make sense to keep it around.
The fold operation exposes this "tree-and-then-walk-back-up" structure directly. I added it to try and make a benchmark faster. It worked, but it's again somewhat complicated. I think I've decided that it does make sense to keep it around, but I'm not 100% sure.

My feeling is that we should probably have a variation on reduce_with_identity as the only reduce operation. It would instead have a closure that produces the identity, so that sum looks like this:

iterator.reduce(|| 0, |a, b| a+b)

At that point, there is no need for the current reduce, since the Reduce trait just provides two methods that are basically equivalent to these two closures. Moreover, we can lift the Clone bounds.

If you need to artifically inject an identity value, as the current reduce_with does, it can be expressed like so:

iterator.reduce_with(|| None, |a, b| match (a, b) {
    (Some(a), Some(b)) => a+b,
    (s @ Some(_), None) | (None, s @ Some(_)) => s,
    (None, None) => None,
})

But I think nobody ever will.

So that just leaves the question of whether to keep fold around. My feeling now is that we should, but modify it to take a closure. I can easily imagine cases where it's useful to be able to retain state across the sequential folds, which fold makes possible. fold also basically corresponds to the map-reduce problem -- you can express this as .map().reduce() but sometimes it more efficient to convert N items into the new type than to convert each item individually, which is basically the point.

As an example, the nbody demo uses fold like this:

iterator.fold(
    || (0, 0), // represents the total "difference" in current position of this item
    |(diff1, diff2), item| ..., // adjust diff1, diff2 based on item
    |(diff1, diff2), (diff1b, diff2b)| (diff1 + diff1b, diff2 + diff2b));

Sound good?

cc @jorendorff, @cuviper -- with whom I talked about this

into_par_iter not defined for u64

I find that:

(0..n).into_par_iter…

fails if n is u64, but if I do:

(0..n as u32).into_par_iter…

everything works fine. Well mostly, but that is a different issue.

Fix Rayon for nightlies

master doesn't work anymore with 2016-05-12 nightly.

v0.4.0 is out but not in RELEASES

Hi:

https://crates.io/crates/rayon clames rayon = "0.4.0" but RELEASES.md clames Release 0.4 (not yet released)

What is new in 0.4.0?

Get an arbitrary `Err` from a parallel iterator over `Result<()>`

I just converted a for loop in cage to a parallel iteration. The original loop used try!, giving me .map(|pod| -> Result<()> {.

In a perfect world, I might like to collect all the errors in some order, but that would require messing around with reducing a "vector of errors" type, which is going to be allocation intensive. We can do almost as well by just choosing one of many errors, and showing that one to the user.

@cuviper helped me figure out the following solution:

            // If more than one parallel branch fails, just return one error.
            .reduce_with(|result1, result2| result1.and(result2).and(Ok(())))
            .unwrap_or(Ok(()))

Having this as a standard, built-in reduction would make it much easier to parallelize loops using try!.

"find" operation on parallel iterators?

As for normal iterators, but look for any matching item in parallel. (position wouldn't make much sense.)

Using `par_iter::*` is too broad

The docs recommend use rayon::par_iter::*; -- but this includes every pub mod, and those have very generic names. So something simple like this fails:

use rayon::par_iter::*;
use std::slice;

error: a module named slice has already been imported in this module

If those submodules really need to be public, perhaps there should be a prelude to glob use instead?

export splitter as a utility

We should have some easy way for "divide-and-conquer" algorithms like quicksort to take advantage of the new ThiefSplitter stuff added by @cuviper in #106. I envision a wrapper around join that is like splitter.join(x, y) which will run x and y sequentially after a certain point.

deprecate existing weight APIs

Per this comment, we should deprecate the existing weight APIs. I think we probably still want some room for manual tuning, but it should be simpler by default -- and if we are going to let people give fine-grained numbers, it should be tied to the length (like seq. cutoff) rather than some abstract notion of weight. Ideally though we'd add these APIs along with representative benchmarks where they are needed.

Add LICENSE file

There is no indication how or whether rayon is open source.

implement leak detection for scope

The scope API introduced in https://github.com/nikomatsakis/rayon/pull/71 heap-allocates jobs. The initial version leaked all of those heap-allocated boxes. While that has been fixed (I think), there is no automatic detection of that in the travis testing.

The ideal would be to use valgrind or some other such tool, but that's a bit tricky (perhaps with a custom allocation crate or alloc_system? haven't tried). I experimented with custom counters and some other approaches but they did not work:

The boxes will get dropped "asynchronously" with respect to the scope terminating (since it only blocks until all the jobs have finished executing). This is as it should be, since to do otherwise would be less efficient (Amdahl's Law and all that). But it makes my initial commit (which checks that the number of unfreed jobs was non-zero) racy.
Another approach (still around in my "scope-leak" branch) was to test "from the outside" by having objects in the scopes that indicate when they are freed. This failed because the user-defined closures were being freed, even though the boxes were not (at least when I intentionally broke the code), and hence we weren't detecting the leaks we wanted.

Change default weight of parallel iterators to assume expensive ops

Currently parallel iterators assume cheap operations. I am thinking that should be changed to assume expensive operations (i.e., fine-grained parallel splits), and have people opt-in to the current behavior by manually adjusting the weights or calling weight_min. My reasoning is:

The worst case behavior today is that you don't see any parallel speedup, which sucks. I see a lot of questions bout this.
The worst case behavior with the new default would be that you see less speedup than you expected. If we do a good job optimizing, this should be fairly low.
It seems to be what I want more often.

Thoughts?

ParallelIterator traits should be unsafe to implement

As https://github.com/nikomatsakis/rayon/pull/16 indicated, incorrectly implementing ParallelIterator can lead to crashes. We ought to just make the trait unsafe, I imagine (i.e., unsafe to implement, not to use).

Error when trying to install rayon 0.4.3

→ rustc -V
rustc 1.9.0 (e4e8b6668 2016-05-18)
→ cargo -V
cargo 0.10.0-nightly (10ddd7d 2016-04-08)

Compiling rayon v0.4.3
Running rustc /Users/daniel/.cargo/registry/src/github.com-88ac128001ac3a9a/rayon-0.4.3/src/lib.rs --crate-name rayon --crate-type lib -g -C metadata=5441ea551292b9c3 -C extra-filename=-5441ea551292b9c3 --out-dir /Users/daniel/game/target/debug/deps --emit=dep-info,link -L dependency=/Users/daniel/game/target/debug/deps -L dependency=/Users/daniel/game/target/debug/deps --extern rand=/Users/daniel/game/target/debug/deps/librand-c724acb3942597d1.rlib --extern num_cpus=/Users/daniel/game/target/debug/deps/libnum_cpus-0d0496d141db0602.rlib --extern libc=/Users/daniel/game/target/debug/deps/liblibc-56a988296c48c60c.rlib --extern deque=/Users/daniel/game/target/debug/deps/libdeque-9a8e24ef39a44da6.rlib --cap-lints allow
Fresh flate2 v0.2.14
Fresh core-foundation-sys v0.2.2
Fresh cgl v0.1.5
Fresh png v0.5.2
Fresh core-foundation v0.2.2
Fresh time v0.1.35
Fresh piston-float v0.2.0
Fresh core-graphics v0.3.2
Fresh vecmath v0.2.0
Fresh piston-viewport v0.2.0
Fresh cocoa v0.3.3
Fresh pistoncore-input v0.14.0
Fresh piston2d-graphics v0.18.0
Fresh glutin v0.6.2
Fresh pistoncore-window v0.23.0
Fresh pistoncore-event_loop v0.26.0
Fresh pistoncore-glutin_window v0.31.0
Fresh piston v0.26.0
/Users/daniel/.cargo/registry/src/github.com-88ac128001ac3a9a/rayon-0.4.3/src/par_iter/internal.rs:96:47: 96:48 error: no rules expected the token ;
/Users/daniel/.cargo/registry/src/github.com-88ac128001ac3a9a/rayon-0.4.3/src/par_iter/internal.rs:96 thread_local!{ static ID: bool = false; }
^
error: Could not compile rayon.

Caused by:
Process didn't exit successfully: rustc /Users/daniel/.cargo/registry/src/github.com-88ac128001ac3a9a/rayon-0.4.3/src/lib.rs --crate-name rayon --crate-type lib -g -C metadata=5441ea551292b9c3 -C extra-filename=-5441ea551292b9c3 --out-dir /Users/daniel/game/target/debug/deps --emit=dep-info,link -L dependency=/Users/daniel/game/target/debug/deps -L dependency=/Users/daniel/game/target/debug/deps --extern rand=/Users/daniel/game/target/debug/deps/librand-c724acb3942597d1.rlib --extern num_cpus=/Users/daniel/game/target/debug/deps/libnum_cpus-0d0496d141db0602.rlib --extern libc=/Users/daniel/game/target/debug/deps/liblibc-56a988296c48c60c.rlib --extern deque=/Users/daniel/game/target/debug/deps/libdeque-9a8e24ef39a44da6.rlib --cap-lints allow (exit code: 101)


https://github.com/nikomatsakis/rayon/blob/c91ea503ba0dee8dd1cf3575e8a4adf86e2d33d8/src/par_iter/internal.rs#L96

Any ideas? (Mac OSX)

Benchmark suite

We need to have a start at a standard Rayon benchmark suite. @cuviper got us started with https://github.com/nikomatsakis/rayon/pull/32. But we need a bit more.

More suggestions or pointers welcome.

We also need some kind of script to compile and run the demos and collect the data. Ideally, we'd run it regularly on some server so we can track performance automatically.

scope should shift to helper thread, and then "help" when blocking

Currently, when a scope is created, it does not shift into a worker thread. This complicates the logic mildly. In particular, then when the scope blocks waiting for its children, we don't know if that code is on a worker thread or not. Right now it just blocks -- but if it is on a worker thread, it should help.

My opinion is that we should (a) shift to a worker thread when a scope is created to execute the body of the scope and (b) help out when blocking.

Par_iter_mut does not seem to be utilizing available (idle) cores?

Full disclaimer, this is likely my fault in some way :)

I'm using Rayon master as the dependency, since I wanted some of the 0.3 features.

My code uses par_iter_mut to operate on a vector of "Pages" in parallel. There are 3969 Pages in total, and each page has variable amount of work. Total runtime is about 60 seconds on my Macbook air, and it appears to be using at least some of my cores (two cores are fairly heavily utilized, the two virtual hyperthreaded cores less so).

However, I tried to run this on my linux "heavy testing" server (hexacore Xeon E5-1650V2, 12 cores if you count hyperthreading) and find that only a single core is used. Worse, it appears only a single thread is being spawned with the default rayon settings (e.g. no custom Configuration):

$ top -H -p 11163

Threads:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  13200303+total,  4213284 used, 12778974+free,    95156 buffers                    
KiB Swap:  4190204 total,        0 used,  4190204 free.  2135916 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                         
11163 tongz      20   0 1442108 1.296g   2640 R 99.8  1.0   0:39.55 bench_grow                                                                                                      
11164 tongz      20   0 1442108 1.296g   2640 S  0.0  1.0   0:00.00 log4rs config r

If I manually specify how many threads should be used, Rayon spawns the full set of threads... but only one thread is actually used:

rayon::initialize(rayon::Configuration::new().set_num_threads(12));

$ top -H -p 10495

Threads:  14 total,   1 running,  13 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  13200303+total,  4212232 used, 12779080+free,    95128 buffers                    
KiB Swap:  4190204 total,        0 used,  4190204 free.  2135916 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                         
10495 tongz      20   0 2253264 1.296g   2680 R 99.7  1.0   0:28.41 bench_grow                                                                                                      
10496 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 log4rs config r                                                                                                 
10497 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10498 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10499 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10500 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10501 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10502 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10503 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10504 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10505 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10506 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10507 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow                                                                                                      
10508 tongz      20   0 2253264 1.296g   2680 S  0.0  1.0   0:00.00 bench_grow

Is it possible the compiler is optimizing away something? After the processing is done, I'm printing out some values from the computation to make sure the result is used. And processing still takes ~40s on my server, which makes me think it isn't just optimized away.

I thought the work chunks might be too small, so I tried variously sized work units and it seems to display the same behavior.

I'm not sure what would be helpful to debug. The code itself is here (and is mostly an unmitigated disaster atm, I haven't had a chance to clean it up). I'm using the bench_grow example as a simple test: cargo run --release --example bench_grow

The salient code looks like this:

fn main() {
    rayon::initialize(rayon::Configuration::new().set_num_threads(12));

    let num_pages = 63u32;
    let dimension = num_pages * PAGE_WIDTH;
    let mut cajal = Cajal::new(num_pages, 0.01, &[1,2,3,7]);

    let start = SteadyTime::now();
    cajal.grow();
    let elapsed =  SteadyTime::now() - start;
    let signal = (*cajal.get_cell(10, 10)).get_signal();
    println!("Elapsed time: {:?} (signal: {})", elapsed, signal);
}

fn grow() {
  self.pages.par_iter_mut()
    .for_each(|page| page.grow());;

  // some sequential processing

  self.pages.par_iter_mut()
    .for_each(|page| page.update());
}

Consider switching to the deque from crossbeam

We are currently using the deque from the crates.io deque package -- however, @aturon informs me that over long uses that deque can leak memory, since it lacks any form of GC to reclaim nodes. The crossbeam deque is a port to use crossbeam's memory reclamation infrastructure, which should keep memory use in check. We should probably switch; however, it would be interesting to first have some Rayon benchmarks in place.

Possible to implement IntoParallelIter for &mut [T] too?

So I was attempting to use into_par_iter() on something that needed mutability, and ran into what seems like a limitation.

My setup is a cellular automata grid. A single Grid holds a vector of Pages, and each Page holds a vector of Cells. The goal was to into_par_iter() over the pages, since they can be processed independently. However, the grow() method on each Page requires mutability because they modify their state internally...which seems to be a problem.

A rough outline:

pub struct Grid {
    pages: Vec<Page>
}

impl Grid {
  fn start(&mut self) {
      loop {
          let active_cells =  (&mut self.pages).into_par_iter()
              .map(|page| page.grow(GrowthPhase::Axon))
              .sum();

            //error: cannot borrow immutable borrowed content `*page` as mutable
            //  .map(|page| page.grow(GrowthPhase::Axon))
            //              ^~~~

          if active_cells == 0 {
              break;
          }
      }
  }
}

pub struct Page {
    cells: Vec<Cell>
}

impl Page {
    pub fn grow(&mut self, phase: GrowthPhase) -> u32 {
        // modifies state internally here
    }
}

Would it be possible for Rayon to support this, or is there something subtle I'm missing? Thanks!

Support moving into paralell iterators

At the moment we only have an impl on &[T]. This makes it expensive to do anything on large objects - you need to clone, especially if you want to reduce.

A possible way could be to pull apart a Vec by way of unsafe code. We could use split_off, but that would be expensive for large n-sizes.

Parallel processing of indexed for loops

Is there any way to add parallelism with rayons to loops that look like this:

for x in 1..g.width() - 1 {
        for y in 1..g.height() - 1 {
                ...processing....
        }
}

I do not even know if this would be possible in theory, but very helpful for some problems.

Use an user-provided "backend" (threadpool, event loop, ...)

Is it possible to tell rayon to use an user provided thread-pool (e.g. in my crate's main function)?

For example if I'd had a crate using tokio where I also want to use rayon, I would like to have a single thread-pool/event loop/task manager... that is used as the backend for both (and that does workstealing for both), instead of two competing ones.

scope needs tests for panic handling

The panic handling tests do not test the scope API!

We need tests to check that if a job in a scope panics:

the panic is propagated to the creating thread
this should work even if the job was not created directly by the scope closure, but indirectly, etc

Quicksort demo is broken

e98c137 broke quicksort demo.

Use of Crossbeam

Have you considered using Crossbeam ( https://aturon.github.io/blog/2015/08/27/epoch/ ) for managing the thread queues?

IntoParallelRefIterator and IntoParallelRefMutIterator are overly constrained

I was trying some implementations for #88 std collections (WIP), but I hit a stumbling block on IntoParallelRefIterator and IntoParallelRefMutIterator. These both require their Item to be the type behind the reference, so the ParallelIterator's Item is this &Item and &mut Item respectively. That's OK for most collections, but the map iterators use (&K, &V) and (&K, &mut V).

It would be nicer if the Item represented the same as the parallel iterator, so both the typical &T and map's (&K, &V) can work. Perhaps some future collection would even like something that only emulates a reference, like a cell Ref or a MutexGuard.

With this made more general, it can also have a blanket implementation, like:

pub trait IntoParallelRefIterator<'data> {
    type Iter: ParallelIterator<Item=Self::Item>;
    type Item: Send + 'data;

    fn par_iter(&'data self) -> Self::Iter;
}

impl<'data, T: 'data + ?Sized> IntoParallelRefIterator<'data> for T
    where &'data T: IntoParallelIterator
{
    type Iter = <&'data T as IntoParallelIterator>::Iter;
    type Item = <&'data T as IntoParallelIterator>::Item;

    fn par_iter(&'data self) -> Self::Iter {
        self.into_par_iter()
    }
}

Users of this trait would now lose the ability to assume that par_iter() yields simple references, but I'm not sure that's actually needed or valuable.

Thoughts?

tag releases

crates.io has 0.1.0 and 0.2.0.

i want to see 0.2.0 code on github, but can’t because the versions aren’t tagged.

change name of `threadpool::install`?

Currently, the question is, do we want a different name than threadpool::install()? Something that makes it clearer that we will execute the closure in the thread-pool?

Original post:

Currently, threadpool::install injects its closure argument in one of the worker threads for the threadpool. Since we are then executing in the worker threads, calls to join or scope::spawn naturally execute in that threadpool. I chose this design because it's zero overhead -- the logic for join and scope::spawn are the same ("if in worker thread, push, else inject into global pool").

However, I have been contemplating a change where we instead set a thread-local variable to point at the thread-pool, and then consult this thread-local variable when injecting threads. The reason is that I want people to be able to install at a very high point in their callstack without disturbing thread-local data they may need to use before they spawn tasks (once they spawn, there is not much I can do about that of course).

impl par_iter for other std collections.

Currently if you want parallelize iterating through anything that can't easily be turned into a Vec<T> or &[T] you have perform a collect, which can be quite slow on large collections.

Allow the user to specify the maximum number of threads that Rayon is allowed to use

Currently Rayon uses all the CPUs / cores available (determined by num_cpus::get()). It would be nice if the user can specify a maximum limit of how many threads Rayon is allowed to use in parallel.

Error type should implement `Error` trait

We recently introduced the InitError for "configuration gone wrong". This enum should implement the Error trait.

Collection method for associative, but not commutative, reductions

Not sure how possible this is, but I'd like to be able to use parallel iterators with non-commutative reduction functions (e.g., collecting error messages in a deterministic order).

Maybe best to use .enumerate() and .sort() a Vec populated by the parallel iterator?

publish docs

even autogenned APIdocs without additional effort would be nice.

https://nikomatsakis.github.io/rayon would be an obvious place

Periodic crashes with panic propagation

I have been seeing travis builds fail unreliably in the negative panic tests. I'll try to pin this down at some point.

cc @Amanieu

No way to influence par_iter's decision when to fall back to sequential

The threshold and/or the cost function is highly dependant on the workload.
For maximum performance on different workloads the user should be able to modify them.

we need a convenient `collect()`, `collect_into()` should support other collection types

In general, it'd be nice if collect_into() were more flexible and convenient. We may need multiple APIs in the end. Some things I would like:

a convenient collect() that allocates a new collection, as with sequential iterators;
maybe rename collect_into() into extend() and have it work like extend()?
- this is possible for vectors, at least
support for some more collection types:
- for exact length iterators, we could support slices (not vectors)
  - but we'd have to assert that their length is correct
- concurrent collections could be easily supported
- also persistent collections like https://github.com/michaelwoerister/hamt-rs
- maybe worth supporting something like RwLock<&mut HashMap>
  - could be practical in some scenarios...
  - or maybe collecting up pairs in a vec and then (transparently) storing them into the hashmap after, as a seq step?
support for more parallel iterator types:
- vecs can support bounded iterators, but they need a "compact" step
  - have to be careful about panics, though we can leak if needed

Well, ok, that's not very organized. But I feel like we can put a bit more effort into making this ergonomic and "bridging" between various things. I'm torn. =)

I suspect we'll wind up with more than just collect_into, in any case, though specialization might help here too.

`par_split_whitespace` for string

I was thinking that if you have an &str, you can split it into words/lines in parallel in a pretty simple way. Let's just think about making a par_split_whitespace that acts like split_whitespace:

First, pick a point in the middle of the string as measured in bytes; if this is in the middle of a utf-8 character, move to the start of the utf-8 character (you'd need to look at the raw as_bytes here, I don't think we have a helper for this -- but utf-8 is designed to make such "alignment" easy, you just have to check for bytes with the 0x80 bit set I think).
Next, check if this is a whitespace character. If so, move backwards until you find a non-whitespace character.
Split the string in half at this point.

This will give you two roughly strings of roughly even length that we can further subdivide.

The same principle can be applied to make a par_lines that acts like lines.

I don't think you can do this in the general case for split(), since it takes arbitrary functions, but you could probably do it for many cases (e.g., an arbitrary split-string).

This seems like a fun starter task for hacking on Rayon. If you're interested in taking it up, please reach out through the gitter channel.

UPDATE: par_lines() has been implemented, but not par_split_whitespace(); see this comment for details.

for_each_mut/sync adapter

hi, i’d like to put things coming out of a ParallelIterator into a HashMap.

currently i don’t see a way to do it, so i’d either like to see a for_each_mut that takes a FnMut or something else.

merge-sort may have some sort of bug

While preparing #104, I tried to convert the mergesort to create the array data using XorShiftRng, like quicksort -- but it started to fail its correctness assertions! Didn't take the time to look into it deeply though.

cc @edre

parallel map not parallel

I think I am just missing something in my haste to get something very small working…

From what I can tell:

thing.par_iter().map(…).sum()

does not do anything in parallel, it computes the maps sequentially.

Use OpenGL for nbody demo

#36 added an nbody demo. It would be nice to glium instead of SDL though.

join hangs if a worker thread panics

If a worker thread panics, I would expect rayon to either return an error result or simply propagate that panic. Instead, it just hangs waiting for a thread that will never finish. Example:

extern crate rayon;

fn main() {
    let s = std::time::Duration::new(1, 0);
    rayon::join(|| std::thread::sleep(s),
                || unimplemented!());
}

The unimplemented panic will be reported, but then the process just spins forever waiting for it.

In #9, I toyed with a Mutex and Condvar. If that mutex were held throughout the stolen job execution, then a panic would poison it. I think this will wake the condvar wait with the poisoned result, but I haven't tried it. We'd also need the steal_until loop to detect panics.

Variant on `join` that does not send closure `A`?

Since we are guaranteed to execute closure A on the main thread, it occurs to me that it does not really need to be Send. Lifting this restriction would add a bit of flexibility. For example, one could thread some non-thread-safe data through the first closure, spawning tasks with the right closure. (Though you'd still have to write tasks in a kind of recursive style.)

UPDATE: This doesn't quite work, but see this comment.

Usage with “Read” or “Iterator” as provider?

I see that only ranges and slices can be converted into parallel iterators.

i wonder how to best utilize rayon while reading from a file.

ATM I read files into buffers, then parse them using an iterator. i imagine there’s a better way to use rayon than:

let buffer = String::new();
File::open(some_path)?.read_to_string(buffer)?;  // iterate unindexed data once
let records: Vec<_> = ParseIter::new(buffer).collect();  // iterate unindexed data again
records.par_iter_mut().for_each(do_stuff);  // iterate indexed data in parallel

Busy yield-loops hurt performance

I noticed that the kernel scheduler was showing up a lot in my perf report, and I tracked this down to having so many sched_yield syscalls in waiting threads. Such busy work is wasted cpu time, not only for your own process but also if anything else is trying to run on the system.

I tried a crude patch below to use Condvar for the blocking latch wait, and it already shows a clear benefit. The other yield loop is in steal_until, which is harder because that's looping for either latch completion or available work -- suggestions welcome.

diff --git a/src/latch.rs b/src/latch.rs
index a81831d1cc7a..fbeff95acbd4 100644
--- a/src/latch.rs
+++ b/src/latch.rs
@@ -1,9 +1,11 @@
+use std::sync::{Mutex, Condvar};
 use std::sync::atomic::{AtomicBool, Ordering};
-use std::thread;

 /// A Latch starts as false and eventually becomes true. You can block
 /// until it becomes true.
 pub struct Latch {
+    m: Mutex<()>,
+    c: Condvar,
     b: AtomicBool
 }

@@ -11,19 +13,24 @@ impl Latch {
     #[inline]
     pub fn new() -> Latch {
         Latch {
+            m: Mutex::new(()),
+            c: Condvar::new(),
             b: AtomicBool::new(false)
         }
     }

     /// Set the latch to true, releasing all threads who are waiting.
     pub fn set(&self) {
+        let _guard = self.m.lock().unwrap();
         self.b.store(true, Ordering::SeqCst);
+        self.c.notify_all();
     }

     /// Spin until latch is set. Use with caution.
     pub fn wait(&self) {
+        let mut guard = self.m.lock().unwrap();
         while !self.probe() {
-            thread::yield_now();
+            guard = self.c.wait(guard).unwrap();
         }
     }

Before, running on Fedora 23 x86_64 with an i7-2600:

     Running `target/release/quicksort 1000`
Array length    : 1000K 32-bit integers
Repetitions     : 10
Parallel time   : 281422159
Sequential time : 848279007
Parallel speedup: 3.014258045685734

After, with Condvar wait:

     Running `target/release/quicksort 1000`
Array length    : 1000K 32-bit integers
Repetitions     : 10
Parallel time   : 225591277
Sequential time : 873530292
Parallel speedup: 3.872181157075502

Consider switching to jobsteal

jobsteal seems to be a nice library. It might make sense to switch the underlying parallel code here to use it, and just concentrate on the higher-level abstractions in Rayon. Got to do more work on those benchmarks. :)

`par_iter` for `Option`

It would be nice to have one to allow uses like:

paths
    .map(File::open)
    .flat_map(Result::ok)
    // ...

No way to force par_iter to run in parallel and not fall back to sequential execution

Currently there is no way to force par_iter to run in parallel and not fall back to sequential execution (except setting weight to an arbitrarily high value).

I think a simple method that guarantees the par_iter will run in parallel would suffice.