Hi, I'm testing the performance of both pools with the given demo sc

Hi <g-emoji class="g-emoji" alias="wave" fallback-src="https://github.githubassets.com

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Large speed differences between MPIPool and MultiPool about schwimmbad HOT 3 CLOSED

adrn commented on May 26, 2024

Large speed differences between MPIPool and MultiPool

from schwimmbad.

Comments (3)

adrn commented on May 26, 2024

Hi 👋 ! Thanks for the issue. I'm offline for a few days but will take a look ASAP. Thanks for the patience!

from schwimmbad.

adrn commented on May 26, 2024

@enajx I ended up with some time tonight, so I did some tests 😅 .

First off, I can reproduce what you see on macOS. I can think of many some reasons why this might be happening, but I don't know exactly why. That said, I did some experiments, and have some recommendations.

Experiment: slow down the worker function

My first thought is that your worker function is very fast to execute, so it's possible that you are dominated by MPI overheads -- either in message passing the large number of tasks, or in the pool waiting for workers to finish before sending new tasks. So, the first experiment I tried was to slow down your worker function by adding a time.sleep(1e-4) just under the function definition. That made the execution times much closer:

MPI

$ mpiexec -n 4 python demo.py --mpi
Elapsed time n=100: 0.0066378116607666016
Elapsed time n=1000: 0.054388999938964844
Elapsed time n=10000: 0.5381379127502441

Multiprocessing (MPI with -n 4 only uses 3 workers, so it's fairer to compare to 3 cores):

$ python demo.py --ncores=3
Elapsed time n=100: 0.48464012145996094
Elapsed time n=1000: 0.04478788375854492
Elapsed time n=10000: 0.4312679767608642

Batching tasks before `map`'ing

If your worker function is inherently very fast to execute and you just have a ton of tasks to execute on, I have gotten much better performance by first batching up the tasks and sending batches of tasks to the worker function.

Example script here (a modified version of your demo):

def worker(batch):
    _, tasks = batch

    results = []
    for a, b in tasks:
        results.append(a**2 + b**2)
    return results

def main(pool, n):
    from schwimmbad.utils import batch_tasks

    # Here we generate some fake data
    import random
    a = [random.random() for _ in range(n)]
    b = [random.random() for _ in range(n)]

    tasks = list(zip(a, b))
    batches = batch_tasks(min(1, pool.size-1),
                        arr=tasks)

    tic = time.time()
    results = pool.map(worker, batches)
    toc = time.time()
    results = [x for sublist in results for x in sublist]
    print(f'Elapsed time n={n}: {toc-tic}')

if __name__ == "__main__":
    import schwimmbad
    import time

    from argparse import ArgumentParser
    parser = ArgumentParser(description="Schwimmbad example.")

    group = parser.add_mutually_exclusive_group()
    group.add_argument("--ncores", dest="n_cores", default=32,
                    type=int, help="Number of processes (uses multiprocessing).")
    group.add_argument("--mpi", dest="mpi", default=False,
                    action="store_true", help="Run with MPI.")
    args = parser.parse_args()

    pool = schwimmbad.choose_pool(mpi=args.mpi, processes=args.n_cores)

    for n in [10 ** x for x in range(2, 6+1)]:
        main(pool, n)

MPI:

$ mpiexec -n 4 python schw-test.py --mpi
Elapsed time n=100: 0.001007080078125
Elapsed time n=1000: 0.003452777862548828
Elapsed time n=10000: 0.03164792060852051
Elapsed time n=100000: 0.29506802558898926
Elapsed time n=1000000: 2.743013858795166

Multiprocessing:

$ python schw-test.py --ncores=3
Elapsed time n=100: 0.4871842861175537
Elapsed time n=1000: 0.001354217529296875
Elapsed time n=10000: 0.006591081619262695
Elapsed time n=100000: 0.07938289642333984
Elapsed time n=1000000: 0.699350118637085

So, MPI is still slower here, but closer than in your initial example (though you could probably get slightly better performance by tuning the number of batches).

from schwimmbad.

enajx commented on May 26, 2024

That makes sense!

I run a benchmark again with a more realistic worker function and now I get more consistent results, in this case MPI taking the lead:

Computing pi, running on OSX with 8 cores:

n = 1000000
MultiPool : 0.54s
MPIPool : 0.12s

n = 10000000
MultiPool : 1.45s
MPIPool : 1.22s

import math
import random
import time
import schwimmbad

def sample(num_samples):
    num_inside = 0
    for _ in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if math.hypot(x, y) <= 1:
            num_inside += 1
    return num_inside

def approximate_pi_parallel(num_samples, cores, mpi):
    
    sample_batch_size = 1000

    with schwimmbad.choose_pool(mpi=mpi, processes=cores) as pool: 
        print(pool)
        start = time.time()
        num_inside = 0
        sample_batch_size = sample_batch_size
        for result in pool.map(sample, [sample_batch_size for _ in range(num_samples//sample_batch_size)]):
            num_inside += result
        print(f"pi ~= {(4*num_inside)/num_samples}")
        print(f"Finished in: {time.time()-start}s")
        pool.close()    

  
if __name__ == "__main__":

    n = 1000000
    # n = 10000000
    cores = 8
    mpi = False # True when run with mpiexec
    approximate_pi_parallel(n, cores, mpi)

Issue solved, thank you!

from schwimmbad.

Large speed differences between MPIPool and MultiPool about schwimmbad HOT 3 CLOSED

Comments (3)

Experiment: slow down the worker function

Batching tasks before `map`'ing

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (3)

Experiment: slow down the worker function

Batching tasks before map'ing

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Batching tasks before `map`'ing