Hi all, I have conducted a strong-scaling analysis for a small spinu

Thanks for the update. <a class="user-mention notranslate" data-hovercard-type="user"

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Minor update. I ran <a class="user-mention notranslate" data-hovercard-type="user" dat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Poor performance of amr-wind about amr-wind HOT 14 CLOSED

Armin-Ha commented on August 22, 2024

Poor performance of amr-wind

from amr-wind.

Comments (14)

marchdf commented on August 22, 2024 1

Hi, thanks for reaching out! The short answer: strong scaling for a code that spends most of its time in linear solvers (as amr-wind does) can be very difficult in general.

However, there are certain things you could do to get the most performance out of your case:

compile with the profiler on (tiny-profile: AMR_WIND_ENABLE_TINY_PROFILE:ON) so you can see where it is spending it's time
vary the input parameter amr.blocking_factor from 4 to 32 by powers of 2
vary the input parameter amr.max_grid_size from 4 to 256 by powers of 2
increase the amount of work per core with a bigger cell count
use an intel compiler
try theading with openmp

We don't have good guidance for your case because we typically don't spend much time profiling at this scale. And these things vary quite a bit machine-to-machine. We do spend a lot of time thinking about code performance for GPUs and for O(10-100k) MPI ranks and have some better ideas for the types of numbers that will lead to better performance.

After all this, if the code is still not fast enough, then we need to start talking about linear solver input parameters.

from amr-wind.

marchdf commented on August 22, 2024 1

Thanks for the update. @lawrenceccheung do you have the input file for Armin to try?

I am running some local tests on my machine to see if there are better settings for your specific case. I will be out for the next week or so though.

from amr-wind.

lawrenceccheung commented on August 22, 2024 1

Hi @Armin-Ha,

Yes, you can try running the 512x512x512 that I used here: https://github.com/lawrenceccheung/ALCC_Frontier_WindFarm/blob/main/precursor/scaling/Baseline_level0/MedWS_LowTI_precursor1.inp. Just set time.max_step or time.stop_time to something small to get a few iterations for the purposes of timing.

Lawrence

from amr-wind.

marchdf commented on August 22, 2024 1

Minor update. I ran @Armin-Ha's case on a local machine (AMD EPYC-Rome Processor). And I get the following for strong scaling:

notes:

@Armin-Ha's data is in red.
I played with just one of the many parameters for tuning (amr.max_grid_size= 16, 32, 64)
Scaling is good until 5e5 cells per proc approximately
Increasing amr.max_grid_size improves runtime mostly but gets worse at low grid cell counts because it can't distribute the cells on all the procs better
@Armin-Ha's scaling is poor compared to these data. Bad MPI implementation? Bad procs? Not sure what is going on that system.
I would imagine playing with other parameters could get the scaling at low cells/core to be better. Maybe even playing with OMP.

This, with the data @lawrenceccheung presented at a much higher proc count, seems to me to indicate that there is not much of a strong scaling problem on CPUs with amr-wind.

from amr-wind.

marchdf commented on August 22, 2024 1

For this issue, I don't want to discuss the weirdnesses of CPU runs on Kestrel. There are known issues with that machine and those are being worked on. For the case that opened up this issue, we've shown that it can scale on CPUs. And on a machine we trust (Frontier), we get good scaling on a large (similar) case.

From @Armin-Ha's reaction to my post about his case, we can close his issue. Please feel free to reopen @Armin-Ha if you need to discuss further.

from amr-wind.

asalmgren commented on August 22, 2024

@Armin-Ha -- just to follow up -- when you have a chance to re-run with the profiling on could you send us the output files (maybe just from 1, 4 and 8 cores). Also -- it looks like you do have the checkpointing and plotfiles on -- could you turn those off before re-running? And feel free to run fewer steps -- if I'm correctly reading your inputs file you are running over 14000 steps and writing plotfiles/checkpoints roughly 28 times? See what happens if you maybe run 100 steps for each case with all the I/O off? Thx

from amr-wind.

Armin-Ha commented on August 22, 2024

Hi Ann, Thanks for the reply. I conducted the simulations for around 50 steps, so no checkpointing or file plotting was involved except for the initial time. As you know, AMR-wind outputs the total time for every single time step, and I have taken the average of these times for 50 steps to exclude the writing time for the initial checkpoint and plot files. I will re-run the cases as you'd like and send you the output files. In addition, I will examine Marc's suggestions to improve the performance.
Best regards,
Armin

from amr-wind.

asalmgren commented on August 22, 2024

Sounds great, thanks! The most important thing for me to look at it will be the profiling results that are printed at the end of the run. To clarify - did you run with 512 grids for each run? Or use fewer (larger) boxes at lower core counts? Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers

…

On Thu, Jun 13, 2024 at 1:50 AM Armin-Ha ***@***.***> wrote: Hi Ann, Thanks for the reply. I conducted the simulations only for around 50 steps, so no checkpointing or fileplotting were involved except for the initial time. As you know, AMR-wind outputs the total time for every single time step, and I have taken the average of these times for 50 steps to exclude the writing time for the initial checkpoint and plot files. I will re-run the cases as you wish and provide you with the output files. In addition, I will examine Marc's suggestions to improve the performance. — Reply to this email directly, view it on GitHub <#1097 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRE6YSBFL2FKK5SPAITIGDZHFMM7AVCNFSM6AAAAABJDUVSM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVGAZTENZYHE> . You are receiving this because you commented.Message ID: ***@***.***>

from amr-wind.

Armin-Ha commented on August 22, 2024

I will provide you with the profiling results. Throughout the study, I maintained a fixed mesh of 256x256x256 cells with a fixed domain size of 2560x2560x1280 m3. The only variable I modified among different simulations was the number of cores.

Best regards,
Armin

from amr-wind.

lawrenceccheung commented on August 22, 2024

Hi @Armin-Ha,

For comparison, here are some strong scaling results of AMR-Wind that we've observed (the plots are time per timestep, which can be converted to the speedups calculated). This is on a 512 x 512 x 512 ABL case using CPU's and GPU's of the Frontier cluster.

The details of the hardware are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#system-overview, but the CPU's are AMD 3rd Gen EPYC processors. Let me know if you have any questions.

Cheers,

Lawrence

from amr-wind.

Armin-Ha commented on August 22, 2024

Hi @asalmgren and @lawrenceccheung,

Sorry for my late reply, and thanks for sharing the strong scaling results of AMR-Wind, which appear to be reasonably linear on AMD 3rd Gen EPYC processors. I would appreciate it if you could provide me with the input file used for this analysis.

I have replicated the analysis for the small spinup simulation with 256X256X256 mesh (10mX10mX5m) on Intel Xeon W-2145. The corresponding log files, which include the profiling outcomes, are attached.

log_1cores.txt
log_2cores.txt
log_4cores.txt
log_8cores.txt

Best regards,
Armin

from amr-wind.

michaelasprague commented on August 22, 2024

Wanted to share an observation. The results from @lawrenceccheung above show excellent strong scaling down to about 34,000 cells per rank, whereas results from @marchdf show good strong scaling down to only about 500,000 cells per rank. These are different cases and different machines, but it seems that more performance could be gained in @Armin-Ha 's case.

from amr-wind.

rthedin commented on August 22, 2024

Just to add to the discussion, I have a test case that I have been using to test compilations on Kestrel. My test case has about 50M cells and two refinements. I run it with and without two ALM turbines. Performance starts to drop at ~120k cells per rank, and lower than that is really not good. I understand my cases and number are different than the ones you are all discussing, and most importantly, I'm running on Kestrel, but just wanted to share what I found. I built my test case from a user point of view and is supposed to mimic a real case I would run, hence the turbines and refinements. Note that I have also not changed amr.max_grid_size from its default.

Edit: this is all on CPUs.

from amr-wind.

asalmgren commented on August 22, 2024

As an FYI to all on this thread -- amrex has actually just changed the default max_grid_size for 3D runs on GPUs from 32 to 64 -- some amount of performance testing seems to indicate that's a win. Your mileage may vary of course. We actually suggest the same for CPU runs but thought it would be less disruptive for users to do this for GPU-only first and see if there any gotchas.

…

On Wed, Jul 10, 2024 at 11:40 AM Regis Thedin ***@***.***> wrote: Just to add to the discussion, I have a test case that I have been using to test compilations on Kestrel. My test case has about 50M cells and two refinements. I run it with and without two ALM turbines. Performance starts to drop at 60k cells per rank, and lower than that is really not good. I understand my cases and number are different than the ones you are all discussing, and most importantly, I'm running on Kestrel, but just wanted to share what I found. I built my test case from a user point of view and is supposed to mimic a real case I would run, hence the turbines and refinements. Note that I have also not changed amr.max_grid_size from its default. image.png (view on web) <https://github.com/Exawind/amr-wind/assets/13243358/ba5c8699-340a-4409-82f2-f33576a8c11f> — Reply to this email directly, view it on GitHub <#1097 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRE6YQWIFVBG2BWKJ5LKA3ZLV52DAVCNFSM6AAAAABJDUVSM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRRGE4TENJYGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers

from amr-wind.

Poor performance of amr-wind about amr-wind HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent