Git Product home page Git Product logo

Comments (14)

marchdf avatar marchdf commented on August 22, 2024 1

Hi, thanks for reaching out! The short answer: strong scaling for a code that spends most of its time in linear solvers (as amr-wind does) can be very difficult in general.

However, there are certain things you could do to get the most performance out of your case:

  • compile with the profiler on (tiny-profile: AMR_WIND_ENABLE_TINY_PROFILE:ON) so you can see where it is spending it's time
  • vary the input parameter amr.blocking_factor from 4 to 32 by powers of 2
  • vary the input parameter amr.max_grid_size from 4 to 256 by powers of 2
  • increase the amount of work per core with a bigger cell count
  • use an intel compiler
  • try theading with openmp

We don't have good guidance for your case because we typically don't spend much time profiling at this scale. And these things vary quite a bit machine-to-machine. We do spend a lot of time thinking about code performance for GPUs and for O(10-100k) MPI ranks and have some better ideas for the types of numbers that will lead to better performance.

After all this, if the code is still not fast enough, then we need to start talking about linear solver input parameters.

from amr-wind.

marchdf avatar marchdf commented on August 22, 2024 1

Thanks for the update. @lawrenceccheung do you have the input file for Armin to try?

I am running some local tests on my machine to see if there are better settings for your specific case. I will be out for the next week or so though.

from amr-wind.

lawrenceccheung avatar lawrenceccheung commented on August 22, 2024 1

Hi @Armin-Ha,

Yes, you can try running the 512x512x512 that I used here: https://github.com/lawrenceccheung/ALCC_Frontier_WindFarm/blob/main/precursor/scaling/Baseline_level0/MedWS_LowTI_precursor1.inp. Just set time.max_step or time.stop_time to something small to get a few iterations for the purposes of timing.

Lawrence

from amr-wind.

marchdf avatar marchdf commented on August 22, 2024 1

Minor update. I ran @Armin-Ha's case on a local machine (AMD EPYC-Rome Processor). And I get the following for strong scaling:

Screenshot 2024-07-09 at 1 30 03 PM

notes:

  • @Armin-Ha's data is in red.
  • I played with just one of the many parameters for tuning (amr.max_grid_size= 16, 32, 64)
  • Scaling is good until 5e5 cells per proc approximately
  • Increasing amr.max_grid_size improves runtime mostly but gets worse at low grid cell counts because it can't distribute the cells on all the procs better
  • @Armin-Ha's scaling is poor compared to these data. Bad MPI implementation? Bad procs? Not sure what is going on that system.
  • I would imagine playing with other parameters could get the scaling at low cells/core to be better. Maybe even playing with OMP.

This, with the data @lawrenceccheung presented at a much higher proc count, seems to me to indicate that there is not much of a strong scaling problem on CPUs with amr-wind.

from amr-wind.

marchdf avatar marchdf commented on August 22, 2024 1

For this issue, I don't want to discuss the weirdnesses of CPU runs on Kestrel. There are known issues with that machine and those are being worked on. For the case that opened up this issue, we've shown that it can scale on CPUs. And on a machine we trust (Frontier), we get good scaling on a large (similar) case.

From @Armin-Ha's reaction to my post about his case, we can close his issue. Please feel free to reopen @Armin-Ha if you need to discuss further.

from amr-wind.

asalmgren avatar asalmgren commented on August 22, 2024

@Armin-Ha -- just to follow up -- when you have a chance to re-run with the profiling on could you send us the output files (maybe just from 1, 4 and 8 cores). Also -- it looks like you do have the checkpointing and plotfiles on -- could you turn those off before re-running? And feel free to run fewer steps -- if I'm correctly reading your inputs file you are running over 14000 steps and writing plotfiles/checkpoints roughly 28 times? See what happens if you maybe run 100 steps for each case with all the I/O off? Thx

from amr-wind.

Armin-Ha avatar Armin-Ha commented on August 22, 2024

Hi Ann, Thanks for the reply. I conducted the simulations for around 50 steps, so no checkpointing or file plotting was involved except for the initial time. As you know, AMR-wind outputs the total time for every single time step, and I have taken the average of these times for 50 steps to exclude the writing time for the initial checkpoint and plot files. I will re-run the cases as you'd like and send you the output files. In addition, I will examine Marc's suggestions to improve the performance.
Best regards,
Armin

from amr-wind.

asalmgren avatar asalmgren commented on August 22, 2024

from amr-wind.

Armin-Ha avatar Armin-Ha commented on August 22, 2024

I will provide you with the profiling results. Throughout the study, I maintained a fixed mesh of 256x256x256 cells with a fixed domain size of 2560x2560x1280 m3. The only variable I modified among different simulations was the number of cores.

Best regards,
Armin

from amr-wind.

lawrenceccheung avatar lawrenceccheung commented on August 22, 2024

Hi @Armin-Ha,

For comparison, here are some strong scaling results of AMR-Wind that we've observed (the plots are time per timestep, which can be converted to the speedups calculated). This is on a 512 x 512 x 512 ABL case using CPU's and GPU's of the Frontier cluster.
image

The details of the hardware are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#system-overview, but the CPU's are AMD 3rd Gen EPYC processors. Let me know if you have any questions.

Cheers,

Lawrence

from amr-wind.

Armin-Ha avatar Armin-Ha commented on August 22, 2024

Hi @asalmgren and @lawrenceccheung,

Sorry for my late reply, and thanks for sharing the strong scaling results of AMR-Wind, which appear to be reasonably linear on AMD 3rd Gen EPYC processors. I would appreciate it if you could provide me with the input file used for this analysis.

I have replicated the analysis for the small spinup simulation with 256X256X256 mesh (10mX10mX5m) on Intel Xeon W-2145. The corresponding log files, which include the profiling outcomes, are attached.

log_1cores.txt
log_2cores.txt
log_4cores.txt
log_8cores.txt

Best regards,
Armin

from amr-wind.

michaelasprague avatar michaelasprague commented on August 22, 2024

Wanted to share an observation. The results from @lawrenceccheung above show excellent strong scaling down to about 34,000 cells per rank, whereas results from @marchdf show good strong scaling down to only about 500,000 cells per rank. These are different cases and different machines, but it seems that more performance could be gained in @Armin-Ha 's case.

from amr-wind.

rthedin avatar rthedin commented on August 22, 2024

Just to add to the discussion, I have a test case that I have been using to test compilations on Kestrel. My test case has about 50M cells and two refinements. I run it with and without two ALM turbines. Performance starts to drop at ~120k cells per rank, and lower than that is really not good. I understand my cases and number are different than the ones you are all discussing, and most importantly, I'm running on Kestrel, but just wanted to share what I found. I built my test case from a user point of view and is supposed to mimic a real case I would run, hence the turbines and refinements. Note that I have also not changed amr.max_grid_size from its default.

image

Edit: this is all on CPUs.

from amr-wind.

asalmgren avatar asalmgren commented on August 22, 2024

from amr-wind.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.