In LB a lot of stuff is send aroudn twice in ghost comm: Push schem

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Concerning observabels: if they only use info from the lattice

Here are the benchmark results: <a target="_blank" rel="noopener nor

LB: avoid double communicaiton about espresso HOT 5 OPEN

RudolfWeeber commented on September 23, 2024

LB: avoid double communicaiton

from espresso.

Comments (5)

RudolfWeeber commented on September 23, 2024

@jngrad

I'd suggest the following:

work on the push scheme only
define a separate communcaotr which communicates only the velocity field
use that instead of the full communicaotr a thte end of integrate_push() and check improvement in performance and scaleability.
If the improvement gained form that is relevant, the get_interpolated_density() needs to be fixed or removed (it relies on the now missing ghost communication of the full pdf). The easiest fix is to trigger a full ghost communicaiot of the pdfs when someune queries an interpolated density. Its slow, but I'm not aware of a cuse case of that getter, anyway.

from espresso.

jngrad commented on September 23, 2024

For a 48x48x48 LB grid, removing the pdfs and forces from the communicator improves runtime significantly:

25% faster on CPU with 8 cores
40% faster on GPU with 1 core or 8 cores

Similar observation with particle coupling enabled, using the defaults: 30x30x30 box with 125 particles per core.

I introduced a std::bitset where individual bits are addressed using an enum value to keep track of which fields have outdated halos. Python setter functions automatically do a full LB ghost communication after changing fields, so that getter functions always access correct data. But in C++ unit tests, one has to manually call the LB ghost communication after any call to a setter function or the integration sweep.

Assertions were introduced to make sure getter functions never access outdated halo cells if the consider_ghosts flag is set to true. While the full testsuite passes with these assertions, more testing is needed, in particular to figure out if the LB observables need to manually call the LB ghost update before collecting data. I'll also run more thorough benchmarks.

from espresso.

RudolfWeeber commented on September 23, 2024

Concerning observabels:

if they only use info from the lattice sites (total momenutm, etc), probably not, as each MPI rank can read from its local LB cells and no info from ghost laers is needed
velocity profile at arbitrary posiitons; probably not, should use same interpolated velociyt getter as particle coupling, which relies on already ghost-communicated pre-calculated velocity field
density and momentum density profile at arbitrary positions: probably yes, due to the density which we calculate direclyt from PDFs, which we no longer ghost communicate.

from espresso.

RudolfWeeber commented on September 23, 2024

Great that it worked out. Thsi probably means, we should put more effort into reducing communication volume. The next steps would be:

In the m_pdf_streaming_communicator replace the Packinfo which packs the entire field by a generated pack info which only packs the data actually used on the other side by the streaming kernel (lbmpy_walberla.packinfo.generate_lb_pack_info does this, I think)

Notes:

according to doc string, passing the collisiom kernel to the function as well will add the other fields the collision kernel needs to the pack info. This is the last_applied_force in our case. It should then be possible to remove the separate PackInfo for the last_applied_force from the m_pdf_streaming_communicator as well.
I think we need streaming pattern "pull", in spite of theis affecitng integrate_push_scheme. Not sure.
If using teh generated pack info improves peroformance, the next step would be to have an AVX-vectorized version as well

from espresso.

jngrad commented on September 23, 2024

Here are the benchmark results:

We are now outperforming the ESPResSo 4.2 LB implementation up to and including 16 cores on the CPU. There is no AVX implementation of the PackInfo, since pystencils generates the exact same code with and without AVX. It's also unclear to me, whether AVX would be beneficial here, since the data in the bufffer doesn't have the same alignment as in the pdfs. Maybe for vector fields it would make sense, but that buffer is one order of magnitude smaller that the pdfs buffer, and we would need to increase the buffer size by 33% to get the proper alignment, not mentioning the changes to the field itself.

The linked PR raised new questions about and MPI buffers data alignment and UBB interactions with the streaming step communication. I invested two days on solving these issues, but could only come up with temporary workarounds. It is unclear if it's worth pursuing this investigation further, as we already agreed on switching to the pull scheme in the near future.

from espresso.

LB: avoid double communicaiton about espresso HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent