Git Product home page Git Product logo

Comments (12)

jngrad avatar jngrad commented on July 18, 2024

One can influence the destruction order by adding the following two code snippets before and after #include <boost/test/unit_test.hpp> in every failing test, and linking them against Espresso::core Boost::mpi via CMake:

#define BOOST_TEST_NO_MAIN
#include "communication.hpp"
int main(int argc, char **argv) {
  auto mpi_env = std::make_shared<boost::mpi::environment>(argc, argv);
  Communication::init(mpi_env);
  return boost::unit_test::unit_test_main(init_unit_test, argc, argv);
}

This is not a sustainable solution, and it will break as soon as the order of instantiation of static globals changes, for example when ESPResSo features that introduce global variables are disabled, or when a different Boost release is used.

from espresso.

jngrad avatar jngrad commented on July 18, 2024

After fixing

export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}"

to

export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"

the bug can be properly investigated with GDB on Ubuntu 22.04.

It's not clear to me, why ESPResSo needs to manage the lifetime of the boost::mpi::detail::mpi_datatype_map singleton. When we don't extend its lifetime, several tests experience segmentation faults or timeouts during normal program termination. Here is the trace when the static global that keeps a reference to the singleton is removed:

Thread 1 "ReactionAlgorit" received signal SIGSEGV, Segmentation fault.
0x00001555544c6504 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0  0x00001555544c6504 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00001555549e1b45 in std::_Rb_tree_iterator<std::pair<std::type_info const* const, ompi_datatype_t*> >::operator++ (
    this=0x7fffffffd320) at /usr/include/c++/10/bits/stl_tree.h:287
#2  0x00001555549e155f in boost::mpi::detail::mpi_datatype_map::clear (
    this=0x1555549ff568 <boost::mpi::detail::mpi_datatype_cache()::cache>) at libs/mpi/src/mpi_datatype_cache.cpp:36
#3  0x00001555549dd6f8 in boost::mpi::environment::~environment (this=0x5555555ca930, __in_chrg=<optimized out>)
    at libs/mpi/src/environment.cpp:184
#4  0x0000155554fa621c in __gnu_cxx::new_allocator<boost::mpi::environment>::destroy<boost::mpi::environment> (
    this=0x5555555ca930, __p=0x5555555ca930) at /usr/include/c++/10/ext/new_allocator.h:162
#5  0x0000155554fa61e7 in std::allocator_traits<std::allocator<boost::mpi::environment> >::destroy<boost::mpi::environment> (
    __a=..., __p=0x5555555ca930) at /usr/include/c++/10/bits/alloc_traits.h:531
#6  0x0000155554fa60a1 in std::_Sp_counted_ptr_inplace<boost::mpi::environment, std::allocator<boost::mpi::environment>,
    (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x5555555ca920)
    at /usr/include/c++/10/bits/shared_ptr_base.h:560
#7  0x0000555555588bd7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5555555ca920)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#8  0x0000555555584321 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (
    this=0x15555539cd38 <Communication::mpi_env+8>, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:736
#9  0x0000155554f9eb16 in std::__shared_ptr<boost::mpi::environment, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (
    this=0x15555539cd30 <Communication::mpi_env>, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1188
#10 0x0000155554f9ec86 in std::shared_ptr<boost::mpi::environment>::~shared_ptr (this=0x15555539cd30 <Communication::mpi_env>,
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121
#11 0x0000155554045a56 in __cxa_finalize (d=0x15555539c560) at ./stdlib/cxa_finalize.c:83
#12 0x0000155554f71d67 in __do_global_dtors_aux () from /tmp/boost/espresso/build/src/core/espresso_core.so
#13 0x00007fffffffd8c0 in ?? ()
#14 0x000015555552024e in _dl_fini () at ./elf/dl-fini.c:142
Backtrace stopped: frame did not save the PC

Running the Python testsuite shows three outcomes: no error, a segmentation fault, or a timeout. The latter most likely happens during atexit, because the test reports success. The latter behavior is quite similar to what is being observed by EESSI maintainers on ARM architectures, although they didn't report segmentation faults (EESSI/software-layer#363), and to my knowledge didn't remove the static reference. Here is an excerpt of the Python testsuite log:

187/191 Test #100: dipolar_p3m .......................................................***Timeout 300.10 sec
.
----------------------------------------------------------------------
Ran 1 test in 10.268s

OK

188/191 Test  #55: observables .......................................................***Timeout 300.24 sec
.......s.........
----------------------------------------------------------------------
Ran 17 tests in 5.649s

OK (skipped=1)

189/191 Test  #10: test_checkpoint__therm_lb__p3m_gpu__lj__lb_walberla_cpu_binary ....***Timeout 300.24 sec
.......s.sss.s.sssss.ss....s..ss..ssssss....
----------------------------------------------------------------------
Ran 44 tests in 0.394s

OK (skipped=21)

from espresso.

jngrad avatar jngrad commented on July 18, 2024

I took a different angle by creating a struct MpiContainer to encapsulate all 3 globals and manage their lifetime and destruction order through a static smart pointer:

namespace Communication {
//static auto const &mpi_datatype_cache = boost::mpi::detail::mpi_datatype_cache();
static std::shared_ptr<boost::mpi::environment> mpi_env;
static std::shared_ptr<MpiCallbacks> m_callbacks;
} // namespace Communication

struct MpiContainer {
    boost::mpi::detail::mpi_datatype_map const &mpi_datatype_cache = boost::mpi::detail::mpi_datatype_cache();
    std::shared_ptr<boost::mpi::environment> mpi_env;
    std::shared_ptr<Communication::MpiCallbacks> m_callbacks;
    ~MpiContainer() {
        m_callbacks.reset();
        Communication::m_callbacks.reset();
        mpi_env.reset();
        Communication::mpi_env.reset();
    }
    MpiContainer() {
        mpi_env = Communication::mpi_env;
        m_callbacks = Communication::m_callbacks;
    }
};

static std::unique_ptr<MpiContainer> mpi_container;

void atexit_handler() {
    if (Communication::m_callbacks) {
        mpi_container->m_callbacks.reset();
        Communication::m_callbacks.reset();
    }
}

The atexit event handler frees the MpiCallbacks handle, which frees the boost::mpi::environment handle, but this introduces more issues, because the MpiCallbacks must remain alive until all dependent Context objects from the ESPResSo ScriptInterface go through their destructors. In C++, the rule is that static variables are destructed during atexit, in reverse order of static initialization resp. event registration. The atexit event was registered right after MpiCallbacks was initialized in Communication::init(), so that the event would be traversed before ~MpiCallbacks(). However, script interface objects live in a different translation unit, and thus we cannot control the order of destruction. The order actually changes between two simulations, making this new bug difficult to reproduce.

from espresso.

jngrad avatar jngrad commented on July 18, 2024

I think the way to solve this issue is to:

  1. not tamper with the MPI datatype cache lifetime, i.e. don't manually call the singleton
  2. don't keep the boost::mpi::environment alive until atexit

Resolving 1. is easy: never call boost::mpi::detail::mpi_datatype_cache() anywhere. It's an implementation detail. Resolving 2. is a lot harder: the MpiCallbacks handle needs the MPI environment handle during Python atexit. We can rewrite MpiCallbacks to keep the MPI environment handle alive. Here is how we need to adapt MPI initialization:

  1. Python interface: the MPI environment is kept alive by _init.pyx (inside a shared pointer) and the MpiCallbacks handle is kept alive by both script_interface.pyx (inside the global context shared pointer) and communication.cpp (inside the static global); a cleanup function is registered to delete all 3 shared pointers during Python atexit
  2. C++ unit tests: the main function contains a MpiContainer handle that keeps the MpiCallbacks handle alive, and it expires as soon as the testsuite ends

A proof-of-concept is available in jngrad/espresso@boost_mpi_bugfix. I get the desired behavior on the python branch compiled with the default config using Boost 1.74, 1.82 and 1.84, at the notable exception of ek_eof.py. The fix can be backported to ESPResSo 4.2, but all LB tests fail due to the LB actor calling the lb_lbfluid_set_lattice_switch() MPI callback after the MPI environment is destroyed.

Here I'm assuming all Python atexit functions run before the first C++ atexit function, please correct me if I'm wrong!

To help with debugging, source the following GDB script in your session to print out calls to relevant symbols in a way that doesn't interrupt the flow of the GDB session with user prompts:

set breakpoint pending on
set pagination off

define handler
break $arg0
commands
cont
end
end

handler MPI_Type_contiguous
handler boost::mpi::environment::environment
handler boost::mpi::environment::~environment
handler boost::mpi::detail::mpi_datatype_cache
handler boost::mpi::detail::mpi_datatype_map::~mpi_datatype_map
handler boost::mpi::detail::mpi_datatype_map::clear

run

from espresso.

jngrad avatar jngrad commented on July 18, 2024

Bugfix backported to 4.2.1 and submitted to openSUSE Tumbleweed (request 1143707) and Factory (request 1143710).

The python branch bugfix is a bit more delicate due to FFTW and HDF5 dependencies, and might take a few more days.

from espresso.

junghans avatar junghans commented on July 18, 2024

On Fedora there was no issue?

from espresso.

jngrad avatar jngrad commented on July 18, 2024

I'm still running up the testsuite locally in a Docker image and will make a bugfix ASAP. They are currently still using Boost 1.83.0 in f40 and rawhide (https://src.fedoraproject.org/rpms/boost), so I gave priority to openSUSE.

from espresso.

jngrad avatar jngrad commented on July 18, 2024

The MpiCallbacks_test unit test fails in a Koji scratch build, but not on my workstation in a Docker image... I'll look into it next week.

from espresso.

jngrad avatar jngrad commented on July 18, 2024

The Homebrew formulae for boost-mpi (link) was bumped to Boost 1.84.0 recently. This package does not provide pinned formulae for older Boost versions. All ESPResSo releases since 4.0.0 and the development version are now unusable on macOS computers that have up-to-date dependencies.

from espresso.

jngrad avatar jngrad commented on July 18, 2024

Progress report:

  • the bugfix seems to work on the macOS GitHub Action with Boost 1.84
  • the remaining undefined behavior revealed by the bugfix still persists on Fedora

Fedora Rawhide still hasn't updated to Boost 1.84 (bugzilla 2178871). Fedora 40 will enter Beta on February 27 (timeline).

On f40 with all architectures, I get random failures of the MpiCallbacks_test and ParallelExceptionHandler_test unit tests with error message Communicator (handle=44000000) being freed has 1 unmatched message(s) on Koji (see below), even though the boost::mpi::environment shared pointer lifetime is tied to the main function lifetime (+2 weak pointers with static linkage), and the MpiCallbacks shared pointer lifetime is tied to the lifetime of the individual test functions. This error appeared last week, and is reproducible locally in a Docker image with MPICH 4.1.2 (and with docker run --shm-size 8G, otherwise a signal 7 fatal error is triggered by OpenMPI).

The MPI deadlocks in Python tests have disappeared last week too, although on the Power9 architecture, when building with C++ assertions, the Boost histogram library triggers an assertion in the observable_cylindricalLB.py test (see below). It happens every time in the CylindricalLBObservableCPU.test_cylindrical_lb_profile_interface test when attempting to set the value of the velocity vector at array[1, 0, 0, :].

MpiCallbacks MPICH error message
Entering test module "MpiCallbacks test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Entering test case "invoke_test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(57): info: check f(i, j) == (invoke<decltype(f), int, unsigned>(f, ia)) has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Leaving test case "invoke_test"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Entering test case "callback_model_t"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(82): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(83): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(93): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(100): info: check 19 == state has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(101): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(102): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(110): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Leaving test case "callback_model_t"; testing time: 420us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Entering test case "adding_function_ptr_cb"
Test case adding_function_ptr_cb did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Leaving test case "adding_function_ptr_cb"; testing time: 611us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Entering test case "RegisterCallback"
Test case RegisterCallback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Leaving test case "RegisterCallback"; testing time: 393us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Entering test case "CallbackHandle"
Test case CallbackHandle did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Leaving test case "CallbackHandle"; testing time: 497us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Entering test case "reduce_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(194): info: check ret == (n * (n - 1)) / 2 has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Leaving test case "reduce_callback"; testing time: 455us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Entering test case "ignore_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(217): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Leaving test case "ignore_callback"; testing time: 374us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Entering test case "one_rank_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(238): info: check cbs.call(Communication::Result::one_rank, fp) == world.size() - 1 has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Leaving test case "one_rank_callback"; testing time: 378us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Entering test case "main_rank_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(263): info: check cbs.call(Communication::Result::main_rank, fp) == world.size() has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Leaving test case "main_rank_callback"; testing time: 354us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Entering test case "call_all"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(287): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Leaving test case "call_all"; testing time: 375us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Entering test case "check_exceptions"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(304): info: check 'exception "std::out_of_range" raised as expected' has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Leaving test case "check_exceptions"; testing time: 400us
Leaving test module "MpiCallbacks test"; testing time: 5020us

*** No errors detected
Running 11 test cases...
Entering test module "MpiCallbacks test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Entering test case "invoke_test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(57): info: check f(i, j) == (invoke<decltype(f), int, unsigned>(f, ia)) has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Leaving test case "invoke_test"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Entering test case "callback_model_t"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(82): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(83): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(93): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(100): info: check 19 == state has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(101): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(102): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(110): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Leaving test case "callback_model_t"; testing time: 420us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Entering test case "adding_function_ptr_cb"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(119): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(120): info: check "adding_function_ptr_cb" == s has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(133): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Leaving test case "adding_function_ptr_cb"; testing time: 614us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Entering test case "RegisterCallback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(139): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(140): info: check "2nd" == s has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(156): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Leaving test case "RegisterCallback"; testing time: 398us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Entering test case "CallbackHandle"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(168): info: check "CallbackHandle" == s has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(177): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Leaving test case "CallbackHandle"; testing time: 504us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Entering test case "reduce_callback"
Test case reduce_callback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Leaving test case "reduce_callback"; testing time: 461us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Entering test case "ignore_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(217): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Leaving test case "ignore_callback"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Entering test case "one_rank_callback"
Test case one_rank_callback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Leaving test case "one_rank_callback"; testing time: 377us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Entering test case "main_rank_callback"
Test case main_rank_callback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Leaving test case "main_rank_callback"; testing time: 346us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Entering test case "call_all"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(287): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Leaving test case "call_all"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Entering test case "check_exceptions"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(307): info: check 'exception "std::logic_error" raised as expected' has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Leaving test case "check_exceptions"; testing time: 400us
Leaving test module "MpiCallbacks test"; testing time: 5020us

*** No errors detected
terminate called after throwing an instance of 'boost::wrapexcept<boost::mpi::exception>'
  what():  MPI_Finalize: Other MPI error, error stack:
internal_Finalize(50)...........: MPI_Finalize failed
MPII_Finalize(394)..............: 
MPIR_finalize_builtin_comms(154): 
MPIR_Comm_release_always(1250)..: 
MPIR_Comm_delete_internal(1224).: Communicator (handle=44000000) being freed has 1 unmatched message(s)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 19812 RUNNING AT 3ae1d6cd6b37
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
cylindrical LB velocity profile observable error message
111/201 Test #133: observable_cylindricalLB ......................................***Failed    1.58 sec
test_cylindrical_lb_flux_density_obs (__main__.CylindricalLBObservableCPU.test_cylindrical_lb_flux_density_obs)
Check that the result from the observable (in its own frame) ...
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
ok
test_cylindrical_lb_profile_interface (__main__.CylindricalLBObservableCPU.test_cylindrical_lb_profile_interface)
Test setters and getters of the script interface ...
M = 3 N = 3 ; array[0, 3, 0, 0] = -0.000000; array[0, 3, 0, 1] = 0.000000; array[0, 3, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 0, 0] = -0.000000; array[0, 4, 0, 1] = 0.000000; array[0, 4, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 0, 0] = 0.000000; array[0, 5, 0, 1] = 0.000000; array[0, 5, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 0, 0] = 0.000000; array[0, 0, 0, 1] = 0.000000; array[0, 0, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 0, 0] = 0.000000; array[0, 1, 0, 1] = 0.000000; array[0, 1, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 0, 0] = 0.000000; array[0, 2, 0, 1] = -0.000000; array[0, 2, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 1, 0] = -0.000000; array[0, 3, 1, 1] = 0.000000; array[0, 3, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 1, 0] = -0.000000; array[0, 4, 1, 1] = 0.000000; array[0, 4, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 1, 0] = 0.000000; array[0, 5, 1, 1] = 0.000000; array[0, 5, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 1, 0] = 0.000000; array[0, 0, 1, 1] = 0.000000; array[0, 0, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 1, 0] = 0.000000; array[0, 1, 1, 1] = 0.000000; array[0, 1, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 1, 0] = 0.000000; array[0, 2, 1, 1] = -0.000000; array[0, 2, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 2, 0] = -0.000000; array[0, 3, 2, 1] = 0.000000; array[0, 3, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 2, 0] = -0.000000; array[0, 4, 2, 1] = 0.000000; array[0, 4, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 2, 0] = 0.000000; array[0, 5, 2, 1] = 0.000000; array[0, 5, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 2, 0] = 0.000000; array[0, 0, 2, 1] = 0.000000; array[0, 0, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 2, 0] = 0.000000; array[0, 1, 2, 1] = 0.000000; array[0, 1, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 2, 0] = 0.000000; array[0, 2, 2, 1] = -0.000000; array[0, 2, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 3, 0] = -0.000000; array[0, 3, 3, 1] = 0.000000; array[0, 3, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 3, 0] = -0.000000; array[0, 4, 3, 1] = 0.000000; array[0, 4, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 3, 0] = 0.000000; array[0, 5, 3, 1] = 0.000000; array[0, 5, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 3, 0] = 0.000000; array[0, 0, 3, 1] = 0.000000; array[0, 0, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 3, 0] = 0.000000; array[0, 1, 3, 1] = 0.000000; array[0, 1, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 3, 0] = 0.000000; array[0, 2, 3, 1] = -0.000000; array[0, 2, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 4, 0] = -0.000000; array[0, 3, 4, 1] = 0.000000; array[0, 3, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 4, 0] = -0.000000; array[0, 4, 4, 1] = 0.000000; array[0, 4, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 4, 0] = 0.000000; array[0, 5, 4, 1] = 0.000000; array[0, 5, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 4, 0] = 0.000000; array[0, 0, 4, 1] = 0.000000; array[0, 0, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 4, 0] = 0.000000; array[0, 1, 4, 1] = 0.000000; array[0, 1, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 4, 0] = 0.000000; array[0, 2, 4, 1] = -0.000000; array[0, 2, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 5, 0] = -0.000000; array[0, 3, 5, 1] = 0.000000; array[0, 3, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 5, 0] = -0.000000; array[0, 4, 5, 1] = 0.000000; array[0, 4, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 5, 0] = 0.000000; array[0, 5, 5, 1] = 0.000000; array[0, 5, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 5, 0] = 0.000000; array[0, 0, 5, 1] = 0.000000; array[0, 0, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 5, 0] = 0.000000; array[0, 1, 5, 1] = 0.000000; array[0, 1, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 5, 0] = 0.000000; array[0, 2, 5, 1] = -0.000000; array[0, 2, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 6, 0] = -0.000000; array[0, 3, 6, 1] = 0.000000; array[0, 3, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 6, 0] = -0.000000; array[0, 4, 6, 1] = 0.000000; array[0, 4, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 6, 0] = 0.000000; array[0, 5, 6, 1] = 0.000000; array[0, 5, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 6, 0] = 0.000000; array[0, 0, 6, 1] = 0.000000; array[0, 0, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 6, 0] = 0.000000; array[0, 1, 6, 1] = 0.000000; array[0, 1, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 6, 0] = 0.000000; array[0, 2, 6, 1] = -0.000000; array[0, 2, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 7, 0] = -0.000000; array[0, 3, 7, 1] = 0.000000; array[0, 3, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 7, 0] = -0.000000; array[0, 4, 7, 1] = 0.000000; array[0, 4, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 7, 0] = 0.000000; array[0, 5, 7, 1] = 0.000000; array[0, 5, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 7, 0] = 0.000000; array[0, 0, 7, 1] = 0.000000; array[0, 0, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 7, 0] = 0.000000; array[0, 1, 7, 1] = 0.000000; array[0, 1, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 7, 0] = 0.000000; array[0, 2, 7, 1] = -0.000000; array[0, 2, 7, 2] = 0.000000
M = 3 N = 3 ; array[1, 3, 0, 0] = 0.000000; array[1, 3, 0, 1] = -0.000000; array[1, 3, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 3, 0, 0] = -0.000000; array[1, 3, 0, 1] = 0.000000; array[1, 3, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 4, 0, 0] = -0.000000; array[1, 4, 0, 1] = 0.000000; array[1, 4, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 4, 0, 0] = 0.000000; array[1, 4, 0, 1] = 0.000000; array[1, 4, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 5, 0, 0] = 0.000000; array[1, 5, 0, 1] = 0.000000; array[1, 5, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 5, 0, 0] = 0.000000; array[1, 5, 0, 1] = 0.000000; array[1, 5, 0, 2] = 0.000000
python3: /usr/include/boost/multi_array/base.hpp:312: Reference boost::detail::multi_array::multi_array_impl_base<T, NumDims>::access_element(boost::type<Reference>, const IndexList&, TPtr, const size_type*, const index*, const index*) const [with Reference = double&; IndexList = boost::array<long int, 4>; TPtr = double*; T = double; long unsigned int NumDims = 4; size_type = long unsigned int; index = long int]: Assertion `size_type(indices[i] - index_bases[i]) < extents[i]' failed.
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 11887 RUNNING AT 92f33c5094034595b63af25b7528b0b9
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

from espresso.

jngrad avatar jngrad commented on July 18, 2024

Adapting the code that generates the error message (raffenet/[email protected]:src/mpi/comm/commutil.c#L1125-L1147) in the unit test helped me find out the issue was due to a missing cbs.loop(); on the worker nodes. The worker nodes receive a LOOP_ABORT from ~MpiCallbacks() but cannot process it without a blocking MpiCallbacks::loop(), hence the receive queue wasn't empty when the communicator was destroyed, which is a fatal error in MPICH version 4.1+.

int main(int argc, char **argv) {
  auto const mpi_env = std::make_shared<boost::mpi::environment>(argc, argv);
  ::mpi_env = mpi_env;
  auto const retval = boost::unit_test::unit_test_main(init_unit_test, argc, argv);
  {
        boost::mpi::communicator world;
        int flag;
        int unmatched_messages = 0;
        MPI_Comm comm = world;
        MPI_Status status;
        do {
            int mpi_errno = MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &flag, &status);
            printf("rank %d has mpi_errno=%d\n", world.rank(), mpi_errno);
            char buffer[10] = {0};
            if (flag) {
                int count = 0;
                MPI_Get_count(&status, MPI_CHAR, &count);
                MPI_Recv(buffer, count, MPI_CHAR, status.MPI_SOURCE, status.MPI_TAG, comm, MPI_STATUS_IGNORE);
                unmatched_messages++;
                printf("rank %d received values {%d,%d,%d,%d} from rank %d, with tag %d, size %d Bytes and error code %d.\n",
                       world.rank(), (int)(buffer[0]), (int)(buffer[1]), (int)(buffer[2]), (int)(buffer[3]),
                       status.MPI_SOURCE, status.MPI_TAG, count, status.MPI_ERROR);
            }
        } while (false);
        printf("rank %d has %d unmatched messages\n", world.rank(), unmatched_messages);
  }
  return retval;
}

Output:

Running 1 test case...
0: ~MpiCallbacks()
call(0)
Running 1 test case...
1: ~MpiCallbacks()


*** No errors detected
*** No errors detected
rank 0 has mpi_errno=0
rank 0 has 0 unmatched messages
rank 1 has mpi_errno=0
rank 1 received values {0,0,0,0} from rank 0, with tag 2147483647, size 4 Bytes and error code -33873408.
rank 1 has 1 unmatched messages

from espresso.

jngrad avatar jngrad commented on July 18, 2024

The bugfix is now in Fedora 41 stable as release espresso-4.2.1-11.fc41 (rpms/espresso).

from espresso.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.