Comments (12)
One can influence the destruction order by adding the following two code snippets before and after #include <boost/test/unit_test.hpp>
in every failing test, and linking them against Espresso::core Boost::mpi
via CMake:
#define BOOST_TEST_NO_MAIN
#include "communication.hpp"
int main(int argc, char **argv) {
auto mpi_env = std::make_shared<boost::mpi::environment>(argc, argv);
Communication::init(mpi_env);
return boost::unit_test::unit_test_main(init_unit_test, argc, argv);
}
This is not a sustainable solution, and it will break as soon as the order of instantiation of static globals changes, for example when ESPResSo features that introduce global variables are disabled, or when a different Boost release is used.
from espresso.
After fixing
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}"
to
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"
the bug can be properly investigated with GDB on Ubuntu 22.04.
It's not clear to me, why ESPResSo needs to manage the lifetime of the boost::mpi::detail::mpi_datatype_map
singleton. When we don't extend its lifetime, several tests experience segmentation faults or timeouts during normal program termination. Here is the trace when the static global that keeps a reference to the singleton is removed:
Thread 1 "ReactionAlgorit" received signal SIGSEGV, Segmentation fault.
0x00001555544c6504 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0 0x00001555544c6504 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00001555549e1b45 in std::_Rb_tree_iterator<std::pair<std::type_info const* const, ompi_datatype_t*> >::operator++ (
this=0x7fffffffd320) at /usr/include/c++/10/bits/stl_tree.h:287
#2 0x00001555549e155f in boost::mpi::detail::mpi_datatype_map::clear (
this=0x1555549ff568 <boost::mpi::detail::mpi_datatype_cache()::cache>) at libs/mpi/src/mpi_datatype_cache.cpp:36
#3 0x00001555549dd6f8 in boost::mpi::environment::~environment (this=0x5555555ca930, __in_chrg=<optimized out>)
at libs/mpi/src/environment.cpp:184
#4 0x0000155554fa621c in __gnu_cxx::new_allocator<boost::mpi::environment>::destroy<boost::mpi::environment> (
this=0x5555555ca930, __p=0x5555555ca930) at /usr/include/c++/10/ext/new_allocator.h:162
#5 0x0000155554fa61e7 in std::allocator_traits<std::allocator<boost::mpi::environment> >::destroy<boost::mpi::environment> (
__a=..., __p=0x5555555ca930) at /usr/include/c++/10/bits/alloc_traits.h:531
#6 0x0000155554fa60a1 in std::_Sp_counted_ptr_inplace<boost::mpi::environment, std::allocator<boost::mpi::environment>,
(__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x5555555ca920)
at /usr/include/c++/10/bits/shared_ptr_base.h:560
#7 0x0000555555588bd7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5555555ca920)
at /usr/include/c++/10/bits/shared_ptr_base.h:158
#8 0x0000555555584321 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (
this=0x15555539cd38 <Communication::mpi_env+8>, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:736
#9 0x0000155554f9eb16 in std::__shared_ptr<boost::mpi::environment, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (
this=0x15555539cd30 <Communication::mpi_env>, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1188
#10 0x0000155554f9ec86 in std::shared_ptr<boost::mpi::environment>::~shared_ptr (this=0x15555539cd30 <Communication::mpi_env>,
__in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121
#11 0x0000155554045a56 in __cxa_finalize (d=0x15555539c560) at ./stdlib/cxa_finalize.c:83
#12 0x0000155554f71d67 in __do_global_dtors_aux () from /tmp/boost/espresso/build/src/core/espresso_core.so
#13 0x00007fffffffd8c0 in ?? ()
#14 0x000015555552024e in _dl_fini () at ./elf/dl-fini.c:142
Backtrace stopped: frame did not save the PC
Running the Python testsuite shows three outcomes: no error, a segmentation fault, or a timeout. The latter most likely happens during atexit
, because the test reports success. The latter behavior is quite similar to what is being observed by EESSI maintainers on ARM architectures, although they didn't report segmentation faults (EESSI/software-layer#363), and to my knowledge didn't remove the static reference. Here is an excerpt of the Python testsuite log:
187/191 Test #100: dipolar_p3m .......................................................***Timeout 300.10 sec
.
----------------------------------------------------------------------
Ran 1 test in 10.268s
OK
188/191 Test #55: observables .......................................................***Timeout 300.24 sec
.......s.........
----------------------------------------------------------------------
Ran 17 tests in 5.649s
OK (skipped=1)
189/191 Test #10: test_checkpoint__therm_lb__p3m_gpu__lj__lb_walberla_cpu_binary ....***Timeout 300.24 sec
.......s.sss.s.sssss.ss....s..ss..ssssss....
----------------------------------------------------------------------
Ran 44 tests in 0.394s
OK (skipped=21)
from espresso.
I took a different angle by creating a struct MpiContainer
to encapsulate all 3 globals and manage their lifetime and destruction order through a static smart pointer:
namespace Communication {
//static auto const &mpi_datatype_cache = boost::mpi::detail::mpi_datatype_cache();
static std::shared_ptr<boost::mpi::environment> mpi_env;
static std::shared_ptr<MpiCallbacks> m_callbacks;
} // namespace Communication
struct MpiContainer {
boost::mpi::detail::mpi_datatype_map const &mpi_datatype_cache = boost::mpi::detail::mpi_datatype_cache();
std::shared_ptr<boost::mpi::environment> mpi_env;
std::shared_ptr<Communication::MpiCallbacks> m_callbacks;
~MpiContainer() {
m_callbacks.reset();
Communication::m_callbacks.reset();
mpi_env.reset();
Communication::mpi_env.reset();
}
MpiContainer() {
mpi_env = Communication::mpi_env;
m_callbacks = Communication::m_callbacks;
}
};
static std::unique_ptr<MpiContainer> mpi_container;
void atexit_handler() {
if (Communication::m_callbacks) {
mpi_container->m_callbacks.reset();
Communication::m_callbacks.reset();
}
}
The atexit
event handler frees the MpiCallbacks
handle, which frees the boost::mpi::environment
handle, but this introduces more issues, because the MpiCallbacks
must remain alive until all dependent Context
objects from the ESPResSo ScriptInterface
go through their destructors. In C++, the rule is that static variables are destructed during atexit
, in reverse order of static initialization resp. event registration. The atexit
event was registered right after MpiCallbacks
was initialized in Communication::init()
, so that the event would be traversed before ~MpiCallbacks()
. However, script interface objects live in a different translation unit, and thus we cannot control the order of destruction. The order actually changes between two simulations, making this new bug difficult to reproduce.
from espresso.
I think the way to solve this issue is to:
- not tamper with the MPI datatype cache lifetime, i.e. don't manually call the singleton
- don't keep the
boost::mpi::environment
alive until atexit
Resolving 1. is easy: never call boost::mpi::detail::mpi_datatype_cache()
anywhere. It's an implementation detail. Resolving 2. is a lot harder: the MpiCallbacks
handle needs the MPI environment handle during Python atexit. We can rewrite MpiCallbacks
to keep the MPI environment handle alive. Here is how we need to adapt MPI initialization:
- Python interface: the MPI environment is kept alive by
_init.pyx
(inside a shared pointer) and theMpiCallbacks
handle is kept alive by bothscript_interface.pyx
(inside the global context shared pointer) andcommunication.cpp
(inside the static global); a cleanup function is registered to delete all 3 shared pointers during Python atexit - C++ unit tests: the
main
function contains aMpiContainer
handle that keeps theMpiCallbacks
handle alive, and it expires as soon as the testsuite ends
A proof-of-concept is available in jngrad/espresso@boost_mpi_bugfix
. I get the desired behavior on the python branch compiled with the default config using Boost 1.74, 1.82 and 1.84, at the notable exception of ek_eof.py
. The fix can be backported to ESPResSo 4.2, but all LB tests fail due to the LB actor calling the lb_lbfluid_set_lattice_switch()
MPI callback after the MPI environment is destroyed.
Here I'm assuming all Python atexit functions run before the first C++ atexit function, please correct me if I'm wrong!
To help with debugging, source the following GDB script in your session to print out calls to relevant symbols in a way that doesn't interrupt the flow of the GDB session with user prompts:
set breakpoint pending on
set pagination off
define handler
break $arg0
commands
cont
end
end
handler MPI_Type_contiguous
handler boost::mpi::environment::environment
handler boost::mpi::environment::~environment
handler boost::mpi::detail::mpi_datatype_cache
handler boost::mpi::detail::mpi_datatype_map::~mpi_datatype_map
handler boost::mpi::detail::mpi_datatype_map::clear
run
from espresso.
Bugfix backported to 4.2.1 and submitted to openSUSE Tumbleweed (request 1143707) and Factory (request 1143710).
The python branch bugfix is a bit more delicate due to FFTW and HDF5 dependencies, and might take a few more days.
from espresso.
On Fedora there was no issue?
from espresso.
I'm still running up the testsuite locally in a Docker image and will make a bugfix ASAP. They are currently still using Boost 1.83.0 in f40 and rawhide (https://src.fedoraproject.org/rpms/boost), so I gave priority to openSUSE.
from espresso.
The MpiCallbacks_test
unit test fails in a Koji scratch build, but not on my workstation in a Docker image... I'll look into it next week.
from espresso.
The Homebrew formulae for boost-mpi
(link) was bumped to Boost 1.84.0 recently. This package does not provide pinned formulae for older Boost versions. All ESPResSo releases since 4.0.0 and the development version are now unusable on macOS computers that have up-to-date dependencies.
from espresso.
Progress report:
- the bugfix seems to work on the macOS GitHub Action with Boost 1.84
- the remaining undefined behavior revealed by the bugfix still persists on Fedora
Fedora Rawhide still hasn't updated to Boost 1.84 (bugzilla 2178871). Fedora 40 will enter Beta on February 27 (timeline).
On f40 with all architectures, I get random failures of the MpiCallbacks_test
and ParallelExceptionHandler_test
unit tests with error message Communicator (handle=44000000) being freed has 1 unmatched message(s)
on Koji (see below), even though the boost::mpi::environment
shared pointer lifetime is tied to the main
function lifetime (+2 weak pointers with static linkage), and the MpiCallbacks
shared pointer lifetime is tied to the lifetime of the individual test functions. This error appeared last week, and is reproducible locally in a Docker image with MPICH 4.1.2 (and with docker run --shm-size 8G
, otherwise a signal 7
fatal error is triggered by OpenMPI).
The MPI deadlocks in Python tests have disappeared last week too, although on the Power9 architecture, when building with C++ assertions, the Boost histogram library triggers an assertion in the observable_cylindricalLB.py
test (see below). It happens every time in the CylindricalLBObservableCPU.test_cylindrical_lb_profile_interface
test when attempting to set the value of the velocity vector at array[1, 0, 0, :]
.
MpiCallbacks MPICH error message
Entering test module "MpiCallbacks test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Entering test case "invoke_test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(57): info: check f(i, j) == (invoke<decltype(f), int, unsigned>(f, ia)) has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Leaving test case "invoke_test"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Entering test case "callback_model_t"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(82): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(83): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(93): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(100): info: check 19 == state has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(101): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(102): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(110): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Leaving test case "callback_model_t"; testing time: 420us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Entering test case "adding_function_ptr_cb"
Test case adding_function_ptr_cb did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Leaving test case "adding_function_ptr_cb"; testing time: 611us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Entering test case "RegisterCallback"
Test case RegisterCallback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Leaving test case "RegisterCallback"; testing time: 393us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Entering test case "CallbackHandle"
Test case CallbackHandle did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Leaving test case "CallbackHandle"; testing time: 497us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Entering test case "reduce_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(194): info: check ret == (n * (n - 1)) / 2 has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Leaving test case "reduce_callback"; testing time: 455us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Entering test case "ignore_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(217): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Leaving test case "ignore_callback"; testing time: 374us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Entering test case "one_rank_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(238): info: check cbs.call(Communication::Result::one_rank, fp) == world.size() - 1 has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Leaving test case "one_rank_callback"; testing time: 378us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Entering test case "main_rank_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(263): info: check cbs.call(Communication::Result::main_rank, fp) == world.size() has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Leaving test case "main_rank_callback"; testing time: 354us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Entering test case "call_all"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(287): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Leaving test case "call_all"; testing time: 375us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Entering test case "check_exceptions"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(304): info: check 'exception "std::out_of_range" raised as expected' has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Leaving test case "check_exceptions"; testing time: 400us
Leaving test module "MpiCallbacks test"; testing time: 5020us
*** No errors detected
Running 11 test cases...
Entering test module "MpiCallbacks test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Entering test case "invoke_test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(57): info: check f(i, j) == (invoke<decltype(f), int, unsigned>(f, ia)) has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Leaving test case "invoke_test"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Entering test case "callback_model_t"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(82): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(83): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(93): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(100): info: check 19 == state has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(101): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(102): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(110): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Leaving test case "callback_model_t"; testing time: 420us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Entering test case "adding_function_ptr_cb"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(119): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(120): info: check "adding_function_ptr_cb" == s has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(133): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Leaving test case "adding_function_ptr_cb"; testing time: 614us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Entering test case "RegisterCallback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(139): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(140): info: check "2nd" == s has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(156): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Leaving test case "RegisterCallback"; testing time: 398us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Entering test case "CallbackHandle"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(168): info: check "CallbackHandle" == s has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(177): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Leaving test case "CallbackHandle"; testing time: 504us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Entering test case "reduce_callback"
Test case reduce_callback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Leaving test case "reduce_callback"; testing time: 461us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Entering test case "ignore_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(217): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Leaving test case "ignore_callback"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Entering test case "one_rank_callback"
Test case one_rank_callback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Leaving test case "one_rank_callback"; testing time: 377us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Entering test case "main_rank_callback"
Test case main_rank_callback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Leaving test case "main_rank_callback"; testing time: 346us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Entering test case "call_all"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(287): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Leaving test case "call_all"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Entering test case "check_exceptions"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(307): info: check 'exception "std::logic_error" raised as expected' has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Leaving test case "check_exceptions"; testing time: 400us
Leaving test module "MpiCallbacks test"; testing time: 5020us
*** No errors detected
terminate called after throwing an instance of 'boost::wrapexcept<boost::mpi::exception>'
what(): MPI_Finalize: Other MPI error, error stack:
internal_Finalize(50)...........: MPI_Finalize failed
MPII_Finalize(394)..............:
MPIR_finalize_builtin_comms(154):
MPIR_Comm_release_always(1250)..:
MPIR_Comm_delete_internal(1224).: Communicator (handle=44000000) being freed has 1 unmatched message(s)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 19812 RUNNING AT 3ae1d6cd6b37
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
cylindrical LB velocity profile observable error message
111/201 Test #133: observable_cylindricalLB ......................................***Failed 1.58 sec
test_cylindrical_lb_flux_density_obs (__main__.CylindricalLBObservableCPU.test_cylindrical_lb_flux_density_obs)
Check that the result from the observable (in its own frame) ...
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
ok
test_cylindrical_lb_profile_interface (__main__.CylindricalLBObservableCPU.test_cylindrical_lb_profile_interface)
Test setters and getters of the script interface ...
M = 3 N = 3 ; array[0, 3, 0, 0] = -0.000000; array[0, 3, 0, 1] = 0.000000; array[0, 3, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 0, 0] = -0.000000; array[0, 4, 0, 1] = 0.000000; array[0, 4, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 0, 0] = 0.000000; array[0, 5, 0, 1] = 0.000000; array[0, 5, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 0, 0] = 0.000000; array[0, 0, 0, 1] = 0.000000; array[0, 0, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 0, 0] = 0.000000; array[0, 1, 0, 1] = 0.000000; array[0, 1, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 0, 0] = 0.000000; array[0, 2, 0, 1] = -0.000000; array[0, 2, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 1, 0] = -0.000000; array[0, 3, 1, 1] = 0.000000; array[0, 3, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 1, 0] = -0.000000; array[0, 4, 1, 1] = 0.000000; array[0, 4, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 1, 0] = 0.000000; array[0, 5, 1, 1] = 0.000000; array[0, 5, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 1, 0] = 0.000000; array[0, 0, 1, 1] = 0.000000; array[0, 0, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 1, 0] = 0.000000; array[0, 1, 1, 1] = 0.000000; array[0, 1, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 1, 0] = 0.000000; array[0, 2, 1, 1] = -0.000000; array[0, 2, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 2, 0] = -0.000000; array[0, 3, 2, 1] = 0.000000; array[0, 3, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 2, 0] = -0.000000; array[0, 4, 2, 1] = 0.000000; array[0, 4, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 2, 0] = 0.000000; array[0, 5, 2, 1] = 0.000000; array[0, 5, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 2, 0] = 0.000000; array[0, 0, 2, 1] = 0.000000; array[0, 0, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 2, 0] = 0.000000; array[0, 1, 2, 1] = 0.000000; array[0, 1, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 2, 0] = 0.000000; array[0, 2, 2, 1] = -0.000000; array[0, 2, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 3, 0] = -0.000000; array[0, 3, 3, 1] = 0.000000; array[0, 3, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 3, 0] = -0.000000; array[0, 4, 3, 1] = 0.000000; array[0, 4, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 3, 0] = 0.000000; array[0, 5, 3, 1] = 0.000000; array[0, 5, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 3, 0] = 0.000000; array[0, 0, 3, 1] = 0.000000; array[0, 0, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 3, 0] = 0.000000; array[0, 1, 3, 1] = 0.000000; array[0, 1, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 3, 0] = 0.000000; array[0, 2, 3, 1] = -0.000000; array[0, 2, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 4, 0] = -0.000000; array[0, 3, 4, 1] = 0.000000; array[0, 3, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 4, 0] = -0.000000; array[0, 4, 4, 1] = 0.000000; array[0, 4, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 4, 0] = 0.000000; array[0, 5, 4, 1] = 0.000000; array[0, 5, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 4, 0] = 0.000000; array[0, 0, 4, 1] = 0.000000; array[0, 0, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 4, 0] = 0.000000; array[0, 1, 4, 1] = 0.000000; array[0, 1, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 4, 0] = 0.000000; array[0, 2, 4, 1] = -0.000000; array[0, 2, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 5, 0] = -0.000000; array[0, 3, 5, 1] = 0.000000; array[0, 3, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 5, 0] = -0.000000; array[0, 4, 5, 1] = 0.000000; array[0, 4, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 5, 0] = 0.000000; array[0, 5, 5, 1] = 0.000000; array[0, 5, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 5, 0] = 0.000000; array[0, 0, 5, 1] = 0.000000; array[0, 0, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 5, 0] = 0.000000; array[0, 1, 5, 1] = 0.000000; array[0, 1, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 5, 0] = 0.000000; array[0, 2, 5, 1] = -0.000000; array[0, 2, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 6, 0] = -0.000000; array[0, 3, 6, 1] = 0.000000; array[0, 3, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 6, 0] = -0.000000; array[0, 4, 6, 1] = 0.000000; array[0, 4, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 6, 0] = 0.000000; array[0, 5, 6, 1] = 0.000000; array[0, 5, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 6, 0] = 0.000000; array[0, 0, 6, 1] = 0.000000; array[0, 0, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 6, 0] = 0.000000; array[0, 1, 6, 1] = 0.000000; array[0, 1, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 6, 0] = 0.000000; array[0, 2, 6, 1] = -0.000000; array[0, 2, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 7, 0] = -0.000000; array[0, 3, 7, 1] = 0.000000; array[0, 3, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 7, 0] = -0.000000; array[0, 4, 7, 1] = 0.000000; array[0, 4, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 7, 0] = 0.000000; array[0, 5, 7, 1] = 0.000000; array[0, 5, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 7, 0] = 0.000000; array[0, 0, 7, 1] = 0.000000; array[0, 0, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 7, 0] = 0.000000; array[0, 1, 7, 1] = 0.000000; array[0, 1, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 7, 0] = 0.000000; array[0, 2, 7, 1] = -0.000000; array[0, 2, 7, 2] = 0.000000
M = 3 N = 3 ; array[1, 3, 0, 0] = 0.000000; array[1, 3, 0, 1] = -0.000000; array[1, 3, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 3, 0, 0] = -0.000000; array[1, 3, 0, 1] = 0.000000; array[1, 3, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 4, 0, 0] = -0.000000; array[1, 4, 0, 1] = 0.000000; array[1, 4, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 4, 0, 0] = 0.000000; array[1, 4, 0, 1] = 0.000000; array[1, 4, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 5, 0, 0] = 0.000000; array[1, 5, 0, 1] = 0.000000; array[1, 5, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 5, 0, 0] = 0.000000; array[1, 5, 0, 1] = 0.000000; array[1, 5, 0, 2] = 0.000000
python3: /usr/include/boost/multi_array/base.hpp:312: Reference boost::detail::multi_array::multi_array_impl_base<T, NumDims>::access_element(boost::type<Reference>, const IndexList&, TPtr, const size_type*, const index*, const index*) const [with Reference = double&; IndexList = boost::array<long int, 4>; TPtr = double*; T = double; long unsigned int NumDims = 4; size_type = long unsigned int; index = long int]: Assertion `size_type(indices[i] - index_bases[i]) < extents[i]' failed.
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 11887 RUNNING AT 92f33c5094034595b63af25b7528b0b9
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
from espresso.
Adapting the code that generates the error message (raffenet/[email protected]:src/mpi/comm/commutil.c#L1125-L1147) in the unit test helped me find out the issue was due to a missing cbs.loop();
on the worker nodes. The worker nodes receive a LOOP_ABORT
from ~MpiCallbacks()
but cannot process it without a blocking MpiCallbacks::loop()
, hence the receive queue wasn't empty when the communicator was destroyed, which is a fatal error in MPICH version 4.1+.
int main(int argc, char **argv) {
auto const mpi_env = std::make_shared<boost::mpi::environment>(argc, argv);
::mpi_env = mpi_env;
auto const retval = boost::unit_test::unit_test_main(init_unit_test, argc, argv);
{
boost::mpi::communicator world;
int flag;
int unmatched_messages = 0;
MPI_Comm comm = world;
MPI_Status status;
do {
int mpi_errno = MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &flag, &status);
printf("rank %d has mpi_errno=%d\n", world.rank(), mpi_errno);
char buffer[10] = {0};
if (flag) {
int count = 0;
MPI_Get_count(&status, MPI_CHAR, &count);
MPI_Recv(buffer, count, MPI_CHAR, status.MPI_SOURCE, status.MPI_TAG, comm, MPI_STATUS_IGNORE);
unmatched_messages++;
printf("rank %d received values {%d,%d,%d,%d} from rank %d, with tag %d, size %d Bytes and error code %d.\n",
world.rank(), (int)(buffer[0]), (int)(buffer[1]), (int)(buffer[2]), (int)(buffer[3]),
status.MPI_SOURCE, status.MPI_TAG, count, status.MPI_ERROR);
}
} while (false);
printf("rank %d has %d unmatched messages\n", world.rank(), unmatched_messages);
}
return retval;
}
Output:
Running 1 test case...
0: ~MpiCallbacks()
call(0)
Running 1 test case...
1: ~MpiCallbacks()
*** No errors detected
*** No errors detected
rank 0 has mpi_errno=0
rank 0 has 0 unmatched messages
rank 1 has mpi_errno=0
rank 1 received values {0,0,0,0} from rank 0, with tag 2147483647, size 4 Bytes and error code -33873408.
rank 1 has 1 unmatched messages
from espresso.
The bugfix is now in Fedora 41 stable as release espresso-4.2.1-11.fc41
(rpms/espresso).
from espresso.
Related Issues (20)
- CI build failed for merged PR HOT 1
- Velocity Verlet step in documentation has a typo HOT 3
- CI build failed for merged PR HOT 1
- Implement a Metatensor adaptor HOT 2
- LB: velocity updated missing in population and force setters
- LB/Virtual sites: incomplete API change HOT 1
- Lost virtual particle propagation flag after checkpointing HOT 1
- CI build failed for merged PR
- has_checkpoint overlooking existing checkpoint not indexed 0 HOT 2
- Particle image box is wrong with Verlet lists HOT 2
- Guard against unused arguments in script interfae HOT 2
- CI build failed for merged PR
- Ccache partial support for `-Xcompiler`
- P3M: further refactoring
- CI build failed for merged PR HOT 1
- CI build failed for merged PR HOT 2
- CI build failed for merged PR
- Decide api for system-wide propagation setup
- py: Allow passing ParticleHanlde and Particle Slice to observables
- CI build failed for merged PR HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from espresso.