Comments (6)
Nope, it's not that: there is a MPI_Barrier
on append
so all files should have been closed by the time the attribute is written.
This is happening with all the changes I've done for #88 (where I left the MPI_Barrier
), so it's possible I introduced a problem there.
from velociraptor-stf.
Mmm, funny...
On cosma, I realised the error is caused by the file begin corrupted. If I try to open it with h5dump
after it's closed, but before the extra attributes are written (for which the file is re-opened in RW mode only on rank 0), h5dump
crashes with an "internal error".
Also, on the latest master
(so completely unrelated to any of the changes I've been testing for #88), when running locally with 4 ranks on localhost
, OpenMPI 4.0.3, parallel HDF5, my 4 processes are all hanging at the same H5Fclose
under the Fhdf.close()
call I showed above (i.e., closing the file XXXX.properties.0
output HDF5 file). So maybe there is a deeper underlying problem that shows us in different forms depending on the MPI/hdf5 libs/versions.
In the case of my local hanging program, the first rank is stuck:
0x00007f8f5cb8a23b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
120 ../sysdeps/unix/syscall-template.S: No such file or directory.
#0 0x00007f8f5cb8a23b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1 0x00007f8f5c953315 in ompi_sync_wait_mt () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40
#2 0x00007f8f5d4ec9f8 in ompi_request_default_wait () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#3 0x00007f8f5d54f2f3 in ompi_coll_base_barrier_intra_recursivedoubling () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#4 0x00007f8f5950857d in mca_io_ompio_file_set_size () from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so
#5 0x00007f8f5d513011 in PMPI_File_set_size () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#6 0x00007f8f5d3c52b1 in ?? () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#7 0x00007f8f5d20e1b1 in H5FD_truncate () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#8 0x00007f8f5d1f6adc in ?? () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#9 0x00007f8f5d1f86f0 in H5F__dest () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#10 0x00007f8f5d1f93e3 in H5F_try_close () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#11 0x00007f8f5d1f971c in H5F__close_cb () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#12 0x00007f8f5d26b807 in H5I_dec_ref () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#13 0x00007f8f5d26b8db in H5I_dec_app_ref () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#14 0x00007f8f5d1f9152 in H5F__close () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#15 0x00007f8f5d1eea12 in H5Fclose () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#16 0x00000000007e5723 in H5OutputFile::close() ()
#17 0x000000000075fc0e in WriteProperties(Options&, long long, PropData*) ()
#18 0x00000000004f6a0b in main ()
While the other three are at:
0x00007f78d493923b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
120 ../sysdeps/unix/syscall-template.S: No such file or directory.
#0 0x00007f78d493923b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1 0x00007f78d4702315 in ompi_sync_wait_mt () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40
#2 0x00007f78d529b9f8 in ompi_request_default_wait () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#3 0x00007f78d52fe2f3 in ompi_coll_base_barrier_intra_recursivedoubling () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#4 0x00007f78d52b3730 in PMPI_Barrier () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#5 0x00007f78d5167a3d in H5AC__run_sync_point () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#6 0x00007f78d51689bf in H5AC__flush_entries () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#7 0x00007f78d4f126b8 in H5AC_dest () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#8 0x00007f78d4fa7788 in H5F__dest () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#9 0x00007f78d4fa83e3 in H5F_try_close () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#10 0x00007f78d4fa871c in H5F__close_cb () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#11 0x00007f78d501a807 in H5I_dec_ref () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#12 0x00007f78d501a8db in H5I_dec_app_ref () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#13 0x00007f78d4fa8152 in H5F__close () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#14 0x00007f78d4f9da12 in H5Fclose () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#15 0x00000000007e5723 in H5OutputFile::close() ()
#16 0x000000000075fc0e in WriteProperties(Options&, long long, PropData*) ()
#17 0x00000000004f6a0b in main ()
from velociraptor-stf.
Locally I've also double-checked that all HDF5 objects (types, datasets, data spaces, etc) are closed before we close the file. There seems to be nothing lingering, so it's becoming difficult to pinpoint what's wrong.
from velociraptor-stf.
I just found out this problem doesn't happen with non-hydro setups, which is something I should have checked long before. Hopefully this will push me in the correct direction now.
from velociraptor-stf.
Ahhhhhhhhh found it! This was driving me crazy.
The problem was the fix for #100 introduced in 4cdbd52. In this commit I missed the broadcast of the opt.bh_internalprop_output_names
value from rank 0, which was present in the old (but buggy) version. Adding the missing broadcast solves the issue, and now finally I can successfully run VR to completion with the inputs and configuration from #87.
I'll push the fix now, double-check that it doesn't break anything prior to merging to the master
. After that we can more confidently proceed with merging the changes for #88.
from velociraptor-stf.
Fixed now in the master
branch, closing.
from velociraptor-stf.
Related Issues (20)
- Incorrectly sized buffer given for MPI_Bcast reception HOT 1
- SO list offsets are wrong/counterintuitive HOT 6
- Inconsistent array names between properties files HOT 6
- SO list output too large and possibly wrong HOT 11
- Error in writing HDF5 outputs HOT 5
- Improve VR's memory usage for extra data in Particles
- DMO Zoom on-the-fly with SWIFT segfault HOT 4
- OpenMP bug in temperature calculations. HOT 5
- Memory usage blowing up in large DMO runs HOT 22
- Differences in halo masses when switching on/off substructure search HOT 2
- Mistakes in metallity calculations HOT 1
- Apparently wrong output when using Star_internal_property options
- HIGHRES needs undocumented Extensive_interloper_properties_output config option
- Error on compiling HOT 9
- Particle_type_for_reference_frames value not listed in the output HOT 3
- More potential issues hidden in MPISendReceive*InfoBetweenThreads functions
- Uninitialised variables in PropData class
- Compiling with DVR_USE_GAS=ON but no other options doesn't work HOT 5
- Buffer overflow in PotentialTree with OpenMP HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from velociraptor-stf.