Git Product home page Git Product logo

Comments (6)

rtobar avatar rtobar commented on August 26, 2024

Nope, it's not that: there is a MPI_Barrier on append so all files should have been closed by the time the attribute is written.

This is happening with all the changes I've done for #88 (where I left the MPI_Barrier), so it's possible I introduced a problem there.

from velociraptor-stf.

rtobar avatar rtobar commented on August 26, 2024

Mmm, funny...

On cosma, I realised the error is caused by the file begin corrupted. If I try to open it with h5dump after it's closed, but before the extra attributes are written (for which the file is re-opened in RW mode only on rank 0), h5dump crashes with an "internal error".

Also, on the latest master (so completely unrelated to any of the changes I've been testing for #88), when running locally with 4 ranks on localhost, OpenMPI 4.0.3, parallel HDF5, my 4 processes are all hanging at the same H5Fclose under the Fhdf.close() call I showed above (i.e., closing the file XXXX.properties.0 output HDF5 file). So maybe there is a deeper underlying problem that shows us in different forms depending on the MPI/hdf5 libs/versions.

In the case of my local hanging program, the first rank is stuck:

0x00007f8f5cb8a23b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
120     ../sysdeps/unix/syscall-template.S: No such file or directory.
#0  0x00007f8f5cb8a23b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007f8f5c953315 in ompi_sync_wait_mt () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007f8f5d4ec9f8 in ompi_request_default_wait () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#3  0x00007f8f5d54f2f3 in ompi_coll_base_barrier_intra_recursivedoubling () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007f8f5950857d in mca_io_ompio_file_set_size () from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so
#5  0x00007f8f5d513011 in PMPI_File_set_size () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#6  0x00007f8f5d3c52b1 in ?? () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#7  0x00007f8f5d20e1b1 in H5FD_truncate () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#8  0x00007f8f5d1f6adc in ?? () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#9  0x00007f8f5d1f86f0 in H5F__dest () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#10 0x00007f8f5d1f93e3 in H5F_try_close () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#11 0x00007f8f5d1f971c in H5F__close_cb () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#12 0x00007f8f5d26b807 in H5I_dec_ref () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#13 0x00007f8f5d26b8db in H5I_dec_app_ref () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#14 0x00007f8f5d1f9152 in H5F__close () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#15 0x00007f8f5d1eea12 in H5Fclose () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#16 0x00000000007e5723 in H5OutputFile::close() ()
#17 0x000000000075fc0e in WriteProperties(Options&, long long, PropData*) ()
#18 0x00000000004f6a0b in main ()

While the other three are at:

0x00007f78d493923b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
120     ../sysdeps/unix/syscall-template.S: No such file or directory.
#0  0x00007f78d493923b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007f78d4702315 in ompi_sync_wait_mt () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007f78d529b9f8 in ompi_request_default_wait () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#3  0x00007f78d52fe2f3 in ompi_coll_base_barrier_intra_recursivedoubling () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007f78d52b3730 in PMPI_Barrier () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007f78d5167a3d in H5AC__run_sync_point () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#6  0x00007f78d51689bf in H5AC__flush_entries () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#7  0x00007f78d4f126b8 in H5AC_dest () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#8  0x00007f78d4fa7788 in H5F__dest () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#9  0x00007f78d4fa83e3 in H5F_try_close () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#10 0x00007f78d4fa871c in H5F__close_cb () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#11 0x00007f78d501a807 in H5I_dec_ref () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#12 0x00007f78d501a8db in H5I_dec_app_ref () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#13 0x00007f78d4fa8152 in H5F__close () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#14 0x00007f78d4f9da12 in H5Fclose () from /usr/lib/x86_64-linux-gnu/libhdf5_openmpi.so.103
#15 0x00000000007e5723 in H5OutputFile::close() ()
#16 0x000000000075fc0e in WriteProperties(Options&, long long, PropData*) ()
#17 0x00000000004f6a0b in main ()

from velociraptor-stf.

rtobar avatar rtobar commented on August 26, 2024

Locally I've also double-checked that all HDF5 objects (types, datasets, data spaces, etc) are closed before we close the file. There seems to be nothing lingering, so it's becoming difficult to pinpoint what's wrong.

from velociraptor-stf.

rtobar avatar rtobar commented on August 26, 2024

I just found out this problem doesn't happen with non-hydro setups, which is something I should have checked long before. Hopefully this will push me in the correct direction now.

from velociraptor-stf.

rtobar avatar rtobar commented on August 26, 2024

Ahhhhhhhhh found it! This was driving me crazy.

The problem was the fix for #100 introduced in 4cdbd52. In this commit I missed the broadcast of the opt.bh_internalprop_output_names value from rank 0, which was present in the old (but buggy) version. Adding the missing broadcast solves the issue, and now finally I can successfully run VR to completion with the inputs and configuration from #87.

I'll push the fix now, double-check that it doesn't break anything prior to merging to the master. After that we can more confidently proceed with merging the changes for #88.

from velociraptor-stf.

rtobar avatar rtobar commented on August 26, 2024

Fixed now in the master branch, closing.

from velociraptor-stf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.