Git Product home page Git Product logo

Comments (9)

MatthieuSchaller avatar MatthieuSchaller commented on August 26, 2024 1

I can confirm that the current master works fine with and without MPI if I comment out this section:

# Compute the total luminosity in the 9 GAMA bands                                                                                                                                                                                                                                                                                                                        
Star_internal_property_names=Luminosities,Luminosities,Luminosities,Luminosities,Luminosities,Luminosities,Luminosities,Luminosities,Luminosities,                                                                                                                                                                                                                       
Star_internal_property_index_in_file=0,1,2,3,4,5,6,7,8,                                                                                                                                                                                                                                                                                                                  
Star_internal_property_input_output_unit_conversion_factors=1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,                                                                                                                                                                                                                                                                         
Star_internal_property_calculation_type=aperture_total,aperture_total,aperture_total,aperture_total,aperture_total,aperture_total,aperture_total,aperture_total,aperture_total,                                                                                                                                                                                          
Star_internal_property_output_units=unitless,unitless,unitless,unitless,unitless,unitless,unitless,unitless,unitless,       

from velociraptor-stf.

rtobar avatar rtobar commented on August 26, 2024

I've reproduced this problem with the latest master. Different runs give different problems though (double-free, invalid pointer, corrupted double-linked list, etc) which points to some type of memory corruption.

After a small debugging session I think I spotted what the problem could be. In MPISendReceiveFOFStarInfoBetweenThreads the receive buffer proprecvbuff is resized to numrecv, but then it receives numrecv * numextrafields values:

else {indicesrecv.resize(numrecv);proprecvbuff.resize(numrecv);}
MPI_Sendrecv(indicessend.data(),numsend, MPI_Int_t, recvTask,
tag*3, indicesrecv.data(),numrecv, MPI_Int_t, recvTask, tag*3, mpi_comm, &status);
MPI_Sendrecv(propsendbuff.data(),numsend*numextrafields, MPI_FLOAT, recvTask,
tag*4, proprecvbuff.data(),numrecv*numextrafields, MPI_FLOAT, recvTask, tag*4, mpi_comm, &status);

Similar code is correctly coded in MPISendReceiveFOFHydroInfoBetweenThreads though:

else {indicesrecv.resize(numrecv);proprecvbuff.resize(numrecv*numextrafields);}
MPI_Sendrecv(indicessend.data(),numsend, MPI_Int_t, recvTask,
tag*3, indicesrecv.data(),numrecv, MPI_Int_t, recvTask, tag*3, mpi_comm, &status);
MPI_Sendrecv(propsendbuff.data(),numsend*numextrafields, MPI_FLOAT, recvTask,
tag*4, proprecvbuff.data(),numrecv*numextrafields, MPI_FLOAT, recvTask, tag*4, mpi_comm, &status);

Then it's wrong again in MPISendReceiveFOFBHInfoBetweenThreads and MPISendReceiveFOFExtraDMInfoBetweenThreads:

else {indicesrecv.resize(numrecv);proprecvbuff.resize(numrecv);}
MPI_Sendrecv(indicessend.data(),numsend, MPI_Int_t, recvTask,
tag*3, indicesrecv.data(),numrecv, MPI_Int_t, recvTask, tag*3, mpi_comm, &status);
MPI_Sendrecv(propsendbuff.data(),numsend*numextrafields, MPI_FLOAT, recvTask,
tag*4, proprecvbuff.data(),numrecv*numextrafields, MPI_FLOAT, recvTask, tag*4, mpi_comm, &status);

else {indicesrecv.resize(numrecv);proprecvbuff.resize(numrecv);}
MPI_Sendrecv(indicessend.data(),numsend, MPI_Int_t, recvTask,
tag*3, indicesrecv.data(),numrecv, MPI_Int_t, recvTask, tag*3, mpi_comm, &status);
MPI_Sendrecv(propsendbuff.data(),numsend*numextrafields, MPI_FLOAT, recvTask,
tag*4, proprecvbuff.data(),numrecv*numextrafields, MPI_FLOAT, recvTask, tag*4, mpi_comm, &status);

From the git history it seems like this problem has always been present since these routines were introduced (April 2020). The routine that behaves correctly does so because I fixed it on an earlier commit 082ff68 for a similar problem reported in #54. Back then I didn't realise this affected more than one routine, and now that it became clear I'll obviously go and fix them all. I'll try to unify a bit the code as well, but without being too disruptive.

from velociraptor-stf.

MatthieuSchaller avatar MatthieuSchaller commented on August 26, 2024

That all makes sense. Thanks.

Indeed, not related to the latest master. I tried to go back in time to find a version that works yesterday but couldn't. That got me really confused so I took a break.

But, one thing that is different, and that your test reveals, is that we are now making use of the extra star properties in the config file. That is something we have not used much.
I added these config options recently in our EAGLE setup and tested it without MPI. So the problem went unnoticed.

Josh then tried that same setup but in MPI-only mode (as he still gets hit by the negative density bug when using OMP) and that when it all went wrong.

Hopefully the solution you had worked out for the hydro case can be relatively easily transplanted here.

from velociraptor-stf.

rtobar avatar rtobar commented on August 26, 2024

@MatthieuSchaller I pushed the relevant changes to the issue-87 branch. I tested them against the dataset/config you provided and now the code gets past the original problem, so it seems like the issue is gone.

However, at dataset writing time there is a new crash due to some invalid size passed to the HDF5 routines. I'll look into that separately on a different issue, but it appears to be specific to parallel HDF5 writing, so deactivating that option might be a workaround. Given this new, additional problem we can't really confirm the issue in this ticket is really gone until we have a successful execution, so I'll refrain from merging for the time being until we have a fix for the second issue.

from velociraptor-stf.

MatthieuSchaller avatar MatthieuSchaller commented on August 26, 2024

Great. Let me know whether there is anything I can help with or test. I suppose there isn't much point before the i/o-time crash is solved but nevertheless can try if needed.

from velociraptor-stf.

rtobar avatar rtobar commented on August 26, 2024

@MatthieuSchaller a test that you can try, if the resulting files are useful, is to run with the latest issue-87 branch but without parallel HDF5 writing, thus hopefully avoiding #88.

from velociraptor-stf.

MatthieuSchaller avatar MatthieuSchaller commented on August 26, 2024

Yes, it all works smoothly.

from velociraptor-stf.

MatthieuSchaller avatar MatthieuSchaller commented on August 26, 2024

The version with parallel-hdf5 hangs while writing.

from velociraptor-stf.

rtobar avatar rtobar commented on August 26, 2024

Thanks @MatthieuSchaller for confirming the fix works. Since both you and I have seen the fix working separately I've merged it now the master branch, so I'm closing this issue. I'll try to focus now on #88.

from velociraptor-stf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.