Comments (22)
Thanks, I'll update my copy of the code to use this extra output.
My job checking whether Chris' config but with approximate velocity density calculation is still in the queue.
If the code performs better with a precise calculation then it's a double-win for us actually. :)
from velociraptor-stf.
(as a side note, reading in the data (4.1TB) takes 4.6 hrs, which is less than 1% of the theoretical read speed of the system, and makes things quite hard to debug)
from velociraptor-stf.
Any advice on things to tweak in the config or setup welcome. I have already tried more node, fewer ranks / node and other similar things but the memory seems to always blow up.
from velociraptor-stf.
@stuartmcalpine and @bwvdnbro will be interested as well.
from velociraptor-stf.
(as a side note, reading in the data (4.1TB) takes 4.6 hrs, which is less than 1% of the theoretical read speed of the system, and makes things quite hard to debug)
This has been mentioned a couple of times so I decided to study a bit more what could it be. It seems the configuration value Input_chunk_size
(set to 1e7 in your config) governs how much data is read each time from the input file, with float/double datasets reading 3x that many values each time (so for floats that's 1e7 * 3 * 4 = 120MB chunks). I guess you should be able to easily take this up at least by 10x? Bringing it further up would cross the 2G limit, and I'm not sure if we'll run into issues there (as you might remember from #88 we had problems writing > 2GB in parallel, I don't know what the situation is for reading...). Anyhow, worth giving it a try!
from velociraptor-stf.
@MatthieuSchaller I tried opening the core files but I don't have enough permissions to do so; e.g.:
$> ll /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008/core.222618
-rw------- 1 jlvc76 dphlss 32862208 Nov 9 05:03 /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008/core.222618
$> whoami
dc-toba1
$> groups
dp004 cosma6 clusterusers
from velociraptor-stf.
Permissions fixed.
The read is not done using parallel hdf5 so I don't think the limit applies. I'll try raising that variable by 10x and see whether it helps.
from velociraptor-stf.
Unfortunately the core files are truncated, so I couldn't even get a stacktrace out of them. This is somewhat expected: the expected core sizes are in the order of ~200 GB, but SLURM would have SIGKILL'd the processes after waiting for a bit while they were writing their core files.
Based on the log messages this memory blowup seems to be happening roughly in the same place where our latest fix for #53 is located -- that is, when densities are being computed for structures that span across MPI ranks. To be clear: I don't think the error happens because of the fix -- and if the logs are to be trusted, the memory spike comes even before that, while particles are being exchanged between ranks in order to perform the calculation. Some of these particle exchanges are based on the MPI*NNImport*
functions (and the associated KDTree::SearchBallPos
functions) we looked at recently in #73 (comment). Like that comment said, some of the SearchBallPos
functions return lists which might contain duplicate particle IDs, so it could be a possibility that these lists are blowing up the memory.
While other avenues of investigation make sense too, it could be worth a shot to try this one out and see if by using a different data structure we actually solve the problem (or not).
from velociraptor-stf.
Thanks for looking into it.
Based on this analysis, anything I should try? Note that I am already trying to not use the SOlists for the same reason of exploding memory footprint(#112) even though we do crash earlier here.
from velociraptor-stf.
Thanks for looking into it.
Based on this analysis, anything I should try? Note that I am already trying to not use the SOlists for the same reason of exploding memory footprint(#112) even though we do crash earlier here.
Unfortunately I'm not sure there's much you can do without code modifications, if the problem is indeed where I think it is. When I fixed the code for #73 I thought that it would be a good idea to modify SearchPosBall
and friends to use memory more wisely, but I wanted to keep the change to a minimum. If this issue is related to the same underlying problem of repeated results being accumulated during SearchPosBall
then I think it's definitely worth me trying to fix more fundamental issue.
I'll see if I can come up with something you can test more or less quickly and will post any updates here.
from velociraptor-stf.
Thanks.
FYI, I tried running an even more trimmed-down version where I request no over-denstiy calculation, no aperture calculation, and no radial profiles and it also ran out of memory.
Not suprising since the crash happens before any of the properties calculation has even started.
I'll see whether changing MPI_particle_total_buf_size
could help or MPI_use_zcurve_mesh_decomposition
just because it may give us a more lucky domain decomposition.
Would it make sense to add a maximal distance SearchPosBall
could go to? Might be a bit dirty but there are clear limits we can use from the physics of the model.
Also, would it be possible to try the config you guys used on Gadi for your big run? If I am not mistaken it was not far from our run in terms of numbers of particle. Might be a baseline for us to try here.
from velociraptor-stf.
@doctorcbpower could you point @MatthieuSchaller to the config he's asking above? Thanks!
from velociraptor-stf.
Hi @rtobar and @MatthieuSchaller, sorry, only just saw this. I have been using this for similar particle numbers in smaller boxes, so it should work.
vr_config.L210N5088.cfg.txt
from velociraptor-stf.
Thanks! I'll give this a go. Do you remember how much memory was needed for VR to succeed and how many MPI ranks were used?
from velociraptor-stf.
Good news: That last configuration worked out of the box on my run.
Time to fire up a diff
.
from velociraptor-stf.
One quick note related to the i/o comment above. This config took 6hr30 to read in the data. Then 1hr30 for the rest.
from velociraptor-stf.
One quick note related to the i/o comment above. This config took 6hr30 to read in the data. Then 1hr30 for the rest.
It doesn't have a value defined for Input_chunk_size
, and the default is 1e6 -- that's 10x lower than the value you had, so it's reading in 12 MB chunks. This seems to show that the configuration value has a noticeable effect. Did you end up trying with 1e8?
Having said that, I've never done an exhausting profiling of the reading code. Even with those reading sizes it still sounds like things could be better. If the inputs are compressed there'll be also some overhead associated to that I guess, but I don't know if that's the case.
from velociraptor-stf.
Hi @MatthieuSchaller, sorry for delay. That's the config file I use when running VR inline - to be comfortable, I find you basically need to double the memory for the particular runs we do, may not be as severe given the box sizes you are running. Also no issue with reading in particle data, so defining Input_chunk_size
isn't such an issue.
from velociraptor-stf.
Here are bits of the configs that are different and plausibly related to what we see as a problem:
Parameter Name | Mine | Yours |
---|---|---|
Cosmological_input |
N/A | 1 |
MPI_use_zcurve_mesh_decomposition |
0 |
1 |
Particle_search_type |
1 |
2 |
Baryon_searchflag |
2 |
0 |
FoF_Field_search_type |
5 |
3 |
Local_velocity_density_approximate_calculation |
1 |
0 |
Bound_halos |
0 |
1 |
Virial_density |
N/A | 500 |
Particle_type_for_reference_frames |
1 |
N/A |
Halo_core_phase_merge_dist |
0.25 |
N/A |
Structure_phase_merge_dist |
N/A | 0.25 |
Overdensity_output_maximum_radius_in_critical_density |
100 |
N/A |
Spherical_overdenisty_calculation_limited_to_structure_types |
4 |
N/A |
Extensive_gas_properties_output |
1 |
N/A |
Extensive_star_properties_output |
1 |
N/A |
MPI_particle_total_buf_size |
10000000000 |
100000000 |
(mine is also used for baryon runs)
I am not quite sure what all of these do.
Looking at this list, I am getting suspicious about Local_velocity_density_approximate_calculation
as we seem to crash when computing this in the code. Maybe I should try my configuration but instead use the more accurate calculation.
Does any other of these parameter values look suspicious to you?
@rtobar the data is compressed and using some hdf5 filters so that will play a role. And indeed, the lower default Input_chunk_size
will have played a role here. I'll try to increase it to 1e8 for the next attempt. SWIFT took 400s to write that compressed data set (using 1200 ranks writing 1 file each however, so that's a factor), which is also one reason to believe some i/o configuration choices here might help make this phase a lot faster.
from velociraptor-stf.
@MatthieuSchaller not a solution, but the latest master now contains some more logging to find out what's going on and how much is expected to travel through MPI, which seems to be the problem.
Changing Local_velocity_density_approximate_calculation
indeed makes the code take a different path, so you won't run into this exact particular issue (but it will still exist I guess, and you might run into something else?). I'm not familiar with all the configuration options, so I'm not able really to tell if any of the differences above are suspicious or not.
from velociraptor-stf.
Eventually got my next test to run.
Taking Chris' config from above and applying the following changes:
- Added
Input_chunk_size=100000000
- Removed
Cosmological_input=1
- Removed
Virial_density=500
- Changed
MPI_particle_total_buf_size
from100000000
to10000000000
- Changed
Local_velocity_density_approximate_calculation
from0
to1
then the code crashed in the old way. (Note without the new memory-related outputs)
I would think it's the approximate calculation of the local velocity density which is the problem here.
Happy to use the more accurate calculation since it works and is likely better if I am to believe the parameter name.
To be extra sure, I am now trying it again but with everything set back to Chris' value apart from that local vel disp. parameter.
from velociraptor-stf.
And now changing just Local_velocity_density_approximate_calculation
from 0
to 1
from Chris' configuration I get the problem.
from velociraptor-stf.
Related Issues (20)
- Incorrectly sized buffer given for MPI_Bcast reception HOT 1
- Writing parallel properties file in hydro builds is broken HOT 6
- SO list offsets are wrong/counterintuitive HOT 6
- Inconsistent array names between properties files HOT 6
- SO list output too large and possibly wrong HOT 11
- Error in writing HDF5 outputs HOT 5
- Improve VR's memory usage for extra data in Particles
- DMO Zoom on-the-fly with SWIFT segfault HOT 4
- OpenMP bug in temperature calculations. HOT 5
- Differences in halo masses when switching on/off substructure search HOT 2
- Mistakes in metallity calculations HOT 1
- Apparently wrong output when using Star_internal_property options
- HIGHRES needs undocumented Extensive_interloper_properties_output config option
- Error on compiling HOT 9
- Particle_type_for_reference_frames value not listed in the output HOT 3
- More potential issues hidden in MPISendReceive*InfoBetweenThreads functions
- Uninitialised variables in PropData class
- Compiling with DVR_USE_GAS=ON but no other options doesn't work HOT 5
- Buffer overflow in PotentialTree with OpenMP HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from velociraptor-stf.