Describe the bug Memory footprint blows up in (or just after) the

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Memory usage blowing up in large DMO runs,about icrar/velociraptor-stf

Comments (22)

MatthieuSchaller commented on July 4, 2024 1

Thanks, I'll update my copy of the code to use this extra output.

My job checking whether Chris' config but with approximate velocity density calculation is still in the queue.

If the code performs better with a precise calculation then it's a double-win for us actually. :)

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

(as a side note, reading in the data (4.1TB) takes 4.6 hrs, which is less than 1% of the theoretical read speed of the system, and makes things quite hard to debug)

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

Any advice on things to tweak in the config or setup welcome. I have already tried more node, fewer ranks / node and other similar things but the memory seems to always blow up.

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

@stuartmcalpine and @bwvdnbro will be interested as well.

from velociraptor-stf.

rtobar commented on July 4, 2024

(as a side note, reading in the data (4.1TB) takes 4.6 hrs, which is less than 1% of the theoretical read speed of the system, and makes things quite hard to debug)

This has been mentioned a couple of times so I decided to study a bit more what could it be. It seems the configuration value Input_chunk_size (set to 1e7 in your config) governs how much data is read each time from the input file, with float/double datasets reading 3x that many values each time (so for floats that's 1e7 * 3 * 4 = 120MB chunks). I guess you should be able to easily take this up at least by 10x? Bringing it further up would cross the 2G limit, and I'm not sure if we'll run into issues there (as you might remember from #88 we had problems writing > 2GB in parallel, I don't know what the situation is for reading...). Anyhow, worth giving it a try!

from velociraptor-stf.

rtobar commented on July 4, 2024

@MatthieuSchaller I tried opening the core files but I don't have enough permissions to do so; e.g.:

$> ll /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008/core.222618
-rw------- 1 jlvc76 dphlss 32862208 Nov  9 05:03 /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008/core.222618
$> whoami
dc-toba1
$> groups
dp004 cosma6 clusterusers

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

Permissions fixed.

The read is not done using parallel hdf5 so I don't think the limit applies. I'll try raising that variable by 10x and see whether it helps.

from velociraptor-stf.

rtobar commented on July 4, 2024

Unfortunately the core files are truncated, so I couldn't even get a stacktrace out of them. This is somewhat expected: the expected core sizes are in the order of ~200 GB, but SLURM would have SIGKILL'd the processes after waiting for a bit while they were writing their core files.

Based on the log messages this memory blowup seems to be happening roughly in the same place where our latest fix for #53 is located -- that is, when densities are being computed for structures that span across MPI ranks. To be clear: I don't think the error happens because of the fix -- and if the logs are to be trusted, the memory spike comes even before that, while particles are being exchanged between ranks in order to perform the calculation. Some of these particle exchanges are based on the MPI*NNImport* functions (and the associated KDTree::SearchBallPos functions) we looked at recently in #73 (comment). Like that comment said, some of the SearchBallPos functions return lists which might contain duplicate particle IDs, so it could be a possibility that these lists are blowing up the memory.

While other avenues of investigation make sense too, it could be worth a shot to try this one out and see if by using a different data structure we actually solve the problem (or not).

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

Thanks for looking into it.

Based on this analysis, anything I should try? Note that I am already trying to not use the SOlists for the same reason of exploding memory footprint(#112) even though we do crash earlier here.

from velociraptor-stf.

rtobar commented on July 4, 2024

Thanks for looking into it.

Based on this analysis, anything I should try? Note that I am already trying to not use the SOlists for the same reason of exploding memory footprint(#112) even though we do crash earlier here.

Unfortunately I'm not sure there's much you can do without code modifications, if the problem is indeed where I think it is. When I fixed the code for #73 I thought that it would be a good idea to modify SearchPosBall and friends to use memory more wisely, but I wanted to keep the change to a minimum. If this issue is related to the same underlying problem of repeated results being accumulated during SearchPosBall then I think it's definitely worth me trying to fix more fundamental issue.

I'll see if I can come up with something you can test more or less quickly and will post any updates here.

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

Thanks.

FYI, I tried running an even more trimmed-down version where I request no over-denstiy calculation, no aperture calculation, and no radial profiles and it also ran out of memory.
Not suprising since the crash happens before any of the properties calculation has even started.

I'll see whether changing MPI_particle_total_buf_size could help or MPI_use_zcurve_mesh_decomposition just because it may give us a more lucky domain decomposition.

Would it make sense to add a maximal distance SearchPosBall could go to? Might be a bit dirty but there are clear limits we can use from the physics of the model.

Also, would it be possible to try the config you guys used on Gadi for your big run? If I am not mistaken it was not far from our run in terms of numbers of particle. Might be a baseline for us to try here.

from velociraptor-stf.

rtobar commented on July 4, 2024

@doctorcbpower could you point @MatthieuSchaller to the config he's asking above? Thanks!

from velociraptor-stf.

doctorcbpower commented on July 4, 2024

Hi @rtobar and @MatthieuSchaller, sorry, only just saw this. I have been using this for similar particle numbers in smaller boxes, so it should work.
vr_config.L210N5088.cfg.txt

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

Thanks! I'll give this a go. Do you remember how much memory was needed for VR to succeed and how many MPI ranks were used?

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

Good news: That last configuration worked out of the box on my run.

Time to fire up a diff.

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

One quick note related to the i/o comment above. This config took 6hr30 to read in the data. Then 1hr30 for the rest.

from velociraptor-stf.

rtobar commented on July 4, 2024

One quick note related to the i/o comment above. This config took 6hr30 to read in the data. Then 1hr30 for the rest.

It doesn't have a value defined for Input_chunk_size, and the default is 1e6 -- that's 10x lower than the value you had, so it's reading in 12 MB chunks. This seems to show that the configuration value has a noticeable effect. Did you end up trying with 1e8?

Having said that, I've never done an exhausting profiling of the reading code. Even with those reading sizes it still sounds like things could be better. If the inputs are compressed there'll be also some overhead associated to that I guess, but I don't know if that's the case.

from velociraptor-stf.

doctorcbpower commented on July 4, 2024

Hi @MatthieuSchaller, sorry for delay. That's the config file I use when running VR inline - to be comfortable, I find you basically need to double the memory for the particular runs we do, may not be as severe given the box sizes you are running. Also no issue with reading in particle data, so defining Input_chunk_size isn't such an issue.

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

Here are bits of the configs that are different and plausibly related to what we see as a problem:

Parameter Name	Mine	Yours
`Cosmological_input`	N/A	`1`
`MPI_use_zcurve_mesh_decomposition`	`0`	`1`
`Particle_search_type`	`1`	`2`
`Baryon_searchflag`	`2`	`0`
`FoF_Field_search_type`	`5`	`3`
`Local_velocity_density_approximate_calculation`	`1`	`0`
`Bound_halos`	`0`	`1`
`Virial_density`	N/A	`500`
`Particle_type_for_reference_frames`	`1`	N/A
`Halo_core_phase_merge_dist`	`0.25`	N/A
`Structure_phase_merge_dist`	N/A	`0.25`
`Overdensity_output_maximum_radius_in_critical_density`	`100`	N/A
`Spherical_overdenisty_calculation_limited_to_structure_types`	`4`	N/A
`Extensive_gas_properties_output`	`1`	N/A
`Extensive_star_properties_output`	`1`	N/A
`MPI_particle_total_buf_size`	`10000000000`	`100000000`

(mine is also used for baryon runs)

I am not quite sure what all of these do.

Looking at this list, I am getting suspicious about Local_velocity_density_approximate_calculation as we seem to crash when computing this in the code. Maybe I should try my configuration but instead use the more accurate calculation.

Does any other of these parameter values look suspicious to you?

@rtobar the data is compressed and using some hdf5 filters so that will play a role. And indeed, the lower default Input_chunk_size will have played a role here. I'll try to increase it to 1e8 for the next attempt. SWIFT took 400s to write that compressed data set (using 1200 ranks writing 1 file each however, so that's a factor), which is also one reason to believe some i/o configuration choices here might help make this phase a lot faster.

from velociraptor-stf.

rtobar commented on July 4, 2024

@MatthieuSchaller not a solution, but the latest master now contains some more logging to find out what's going on and how much is expected to travel through MPI, which seems to be the problem.

Changing Local_velocity_density_approximate_calculation indeed makes the code take a different path, so you won't run into this exact particular issue (but it will still exist I guess, and you might run into something else?). I'm not familiar with all the configuration options, so I'm not able really to tell if any of the differences above are suspicious or not.

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

Eventually got my next test to run.

Taking Chris' config from above and applying the following changes:

Added Input_chunk_size=100000000
Removed Cosmological_input=1
Removed Virial_density=500
Changed MPI_particle_total_buf_size from 100000000 to 10000000000
Changed Local_velocity_density_approximate_calculation from 0 to 1

then the code crashed in the old way. (Note without the new memory-related outputs)

I would think it's the approximate calculation of the local velocity density which is the problem here.
Happy to use the more accurate calculation since it works and is likely better if I am to believe the parameter name.

To be extra sure, I am now trying it again but with everything set back to Chris' value apart from that local vel disp. parameter.

from velociraptor-stf.

MatthieuSchaller commented on July 4, 2024

And now changing just Local_velocity_density_approximate_calculation from 0 to 1 from Chris' configuration I get the problem.

from velociraptor-stf.

Memory usage blowing up in large DMO runs about velociraptor-stf HOT 22 OPEN

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent