Describe the bug Repeated calls to minimize() leads to memory use

<a class="user-mention notranslate

When executing this with added lines PETSc.garbage_cl

<a href="https://github.com/firedrakeproject/firedrake/files/12396197/mprofile_2023082

Memory growth with repeated minimize calls ,about firedrakeproject/firedrake

Comments (35)

colinjcotter commented on July 29, 2024 1

I think that it is specific to minimize(), because if I just call Jhat.derivative() then the memory usage asymptotes to a constant value.

from firedrake.

dham commented on July 29, 2024 1

@dham I'm using mprof which just gives a time series from snapshots. I suppose I can just divide the final number by the number of for loop iterations.
@connorjward I am getting what seems to be an increase when using Jhat(). What made you think it wasn't happening?

Just double checked. The memory usage stays constant if you have the lines
gc.collect()
PETSc.garbage_cleanup(mesh._comm)
inside the loop. Otherwise it does increase over time. This is not going to be the solution here since we're still leaking with these calls when we use minimize.
Are both necessary or is the second sufficient?
This is a bit weird, because NLVS.solve should be being called when re-evaluating the functional, and that should call the garbage cleanup.
Actually only the first (gc.collect()) is required. This must be because we're otherwise only rarely cleaning up cyclically-referenced objects so the memory appears to grow at first. Ultimately it would get cleared when Python decides to do its own cleanup.

OK so it sounds like we need to add gc.collect() in front of the PETSc collection operation when we call it.

from firedrake.

connorjward commented on July 29, 2024

I believe that this is a known issue. We only clean up the memory when solve is called.

I think that your issue will be resolved if you add the line PETSc.garbage_cleanup(mesh._comm) inside your loop.

from firedrake.

colinjcotter commented on July 29, 2024

Does solve not get called when playing the tape then? Or does it directly build ksps? On 18 Aug 2023, at 17:55, Connor Ward ***@***.***> wrote: I believe that this is a known issue. We only clean up the memory when solve is called. I think that your issue will be resolved if you add the line PETSc.garbage_cleanup(mesh._comm) inside your loop. — Reply to this email directly, view it on GitHub<#3069 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABOSV4XZQVNRSUDPGP44HATXV6M6DANCNFSM6AAAAAA3VV4X6Q>. You are receiving this because you authored the thread.Message ID: ***@***.***>

from firedrake.

connorjward commented on July 29, 2024

When we evaluate the adjoint we use LinearSolvers instead of LinearVariationalSolvers (since we can't have RHSs that are Cofunctions). I think it makes sense that we're not hitting the code path where we call garbage_cleanup.

from firedrake.

colinjcotter commented on July 29, 2024

Thanks for explaining! On 18 Aug 2023, at 18:31, Connor Ward ***@***.***> wrote: When we evaluate the adjoint we use LinearSolvers instead of LinearVariationalSolvers (since we can't have RHSs that are Cofunctions). I think it makes sense that we're not hitting the code path where we call garbage_cleanup. — Reply to this email directly, view it on GitHub<#3069 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABOSV4XHVIGYLBQQFBZ6QPDXV6RINANCNFSM6AAAAAA3VV4X6Q>. You are receiving this because you authored the thread.Message ID: ***@***.***>

from firedrake.

maneeshkrsingh commented on July 29, 2024

When executing this script with added lines PETSc.garbage_cleanup(mesh._comm) in the loop, memory usage initiates at approximately 200MB and gradually escalates to 2GB within the first 200 steps. Upon reaching the 1000th step, the total memory consumption peaks at around 7GB. Results are same without this line as well.

from firedrake.

colinjcotter commented on July 29, 2024

Yes, I also observe that garbage_cleanup does not help.

from firedrake.

colinjcotter commented on July 29, 2024

mprofile_20230821135302.dat.gz

from firedrake.

colinjcotter commented on July 29, 2024

... actually it is still creeping upwards, just waiting for a longer run

from firedrake.

connorjward commented on July 29, 2024

~~I think I've observed that Jhat(...) is also increasing memory usage.~~ I'll continue to investigate.

I was wrong.

from firedrake.

dham commented on July 29, 2024

Can we get an approximate quantification of the leakage in units of Functions per timestep? I think that might help us understand what's going on (e.g. are we leaking functions or solvers).

from firedrake.

dham commented on July 29, 2024

You can approximate a Function as number of DoFs x 8 bytes.

from firedrake.

dham commented on July 29, 2024

Also, can you try a different optimiser?

from firedrake.

colinjcotter commented on July 29, 2024

@dham I'm using mprof which just gives a time series from snapshots. I suppose I can just divide the final number by the number of for loop iterations.

@connorjward I am getting what seems to be an increase when using Jhat(). What made you think it wasn't happening?

from firedrake.

colinjcotter commented on July 29, 2024

@dham: I'm just calling Jhat and Jhat.derivative, and that's enough to see a memory increase in mprof.

from firedrake.

colinjcotter commented on July 29, 2024

@dham do you mean some other function than minimize()? Or some different options in the call to minimize?

from firedrake.

dham commented on July 29, 2024

I'm a bit confused because you seem to have said something a bit different an hour ago. In any event, if it's in our code that's easier to deal with. So what we need to know is what we're leaking in units of Functions per step.

from firedrake.

connorjward commented on July 29, 2024

@dham I'm using mprof which just gives a time series from snapshots. I suppose I can just divide the final number by the number of for loop iterations.

@connorjward I am getting what seems to be an increase when using Jhat(). What made you think it wasn't happening?

Just double checked. The memory usage stays constant if you have the lines

gc.collect()
PETSc.garbage_cleanup(mesh._comm)

inside the loop. Otherwise it does increase over time. This is not going to be the solution here since we're still leaking with these calls when we use minimize.

from firedrake.

dham commented on July 29, 2024

@dham I'm using mprof which just gives a time series from snapshots. I suppose I can just divide the final number by the number of for loop iterations.
@connorjward I am getting what seems to be an increase when using Jhat(). What made you think it wasn't happening?

Just double checked. The memory usage stays constant if you have the lines
gc.collect()
PETSc.garbage_cleanup(mesh._comm)
inside the loop. Otherwise it does increase over time. This is not going to be the solution here since we're still leaking with these calls when we use minimize.

Are both necessary or is the second sufficient?

This is a bit weird, because NLVS.solve should be being called when re-evaluating the functional, and that should call the garbage cleanup.

from firedrake.

colinjcotter commented on July 29, 2024

Thanks for nailing it down a bit more, @connorjward !

from firedrake.

colinjcotter commented on July 29, 2024

@dham the second is not sufficient.

Summary: if the loop just calls jhat and/or jhat.derivative(), gc.collect() and garbage_cleanup(...) are necessary and sufficient.
If the loop calls minimize(), we are leaking.

from firedrake.

connorjward commented on July 29, 2024

@dham I'm using mprof which just gives a time series from snapshots. I suppose I can just divide the final number by the number of for loop iterations.
@connorjward I am getting what seems to be an increase when using Jhat(). What made you think it wasn't happening?

Just double checked. The memory usage stays constant if you have the lines
gc.collect()
PETSc.garbage_cleanup(mesh._comm)
inside the loop. Otherwise it does increase over time. This is not going to be the solution here since we're still leaking with these calls when we use minimize.
Are both necessary or is the second sufficient?

This is a bit weird, because NLVS.solve should be being called when re-evaluating the functional, and that should call the garbage cleanup.

Actually only the first (gc.collect()) is required. This must be because we're otherwise only rarely cleaning up cyclically-referenced objects so the memory appears to grow at first. Ultimately it would get cleared when Python decides to do its own cleanup.

from firedrake.

connorjward commented on July 29, 2024

To be clear though, this does not fix the memory leak being experienced. Adding gc.collect() and garbage_cleanup calls does not stop the memory from leaking from the example given above.

from firedrake.

colinjcotter commented on July 29, 2024

Here's the output of the above from mprof (so x-axis is sample times). It is 200 calls to minimize.

from firedrake.

colinjcotter commented on July 29, 2024

This is with gc.collect() and garbage_collect added after each minimise.

It looks like most of the memory is eventually collected, but this always happens at the end of the loop, no matter how long it is.

from firedrake.

colinjcotter commented on July 29, 2024

That's about 5250Mb leaking total, which is about 26.25Mb per minimize() call.

from firedrake.

colinjcotter commented on July 29, 2024

I am interested to try other optimisers, but there is not even any API documentation for minimize(), so I don't know how to.

from firedrake.

connorjward commented on July 29, 2024

I've figured it out. We are leaking PETSc objects that are stored on COMM_SELF. Adding the line PETSc.garbage_cleanup(PETSc.COMM_SELF) inside the loop makes the leak go away. You can see the number of objects increasing if you call PETSc.garbage_view(PETSc.COMM_SELF).

The proper fix for this is probably to add some code to PETSc/petsc4py so COMM_SELF is always cleared when garbage_cleanup is called. Or one could actually eagerly delete these objects since there are no deadlock concerns.

from firedrake.

maneeshkrsingh commented on July 29, 2024

Yes, it works. Significantly reduces memory use.

from firedrake.

wence- commented on July 29, 2024

The proper fix for this is probably to add some code to PETSc/petsc4py so COMM_SELF is always cleared when garbage_cleanup is called. Or one could actually eagerly delete these objects since there are no deadlock concerns.

Yeah, if comm.size == 1 in petsc then you never need to defer collection

from firedrake.

connorjward commented on July 29, 2024

The proper fix for this is probably to add some code to PETSc/petsc4py so COMM_SELF is always cleared when garbage_cleanup is called. Or one could actually eagerly delete these objects since there are no deadlock concerns.

Yeah, if comm.size == 1 in petsc then you never need to defer collection

That's exactly what I've done here. I will close this issue when that gets merged and our PETSc fork is updated.

from firedrake.

colinjcotter commented on July 29, 2024

Thanks for clearing this up!

Once the fix is in, we would still need to call garbage_cleanup in this case - is that right?

What kind of objects are on COMM_SELF?

from firedrake.

connorjward commented on July 29, 2024

Thanks for clearing this up!

Once the fix is in, we would still need to call garbage_cleanup in this case - is that right?

No you shouldn't need to call garbage_cleanup. It gets called automatically by the LinearSolver during the adjoint calculation. Also, if you're only running this in serial, then things should always be eagerly collected anyway.

What kind of objects are on COMM_SELF?

I think there were a lot of PetscSFs and the like.

from firedrake.

connorjward commented on July 29, 2024

The proper fix for this is probably to add some code to PETSc/petsc4py so COMM_SELF is always cleared when garbage_cleanup is called. Or one could actually eagerly delete these objects since there are no deadlock concerns.

Yeah, if comm.size == 1 in petsc then you never need to defer collection

That's exactly what I've done here. I will close this issue when that gets merged and our PETSc fork is updated.

Closing as this has happened.

from firedrake.

Memory growth with repeated minimize calls about firedrake HOT 35 CLOSED

Comments (35)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent