🐛 Describe the bug Introduction : <p dir="aut

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Quick observation: The memory is always stable between the wit

Thanks for the reply. I'll try to explain a bit better. <ul dir="auto

PyTorch Memory Management in GPU-to-CPU Transfers issue about pytorch HOT 7 OPEN

AntoGuer commented on May 13, 2024 1

PyTorch Memory Management in GPU-to-CPU Transfers issue

from pytorch.

Comments (7)

colesbury commented on May 13, 2024 1

Interestingly, after 3 to 4 iterations, the memory usage stabilizes

Yeah, this sounds like normal allocator behavior. You can look into https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html, but I think this is probably:

Outside the scope of PyTorch
Unrelated to the GPU-CPU transfers.

You can also try some other allocator like https://jemalloc.net/, but you'll probably see similar behavior.

from pytorch.

gdacciaro commented on May 13, 2024

I also had this problem, does anyone have any solution or idea how to solve it?

from pytorch.

colesbury commented on May 13, 2024

cc @albanD

from pytorch.

albanD commented on May 13, 2024

Quick observation:

The memory is always stable between the with no_grad and last line of the function.
I'm not sure how to interpret on line have 452.1 and the next 333.8MB but the increment is 0 ??

Also on cpu, the libc malloc is used for memory allocation. Depending on which memory measure you're looking at, there are quite a few cases where the allocator will keep memory around and not give it back to the OS. This would explain the first few iterations allocating more until malloc starts to properly re-use memory it has cached.

from pytorch.

AntoGuer commented on May 13, 2024

Thanks for the reply. I'll try to explain a bit better.

The memory is always stable between the with no_grad and last line of the function.

Indeed, the memory remains stable between the with torch.no_grad() context manager and the final line of the function. However, this stability does not necessarily correlate with the amount of memory available at the start of the function.
To illustrate the memory usage throughout the function, I will add an additional output with a print statement to highlight the differences:

PyTorch version: 2.2.2+cu121
******* Iteration num: 1 *********** 

Start "with torch_no_grad()"
End "with torch_no_grad()"
End run_test()
Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     8    332.1 MiB    332.1 MiB           1   @profile
     9                                         def run_test():
    10                                             
    11    332.1 MiB      0.0 MiB           1       print('Start "with torch_no_grad()"')    
    12                                             
    13    449.8 MiB      0.0 MiB           2       with torch.no_grad():
    14    332.1 MiB      0.0 MiB           1           batch_size = 300
    15    332.1 MiB      0.0 MiB           1           tensor_size = (1000, 1000)
    16                                                 # Create a batch tensor in one line
    17   1479.1 MiB   1147.0 MiB         303           batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
    18                                                 
    19    449.7 MiB  -1029.5 MiB           1           batch_tensors = batch_tensors.to('cuda')
    20   1594.1 MiB   1144.4 MiB           1           batch_tensors = batch_tensors.to('cpu').detach()
    21                                                 # Print the size of the batch tensor
    22    449.8 MiB  -1144.3 MiB           1           del batch_tensors
    23                                         
    24    449.8 MiB      0.0 MiB           1       print('End "with torch_no_grad()"')    
    25                                         
    26    449.8 MiB      0.0 MiB           1       gc.collect()
    27    449.8 MiB      0.0 MiB           1       torch.cuda.empty_cache()
    28                                         
    29    449.8 MiB      0.0 MiB           1       print('End run_test()')


******* Iteration num: 2 *********** 

Start "with torch_no_grad()"
End "with torch_no_grad()"
End run_test()
Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     8    449.8 MiB    449.8 MiB           1   @profile
     9                                         def run_test():
    10                                             
    11    449.8 MiB      0.0 MiB           1       print('Start "with torch_no_grad()"')    
    12                                             
    13   1594.5 MiB      0.0 MiB           2       with torch.no_grad():
    14    449.8 MiB      0.0 MiB           1           batch_size = 300
    15    449.8 MiB      0.0 MiB           1           tensor_size = (1000, 1000)
    16                                                 # Create a batch tensor in one line
    17   2738.7 MiB   2288.9 MiB         303           batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
    18                                                 
    19   1594.5 MiB  -1144.2 MiB           1           batch_tensors = batch_tensors.to('cuda')
    20   2738.7 MiB   1144.2 MiB           1           batch_tensors = batch_tensors.to('cpu').detach()
    21                                                 # Print the size of the batch tensor
    22   1594.5 MiB  -1144.2 MiB           1           del batch_tensors
    23                                         
    24   1594.5 MiB      0.0 MiB           1       print('End "with torch_no_grad()"')    
    25                                         
    26   1594.5 MiB      0.0 MiB           1       gc.collect()
    27   1594.5 MiB      0.0 MiB           1       torch.cuda.empty_cache()
    28                                         
    29   1594.5 MiB      0.0 MiB           1       print('End run_test()')

I'm not sure how to interpret on line have 452.1 and the next 333.8MB but the increment is 0 ??

The line with 452.1MB likely represents the memory allocated at the conclusion of the with torch.no_grad() block.

Also on cpu, the libc malloc is used for memory allocation. Depending on which memory measure you're looking at, there are quite a few cases where the allocator will keep memory around and not give it back to the OS. This would explain the first few iterations allocating more until malloc starts to properly re-use memory it has cached.

Is there a way to prevent this behavior, especially considering that in the second iteration, memory usage increases excessively?
Would you recommend exploring alternative memory allocator implementations as a potential solution?

from pytorch.

albanD commented on May 13, 2024

Is there a way to prevent this behavior, especially considering that in the second iteration, memory usage increases excessively?

You can try using another malloc implementation like jemalloc but they will most likely have similar behavior.
In particular, as long as there is no memory pressure, it is usually faster to keep around memory as you can serve it faster.

In particular, unless you actually seem OOMs, it might just be keeping memory around to speed things up.

from pytorch.

AntoGuer commented on May 13, 2024

Is there a way to prevent this behavior, especially considering that in the second iteration, memory usage increases excessively?

You can try using another malloc implementation like jemalloc but they will most likely have similar behavior. In particular, as long as there is no memory pressure, it is usually faster to keep around memory as you can serve it faster.

In particular, unless you actually seem OOMs, it might just be keeping memory around to speed things up.

Thanks for the answer. I'll experiment with different malloc implementations to see if the behavior persists.
My main concern is that this issue also occurs when loading and transferring models from CPU to GPU. I'm encountering out-of-memory errors. It seems strange that the model loads successfully the first few times, but then requires significantly more memory on subsequent attempts.

from pytorch.

PyTorch Memory Management in GPU-to-CPU Transfers issue about pytorch HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent