Git Product home page Git Product logo

pytorch_memlab's Introduction

pytorch_memlab

Test Upload Python Package PyPI CodeQL: Python PyPI - Downloads

A simple and accurate CUDA memory management laboratory for pytorch, it consists of different parts about the memory:

  • Features:

    • Memory Profiler: A line_profiler style CUDA memory profiler with simple API.
    • Memory Reporter: A reporter to inspect tensors occupying the CUDA memory.
    • Courtesy: An interesting feature to temporarily move all the CUDA tensors into CPU memory for courtesy, and of course the backward transferring.
    • IPython support through %mlrun/%%mlrun line/cell magic commands.
  • Table of Contents

Installation

  • Released version:
pip install pytorch_memlab
  • Newest version:
pip install git+https://github.com/stonesjtu/pytorch_memlab

What's for

Out-Of-Memory errors in pytorch happen frequently, for new-bees and experienced programmers. A common reason is that most people don't really learn the underlying memory management philosophy of pytorch and GPUs. They wrote memory in-efficient codes and complained about pytorch eating too much CUDA memory.

In this repo, I'm going to share some useful tools to help debugging OOM, or to inspect the underlying mechanism if anyone is interested in.

User-Doc

Memory Profiler

The memory profiler is a modification of python's line_profiler, it gives the memory usage info for each line of code in the specified function/method.

Sample:

import torch
from pytorch_memlab import LineProfiler

def inner():
    torch.nn.Linear(100, 100).cuda()

def outer():
    linear = torch.nn.Linear(100, 100).cuda()
    linear2 = torch.nn.Linear(100, 100).cuda()
    linear3 = torch.nn.Linear(100, 100).cuda()

work()

After the script finishes or interrupted by keyboard, it gives the following profiling info if you're in a Jupyter notebook:

or the following info if you're in a text-only terminal:

## outer

active_bytes reserved_bytes line  code
         all            all
        peak           peak
       0.00B          0.00B    7  def outer():
      40.00K          2.00M    8      linear = torch.nn.Linear(100, 100).cuda()
      80.00K          2.00M    9      linear2 = torch.nn.Linear(100, 100).cuda()
     120.00K          2.00M   10      inner()


## inner

active_bytes reserved_bytes line  code
         all            all
        peak           peak
      80.00K          2.00M    4  def inner():
     120.00K          2.00M    5      torch.nn.Linear(100, 100).cuda()

An explanation of what each column means can be found in the Torch documentation. The name of any field from memory_stats() can be passed to display() to view the corresponding statistic.

If you use profile decorator, the memory statistics are collected during multiple runs and only the maximum one is displayed at the end. We also provide a more flexible API called profile_every which prints the memory info every N times of function execution. You can simply replace @profile with @profile_every(1) to print the memory usage for each execution.

The @profile and @profile_every can also be mixed to gain more control of the debugging granularity.

  • You can also add the decorator in the module class:
class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()
    @profile
    def forward(self, inp):
        #do_something
  • The Line Profiler profiles the memory usage of CUDA device 0 by default, you may want to switch the device to profile by set_target_gpu. The gpu selection is globally, which means you have to remember which gpu you are profiling on during the whole process:
import torch
from pytorch_memlab import profile, set_target_gpu
@profile
def func():
    net1 = torch.nn.Linear(1024, 1024).cuda(0)
    set_target_gpu(1)
    net2 = torch.nn.Linear(1024, 1024).cuda(1)
    set_target_gpu(0)
    net3 = torch.nn.Linear(1024, 1024).cuda(0)

func()

More samples can be found in test/test_line_profiler.py

IPython support

Make sure you have IPython installed, or have installed pytorch-memlab with pip install pytorch-memlab[ipython].

First, load the extension:

%load_ext pytorch_memlab

This makes the %mlrun and %%mlrun line/cell magics available for use. For example, in a new cell run the following to profile an entire cell

%%mlrun -f func
import torch
from pytorch_memlab import profile, set_target_gpu
def func():
    net1 = torch.nn.Linear(1024, 1024).cuda(0)
    set_target_gpu(1)
    net2 = torch.nn.Linear(1024, 1024).cuda(1)
    set_target_gpu(0)
    net3 = torch.nn.Linear(1024, 1024).cuda(0)

Or you can invoke the profiler for a single statement on via the %mlrun cell magic.

import torch
from pytorch_memlab import profile, set_target_gpu
def func(input_size):
    net1 = torch.nn.Linear(input_size, 1024).cuda(0)
%mlrun -f func func(2048)

See %mlrun? for help on what arguments are supported. You can set the GPU device to profile, dump profiling results to a file, and return the LineProfiler object for post-profile inspection.

Find out more by checking out the demo Jupyter notebook

Memory Reporter

As Memory Profiler only gives the overall memory usage information by lines, a more low-level memory usage information can be obtained by Memory Reporter.

Memory reporter iterates all the Tensor objects and gets the underlying Storage object to get the actual memory usage instead of the surface Tensor.size.

Sample

  • A minimal one:
import torch
from pytorch_memlab import MemReporter
linear = torch.nn.Linear(1024, 1024).cuda()
reporter = MemReporter()
reporter.report()

outputs:

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Parameter0                                      (1024, 1024)     4.00M
Parameter1                                           (1024,)     4.00K
-------------------------------------------------------------------------------
Total Tensors: 1049600  Used Memory: 4.00M
The allocated memory on cuda:0: 4.00M
-------------------------------------------------------------------------------
  • You can also pass in a model object for automatically name inference.
import torch
from pytorch_memlab import MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
inp = torch.Tensor(512, 1024).cuda()
# pass in a model to automatically infer the tensor names
reporter = MemReporter(linear)
out = linear(inp).mean()
print('========= before backward =========')
reporter.report()
out.backward()
print('========= after backward =========')
reporter.report()

outputs:

========= before backward =========
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 1573889  Used Memory: 6.00M
The allocated memory on cuda:0: 6.00M
-------------------------------------------------------------------------------
========= after backward =========
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
weight.grad                                     (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
bias.grad                                            (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 2623489  Used Memory: 10.01M
The allocated memory on cuda:0: 10.01M
-------------------------------------------------------------------------------
  • The reporter automatically deals with the sharing weights parameters:
import torch
from pytorch_memlab import MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
linear2 = torch.nn.Linear(1024, 1024).cuda()
linear2.weight = linear.weight
container = torch.nn.Sequential(
    linear, linear2
)
inp = torch.Tensor(512, 1024).cuda()
# pass in a model to automatically infer the tensor names

out = container(inp).mean()
out.backward()

# verbose shows how storage is shared across multiple Tensors
reporter = MemReporter(container)
reporter.report(verbose=True)

outputs:

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
0.weight                                        (1024, 1024)     4.00M
0.weight.grad                                   (1024, 1024)     4.00M
0.bias                                               (1024,)     4.00K
0.bias.grad                                          (1024,)     4.00K
1.bias                                               (1024,)     4.00K
1.bias.grad                                          (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 2625537  Used Memory: 10.02M
The allocated memory on cuda:0: 10.02M
-------------------------------------------------------------------------------
  • You can better understand the memory layout for more complicated module:
import torch
from pytorch_memlab import MemReporter

lstm = torch.nn.LSTM(1024, 1024).cuda()
reporter = MemReporter(lstm)
reporter.report(verbose=True)
inp = torch.Tensor(10, 10, 1024).cuda()
out, _ = lstm(inp)
out.mean().backward()
reporter.report(verbose=True)

As shown below, the (->) indicates the re-use of the same storage back-end outputs:

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight_ih_l0                                    (4096, 1024)    32.03M
weight_hh_l0(->weight_ih_l0)                    (4096, 1024)     0.00B
bias_ih_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_hh_l0(->weight_ih_l0)                           (4096,)     0.00B
Tensor0                                       (10, 10, 1024)   400.00K
-------------------------------------------------------------------------------
Total Tensors: 8499200  Used Memory: 32.42M
The allocated memory on cuda:0: 32.52M
Memory differs due to the matrix alignment
-------------------------------------------------------------------------------
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight_ih_l0                                    (4096, 1024)    32.03M
weight_ih_l0.grad                               (4096, 1024)    32.03M
weight_hh_l0(->weight_ih_l0)                    (4096, 1024)     0.00B
weight_hh_l0.grad(->weight_ih_l0.grad)          (4096, 1024)     0.00B
bias_ih_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_ih_l0.grad(->weight_ih_l0.grad)                 (4096,)     0.00B
bias_hh_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_hh_l0.grad(->weight_ih_l0.grad)                 (4096,)     0.00B
Tensor0                                       (10, 10, 1024)   400.00K
Tensor1                                       (10, 10, 1024)   400.00K
Tensor2                                        (1, 10, 1024)    40.00K
Tensor3                                        (1, 10, 1024)    40.00K
-------------------------------------------------------------------------------
Total Tensors: 17018880         Used Memory: 64.92M
The allocated memory on cuda:0: 65.11M
Memory differs due to the matrix alignment
-------------------------------------------------------------------------------

NOTICE:

When forwarding with grad_mode=True, pytorch maintains tensor buffers for future Back-Propagation, in C level. So these buffers are not going to be managed or collected by pytorch. But if you store these intermediate results as python variables, then they will be reported.

  • You can also filter the device to report on by passing extra arguments: report(device=torch.device(0))

  • A failed example due to pytorch's C side tensor buffers

In the following example, a temp buffer is created at inp * (inp + 2) to store both inp and inp + 2, unfortunately python only knows the existence of inp, so we have 2M memory lost, which is the same size of Tensor inp.

import torch
from pytorch_memlab import MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
inp = torch.Tensor(512, 1024).cuda()
# pass in a model to automatically infer the tensor names
reporter = MemReporter(linear)
out = linear(inp * (inp + 2)).mean()
reporter.report()

outputs:

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 1573889  Used Memory: 6.00M
The allocated memory on cuda:0: 8.00M
Memory differs due to the matrix alignment or invisible gradient buffer tensors
-------------------------------------------------------------------------------

Courtesy

Sometimes people would like to preempt your running task, but you don't want to save checkpoint and then load, actually all they need is GPU resources ( typically CPU resources and CPU memory is always spare in GPU clusters), so you can move all your workspaces from GPU to CPU and then halt your task until a restart signal is triggered, instead of saving&loading checkpoints and bootstrapping from scratch.

Still developing..... But you can have fun with:

from pytorch_memlab import Courtesy

iamcourtesy = Courtesy()
for i in range(num_iteration):
    if something_happens:
        iamcourtesy.yield_memory()
        wait_for_restart_signal()
        iamcourtesy.restore()

Known Issues

  • As is stated above in Memory_Reporter, intermediate tensors are not covered properly, so you may want to insert such courtesy logics after backward or before forward.
  • Currently the CUDA context of pytorch requires about 1 GB CUDA memory, which means even all Tensors are on CPU, 1GB of CUDA memory is wasted, :-(. However it's still under investigation if I can fully destroy the context and then re-init.

ACK

I suffered a lot debugging weird memory usage during my 3-years of developing efficient Deep Learning models, and of course learned a lot from the great open source community.

CHANGES

0.2.4 (2021-10-28)
  • Fix colab error (#35)
  • Support python3.8 (#38)
  • Support sparse tensor (#30)
0.2.3 (2020-12-01)
  • Fix name mapping in MemReporter (#24)
  • Fix reporter without model input (#22 #25)
0.2.2 (2020-10-23)
  • Fix memory leak in MemReporter
0.2.1 (2020-06-18)
  • Fix line_profiler not found
0.2.0 (2020-06-15)
  • Add jupyter notebook figure and ipython support
0.1.0 (2020-04-17)
  • Add ipython magic support (#8)
0.0.4 (2019-10-08)
  • Add gpu switch for line-profiler(#2)
  • Add device filter for reporter
0.0.3 (2019-06-15)
  • Install dependency for pip installation
0.0.2 (2019-06-04)
  • Fix statistics shift in loop
0.0.1 (2019-05-28)
  • initial release

Star History

Star History Chart

pytorch_memlab's People

Contributors

andyljones avatar c01o avatar catwell avatar hauntsaninja avatar kngwyu avatar stas00 avatar stonesjtu avatar vinayakakv avatar willprice avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch_memlab's Issues

Error when running on Colab CPU instance

When I attempt to import anything from pytorch_memlab on a Google Colab CPU instance, I get the following error:

Could not reset CUDA stats and cache: 'NoneType' object has no attribute 'lower'

is PyTorch Profiler Used Internally?

Great repository! Thanks for creating this :)
Just have a quick question. Is pytorch_memlab tracking the memory usage by torch.autograd.profiler? thanks a lot! Just want to clarify this to see if I need to use both :)

Question: discrepancy between MemReporter 'Used Memory' and 'nvtop'

I'm trying to understand why there is a large discrepancy (between the 'Used Memory' or 'allocated memory on cuda:0' from MemReporter, versus what nvtop's (or nvidia-smi)'s reported memory usage. For example, while training a model (RetinaNet from detectron2, for context), I'm seeing ~285M from MemReporter, and ~15G from nvtop/nvidia-smi.

Is this all due to the autograd graph? I've been trying to read more about this but haven't found good references.

Thanks for your work on this library, and any pointers you can share about this!

weakly-referenced object no longer exists

Hello,

Firstly, congratulations for memlab. I have been trying to use it in Google Colab, but sometimes this error happens:

ReferenceError Traceback (most recent call last)
in ()
33 print('Reporter!!!!!!!')
34 reporter = MemReporter()
---> 35 reporter.report()

2 frames
/usr/local/lib/python3.7/dist-packages/pytorch_memlab/mem_reporter.py in (.0)
62 #FIXME: make the grad tensor collected by gc
63 objects = gc.get_objects()
---> 64 tensors = [obj for obj in objects if isinstance(obj, torch.Tensor)]
65 for t in tensors:
66 self.device_mapping[t.device].append(t)

ReferenceError: weakly-referenced object no longer exists

In my code, I use MemReport() just right after the training phase, i.e.:

for epoch in range(num_epochs):
net.train()
...
# end of training

print('Reporter!!!!!!!')
reporter = MemReporter()
reporter.report()

Do you know what is the problem?

Thank you and regards.

Does not work with torch 1.7.1+

The demo code does not work with this version of PyTorch. No output is printed

import torch
from pytorch_memlab import LineProfiler

def inner():
    torch.nn.Linear(100, 100).cuda()

def outer():
    linear = torch.nn.Linear(100, 100).cuda()
    linear2 = torch.nn.Linear(100, 100).cuda()
    inner()

with LineProfiler(outer, inner) as prof:
    outer()
    prof.display()

What's the difference between `active_bytes` and `reserved_bytes`?

I need to show that some technique called gradient checkpointing can really save GPU memory usage during backward propagation. When I see the result there are two columns on the left showing active_bytes and reserved_bytes. In my testing, while active bytes read 3.83G, the reserved bytes read 9.35G. So why does PyTorch still reserve that much GPU memory?

ImportError: cannot import name 'set_target_gpu'

I installed pytorch_memlab with pip3 install pytorch_memlab and I get this error when trying to set the target GPU:

    from pytorch_memlab import profile, set_target_gpu
ImportError: cannot import name 'set_target_gpu'

How do I record the maximum memory usage of a script?

I want to measure the max memory used during the execution of a script.

I think doing this naively will count max reserved memory, while I want the max memory actually used.

I do not need any further details.

Will this have an overhead? I want to do this on scripts taking hours to finish.

Redirect report() to the file

Hi, thank you for very useful python library. I am just checking if there is a way to redirect report() output directly to the txt file without redirecting stdout or anything similar. Or add method that returns string fo report, something like reporter.report() -> str.

Memory differs due to the matrix alignment or invisible gradient buffer tensors

Hi

I was just wondering what this message means in the MemReporter output

Total Tensors: 266979334        Used Memory: 924.71M
The allocated memory on cuda:0: 1.31G
Memory differs due to the matrix alignment or invisible gradient buffer tensors

Also what is the difference between Used Memory and allocated memory?

Many thanks

Nor working

for me it does not work at all
I even tried your example code, without profiler does nothing, with profiler tag it does not compile, see bellow:
image

Clearing GPU/CUDA memory

This is a very interesting library and will hopefully help with constant OOM errors. Beyond profiling the memory usage, does memlab offer a way to clear GPU memory (if we do find the GPU holding onto tensors after execution)? Thanks!

Fail to install using pip

I tried to install using pip install or pip install git+https://github.com/stonesjtu/pytorch_memlab and in both cases got this error:

Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-sx0xf6jy/pandas/setup.py", line 333
        f"{extension}-source file '{sourcefile}' not found.\n"
                                                             ^
    SyntaxError: invalid syntax
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-sx0xf6jy/pandas/

Naming Tensors within MemReporter

I'm developing a custom network layer and it subsequently has many unnamed Tensors within the MemReporter output. below is a snippet example

pcconv5_1.weight_net.layers.4.weight                 (1, 64)   512.00B
pcconv5_1.weight_net.layers.4.bias                      (1,)   512.00B
pcconv5_1.feature_layer.weight                       (1, 32)   512.00B
Tensor0                                   (1, 10, 10, 10, 8)    31.50K
Tensor1                                            (1, 8, 3)   512.00B
Tensor2                                  (1, 10, 10, 10, 12)    47.00K
Tensor3                                           (1, 12, 3)   512.00B
Tensor4                                            (1, 8, 8)   512.00B
Tensor5                                 (1, 8, 1, 1, 1, 8, 5)     1.50K
Tensor6                                 (1, 8, 1, 1, 1, 8, 32)     8.00K
Tensor7                                 (1, 8, 1, 1, 1, 8, 32)     8.00K
Tensor8                                 (1, 8, 1, 1, 1, 8, 64)    16.00K
Tensor9                                 (1, 8, 1, 1, 1, 8, 64)    16.00K
Tensor10(->Tensor4)                     (1, 8, 1, 1, 1, 8, 1)     0.00B

Is there any way to name or label these tensors (Tensors 0 through 10 for example) so that I can more easily determine which operation specifically creates them?

Jupyter Support Issues

Hi, thanks for this awesome package, it indeed help me a lot.
However, when i try to run some basic code from the jupyter demo provide by you.
I have a error [TypeError: can only concatenate tuple (not "list") to tuple], and do not know how to solve the issues.

error

My environment settings is

  1. python==3.8.8
  2. torch==1.8.1+cu102
  3. pytorch-memlab==0.2.3
  4. ipykernel==5.5.3
  5. ipython==7.22.0

Do you have any idea how to solve this issues ? Thanks for your attention, hope to get your reply soon.

turn profile decorator on and off

Hello,

This is probably a naive question. I would like to turn on and off the decorator without commenting things out. What would be an elegant way to achieve this? I have something like the following in mind but I don't know how to achieve this. Would you like to share some thoughts on this? Many thanks!

profile_flag = True # False

@profile(profile_flag)
def func1():
    ...

@profile(profile_flag)
def func2():
    ...

Support for gpu2,3,4

pytorch_memlab works excellent for gpu0, however, all mem of tensors turn to 0 when I used gpu2,3,4.

Thank you for the productive tools for the open source community!

Negative 'diff max' and 'diff peak' values?

What do the negative values mean? For example below is part of what I captured when running some code:

Line # Max usage   Peak usage diff max diff peak  Line Contents
===============================================================
    56                                           @profile
    58                                           def main():
    59     0.00B        0.00B   -4.14G   -4.64G      dtype, multidevice, backward_device = setup_gpu()

OSError: could not get source code when running LineProfiler

Hi, when I trying to use the LineProfiler example, the following error message pops up:

import torch

from pytorch_memlab import LineProfiler

def inner():
... torch.nn.Linear(100, 100).cuda()
...
def outer():
... linear = torch.nn.Linear(100, 100).cuda()
... linear2 = torch.nn.Linear(100, 100).cuda()
... inner()
...
with LineProfiler(outer, inner) as prof:
... outer()
...
Traceback (most recent call last):
File "", line 1, in
File "/home/wyuancs/miniconda3/envs/response-selection/lib/python3.8/site-packages/pytorch_memlab/line_profiler/line_profiler.py", line 45, in init
self.add_function(func)
File "/home/wyuancs/miniconda3/envs/response-selection/lib/python3.8/site-packages/pytorch_memlab/line_profiler/line_profiler.py", line 59, in add_function
first_line = inspect.getsourcelines(func)[1]
File "/home/wyuancs/miniconda3/envs/response-selection/lib/python3.8/inspect.py", line 979, in getsourcelines
lines, lnum = findsource(object)
File "/home/wyuancs/miniconda3/envs/response-selection/lib/python3.8/inspect.py", line 798, in findsource
raise OSError('could not get source code')
OSError: could not get source code

How could I solve this problem? I tried to search online, but not find any solution

KeyError: "['XXX'] not in index"

I added the @profile decorator to a custom module method (after from pytorch_memlab import LineProfiler, profile) and the output is the truncated display with the error KeyError: "['XXX'] not in index" for my method XXX. What's the meaning of this?

Where to specify the device?

From the readme:

You can also filter the device to report on by passing extra arguments: report(device=torch.device(0))

The device parameter seems to work neither with the report decorator nor with the MemReporter. If this feature is still available, the readme should be clarified.

Request: solving the lack of incremental reporting in loops / functions

This is a great tool for finding where the memory has gone - thank you!

I have a request:

  1. add support for partial loop unrolling - reporting first iterations separately
  2. same for functions

Problem:

Memory is being reported incorrectly for any loop or a function, since those are repeated multiple times, it doesn't show how the peak/active memory counters were progressing in, say, first iteration and shows the data for the whole loop/function after it has run more than once. It's correct for the final iteration, but not the first one and it's crucial to see the first moment the memory usage has gone up and the peak.

This functionality is typically not needed for a normal memory profiler, since all we want to know there is frequency of calls and total usage for the given line, but here if it's an investigation tool we need to see the first few uses. I hope I was able to convey the issue clearly.

I tried to solve this manually by unrolling the loop in the code I was profiling and replicating the iteration code multiple times, which is not very sustainable.

It's also typical that there is a change in memory footprint from iteration 1 to 2 and then things stabilize at iteration 3 and onward (if there is no leak that is). So probably there could be an option to record and prints 3 different stats:

  • iteration 1
  • iteration 2
  • all iterations (like it's done now)

same applies to functions.

I'm thinking perhaps the low-hanging fruit is to give users an option to record a loop iteration or function just the first time it's run and report that. That already would be very useful and perhaps not to difficult to implement.

Thank you!

Jupyter notebook support

Hi,
Thanks for the super useful package. Currently, it seems it is not possible to leverage it under a jupyter notebook. If I create a new notebook and add a cell with the contents

import torch
from pytorch_memlab import profile
@profile
def work():
    linear = torch.nn.Linear(100, 100).cuda()
    linear2 = torch.nn.Linear(100, 100).cuda()
    linear3 = torch.nn.Linear(100, 100).cuda()

work()

I get no results printed.

However, when I use MemReporter I do get results:

import torch
from pytorch_memlab import  MemReporterrter

linear = torch.nn.Linear(1024, 1024).cuda()
reporter = MemReporter()
reporter.report()
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Parameter0                                      (1024, 1024)     4.00M
Parameter1                                           (1024,)     4.00K
Parameter2                                      (1024, 1024)     4.00M
Parameter3                                           (1024,)     4.00K
-------------------------------------------------------------------------------
Total Tensors: 2099200 	Used Memory: 8.01M
The allocated memory on cuda:0: 8.01M
-------------------------------------------------------------------------------

Additionally, I wonder if it is possible to add a line magic similar to %lprun for profiling cells.

Question about Used Memory and GPU memory

Hi,

Thanks a lot for providing this very helpful library.
I have a question about Used Memory and GPU memory.
I followed your code to get the Used Memory of my model for one batch (size: (16, 3, 224, 224)). It is 928.02M. But the same code for the same model could not run in 2070 super GPU (8 GiB capacity).
928.02 M vs 8 GiB. What is the difference between the Used Memory in your code and GPU memory? Thanks.

Here are the running results.
image
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.