I have encountered a memory leak issue when executing a JIT model under a multi-gorout

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

#<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Memory Leak in JIT Model under Multi-Goroutine Environment about gotch HOT 9 CLOSED

yinziyang commented on August 25, 2024

Memory Leak in JIT Model under Multi-Goroutine Environment

from gotch.

Comments (9)

sugarme commented on August 25, 2024

@yinziyang ,

Thanks for the report. However, I have a quick look and see 2-3 things that may cause memory blown up:

Your func getTensor last operation should be tensor = tensor.MustUnsqueeze(0, true) (true) to delete existing tensor before assigning to new one otherwise mem leak here.
In go routine for loop, you run net.Forward(tensor), which return a tensor, that tensor should be deleted after being used by log.Println() otherwise leak here as well.
When doing forward in inference mode, you should put inside ts.NoGrad() otherwise, autograd will build up (not really a memory leak but hidden tensors).

Please try those things to see how thing are going. Thanks.

from gotch.

yinziyang commented on August 25, 2024

#@sugarme

Thank you for your response. I have adjusted my code according to your suggestions, but the memory usage still keeps increasing. Below is my latest code:

package main

import (
    "encoding/json"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }

    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, true)
    return
}

func main() {

    net := getModel()

    tensor := getTensor()
    defer tensor.MustDrop()

    var goroutineNum = 10
    for i := 0; i < goroutineNum; i++ {
        go func(net nn.FuncT) {
            for {
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

from gotch.

yinziyang commented on August 25, 2024

When calling the model in multiple goroutines, a lot of warning messages appear, as follows:

2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235087". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235091". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235100". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235098". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235215". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235245". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235395". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235566". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235609". Called from "ts.Drop()". Just skipping...

from gotch.

sugarme commented on August 25, 2024

@yinziyang ,

Probably you should create a model for each go routine then. Actually, I have never tried to do concurrency on one model like that. I guess, there will be a lot of data collision as all go routines feed into a single model.

from gotch.

yinziyang commented on August 25, 2024

I created a model for each goroutine, and used the corresponding model when calling within the goroutine, but there are still issues.

package main

import (
    "encoding/json"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }

    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, true)
    return
}

func main() {

    var goroutineNum = 10

    var nets []nn.FuncT
    for i := 0; i < goroutineNum; i++ {
        nets = append(nets, getModel())
    }

    tensor := getTensor()
    defer tensor.MustDrop()

    for i := 0; i < goroutineNum; i++ {
        net := nets[i]
        go func(net nn.FuncT) {
            for {
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

from gotch.

sugarme commented on August 25, 2024

@yinziyang ,

I will try to reproduce your problem when having time by this week. However, your latest go func() should not have input then.

What about some thing like this:

for i := 0; i < goroutineNum; i++ {
        go func() {
                net := getModel()
                tensor := getTensor()
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
                tensor.MustDrop()
            }
        }()
}

from gotch.

yinziyang commented on August 25, 2024

The memory usage still keeps increasing, the key code is as follows:

for i := 0; i < goroutineNum; i++ {
    go func() {
        // goroutine model
        net := getModel()

        // test input tensor
        tensor := getTensor()
        defer tensor.MustDrop()

        // stress test to observe memory increase
        for {
            ts.NoGrad(func() {
                result := net.ForwardT(tensor, false)

                // drop result tensor
                result.MustDrop()
            })
        }
    }()
}

from gotch.

yinziyang commented on August 25, 2024

@sugarme

I understand now, I seem to have found a bug in tensor.go that causes some Tensors not to be released.

this is old code:

	atomic.AddInt64(&TensorCount, 1)
	nbytes := x.nbytes()
	atomic.AddInt64(&AllocatedMem, nbytes)

	lock.Lock()
	if _, ok := ExistingTensors[name]; ok {
		name = fmt.Sprintf("%s_%09d", name, TensorCount)
	}
	ExistingTensors[name] = struct{}{}
	lock.Unlock()

change to:

	tensorCount := atomic.AddInt64(&TensorCount, 1)
	nbytes := x.nbytes()
	atomic.AddInt64(&AllocatedMem, nbytes)

	lock.Lock()
	if _, ok := ExistingTensors[name]; ok {
		name = fmt.Sprintf("%s_%09d", name, tensorCount)
	}
	ExistingTensors[name] = struct{}{}
	lock.Unlock()

I just realized that you had made a fix for this issue last week, but I didn't use your latest code. The problem is resolved now, it can be closed.

from gotch.

sugarme commented on August 25, 2024

@yinziyang ,

Thanks for reporting.

from gotch.

Memory Leak in JIT Model under Multi-Goroutine Environment about gotch HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent