Git Product home page Git Product logo

Comments (11)

HamedHemati avatar HamedHemati commented on July 22, 2024 1

I didn't do any modification on the settings.
You can access it through
https://colab.research.google.com/drive/1BBB6J0JBj1_6nxOd2OxqARe3v226nI69?usp=sharing

Thanks for sharing the Colab notebook. I also tried it and got the same error. But the interesting part is that the max GPU usage (also including the GPU cache) was around 1.2 GBs just before the training process crashed:

colab_suage

I can't remember what GPU I got the last time I ran the code, but this issue could be caused by the GPUtil package and its compatibility with Tesla series, but I'm still not sure. We haven't had this problem for local GPUs so far. For now, you can manually increase the max GPU allocation threshold and run your experiments until we find the reason. For most cases the true GPU allocation shouldn't exceed the competition threshold since the threshold is pretty lenient, so no need to worry for most cases with batch size 64, unless you want to implement a strategy that requires frequent parameter replications or huge batch sizes.

from clvision-challenge-2023.

HamedHemati avatar HamedHemati commented on July 22, 2024

Hi @Cklwanfifa

Did you increase the batch size or the other settings?
I tried the same code on Colab before and it worked fine. At which epoch did this happen and what GPU is assigned to you when using Colab?

from clvision-challenge-2023.

Cklwanfifa avatar Cklwanfifa commented on July 22, 2024

I didn't do any modification on the settings.
You can access it through
https://colab.research.google.com/drive/1BBB6J0JBj1_6nxOd2OxqARe3v226nI69?usp=sharing

from clvision-challenge-2023.

Cklwanfifa avatar Cklwanfifa commented on July 22, 2024

I found a local A100 GPU and ran the example code. At the first checkpoint it showed that

MAX GPU MEMORY ALLOCATED: 119 MB MAX RAM ALLOCATED: 5874 MB

I assume that:

  1. The RAMchecker (not the GPUMemoryChecker) got the incorrect RAM data.
  2. The error information provided by competition_plugins might be wrong.

from clvision-challenge-2023.

HamedHemati avatar HamedHemati commented on July 22, 2024

I found a local A100 GPU and ran the example code. At the first checkpoint it showed that

MAX GPU MEMORY ALLOCATED: 119 MB MAX RAM ALLOCATED: 5874 MB

I assume that:

  1. The RAMchecker (not the GPUMemoryChecker) got the incorrect RAM data.
  2. The error information provided by competition_plugins might be wrong.

Thanks for the update. The RAM usage for me is between 2-2.5 GBs for the first few epochs on Mac and Linux. Similar to the GPU limit, you can manually change the limit for your experiments. We will try to find a solution for the hardware usage inconsistency.

from clvision-challenge-2023.

ShiWuxuan avatar ShiWuxuan commented on July 22, 2024

@HamedHemati Hi, I also have a question about the GPUMemoryChecker. The memory usage output from the GPUMemoryChecker does not match the results I see using the gpustat command, and I want to know which one ultimately prevails?
The first image is the output of GPUMemoryChecker, and the second image is the gpu usage displayed using the gpustat command.
checker
gpustat

from clvision-challenge-2023.

HamedHemati avatar HamedHemati commented on July 22, 2024

@ShiWuxuan Thanks for sharing the usage report. It seems like the only way to get the actual GPU memory usage in a consistent way is to use the nvidia-smi package. The RAM usage is also not consistent across different operating systems.

One solution is to use the nvidia-smi package, but that would also cause issues if someone tries to use a shared GPU. Therefore, we will most probably remove the RAM and GPU usage plugins, and will ask the participants to check the GPU memory usage manually (with the current limits). We will only keep the time checker plugin just to have an approximate training time limit for the strategies.

from clvision-challenge-2023.

ShiWuxuan avatar ShiWuxuan commented on July 22, 2024

Thank you for your detailed reply. The GPU memory usage reported by nvidia-smi package is shown in the picture below. This is the result without changing the code, i.e. Navie strategy + EWCPlugin + LwFPlugin with batch_size=64. Even with such a simple strategy, the GPU memory used already exceeds the given limit of 1000MB. Perhaps the previous limit was set based on the output of the GPUMemoryChecker. May I ask if the GPU memory limit will be relaxed, which is important for method design.
image

from clvision-challenge-2023.

HamedHemati avatar HamedHemati commented on July 22, 2024

That's correct, sorry for the confusion. I got mixed up between the current limits for the RAM usage and the GPU memory usage limits. The new GPU usage limit will be 4000 MBs and the limit for the RAM usage will be removed. I hope this solves the hardware restriction issues across different platforms.

from clvision-challenge-2023.

ShiWuxuan avatar ShiWuxuan commented on July 22, 2024

I get it, thanks for the explanation.

from clvision-challenge-2023.

HamedHemati avatar HamedHemati commented on July 22, 2024

GPU and RAM usage plugins are removed in the latest version of the code. I'm closing this issue.

from clvision-challenge-2023.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.