I can train the model in the first phase, but when it comes to validating, the server

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Validation (prediction) phase, server jammed. about mm-cot HOT 6 OPEN

HuahuiYi commented on June 21, 2024 1

Validation (prediction) phase, server jammed.

from mm-cot.

Comments (6)

gianfrancodemarco commented on June 21, 2024

I've just managed to reproduce the prediction step.
I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work...
However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy.
To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller).

This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

from mm-cot.

zhongfansun commented on June 21, 2024

I've just managed to reproduce the prediction step. I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work... However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy. To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller).

This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

I have encountered the same problem, and RAM 125.50GB is also not enough. I would like to know which data you are storing on the GPU. Could you please provide more detailed modifications？ Thank you very much.

from mm-cot.

gianfrancodemarco commented on June 21, 2024

I've just managed to reproduce the prediction step. I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work... However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy. To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller).
This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

I have encountered the same problem, and RAM 125.50GB is also not enough. I would like to know which data you are storing on the GPU. Could you please provide more detailed modifications？ Thank you very much.

You can find them here and in the rest of the repo: https://github.com/gianfrancodemarco/mm-cot/blob/main/src/data/scienceQA/dataset_std.py

from mm-cot.

zhongfansun commented on June 21, 2024

I've just managed to reproduce the prediction step. I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work... However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy. To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller).
This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

I have encountered the same problem, and RAM 125.50GB is also not enough. I would like to know which data you are storing on the GPU. Could you please provide more detailed modifications？ Thank you very much.

You can find them here and in the rest of the repo: https://github.com/gianfrancodemarco/mm-cot/blob/main/src/data/scienceQA/dataset_std.py

I don't know why it doesn't work for me. I replaced the entire class ScienceQADatasetStd and ScienceQADatasetImg with the one you provided. But the same problem occurred

from mm-cot.

zhongfansun commented on June 21, 2024

I've just managed to reproduce the prediction step. I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work... However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy. To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller).
This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

I have encountered the same problem, and RAM 125.50GB is also not enough. I would like to know which data you are storing on the GPU. Could you please provide more detailed modifications？ Thank you very much.

You can find them here and in the rest of the repo: https://github.com/gianfrancodemarco/mm-cot/blob/main/src/data/scienceQA/dataset_std.py

I am studying the fork you provided. Could you provide the running configuration of the scienceQA dataset about https://github.com/gianfrancodemarco/mm-cot/blob/main/experiments/run_experiments.py
Looking forward to your reply. Thank you very much.

from mm-cot.

gianfrancodemarco commented on June 21, 2024

@zhongfansun i don't think you need to use run_experiments.py. You'll find the relevant configurations here: https://github.com/gianfrancodemarco/mm-cot/blob/main/.vscode/launch.json

from mm-cot.

Validation (prediction) phase, server jammed. about mm-cot HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent