Through a lot of trial and error I managed to get both the ADOP viewer and trainer set

Hello, please pull the latest commit (+submodule update) and run <code class="notransl

RAM usage when training dataset about adop HOT 19 CLOSED

darglein commented on July 20, 2024

RAM usage when training dataset

from adop.

Comments (19)

darglein commented on July 20, 2024

So every epoch increases memory consumption by 1GB? How many epochs are you able to run? What is your starting memory consumption?

Can you please also paste the train config .ini file here?

from adop.

ShinyLuxray commented on July 20, 2024

After Eval 0: 5742MB, Virt 24.3GB
After Train 1: 7400MB, Virt 25.6GB
After Train 2: 9070 MB Virt 27.4GB
After Train 3: 10.8GB Virt 29.6GB
After Train 4: 12.0GB Virt 31.0GB
After Train 5: 12.9GB Virt 29.4GB
After Eval 5: 10.0GB Virt 31.2GB
After Train 6: 9.6GB Virt 36.5GB
After Train 7: 9.7GB Virt 38.4GB
...and so on.

My total physical RAM usage doesn't exceed 90% or so, as the rest gets moved to swap, so in any case the extra stored memory appears to be going unused. Beyond this point certain point only Virt increases on htop.

With about 300GB of swap, I can get to around epoch 165 before it stops.

Train_config.ini file:
[TrainParams]
random_seed = 3746934646
do_train = true
do_eval = true
batch_size = 2
inner_batch_size = 8
inner_sample_size = 3
num_epochs = 195
save_checkpoints_its = 5
eval_only_on_checkpoint = true
name = test1
debug = false
output_file_type = .jpg
split_method =
max_images = -1
duplicate_train_factor = 1
shuffle_initial_indices = false
shuffle_train_indices = true
split_remove_neighbors = 0
split_index_file_train =
split_index_file_test =
train_on_eval = true
train_factor = 0.0000000000
num_workers_train = 4
num_workers_eval = 2
train_crop_size = 256
train_mask_border = 16
reduced_check_point = false
write_images_at_checkpoint = false
keep_all_scenes_in_memory = false
use_image_masks = false
write_test_images = true
texture_random_init = true
texture_color_init = false
train_use_crop = true
experiment_dir = experiments/
scene_base_dir = scenes/
scene_names = test1
checkpoint_directory = default_checkpoint/
loss_vgg = 1.0000000000
loss_l1 = 1.0000000000
loss_mse = 0.0000000000
min_zoom = 0.7500000000
max_zoom = 1.5000000000
crop_prefere_border = true
optimize_eval_camera = false
interpolate_eval_settings = false
noise_pose_r = 0.0000000000
noise_pose_t = 0.0000000000
noise_intr_k = 0.0000000000
noise_intr_d = 0.0000000000
noise_point = 0.0000000000
point_duplicate_factor = 1
lr_decay_factor = 0.7500000000
lr_decay_patience = 10
lock_camera_params_epochs = 50
lock_structure_params_epochs = 50

[RenderParams]
render_outliers = false
check_normal = true
ghost_gradients = true
drop_out_points_by_radius = true
outlier_count = 1000000
drop_out_radius_threshold = 0.6000000238
dropout = 0.2500000000
depth_accept = 0.0099999998
test_backward_mode = 0
distortion_gradient_factor = 0.0099999998
K_gradient_factor = 0.5000000000

[PipelineParams]
train = true
verbose_eval = false
log_render = false
log_texture = true
skip_neural_render_network = false
enable_environment_map = true
env_map_w = 1024
env_map_h = 512
env_map_channels = 4
num_texture_channels = 4
cat_env_to_color = false
cat_masks_to_color = false

[OptimizerParams]
texture_optimizer = adam
fix_render_network = false
fix_texture = false
fix_environment_map = false
fix_points = true
fix_poses = false
fix_intrinsics = true
fix_vignette = false
fix_response = false
fix_wb = true
fix_exposure = true
fix_motion_blur = true
fix_rolling_shutter = true
lr_render_network = 0.0004000000
lr_texture = 0.0100000000
lr_background_color = 0.0002000000
lr_environment_map = 0.0020000000
lr_points = 0.0001000000
lr_poses = 0.0050000000
lr_intrinsics = 0.0100000000
response_smoothness = 1.0000000000
lr_vignette = 0.0000050000
lr_response = 0.0010000000
lr_wb = 0.0010000000
lr_exposure = 0.0005000000
lr_motion_blur = 0.0050000000
lr_rolling_shutter = 0.0000020000

[NeuralCameraParams]
enable_vignette = true
enable_exposure = true
enable_response = true
enable_white_balance = false
enable_motion_blur = false
enable_rolling_shutter = false
response_params = 25
response_gamma = 0.4545454681
response_leak_factor = 0.0099999998

[MultiScaleUnet2dParams]
num_input_layers = 4
num_input_channels = 4
num_output_channels = 3
feature_factor = 4
num_layers = 4
add_input_to_filters = false
channels_last = false
half_float = false
upsample_mode = bilinear
norm_layer_down = id
norm_layer_up = id
last_act =
conv_block = gated
conv_block_up = gated
pooling = average

from adop.

darglein commented on July 20, 2024

I tested your train config file on our dataset and there was no issue in RAM usage. What system are you running it on?

And also does the viewer on the pretrained models work without a memory issue?

from adop.

ShinyLuxray commented on July 20, 2024

Ubuntu Linux 20.04 with MATE DE. Using the boat dataset takes up a lot of memory even after one run compared to my own smaller one. The VRAM I know how to tone down, but for some reason RAM (Memory) and Swap go up like crazy. It's not really a leak in the sense that it's properly freed when the program terminates but it's kind of a strange occurrence.

from adop.

darglein commented on July 20, 2024

So does the ram increases during Eval as well or does it only starts during training?

Also please post the cmake output of adop so I can see if there are any weird dependencies.


cd ADOP/build
conda activate adop
cmake ..

from adop.

ShinyLuxray commented on July 20, 2024

I wasn't able to get any usable output from that.
I will try re-building ADOP again from another folder. I'll get back to you in the next couple days if this issue is still happening, and the output of cmake if it is.

from adop.

ShinyLuxray commented on July 20, 2024

Alright, after building it again, it still exhibits the same issues. RAM/swap usage increases during both Train and Eval for me.
cmake-output.zip contains a handful of cmake files plus the terminal output after running the last of the three suggested commands.

My conda environment is at /home/rentler/Games/anaconda3/envs/adop, and the latest install folder is /mnt/Ext-drive/Program/ADOP2/ADOP/
A small note, but just in case it's related at all, the other abnormal thing is that I need to run
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/rentler/Games/anaconda3/envs/adop/lib for adop_train to work.

See if you can find anything useful. If there are other files or folders you'd like to have a look at please let me know.

from adop.

darglein commented on July 20, 2024

I couldn't find any weird things in the cmake output. We can only try to track down the bug. Could you please try the following:

In the train config, set do_train = false
Add a continue; before this line:

ADOP/src/apps/adop_train.cpp

Line 503 in 13b839e

int scene_id_of_batch = batch.front()->scene_id;

This way we can test if the memory leak comes from the data loader or the rendering pipeline.

from adop.

ShinyLuxray commented on July 20, 2024

I made the changes and rebuilt adop_train, ran my config, and at each checkpoint the Mem/swap usage increased by about 1-2GB each Eval checkpoint. After 40 Eval runs (200 epochs on my config) total combined swap and mem usage I estimate to be about 40-50GB.

from adop.

darglein commented on July 20, 2024

Hello, please pull the latest commit (+submodule update) and run adop_train again with the following config params set and send me the console output.

debug = true
do_train = false
num_epochs = 20
save_checkpoint_its = 5

After that please also send me the output of

ldd build/bin/adop_train

from adop.

ShinyLuxray commented on July 20, 2024

Sorry for the day, I'll update the source and build it again later today then write back on my results.

from adop.

ShinyLuxray commented on July 20, 2024

debug_output.txt
ldd_output.txt

Here are the outputs.

from adop.

darglein commented on July 20, 2024

Thanks for the update. I tried recreating the problem but wasn't able to reproduce the memory leak :/

What filesystem are you using?

What you could also try is converting the images into .png. Maybe the jpg loader is somehow broken. Besides that I really have no clue what's causing this issue.

from adop.

ShinyLuxray commented on July 20, 2024

Filesystem is BTRFS. I can try moving it to an ext4 drive and try again within the next couple days...

from adop.

darglein commented on July 20, 2024

Hey, please check if this issue persists after the latest commit (+ submodule update).

from adop.

ShinyLuxray commented on July 20, 2024

Promising news! Eval stage seems to work fine now (without needing to change installation directory, I've been rather busy).
shortened_output.txt

I'll undo adding continue and do_train = false and see whether the train stage works as well.

from adop.

ShinyLuxray commented on July 20, 2024

Virtual memory space has stayed roughly around a consistent 20GB (which is normal) over 40 training epochs , and RAM usage has not gone above 4450 MB. I believe the issue may be fixed? Close if you agree.

from adop.

darglein commented on July 20, 2024

Alright, awesome!

from adop.

witsomatic commented on July 20, 2024

Can I train using a 3060 ti 8gb gpu? Will it be impossible or can it be done with some performance hit using system RAM?

from adop.

RAM usage when training dataset about adop HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent