I am getting nan weights (and losses) during training. <div class="snippet-clipboa

The full log looks more like this: <div class="snippet-clipboard-content notransla

nan weights, nan validation losses about multiplanarunet HOT 3 CLOSED

perslev commented on June 2, 2024

nan weights, nan validation losses

from multiplanarunet.

Comments (3)

dyollb commented on June 2, 2024

The full log looks more like this:

Audit for 16 images
-------------------
Total memory GiB:  1.776
Number of classes: 17

2D:
Real space span:   263.500
Sample dim:        304.000

3D:
Sample dim:        64
Real space span:   263.500
Box span:          54.400
--------------------------------------------------------------------------------
>>> Logged by: 'set_value' in 'hparams.py'
Setting value '263.50000739097595' (type <class 'numpy.float64'>) in subdir 'fit' with name 'real_space_span'
Setting value '304' (type <class 'numpy.int64'>) in subdir 'build' with name 'dim'
Setting value '1' (type <class 'int'>) in subdir 'build' with name 'n_channels'
Entry of name 'n_classes' already set in subdir 'build' with value '17'. Skipping (overwrite=False).

>>> Logged by: 'save_current' in 'hparams.py'
Saving current YAML configuration to file:
 /home/jovyan/work/results/drcmr_16/train_hparams.yaml
--------------------------------------------------------------------------------
>>> Logged by: '_base_loader_func' in 'data_preparation_funcs.py'
Preparing dataset ImagePairLoader(id=train, images=10, data_dir=/home/jovyan/work/data_dir/train)
X10679
--- loaded:     False
--- shape:      [310 310 310   1]
--- bg class    0
--- bg value    ['1pct']
--- scaler      RobustScaler
--- real shape: [263.5 263.5 263.5]
--- pixdim:     [0.85 0.85 0.85]
X14109
--- loaded:     False
--- shape:      [310 310 310   1]
--- bg class    0
--- bg value    ['1pct']
--- scaler      RobustScaler
--- real shape: [263.5 263.5 263.5]
--- pixdim:     [0.85 0.85 0.85]
...
>>> Logged by: '__init__' in 'eager_queue.py'
'Eager' queue created:
  Dataset:      ImagePairLoader(id=train, images=10, data_dir=/home/jovyan/work/data_dir/train)
Preloading all 10 images now... (eager)
'Eager' queue created:
  Dataset:      ImagePairLoader(id=val, images=6, data_dir=/home/jovyan/work/data_dir/val)
Preloading all 6 images now... (eager)
--------------------------------------------------------------------------------
>>> Logged by: 'sample_random_views_with_angle_restriction' in 'sample_grid.py'
Generating 6 random views...
[OBS] Weighting random views by median res: [0.85 0.85 0.85]
--------------------------------------------------------------------------------
>>> Logged by: 'load_or_create_views' in 'data_preparation_funcs.py'
View SD:     0.1
--------------------------------------------------------------------------------
>>> Logged by: 'prepare_for_multi_view_unet' in 'data_preparation_funcs.py'
Views:       N=6
             [ 0.89081436 -0.42408026  0.16311258]
             [ 0.43957442 -0.12374472  0.88964126]
             [-0.54803847  0.1605022   0.82090979]
             [ 0.03570166 -0.69000704  0.72292163]
             [-0.12735015  0.93440077  0.33268173]
             [-0.95791583 -0.14147684  0.24976303]

--------------------------------------------------------------------------------
>>> Logged by: 'get_sequencers' in 'data_preparation_funcs.py'
Preparing sequence objects...
--------------------------------------------------------------------------------
>>> Logged by: 'get_sequence' in 'utils.py'
Using on-the-fly augmenters:
Elastic2D(alpha=[0, 450], sigma=[20, 30], apply_prob=0.333)
--------------------------------------------------------------------------------
>>> Logged by: 'log' in 'isotrophic_live_view_sequence_2d.py'

Is validation:               False
Using real space span:       263.50000739097595
Using sample dim:            304
Using real space sample res: 0.8667763401018945
N fg slices:                 8
Batch size:                  16
Force all FG:                False
Noise SD:                    0.1
Augmenters:                  [Elastic2D(alpha=[0, 450], sigma=[20, 30], apply_prob=0.333)]

Is validation:               True
Using real space span:       263.50000739097595
Using sample dim:            304
Using real space sample res: 0.8667763401018945
N fg slices:                 8
Batch size:                  16
Force all FG:                False
Noise SD:                    0.0
Augmenters:                  None
Waiting for free GPU.
Found free GPU: 0
2021-08-11 14:20:16.064686: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-08-11 14:20:17.031107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.

...

>>> Logged by: 'init_model' in 'model_init.py'
Creating new model of type 'UNet'
--------------------------------------------------------------------------------
>>> Logged by: 'log' in 'unet.py'
UNet Model Summary
------------------
Image rows:        304
Image cols:        304
Image channels:    1
N classes:         17
CF factor:         2.000
Depth:             4
l2 reg:            False
Padding:           same
Conv activation:   relu
Out activation:    softmax
Receptive field:   [155 155]
N params:          62062642
Output:            Tensor("flatten_output/Reshape:0", shape=(None, 92416, 17), dtype=float32)
Crop:              None
--------------------------------------------------------------------------------
>>> Logged by: 'set_bias_weights' in 'utils.py'
OBS: Estimating class counts from 10 images
/opt/conda/lib/python3.7/site-packages/mpunet/utils/utils.py:237: RuntimeWarning: divide by zero encountered in log
  bias = np.log(freq * np.sum(np.exp(freq)))
/opt/conda/lib/python3.7/site-packages/mpunet/utils/utils.py:238: RuntimeWarning: invalid value encountered in true_divide
  bias /= np.linalg.norm(bias)
Setting bias weights on output layer to:
[ 0. -0. -0. -0. -0. -0. -0. -0. -0. -0.  0. -0. nan -0. -0. -0. -0.]
--------------------------------------------------------------------------------
>>> Logged by: 'compile_model' in 'trainer.py'
Optimizer:   <tensorflow.python.keras.optimizer_v2.adam.Adam object at 0x7f138c320590>
Loss funcs:  [<tensorflow.python.keras.losses.SparseCategoricalCrossentropy object at 0x7f138c326950>]
Metrics:     <function init_metrics at 0x7f138c172ef0>
--------------------------------------------------------------------------------
>>> Logged by: 'save_images' in 'plotting.py'
Saving 64 sample images in '<project_dir>/images' folder
--------------------------------------------------------------------------------
>>> Logged by: '_fit' in 'trainer.py'
Using 157 steps per train epoch (total batches=1000000000000)
Using 219 steps per val epoch (total batches=1000000000000)
--------------------------------------------------------------------------------
>>> Logged by: 'init_callback_objects' in 'funcs.py'
[1] Using callback: Validation(params=?)
[2] Using callback: MeanReduceLogArrays(params=?)
[3] Using callback: ReduceLROnPlateau(patience=2, factor=0.9, verbose=1, monitor=val_dice, mode=max)
[4] Using callback: TensorBoard(log_dir=./tensorboard, profile_batch=0)
[5] Using callback: ModelCheckPointClean(filepath=./model/@epoch_{epoch:02d}_val_dice_{val_dice:.5f}.h5, monitor=val_dice, save_best_only=True, save_weights_only=True, verbose=1, mode=max)
[6] Using callback: EarlyStopping(monitor=val_dice, min_delta=0, patience=15, verbose=1, mode=max)
[7] Using callback: TrainTimer(verbose=True, logger=Logger(base_path=/home/jovyan/work/results/drcmr_16, print_to_screen=True, overwrite_existing=False, append_existing=False))
[8] Using callback: CSVLogger(filename=logs/training.csv, separator=,, append=True)
[9] Using callback: FGBatchBalancer(params=?)
[10] Using callback: SavePredictionImages(params=?)
[11] Using callback: LearningCurve(params=?)
[12] Using callback: DividerLine(params=?)
Epoch 1/500
  1/157 [..............................] - ETA: 0s - loss: nan - sparse_categorical_accuracy: 0.7300
...

from multiplanarunet.

perslev commented on June 2, 2024

Hi,

Thanks for reporting this. Based on the log it seems that the issue is caused by a given class not being present across a sample of images used to estimate class frequencies, ultimately leading to the setting of a NaN model weight during initialisation. Specifically, I am referring to the following section of the log:

--------------------------------------------------------------------------------
>>> Logged by: 'set_bias_weights' in 'utils.py'
OBS: Estimating class counts from 10 images
/opt/conda/lib/python3.7/site-packages/mpunet/utils/utils.py:237: RuntimeWarning: divide by zero encountered in log
  bias = np.log(freq * np.sum(np.exp(freq)))
/opt/conda/lib/python3.7/site-packages/mpunet/utils/utils.py:238: RuntimeWarning: invalid value encountered in true_divide
  bias /= np.linalg.norm(bias)
Setting bias weights on output layer to:
[ 0. -0. -0. -0. -0. -0. -0. -0. -0. -0.  0. -0. nan -0. -0. -0. -0.]
--------------------------------------------------------------------------------

I will fix this issue in 1 or 2 weeks when I am back from other activities. Until then, could you please try and set the biased_output_layer variable to False in the train_hparams.yaml parameter file and re-run training to verify that this is indeed the issue? See:

https://github.com/perslev/MultiPlanarUNet/blob/master/mpunet/bin/defaults/MultiPlanar/train_hparams.yaml#L86

Cheers,
Mathias

from multiplanarunet.

dyollb commented on June 2, 2024

At some point I realized that my dataset has no label 12, i.e. label=12 never is used.
After fixing it in the data the model is now learning...

So the reason for the nan (at position 12 of the weights) was this missing label.

from multiplanarunet.

nan weights, nan validation losses about multiplanarunet HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent