Comments (3)
The full log looks more like this:
Audit for 16 images
-------------------
Total memory GiB: 1.776
Number of classes: 17
2D:
Real space span: 263.500
Sample dim: 304.000
3D:
Sample dim: 64
Real space span: 263.500
Box span: 54.400
--------------------------------------------------------------------------------
>>> Logged by: 'set_value' in 'hparams.py'
Setting value '263.50000739097595' (type <class 'numpy.float64'>) in subdir 'fit' with name 'real_space_span'
Setting value '304' (type <class 'numpy.int64'>) in subdir 'build' with name 'dim'
Setting value '1' (type <class 'int'>) in subdir 'build' with name 'n_channels'
Entry of name 'n_classes' already set in subdir 'build' with value '17'. Skipping (overwrite=False).
>>> Logged by: 'save_current' in 'hparams.py'
Saving current YAML configuration to file:
/home/jovyan/work/results/drcmr_16/train_hparams.yaml
--------------------------------------------------------------------------------
>>> Logged by: '_base_loader_func' in 'data_preparation_funcs.py'
Preparing dataset ImagePairLoader(id=train, images=10, data_dir=/home/jovyan/work/data_dir/train)
X10679
--- loaded: False
--- shape: [310 310 310 1]
--- bg class 0
--- bg value ['1pct']
--- scaler RobustScaler
--- real shape: [263.5 263.5 263.5]
--- pixdim: [0.85 0.85 0.85]
X14109
--- loaded: False
--- shape: [310 310 310 1]
--- bg class 0
--- bg value ['1pct']
--- scaler RobustScaler
--- real shape: [263.5 263.5 263.5]
--- pixdim: [0.85 0.85 0.85]
...
>>> Logged by: '__init__' in 'eager_queue.py'
'Eager' queue created:
Dataset: ImagePairLoader(id=train, images=10, data_dir=/home/jovyan/work/data_dir/train)
Preloading all 10 images now... (eager)
'Eager' queue created:
Dataset: ImagePairLoader(id=val, images=6, data_dir=/home/jovyan/work/data_dir/val)
Preloading all 6 images now... (eager)
--------------------------------------------------------------------------------
>>> Logged by: 'sample_random_views_with_angle_restriction' in 'sample_grid.py'
Generating 6 random views...
[OBS] Weighting random views by median res: [0.85 0.85 0.85]
--------------------------------------------------------------------------------
>>> Logged by: 'load_or_create_views' in 'data_preparation_funcs.py'
View SD: 0.1
--------------------------------------------------------------------------------
>>> Logged by: 'prepare_for_multi_view_unet' in 'data_preparation_funcs.py'
Views: N=6
[ 0.89081436 -0.42408026 0.16311258]
[ 0.43957442 -0.12374472 0.88964126]
[-0.54803847 0.1605022 0.82090979]
[ 0.03570166 -0.69000704 0.72292163]
[-0.12735015 0.93440077 0.33268173]
[-0.95791583 -0.14147684 0.24976303]
--------------------------------------------------------------------------------
>>> Logged by: 'get_sequencers' in 'data_preparation_funcs.py'
Preparing sequence objects...
--------------------------------------------------------------------------------
>>> Logged by: 'get_sequence' in 'utils.py'
Using on-the-fly augmenters:
Elastic2D(alpha=[0, 450], sigma=[20, 30], apply_prob=0.333)
--------------------------------------------------------------------------------
>>> Logged by: 'log' in 'isotrophic_live_view_sequence_2d.py'
Is validation: False
Using real space span: 263.50000739097595
Using sample dim: 304
Using real space sample res: 0.8667763401018945
N fg slices: 8
Batch size: 16
Force all FG: False
Noise SD: 0.1
Augmenters: [Elastic2D(alpha=[0, 450], sigma=[20, 30], apply_prob=0.333)]
Is validation: True
Using real space span: 263.50000739097595
Using sample dim: 304
Using real space sample res: 0.8667763401018945
N fg slices: 8
Batch size: 16
Force all FG: False
Noise SD: 0.0
Augmenters: None
Waiting for free GPU.
Found free GPU: 0
2021-08-11 14:20:16.064686: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-08-11 14:20:17.031107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.
...
>>> Logged by: 'init_model' in 'model_init.py'
Creating new model of type 'UNet'
--------------------------------------------------------------------------------
>>> Logged by: 'log' in 'unet.py'
UNet Model Summary
------------------
Image rows: 304
Image cols: 304
Image channels: 1
N classes: 17
CF factor: 2.000
Depth: 4
l2 reg: False
Padding: same
Conv activation: relu
Out activation: softmax
Receptive field: [155 155]
N params: 62062642
Output: Tensor("flatten_output/Reshape:0", shape=(None, 92416, 17), dtype=float32)
Crop: None
--------------------------------------------------------------------------------
>>> Logged by: 'set_bias_weights' in 'utils.py'
OBS: Estimating class counts from 10 images
/opt/conda/lib/python3.7/site-packages/mpunet/utils/utils.py:237: RuntimeWarning: divide by zero encountered in log
bias = np.log(freq * np.sum(np.exp(freq)))
/opt/conda/lib/python3.7/site-packages/mpunet/utils/utils.py:238: RuntimeWarning: invalid value encountered in true_divide
bias /= np.linalg.norm(bias)
Setting bias weights on output layer to:
[ 0. -0. -0. -0. -0. -0. -0. -0. -0. -0. 0. -0. nan -0. -0. -0. -0.]
--------------------------------------------------------------------------------
>>> Logged by: 'compile_model' in 'trainer.py'
Optimizer: <tensorflow.python.keras.optimizer_v2.adam.Adam object at 0x7f138c320590>
Loss funcs: [<tensorflow.python.keras.losses.SparseCategoricalCrossentropy object at 0x7f138c326950>]
Metrics: <function init_metrics at 0x7f138c172ef0>
--------------------------------------------------------------------------------
>>> Logged by: 'save_images' in 'plotting.py'
Saving 64 sample images in '<project_dir>/images' folder
--------------------------------------------------------------------------------
>>> Logged by: '_fit' in 'trainer.py'
Using 157 steps per train epoch (total batches=1000000000000)
Using 219 steps per val epoch (total batches=1000000000000)
--------------------------------------------------------------------------------
>>> Logged by: 'init_callback_objects' in 'funcs.py'
[1] Using callback: Validation(params=?)
[2] Using callback: MeanReduceLogArrays(params=?)
[3] Using callback: ReduceLROnPlateau(patience=2, factor=0.9, verbose=1, monitor=val_dice, mode=max)
[4] Using callback: TensorBoard(log_dir=./tensorboard, profile_batch=0)
[5] Using callback: ModelCheckPointClean(filepath=./model/@epoch_{epoch:02d}_val_dice_{val_dice:.5f}.h5, monitor=val_dice, save_best_only=True, save_weights_only=True, verbose=1, mode=max)
[6] Using callback: EarlyStopping(monitor=val_dice, min_delta=0, patience=15, verbose=1, mode=max)
[7] Using callback: TrainTimer(verbose=True, logger=Logger(base_path=/home/jovyan/work/results/drcmr_16, print_to_screen=True, overwrite_existing=False, append_existing=False))
[8] Using callback: CSVLogger(filename=logs/training.csv, separator=,, append=True)
[9] Using callback: FGBatchBalancer(params=?)
[10] Using callback: SavePredictionImages(params=?)
[11] Using callback: LearningCurve(params=?)
[12] Using callback: DividerLine(params=?)
Epoch 1/500
1/157 [..............................] - ETA: 0s - loss: nan - sparse_categorical_accuracy: 0.7300
...
from multiplanarunet.
Hi,
Thanks for reporting this. Based on the log it seems that the issue is caused by a given class not being present across a sample of images used to estimate class frequencies, ultimately leading to the setting of a NaN model weight during initialisation. Specifically, I am referring to the following section of the log:
--------------------------------------------------------------------------------
>>> Logged by: 'set_bias_weights' in 'utils.py'
OBS: Estimating class counts from 10 images
/opt/conda/lib/python3.7/site-packages/mpunet/utils/utils.py:237: RuntimeWarning: divide by zero encountered in log
bias = np.log(freq * np.sum(np.exp(freq)))
/opt/conda/lib/python3.7/site-packages/mpunet/utils/utils.py:238: RuntimeWarning: invalid value encountered in true_divide
bias /= np.linalg.norm(bias)
Setting bias weights on output layer to:
[ 0. -0. -0. -0. -0. -0. -0. -0. -0. -0. 0. -0. nan -0. -0. -0. -0.]
--------------------------------------------------------------------------------
I will fix this issue in 1 or 2 weeks when I am back from other activities. Until then, could you please try and set the biased_output_layer
variable to False
in the train_hparams.yaml
parameter file and re-run training to verify that this is indeed the issue? See:
Cheers,
Mathias
from multiplanarunet.
At some point I realized that my dataset has no label 12, i.e. label=12 never is used.
After fixing it in the data the model is now learning...
So the reason for the nan (at position 12 of the weights) was this missing label.
from multiplanarunet.
Related Issues (20)
- [urgent] Inputs to operation AddN_7 of type AddN must have the same size and shape.
- AttributeError: 'UNet' object has no attribute 'loss_functions' HOT 4
- Does it support multi-channel input images? HOT 1
- ValueError: 'bg_value' should be a list of length 'n_channels'. Got [4.0] for n_channels=2 HOT 1
- Input volume is sampled on 2D HOT 2
- Error when trying to train HOT 1
- Does MPU have to install "tensorflow-gpu==2.3.2"? The readme file says that GPU is not required. HOT 1
- Why do we need 'predict_single' in predict.py? HOT 2
- No available GPUs... Sleeping 120 seconds HOT 1
- What is real_space_span? HOT 1
- dataset folder structure
- Could you please provide the pdf of your paper? HOT 1
- my system restarts automatically whenever I start training HOT 1
- Clallbacks: Cannot feed value of shape (16, 128, 128, 1) for Tensor 'conv2d/truediv:0', which has shape '(?, 128, 128, 3)' HOT 7
- Getting ValueError in new environment HOT 1
- Memory Error and Lack of GPU Usage HOT 4
- When training on toy_data, TypeError: can only concatenate list (not "dict") to list HOT 1
- weights can not be broadcast to values. values.rank=3. weights.rank=1. values.shape=(2, 448, 448). weights.shape=(2,). HOT 12
- xla_gup not compatible to newest tensorflow version HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from multiplanarunet.