fvisin / main_loop_tf Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 8.0 484 KB

A main loop based on dataset loaders

License: GNU General Public License v3.0

Python 100.00%

main_loop_tf's People

Contributors

Stargazers

Watchers

Forkers

fral92 marcociccone mvpcom casperdcl eong2012 gtdong-ustc mosely fagan2888

main_loop_tf's Issues

Passing the crop size as an argument

I know this is a basic question but how can I pass the crop size as an argument?

Swap sess and unhookedsess

People tend to use self.sess in validation, which causes the hooks to be run. When the validation hook is among the hooks this causes an infinite loop.

It's probably better to change self.sess to become self.sess_with_hooks and make self.unhookedsess become self.sess. Will work on it as soon as I can, but happy to accept PRs.

Checkpoints saving best/last model

Checkpoints shouldn't be saved at each minibatch but at the end of each training epoch.
Moreover, the best model and last model should be saved with different names.

Add runtime statistics

Add a flag to optionally save runtime statistics every n batches to be visualized in tensorboard.
See more information here:
https://www.tensorflow.org/get_started/graph_viz#runtime_statistics

I got the memory issue
OOM when allocating tensor with shape[1,60,60,1024] [[Node: gpu0_train/my_model/resnet_v2_101/block3/unit_8/bottleneck_v2/preact/moments/SquaredDifference = SquaredDifference[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](gpu0_train/my_model/resnet_v2_101/block3/unit_7/bottleneck_v2/add, gpu0_train/my_model/resnet_v2_101/block3/unit_8/bottleneck_v2/preact/moments/StopGradient)]]

Which is caused by training on cpu. I changed the devices to gpu but it is still on cpu. Also I changed the with tf.device('/cpu:0') line to with tf.device('/gpu:0'). But nothing changed. How can I make training on one gpu device.
Thanks

Validation does not run every epoch with val_every=1

As reported by @marcociccone

[Feature Request] Refactor input pipeline

We should find a way to use our dataset_loader with tensorflow Queues:

Here some suggestions on how to do that:

Introduce callbacks to allow for custom code to be executed

Introduce a callbacks mechanism to allow the users to write custom code in the model to be used e.g., as weight decay, lr_schedule and to allow custom code to be executed at the beginning/end of an epoch/iteration, etc, ..

Add image summaries to training

Image summaries are saved only in validation, we should add another summary for training as well.

IoU is not shown correctly after reloading

As reported in #13 (review) there seems to be a problem with the IoU graph when the training is restarted. It probably has to do with the way we compute the incremental counter for the x-axis of the graph, should be verified.

The only thing I noticed it's that when you reload the parameters and continue the training, the plot of the IoU metrics become a mess

[Feature Request] Use tf.supervisor

We should use tf.supervisor to have a better control on the training and on the summaries ops.

Fix wiggle lines in TB when experiment is resumed

The model checkpoints and tensorboard events are not saved with the same frequency. When the model is reloaded and training resumes this causes the main loop to write new events to the event file that have the same global step (x-coordinate) as some previous event. As a result the TB graph looks weird. This can be fixed by means of using SessionLog messages as suggested in the documentation.

I don't have time to work on this right now, but I'd be happy to review a PR.

We can both integrate it inside the main loop, or just take the things we need

Expose useful placeholders

As suggested by @marcociccone self.placeholders is not easily comprehensible from the user perspective.

This is required to access the targets that are actually used at validation time:

val_labels = [el['labels'] for el in self.placeholders[False]]
actual_val_labels = tf.concat(val_targets[:self.sym_num_devs])

or more easily use recursive_truncate_dict.

I guess I should:

rename self.placeholders --> self._placeholders or self._per_dev_placeholders
store in self.placeholders the placeholders that can be of some use for the end user

[Feature request] Log stdout in tensorboard

We should use a logger, so anybody could see the state of an experiment don tensorboard