fvisin / main_loop_tf Goto Github PK
View Code? Open in Web Editor NEWA main loop based on dataset loaders
License: GNU General Public License v3.0
A main loop based on dataset loaders
License: GNU General Public License v3.0
I know this is a basic question but how can I pass the crop size as an argument?
People tend to use self.sess
in validation, which causes the hooks to be run. When the validation hook is among the hooks this causes an infinite loop.
It's probably better to change self.sess
to become self.sess_with_hooks
and make self.unhookedsess
become self.sess
. Will work on it as soon as I can, but happy to accept PRs.
Checkpoints shouldn't be saved at each minibatch but at the end of each training epoch.
Moreover, the best model and last model should be saved with different names.
Add a flag to optionally save runtime statistics every n batches to be visualized in tensorboard.
See more information here:
https://www.tensorflow.org/get_started/graph_viz#runtime_statistics
I got the memory issue
OOM when allocating tensor with shape[1,60,60,1024] [[Node: gpu0_train/my_model/resnet_v2_101/block3/unit_8/bottleneck_v2/preact/moments/SquaredDifference = SquaredDifference[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](gpu0_train/my_model/resnet_v2_101/block3/unit_7/bottleneck_v2/add, gpu0_train/my_model/resnet_v2_101/block3/unit_8/bottleneck_v2/preact/moments/StopGradient)]]
Which is caused by training on cpu. I changed the devices to gpu but it is still on cpu. Also I changed the with tf.device('/cpu:0') line to with tf.device('/gpu:0'). But nothing changed. How can I make training on one gpu device.
Thanks
As reported by @marcociccone
We should find a way to use our dataset_loader with tensorflow Queues:
Here some suggestions on how to do that:
Introduce a callbacks mechanism to allow the users to write custom code in the model to be used e.g., as weight decay, lr_schedule and to allow custom code to be executed at the beginning/end of an epoch/iteration, etc, ..
Image summaries are saved only in validation, we should add another summary for training as well.
As reported in #13 (review) there seems to be a problem with the IoU graph when the training is restarted. It probably has to do with the way we compute the incremental counter for the x-axis of the graph, should be verified.
The only thing I noticed it's that when you reload the parameters and continue the training, the plot of the IoU metrics become a mess
We should use tf.supervisor to have a better control on the training and on the summaries ops.
The model checkpoints and tensorboard events are not saved with the same frequency. When the model is reloaded and training resumes this causes the main loop to write new events to the event file that have the same global step (x-coordinate) as some previous event. As a result the TB graph looks weird. This can be fixed by means of using SessionLog
messages as suggested in the documentation.
I don't have time to work on this right now, but I'd be happy to review a PR.
It could be useful to have the random seed as FLAG parameter in order to have experiment reproducibility. If not set then it could be fixed or random
Check why Multigpu is slow and goes OOM
I will need all of the features of this optimizer:
We can both integrate it inside the main loop, or just take the things we need
As suggested by @marcociccone self.placeholders
is not easily comprehensible from the user perspective.
This is required to access the targets that are actually used at validation time:
val_labels = [el['labels'] for el in self.placeholders[False]]
actual_val_labels = tf.concat(val_targets[:self.sym_num_devs])
or more easily use recursive_truncate_dict
.
I guess I should:
self.placeholders
--> self._placeholders
or self._per_dev_placeholders
self.placeholders
the placeholders that can be of some use for the end userWe should use a logger, so anybody could see the state of an experiment don tensorboard
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.