cybertronai / gradient-checkpointing Goto Github PK

View Code? Open in Web Editor NEW

2.6K 2.6K 270.0 1.81 MB

Make huge neural nets fit in memory

License: MIT License

Python 99.55% Shell 0.45%

gradient-checkpointing's People

Contributors

Stargazers

Watchers

Forkers

bssrdf ml-lab nathanin hbcbh1999 bkong1990 oppa3109 tigerneil miracle-fmh tony32769 thesukantadey ajnovice cclauss cjschneider2 shubhampachori12110095 tano297 phecy andandandand liuning123 kywang tomzhang onisimchukv rdclouds intfrr lixiaosi33 anpark vshalts frankatmech iantuioti cryptsky sathyapatel leotam yalechang shaunstanislauslau www3838438 forkedreposbak zenetio hunslater-deeplearning heaven00 jokerwhy233 1kaiser alibabapai mydreammyway hcg2011 13861419 aymar73 jiths anurag250 mouhanedg56 deafrhino anujloomba thincal shiyongde mengqhui hidhineshraja dchichkov chiralcarbon rubenszimbres stefmt2970 kjeanclaude ethancaballero yangjunpro hoangcuong2011 fence twistedmove southatsouth zhangyaya aiedward mysqlsc dhruv-mohan praveenmunagapati rajnehra mithridates uptodiff yujunfeng sayanmutd jackyzhougithub xtaraim schaudge shenggaozhu fangjufa weeang763162 mohanarunachalam deep-learning zhabzhang mlisiak davidbelanger kdatta lamperougeyxy dmitrivainbrand apilipis mosincos datablockchainio raonyguimaraes ctatlah alexanderdaly ykankaya lalalland ourobouros meelement upml

gradient-checkpointing's Issues

Wrong timeline computation

peak_memory for my model running completely on a cpu returns always 0. Could it be that in the snippet below the last line should read line = [node.all_start_micros, node.node_name, output_bytes, "unknown"]?
https://github.com/openai/gradient-checkpointing/blob/b2b7def2e15f2607c14c5b32fb8fa4f892489484/test/mem_util.py#L100-L106

AttributeError: 'NoneType' object has no attribute 'pred'

Hello, I tried using this project with keras import code below:
`
import tqdm
import keras
import numpy as np
import tensorflow as tf
import keras.backend as k
import memory_saving_gradients
from keras.models import Model
from keras.layers import Input,Dense,Bidirectional,Activation,TimeDistributed,GRU,Dropout

k.dict["gradients"] = memory_saving_gradients.gradients_memory

inputs=Input((400,len(chars)))

gu1=Bidirectional(GRU(200,activation='relu',kernel_initializer='RandomUniform',
bias_initializer='RandomUniform',recurrent_dropout=0.2,return_sequences=True))(inputs)

gu2=GRU(400,activation='relu',kernel_initializer='RandomUniform',
bias_initializer='RandomUniform',recurrent_dropout=0.2,dropout=0.2,return_sequences=True)(gu1)

d=Dropout(0.3)(gu2)

logits_td=TimeDistributed(Dense(len(chars)))(d)

logits=Activation('softmax')(logits_td)

model=Model(inputs,logits)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy','categorical_accuracy'])
model.train_on_batch(data_x,data_y)
Note that data_x and data_y shapes are __(32,400,74)__ and that I cannot importimport tensorflow.python.*and the full traceback is--------------------------
AttributeError Traceback (most recent call last)
in ()
1 import time
2 t1=time.time()
----> 3 model.train_on_batch(data_1[:32],data_2[:32])
4 print('Batch Training time Approx. '+str(round(time.time()-t1,1)))

/usr/local/lib/lib/python3.4/site-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
1811 else:
1812 ins = x + y + sample_weights
-> 1813 self._make_train_function()
1814 outputs = self.train_function(ins)
1815 if len(outputs) == 1:

/usr/local/lib/lib/python3.4/site-packages/keras/engine/training.py in _make_train_function(self)
988 training_updates = self.optimizer.get_updates(
989 params=self._collected_trainable_weights,
--> 990 loss=self.total_loss)
991 updates = self.updates + training_updates
992 # Gets loss and metrics. Updates weights at each call.

/usr/local/lib/lib/python3.4/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
85 warnings.warn('Update your ' + object_name + 86 ' call to the Keras 2 API: ' + signature, stacklevel=2)
---> 87 return func(*args, **kwargs)
88 wrapper._original_function = func
89 return wrapper

/usr/local/lib/lib/python3.4/site-packages/keras/optimizers.py in get_updates(self, loss, params)
413 @interfaces.legacy_get_updates_support
414 def get_updates(self, loss, params):
--> 415 grads = self.get_gradients(loss, params)
416 self.updates = [K.update_add(self.iterations, 1)]
417

/usr/local/lib/lib/python3.4/site-packages/keras/optimizers.py in get_gradients(self, loss, params)
71
72 def get_gradients(self, loss, params):
---> 73 grads = K.gradients(loss, params)
74 if hasattr(self, 'clipnorm') and self.clipnorm > 0:
75 norm = K.sqrt(sum([K.sum(K.square(g)) for g in grads]))

/var/host/media/removable/UNTITLED/seq2seq/memory_saving_gradients.py in gradients_memory(ys, xs, grad_ys, **kwargs)
25
26 def gradients_memory(ys, xs, grad_ys=None, **kwargs):
---> 27 return gradients(ys, xs, grad_ys, checkpoints='memory', **kwargs)
28
29 def gradients_collection(ys, xs, grad_ys=None, **kwargs):

/var/host/media/removable/UNTITLED/seq2seq/memory_saving_gradients.py in gradients(ys, xs, grad_ys, checkpoints, **kwargs)
256 dv = tf_gradients(boundary,
257 checkpoints_disconnected_other+xs,
--> 258 grad_ys=substitute_backprops, **kwargs)
259 debug_print("Got gradients %s", dv)
260 debug_print("for %s", boundary)

/usr/local/lib/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py in gradients(ys, xs, grad_ys, name, colocate_gradients_with_ops, gate_gradients, aggregation_method)
547 # issue here because of zeros.
548 if loop_state:
--> 549 out_grads[i] = loop_state.ZerosLike(op, i)
550 else:
551 out_grads[i] = control_flow_ops.ZerosLikeOutsideLoop(op, i)

/usr/local/lib/lib/python3.4/site-packages/tensorflow/python/ops/control_flow_ops.py in ZerosLike(self, op, index)
1172 if grad_state is None:
1173 # op is not in a while loop that is part of gradients().
-> 1174 return ZerosLikeOutsideLoop(op, index)
1175 op_ctxt = op._get_control_flow_context()
1176 val = ops.convert_to_tensor(op.outputs[index], name="tensor")

/usr/local/lib/lib/python3.4/site-packages/tensorflow/python/ops/control_flow_ops.py in ZerosLikeOutsideLoop(op, index)
1303 else:
1304 op_ctxt = op._get_control_flow_context()
-> 1305 pred = op_ctxt.pred
1306 branch = op_ctxt.branch
1307 switch_val = switch(op.inputs[0], pred)[1 - branch]

AttributeError: 'NoneType' object has no attribute 'pred'`
,thank you

how to find out which checkpoint nodes were selected with memory heuristic

Hey guys,
Great work by the way! I was just wondering, is it possible to see which checkpoints nodes were automatically created by the memory heuristic when using keras?

Package Maintenance for TF-2.0 with Contrib Module Sunset

Iam wondering if i start to use the package, what seems to be an excel option to do models from scratch (including contrib to the package if i need to modify it to fit my model archs), do you plan to do maintenance for TF-2.0 ?

graph_editor gonna be erased from the framework since until now there is no proposal for maintenance.

Speed mode is much slower than memory mode

Settings:

benchmark: BERT Base
one 32GB V100 GPU
Tensorflow 1.15
CUDA 10.0, cuDNN 7.6.5

The measured time is the average of 10 iterations.

method	iteration time (ms)	memory (GB)
w/o optimization	557.11	16.42
recomputation (speed mode)	1457.91	13.32
recomputation (memory mode)	704.9	7.43

code comes from google-research/bert, with a small modification to adopt gradient checkpointing.

Cannot fit any extra batches into memory than normal

Hi,

I really like what you have developed, i think it will be very useful for models like DenseNet.

I tried it on a Keras model I have been working on, i just copied the "monkey patch" from the Keras-test example you've made.

I was not able to see any improvement. I specifically wanted to train on a larger batch size, but I get out of memory error at the same threshold as before applying the patch.

Have I misunderstood what gradient-checkpointing can do? if not, how do I verify if the patch is working?

Optimal checkpointing of reverse differentiation has been already solved

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.455.4143&rep=rep1&type=pdf

TF 1.6rc1 Support

Is it included in 1.6rc1, or do we still need to install nightly?

Checkpointing with collections raising exception in Keras

Hi,

A TF newbie here. Am trying to use a pre-trained Keras Resnet-50 model on very large biological images(3000x4000 pix) of larvae. Since the image size is huge have to resort to gradient checkpointing.

have done the requisite monkey patching

import memory_saving_gradients K.__dict__['gradients']=memory_saving_gradients.gradients_collection

I have manually defined a collection of all the checkpoint nodes like so

[tf.add_to_collection("checkpoints",base_model.get_layer(i).get_output_at(0)) for i in ["add_4","add_8","add_12","add_16"]]

However I get the following exception

File "/home/satish/anaconda3/envs/tensorflow/lib/python3.6/site-packages/keras/optimizers.py", line 244, in get_updates grads = self.get_gradients(loss, params) File "/home/satish/anaconda3/envs/tensorflow/lib/python3.6/site-packages/keras/optimizers.py", line 78, in get_gradients grads = K.gradients(loss, params) File "/home/satish/anaconda3/envs/tensorflow/lib/python3.6/site-packages/memory_saving_gradients.py", line 31, in gradients_collection return gradients(ys, xs, grad_ys, checkpoints='collection', **kwargs) File "/home/satish/anaconda3/envs/tensorflow/lib/python3.6/site-packages/memory_saving_gradients.py", line 185, in gradients raise Exception('no checkpoints nodes found or given as input! ') Exception: no checkpoints nodes found or given as input!

Not sure why the collection is deemed empty.

Would greatly appreciate any feedback.

Thanks
Satish

gradients_memory require more memory than tf.Optimizer.minimize

I would like to use the memory saving gradients to train a U-net model with bigger patches or/and increased batch size. I implemented a toy example to assess the memory usage when switching from tf.Optimizer.minimize to the memory saving gradients: https://github.com/gchlebus/gchlebus.github.io/blob/ca55f92d816ebe4659721b61e1a1f4f3b5c3e4f1/code/profiling-tf-models/u_net.py

What I surprisingly found out, is that the memory gradients require more memory than tf.Optimizer.minimize, but less memory than tf.gradients. I queried the peak memory usage using the mem_util.py.
Memory usage:

tf.train.AdamOptimizer().minimize(loss): 75 MB
tf.gradients(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 107 MB
gradients_memory(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 96 MB

I would have two questions:

How come that the memory saving gradients require more memory than tf.train.AdamOptimizer.minimize? Am I using the memory saving gradients wrongly?
Why the peak memory usage between 1st and 2nd bullet point differ? I thought, that the minimizefunction does tf.gradients + optimizer.apply_gradients().

I would greatly appreciate your feedback.

'NoneType' object has no attribute 'op'

I am trying to run the code for my model which uses 3d convolution and fully connected layers.

grads = gradient_memory(train_loss, self.model_variables)
grads = list(zip(grads, self.model_variables))

This should give me the list as

optimizer.compute_grads(train_loss, var_list=self.model_variables)

But instead, I get:

File "gradient_checkpointing.py", line 274, in
inputs_to_do_before = [d_checkpoints[r].op for r in ts]
'NoneType' object has no attribute 'op'

Can you help me with this, please?

I have set the checkpoints equal to ts_all.

A few more citations

This is a great package! Thanks for making it available.

FYI, your README should cite a few more works:

Zweig, Geoffrey and Padmanabhan, Mukund. Exact Alpha-Beta Computation in
Logarithmic Space with Application to MAP Word Graph Construction. Sixth
International Conference on Spoken Language Processing, 2000.

Lewis, Bil. Debugging Backwards in Time. arXiv preprint cs/0310016, 2003.

Does this work with `fit_generator`?

@yaroslavvb Does this work with Keras fit_generator? I saw you used fit, but do you know if it will work with fit_generator?

RNN while loop context error

Problem Description

When stopping the gradient on a node in a while loop a value error is thrown. Stack trace:

2018-01-18 16:56:14.252388: I tensorflow/core/platform/s3/aws_logging.cc:53] Initializing Curl library
Traceback (most recent call last):
  File "test.py", line 256, in <module>
    data, data_lengths, target, target_lengths, training, hparams)
  File "test.py", line 52, in __init__
    self.optimize
  File "test.py", line 21, in decorator
    setattr(self, attribute, function(self))
  File "test.py", line 189, in optimize
    self.loss, params, checkpoints="speed")
  File "/u/smithmax/Projects/_/memory_saving_gradients.py", line 197, in gradients
    grad_node = tf.stop_gradient(x, name=x.op.name+"_sg")
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5220, in stop_gradient
    "StopGradient", input=input, name=name)
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3172, in create_op
    op_def=op_def)
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1659, in __init__
    self._control_flow_post_processing()
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1668, in _control_flow_post_processing
    control_flow_util.CheckInputFromValidContext(self, input_tensor.op)
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_util.py", line 260, in CheckInputFromValidContext
    raise ValueError(error_msg + " See info log for more details.")
ValueError: Cannot use 'decoder/while/BasicDecoderStep/gru_cell/MatMul/Enter' as input to 'decoder/while/BasicDecoderStep/gru_cell/MatMul/Enter_sg' because 'decoder/while/BasicDecoderStep/gru_cell/MatMul/Enter' is in a while loop. See info log for more details.

I'm not familiar with graph editing, so I don't have an idea of a direction on where to start. :(

System Information

Python version: 3.6.0 (64-bit)
OS Platform: Debian 4.9.65 x86_x64
TF version: tf-nightly-gpu==1.6.0.dev20180117

Code to Reproduce

This is just a seq2seq graph class, sorry that it's a little messy.

There's various lines for different potential checkpoints commented out, and only one is currently uncommented. It doesn't appear to work for any of the checkpoint candidates.

import functools
import tensorflow as tf

import memory_saving_gradients as memory_saving_gradients


_CHECK = "checkpoints"


def lazy_loading_property(function):
    """ Lazy loading decorator.

    Source: https://danijar.com/structuring-your-tensorflow-models/
    """
    attribute = "_cache_" + function.__name__

    @property
    @functools.wraps(function)
    def decorator(self):
        if not hasattr(self, attribute):
            setattr(self, attribute, function(self))
        return getattr(self, attribute)

    return decorator


class RNNGraph:

    def __init__(
            self, data, data_lengths, target, target_lengths,
            training, hyperparams):
        self.data = data
        self.data_lengths = data_lengths
        self.target = target
        self.target_lengths = target_lengths
        self.training = training
        self._hyperparams = hyperparams

        # defined in time for `tf.initialize_variables()`.
        self.inference
        self.optimize
        self.loss

    @lazy_loading_property
    def encoder_cells(self):
        """ Encoding cells.

        :return: `RNNCell` object.
        """
        encoder_cells = tf.nn.rnn_cell.GRUCell(
            self._hyperparams["num_units"],
            kernel_initializer=self._hyperparams["initializer"],
            bias_initializer=self._hyperparams["initializer"])
        return encoder_cells

    @lazy_loading_property
    def decoder_cells(self):
        """ Decoding cells.

        :return: `RNNCell` object.
        """
        decoder_cells = tf.nn.rnn_cell.GRUCell(
            self._hyperparams["num_units"],
            kernel_initializer=self._hyperparams["initializer"],
            bias_initializer=self._hyperparams["initializer"])
        return decoder_cells

    @lazy_loading_property
    def inference(self, open_loop=False):
        """ Perform inference on the graph.

        :param open_loop:
        :return:
        """
        # Remove the end-of-sentence tag from targets.
        target = tf.slice(
            self.target, [0, 0, 0], [-1, tf.shape(self.target)[1]-1, -1])

        # Split the turn meta-info from the token ID.
        # [B, S, 10] --> [B, S, 9], [B, S, 1].
        data_meta, data_tokens = tf.split(self.data, [7, 1], axis=2)
        target_meta, target_tokens = tf.split(target, [7, 1], axis=2)

        # Embedding.
        self.embedding = tf.get_variable(
                "embedding",
                [self._hyperparams["vocab_size"],
                 self._hyperparams["embedding_size"]])

        # Look up embeddings.
        # Embeddings are shape: [B, S, 1, 300].
        encoder_inputs_embedded = tf.nn.embedding_lookup(
            self.embedding, tf.cast(data_tokens, tf.int32))
        decoder_inputs_embedded = tf.nn.embedding_lookup(
            self.embedding, tf.cast(target_tokens, tf.int32))

        # Remove '1' dimension from embeddings: [B, 1, S, 300] --> [B, S, 300].
        encoder_inputs_embedded = tf.squeeze(encoder_inputs_embedded, [2])
        decoder_inputs_embedded = tf.squeeze(decoder_inputs_embedded, [2])

        tf.add_to_collection(_CHECK, encoder_inputs_embedded)
        # tf.add_to_collection(_CHECK, decoder_inputs_embedded)

        # Merge the meta-info onto the token's embedding.
        # [B, S, 9] + [B, S, 300] --> [B, S, 309].
        encoder_inputs_embedded = tf.concat(
            [data_meta, encoder_inputs_embedded], axis=2)
        decoder_inputs_embedded = tf.concat(
            [target_meta, decoder_inputs_embedded], axis=2)

        # tf.add_to_collection(_CHECK, encoder_inputs_embedded)
        # tf.add_to_collection(_CHECK, decoder_inputs_embedded)

        # Run Dynamic RNN:
        #   encoder_outputs: [batch_size, max_time, num_units]
        #   encoder_states: [batch_size, num_units]
        with tf.variable_scope("gru_graph", reuse=tf.AUTO_REUSE):
            encoder_initial_state = self.encoder_cells.zero_state(
                tf.shape(self.data)[0], tf.float32)

        self.encoder_outputs, self.encoder_states = tf.nn.dynamic_rnn(
            self.encoder_cells, encoder_inputs_embedded, self.data_lengths,
            initial_state=encoder_initial_state)

        # tf.add_to_collection(_CHECK, self.encoder_outputs)
        # tf.add_to_collection(_CHECK, self.encoder_states)

        with tf.variable_scope("gru_graph", reuse=tf.AUTO_REUSE):
            # Vocabulary projection layer.
            projection_layer = tf.layers.Dense(
                self._hyperparams["vocab_size"], name="projection")

        # Decode.
        if open_loop:
            helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(
                self.embedding,
                tf.fill([self.data.get_shape()[0]], "<SOS>"),
                "<EOS>")
            # Decoder.
            decoder = tf.contrib.seq2seq.BasicDecoder(
                self.decoder_cells,
                helper,
                self.decoder_state_inputs,
                output_layer=projection_layer)
            decoder_outputs, decoder_states, decoder_output_lengths = \
                tf.contrib.seq2seq.dynamic_decode(decoder)
            logits = decoder_outputs.sample_id
        else:
            # Subtract one from target lengths because we've stripped EOS.
            helper = tf.contrib.seq2seq.TrainingHelper(
                decoder_inputs_embedded, self.target_lengths-1)
            # Decoder.
            decoder = tf.contrib.seq2seq.BasicDecoder(
                self.decoder_cells,
                helper,
                self.decoder_state_inputs,
                output_layer=projection_layer)
            decoder_outputs, decoder_states, decoder_output_lengths = \
                tf.contrib.seq2seq.dynamic_decode(decoder)
            logits = decoder_outputs.rnn_output

        # tf.add_to_collection(_CHECK, decoder_outputs)
        # tf.add_to_collection(_CHECK, decoder_states)

        return logits

    @lazy_loading_property
    def optimize(self):
        """ Create an operation to perform an update step on the network.

        :return: Update step operation.
        """
        optimizer = tf.train.AdamOptimizer(self._hyperparams["learning_rate"])
        params = tf.trainable_variables()

        if self._hyperparams["gradient_checkpointing"]:
            gradients = memory_saving_gradients.gradients(
                self.loss, params, checkpoints="speed")
        else:
            gradients = tf.gradients(self.loss, params)

        # Gradient clipping.
        clipped_gradients, _ = tf.clip_by_global_norm(
            gradients, self._hyperparams["gradient_clip"])
        update_operation = optimizer.apply_gradients(
            zip(clipped_gradients, params))

        return update_operation

    @lazy_loading_property
    def loss(self):
        """ Calculate the loss from the decoded sequence.

        :return: Cross entropy loss (scalar).
        """
        # Set all valid timesteps to `1`, and padded timesteps to `0`.
        weights = tf.cast(tf.sequence_mask(self.target_lengths-1), tf.float32)

        # Create decoder output, it is shifted left once.
        _, target_tokens = tf.split(self.target, [7, 1], axis=2)
        target_tokens = tf.cast(tf.squeeze(target_tokens, [2]), tf.int32)
        target_tokens = tf.slice(target_tokens, [0, 1], [-1, -1])

        return tf.contrib.seq2seq.sequence_loss(
            self.inference, target_tokens, weights=weights)

    @lazy_loading_property
    def decoder_state_inputs(self):
        return self.encoder_states

    @lazy_loading_property
    def total_input_size(self):
        total = self._hyperparams["num_units"] + 7
        return total

    @lazy_loading_property
    def decoder_state_inputs(self):
        return self.encoder_states


if __name__ == "__main__":
    with tf.device("/gpu:0"):
        data = tf.placeholder(tf.float32, [None, None, 8], name="input")
        data_lengths = tf.placeholder(tf.int32, [None], name="input_lengths")
        target = tf.placeholder(tf.float32, [None, None, 8], name="target")
        target_lengths = tf.placeholder(
            tf.int32, [None], name="target_lengths")
        training = tf.placeholder(tf.bool, [], name="training")

    hparams = {
        "num_units": 300,
        "embedding_size": 300,
        # Optimization
        "learning_rate": 0.003,
        "gradient_clip": 5,
        "gradient_checkpointing": True,
        # Hardcoded.
        "vocab_size": 27000,
        "batch_size": 2,
        "activation": tf.nn.relu,
        "initializer": tf.contrib.layers.xavier_initializer(),
    }

    model = RNNGraph(
        data, data_lengths, target, target_lengths, training, hparams)

Rerunning from Checkpoint Gives Error

When attempting to restart learning from a checkpoint, I get the following error (please let me know what other information I can provide):

Traceback (most recent call last):
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
    return fn(*args)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[{{node conv01/kernel}}]]
         [[conv01/kernel/_3]]
  (1) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[{{node conv01/kernel}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main_run_0008.py", line 1599, in <module>
    main()
  File "main_run_0008.py", line 1519, in main
    'model_init_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts + '.h5') )
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1090, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 382, in save_model
    _serialize_model(model, f, include_optimizer)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 97, in _serialize_model
    weight_values = K.batch_get_value(symbolic_weights)
  File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 2420, in batch_get_value
    return get_session().run(ops)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
    run_metadata)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[node conv01/kernel (defined at C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py:402) ]]
         [[conv01/kernel/_3]]
  (1) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[node conv01/kernel (defined at C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py:402) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'conv01/kernel':
  File "main_run_0008.py", line 1599, in <module>
    main()
  File "main_run_0008.py", line 1390, in main
    train_model = load_model( args.checkpoint )
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 419, in load_model
    model = _deserialize_model(f, custom_objects, compile)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 225, in _deserialize_model
    model = model_from_config(model_config, custom_objects=custom_objects)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 458, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "C:\Program Files\Python36\lib\site-packages\keras\layers\__init__.py", line 55, in deserialize
    printable_module_name='layer')
  File "C:\Program Files\Python36\lib\site-packages\keras\utils\generic_utils.py", line 145, in deserialize_keras_object
    list(custom_objects.items())))
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1032, in from_config
    process_node(layer, node_data)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 991, in process_node
    layer(unpack_singleton(input_tensors), **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\base_layer.py", line 431, in __call__
    self.build(unpack_singleton(input_shapes))
  File "C:\Program Files\Python36\lib\site-packages\keras\layers\convolutional.py", line 141, in build
    constraint=self.kernel_constraint)
  File "C:\Program Files\Python36\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\base_layer.py", line 252, in add_weight
    constraint=constraint)
  File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 402, in variable
    v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 259, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 220, in _variable_v1_call
    shape=shape)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 198, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 2511, in default_variable_creator
    shape=shape)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 263, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 1568, in __init__
    shape=shape)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 1728, in _init_from_args
    name=name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\state_ops.py", line 79, in variable_op_v2
    shared_name=shared_name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 2024, in variable_v2
    shared_name=shared_name, name=name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
    op_def=op_def)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

tf.GradientTape and higher order derivatives

Hi,
Should this work, maybe with manual node selection, with on trainable variables and second and higher order derivations ?

Is it possible to implement in dlib?

Hi, is it possible to implement in dlib?

Limiting memory usage via GPUOptions conflicts with is_gpu_available

TF version tensorflow-gpu==1.14.0

The default behavior is that TF would allocate almost all the GPU memory.

I tried to limit the GPU memory allocated by TF via
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.5)

But it turns out no effect. This SO post explained that any function calling device_lib.list_local_devices() would allocate all the GPU memory on all of the devices.

By commenting out all the tf.test.is_gpu_available(), the GPUOptions above works. This is not the problem of gradient-checkpointing but some unexpected behavior of TF. Just try to leave an issue here in case anyone running into the same problem as I do.:sweat_smile:

OOM when using gradients_memory's list of checkpointed tensors

First, thanks very much for this contribution!

I run my code using the gradients_memory() method, which indeed works.
I then try to get the bottleneck tensors it found, by taking the "Checkpoint nodes used" list. I add them manually to a "checkpoints" collection, and then use gradients_collection(). Then, I get OOM.

I don't understand why that is so - if I take the same tensors, shouldn't the behaviour be the same?

Extremely slow when running on distributed tensorflow with horovod

We are working to use gradient-checkpointing on distributed tensorflow(horovod: https://github.com/uber/horovod)

configuration as follows:
model: resnet50
input: synthetic data with 1k class number
horovod: horovod-0.12.1-py3.6-linux-x86_64 with NCCL2
tensorflow: 1.8.0
gradient-checkpointing: memory
experiment:

1GPU on P40
8GPU on P40, each GPU occupied by one MPI process.

The result on single GPU looks promising, with memory usage dropping from 7820.39MB to 3580.35MB while training speed drops from 152.51examples/sec to 115.68examples/sec. But the result on multiple(8) GPUs is not good. Memory usage drops from 7914.49MB to 3811.62MB, which is as expected.
But training speed drops from 143.63examples/sec per GPU to 6.71examples/sec.

Anyone could give us a hint on solving this issue?

memory_test.py failed

Hey guys,

Your work is great!

I tried to run the tests but failed. The error message is (I modified the code to print peak_memory):

Traceback (most recent call last):
File "memory_test.py", line 677, in
test_chain()
File "memory_test.py", line 119, in test_chain
assert peak_memory > 2e6, peak_memory
AssertionError: 0

Please help me if you can. :)

Using gradient checkpointing with a optimizer

I posted this as an issue before ( #4 ), however, neither of the suggestions appear to work. I get the same error with both methods suggested:

File "last-lstm-batchnorm.py", line 543, in
main()
File "last-lstm-batchnorm.py", line 486, in main
autoenc = Autoencoder(seq_len=SEQ_LEN, num_classes=NUM_CLASSES, embedding_dim=EMBEDDED_DIM)
File "last-lstm-batchnorm.py", line 93, in init
self.optimize
File "/usr/local/lib/python3.5/dist-packages/lazy_property/init.py", line 27, in get
result = self.method(instance)
File "last-lstm-batchnorm.py", line 214, in optimize
grads = tf.gradients(self.objective, tf.trainable_variables())
File "/data/oscar/rxu/featurelearning/memory_saving_gradients.py", line 27, in gradients_memory
return gradients(ys, xs, grad_ys, checkpoints='memory', **kwargs)
File "/data/oscar/rxu/featurelearning/memory_saving_gradients.py", line 258, in gradients
grad_ys=substitute_backprops, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py", line 516, in gradients
colocate_gradients_with_ops)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py", line 192, in _PendingCount
between_op_list, between_ops, colocate_gradients_with_ops)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1348, in MaybeCreateControlFlowState
loop_state.AddWhileContext(op, between_op_list, between_ops)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1157, in AddWhileContext
outer_forward_ctxt = forward_ctxt.outer_context
AttributeError: 'NoneType' object has no attribute 'outer_context'

where I am doing the following to update the weights (for the second suggestion):

@lazyprop.LazyProperty
    def objective(self):
        reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
        logits, _ = self.inference
        xentropy = tf.losses.sparse_softmax_cross_entropy(self.labels, logits)
        return xentropy + sum(reg_losses)

@lazyprop.LazyProperty
def optimize(self):    
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        # optimizer_op = optimizer.minimize(self.objective, name='optimizer')
        grads = tf.gradients(self.objective, tf.trainable_variables())
        grads_and_vars = list(zip(grads, tf.trainable_variables()))
        train_op = optimizer.apply_gradients(grads_and_vars)
    return optimizer_op

and the following for the first suggestion:

@lazyprop.LazyProperty
def objective(self):
    reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    logits, _ = self.inference
    xentropy = tf.losses.sparse_softmax_cross_entropy(self.labels, logits)
    return xentropy + sum(reg_losses)

@lazyprop.LazyProperty
def optimize(self):    
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        optimizer_op = optimizer.minimize(self.objective, name='optimizer')
    return optimizer_op

Some high level descriptions on the MEMORY policy will be very helpful

I am trying to understand the heuristic algorithm used in the memory policy. However I could not fully understand the whole logic, especially the following if statement as shown below.

gradient-checkpointing/memory_saving_gradients.py

Line 143 in 43444e0

if not set(b_inp).intersection(f_inp) and len(b_inp)+len(f_inp) >= len(ts_all):

Some explanations or guidance will be highly appreciated.

Thanks.

TF while loop error

I'm trying to apply this awesome tool on BERT model. But it seems doesn's work with TF while loop. The model code is basically same as https://github.com/CLUEbenchmark/CLUENER2020/blob/master/tf_version/modeling.py, except that I add every sqrt(num_hidden_layers) hidden to collections by tf.add_to_collection('checkpoints', layer_output) . When run training, I got this error message: "ValueError: Cannot use 'loss/rnn/while/TensorArrayReadV3/Enter' as input to 'loss/rnn/while/TensorArrayReadV3_1' because 'loss/rnn/while/TensorArrayReadV3/Enter' is in a whileloop. See info log for more details." Would you please help me solve this problem?

Grad CAM Map With Memory Saving Looks Odd

@yaroslavvb Have you tried to create grad cam or saliency maps after using memory_saving_gradients? Mine look very odd (columnar). I'm doing the Udacity program, and my keras-vis grad-cam plots look like this:

Does this package work for tensorflow 1.15?

I found that the last date of commit is 2 years ago, so maybe this package is not applied in tensorflow1.15? Does anyone could make sure it?? It can not work in my codes with tf1.15. I need to know if it is because of the version of tensorflow or my own codes.

Gradient checkpointing seems to conflict with Keras batch norm

I tried this out but get an Error when computing the gradients with the provided function using manually selected checkpoints. I get three different errors at the same time, and am not sure what of my graph is actually causing them, so I would appreciate some hints so that I could come up with a minimal non-working example. I currently use TF1.13.1 and especially the tf.keras.layers.BatchNormalization (just saying this because it pops up along the Error message). Is there any hope that this would be an easy fix?

Traceback (most recent call last):                                                                                                                             
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 415, in _MaybeCompile                    
    xla_compile = op.get_attr("_XlaCompile")                                                                                                                              
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2413, in get_attr
    raise ValueError(str(e))                                                                                                                          
ValueError: Operation 'optimizer/head/convolve_batch_activate_20/batch_normalization_v1_21/cond/ReadVariableOp_1/Switch' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 455, in _apply_op_helper
    as_ref=input_arg.is_ref)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1240, in internal_convert_n_to_tensor
    ctx=ctx))
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1175, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 977, in _TensorTensorConversionFunction
    (dtype.name, t.dtype.name, str(t)))
ValueError: Tensor conversion requested dtype float32 for Tensor with dtype resource: 'Tensor("optimizer/gradients/optimizer/head/convolve_batch_activate_20/batch_normalization_v1_21/cond/ReadVariableOp_1/Switch
_grad/Switch_1:1", shape=(), dtype=resource)'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/sadt.py", line 544, in <module>
    with SpaceAndDeformableTimeNetwork(cfg, datasets) as exp:
  File "/lhome/davidj2/code/sync/space_and_deformable_time/src/xxsflow/experiments/base_experiment.py", line 42, in __enter__
    self.build_graph()
  File "./src/sadt.py", line 298, in build_graph
    self.optimizer_op = self.optimizer
  File "/lhome/davidj2/code/sync/space_and_deformable_time/src/xxsflow/utils.py", line 388, in wrapped_function
    setattr(self, attribute, function(self))
  File "./src/sadt.py", line 267, in optimizer
    grads = grads = tf.gradients(self.loss, tf.trainable_variables())
  File "/lhome/davidj2/code/sync/space_and_deformable_time/packages/gradient_checkpointing/memory_saving_gradients.py", line 40, in gradients_collection
    return gradients(ys, xs, grad_ys, checkpoints='collection', **kwargs)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/packages/gradient_checkpointing/memory_saving_gradients.py", line 227, in gradients
    dv = tf_gradients(ys=copied_ys, xs=boundary+xs, grad_ys=grad_ys, **kwargs)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/packages/gradient_checkpointing/memory_saving_gradients.py", line 27, in tf_gradients
    return tf_gradient_function(ys, *args, **kwargs)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 664, in gradients
    unconnected_gradients)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 965, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 420, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 965, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_grad.py", line 88, in _SwitchGrad
    return merge([false_grad, true_grad])[0], None
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 466, in merge
    return gen_control_flow_ops.merge(inputs, name)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_control_flow_ops.py", line 418, in merge
    "Merge", inputs=inputs, name=name)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 483, in _apply_op_helper
    raise TypeError("%s that don't all match." % prefix)
TypeError: Tensors in list passed to 'inputs' of 'Merge' Op have types [float32, resource] that don't all match.

How to select nodes for collections

I add all tf.nn.relu to ‘checkpoints', and I also have weights_regularizer. I don’t save memory, so Is there something wrong?

core dump

I update tensorflow to 1.5 ,and with cuda 9 cudnn 7
after install tf-nightly-gpu , I got core dumped

$python -c 'import tensorflow '
Segmentation fault (core dumped)

Program terminated with signal 11, Segmentation fault.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `python -c import tensorflow '.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fa181219976 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: debuginfo-install python-2.7.5-58.el7.x86_64
(gdb) bt
#0 0x00007fa181219976 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
#1 0x00007fa181221b3c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#2 0x00007fa18121d1b4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#3 0x00007fa1812211ab in dl_open () from /lib64/ld-linux-x86-64.so.2
#4 0x00007fa180a2302b in dlopen_doit () from /lib64/libdl.so.2
#5 0x00007fa18121d1b4 in dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6 0x00007fa180a2362d in dlerror_run () from /lib64/libdl.so.2
#7 0x00007fa180a230c1 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#8 0x00007fa179a261f1 in py_dl_open () from /usr/lib64/python2.7/lib-dynload/ctypes.so
#9 0x00007fa180f26bb0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#10 0x00007fa180f28efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#11 0x00007fa180eb2858 in function_call () from /lib64/libpython2.7.so.1.0
#12 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#13 0x00007fa180e9c995 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#14 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#15 0x00007fa180ee4947 in slot_tp_init () from /lib64/libpython2.7.so.1.0
#16 0x00007fa180ee365f in type_call () from /lib64/libpython2.7.so.1.0
#17 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#18 0x00007fa180f220f6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#19 0x00007fa180f28efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#20 0x00007fa180f29002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#21 0x00007fa180f38dec in PyImport_ExecCodeModuleEx () from /lib64/libpython2.7.so.1.0
#22 0x00007fa180f39068 in load_source_module () from /lib64/libpython2.7.so.1.0
#23 0x00007fa180f39d01 in import_submodule () from /lib64/libpython2.7.so.1.0
#24 0x00007fa180f39fe6 in load_next () from /lib64/libpython2.7.so.1.0
#25 0x00007fa180f3a92e in PyImport_ImportModuleLevel () from /lib64/libpython2.7.so.1.0
#26 0x00007fa180f1dbdf in builtin___import () from /lib64/libpython2.7.so.1.0
#27 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#28 0x00007fa180f1f7b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#29 0x00007fa180f24475 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#30 0x00007fa180f28efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#31 0x00007fa180f29002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#32 0x00007fa180f38dec in PyImport_ExecCodeModuleEx () from /lib64/libpython2.7.so.1.0
#33 0x00007fa180f39068 in load_source_module () from /lib64/libpython2.7.so.1.0
#34 0x00007fa180f39d01 in import_submodule () from /lib64/libpython2.7.so.1.0
#35 0x00007fa180f3a1ff in ensure_fromlist () from /lib64/libpython2.7.so.1.0
#36 0x00007fa180f3aa3a in PyImport_ImportModuleLevel () from /lib64/libpython2.7.so.1.0
#37 0x00007fa180f1dbdf in builtin___import () from /lib64/libpython2.7.so.1.0
#38 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#39 0x00007fa180f1f7b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#40 0x00007fa180f24475 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#41 0x00007fa180f2657d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#42 0x00007fa180f2657d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#43 0x00007fa180f28efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#44 0x00007fa180eb2858 in function_call () from /lib64/libpython2.7.so.1.0
#45 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#46 0x00007fa180f1f7b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#47 0x00007fa180f43c9c in PyErr_PrintEx () from /lib64/libpython2.7.so.1.0
#48 0x00007fa180f44c9c in PyRun_SimpleStringFlags () from /lib64/libpython2.7.so.1.0
#49 0x00007fa180f55520 in Py_Main () from /lib64/libpython2.7.so.1.0
#50 0x00007fa18017cb15 in __libc_start_main () from /lib64/libc.so.6
#51 0x000000000040071e in _start ()

Splitting model across 2 GPUs leads to OOM

I am running a UNet based model on a single GPU, using gradients_speed.
When splitting the same model across 2 GPUs training runs out with OOM before even starting.

Same model runs fine on 2 GPUs with regular gradients.

What would be a good place to start investigating this issue? What can be causing that?

`mem_util` only shows CPU, not GPU

The mem_util module only shows device /cpu:0, but not my gpu. I have tensorflow-gpu 1.5.

How do you use gradient checkpointing if you are updating gradients with an optimizer?

Doesn't seemingly work with latest TF versions.

Having successfully used this when it came out with TF 1.5 it seemingly doesn't work anymore in TF 1.9.

@yaroslavvb Do you have a working version still? Or do you have any insights as to what might have changed?

How does gradient checkpointing relate to reversible layers?

While studying ways to optimize GPU memory consumption, I found two approaches:

Gradient checkpointing
Reversible layer (from reformer paper)

Can you explain if there is a connection between them and which of the methods is more relevant now?

Publish this on Pypi

Hello,

First of all, thank you for making this work public!

I have noticed that this package is not available on Pypi, and this makes distributing this package harder than it needs to be.

Would you be interested in doing so? I have published a few packages on there so if you need a hand doing so I'd be happy to help out. It should be a matter of a few minutes from start to finish.

Thank you and have a nice day,
Luca

Checkpointing of VGG

Hi,
I've ran a bunch of Imagenet networks with and without checkpointing. Everything seems to work pretty well everywhere except for the VGGs. I've tried different block sizes - VGG[11,13,16,19], different batch sizes, the automatic and manual checkpointing through 'collections'. It just doesn't work:

I wonder if this is somehow inherently related to the fact that in VGG's most of the memory is spent on first few layers?
One thing I noticed when I tried to debug it is that output of the toposort() loos strange. All the Maxpooling layers are at the end :

Marked-up tensors is what the automatic mode chooses to checkpoint.
Yet again, manual checkpointing doesn't help. Any ideas?
Thanks
Dmitri

Problems with custom gradient

We have meet a problem when use checkpoints and custom gradients together. We have created custom gradient for operation tf.matrix_solve_ls for mode (fast=False), but if we include tensor MatrixSolveLs in the list of checkpointed tensors, the gradients function in memory_saving_gradient.py tries to use the default gradient and ends up with an error because the gradient is not defined for mode (fast=False). We are using tf 1.9. @yaroslavvb do you have any hints about how to make @tf.custom_gradient work with checkpointing?

Use with static (unrolled) RNN?

Hi guys, thanks for your contribution. I wanted to give some feedback and request that you add a static (unrolled) RNN to your test suite. If/when I get a chance to spend more time on this, I'm happy to contribute this myself.

I tried using your code with a 2-layer LSTM RNN using dynamic_rnn and hit the same issue as here: #9

I converted my model to use static_rnn. This removes the while loop by statically unrolling for a fixed sequence length. At this point, your code was unable to automatically find articulation points. So, I tried adding manual checkpoints in a few intuitive places (at output of each layer, or at every unrolled loop iteration, or at every k unrolled loop iterations). In all cases, the memory usage was still higher than the baseline. I investigated the modified backprop graph. It seemed to be doing a lot of redundant computation and not working as described in your writing. I suspect I wasn't checkpointing correctly. A working static RNN test case would be a helpful reference.

is this module adapted for tensorflow 2.0 and above?

Does Not Work with Keras

@yaroslavvb Would you please add keras model.fit_generator to your test cases? I notice the keras test case is a simple MNIST model that does not use convolutional layers either. As an example for me, on tensorflow 1.5-gpu with keras 2.1.6 and python 3.5 x64-bit on a Windows 10 machine, I cannot get the following to work (i.e. memory used and time per epoch is the same with or without memory_saving_gradients code):

# -*- coding: utf-8 -*-

##########
#LIBRARIES
##########

#Future
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import pandas as pd

pd.set_option('chained_assignment',None) #Sets `SettingWithCopyWarning` to None. If
                                         # making a chained assignment, the outcome may
                                         # vary depnding on if the data is a view of
                                         # other data or a copy of other data.

import cv2

import os
import time
import argparse
import h5py
import gc

import multiprocessing as mp

import tensorflow as tf
from tensorflow.python.keras._impl.keras import backend as K

from tensorflow.contrib.data.python.ops.shuffle_ops import shuffle_and_repeat
from tensorflow.contrib.data.python.ops.batching import map_and_batch

import memory_saving_gradients

Dataset = tf.data.Dataset

from tensorflow.python.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
from tensorflow.python.keras.models import Sequential, Model, load_model, model_from_yaml
from tensorflow.python.keras.callbacks import LearningRateScheduler, ModelCheckpoint, EarlyStopping, History, TensorBoard
from tensorflow.python.keras import regularizers, optimizers
from tensorflow.python.keras.layers import Conv2D, Dense, Flatten, Dropout, Input, Lambda, Activation

##################
#GLOBAL VARIABLES
##################

img_shape_raw = (3, 160, 320)

batch_size = 32

num_epochs = 1

crop_top = 70
crop_btm = 25

img_format = 'channels_first'
K.set_image_data_format(img_format)

img_shape_input = (img_shape_raw[0],
                   img_shape_raw[1] - crop_top - crop_btm,
                   img_shape_raw[2]) #(3, 65, 320)

max_procs = mp.cpu_count() - 1 or 1 # 4 physical cores, 8 logical cores
max_q_size = batch_size

root = r'.'

fldr_img_raw = os.path.join( root, r'dat\raw' )
fldr_csv_raw = os.path.join( root, r'dat\raw' )

fldr_img_mod = os.path.join( root, r'dat\mod' )
fldr_csv_mod = os.path.join( root, r'dat\mod' )

train_csv = os.path.join(fldr_csv_mod, 'training_data.csv')
val_csv = os.path.join(fldr_csv_mod, 'validation_data.csv')
test_csv = os.path.join(fldr_csv_mod, 'test_data.csv')

pth_bins_fl = os.path.join( fldr_csv_mod, 'bins.txt' )

fldr_fig = os.path.join( root, r'fig' )

lr = [1e-4, ]
run = [1, ]

hparam_str = ['1e-4', ]

fldr_log = os.path.join( root, r'log', hparam_str[0], 'run_{:04d}'.format(run[0]))

fldr_arch = os.path.join( root, r'arch' )
fldr_wt = os.path.join( root, r'wt' )
fldr_ckpt = os.path.join( root, r'ckpt' )
fldr_mdl = os.path.join( root, r'mdl' )

fldr_summary = os.path.join( root, r'summary' )

fl_fmt_wt_ckpt = os.path.join( fldr_ckpt,
                               r'wt_ckpt-run_{run:04d}'.format(run=run[0]) + '_epoch_{epoch:04d}_val_mse_{val_mean_squared_error:.7f}.h5' )

################
#DATA GENERATOR
################

def get_data( keep_ptl = 75 ):
    '''This just returns the train, validation, and test dataframes
       keeping a certain percentile of the original data. I'm not
       including it here for space and since it doesn't seem pertinent.
    '''

def generator_from_df( df, batch_size, shuffle = True ):
    
    def read( img_pth, angle ):
        
        im_fl = tf.read_file( img_pth )
        im = tf.image.decode_image(im_fl, channels=3)
        im = tf.transpose( im, [2, 0, 1] ) # Make image channels first

        return Dataset.from_tensors( (im, angle) )

    img_pths = tf.convert_to_tensor( df['Image_Path'].values )
    angs = tf.convert_to_tensor( df['Angle'].values )

    ds = Dataset.from_tensor_slices( (img_pths, angs) )

    ds = ds.apply( tf.contrib.data.parallel_interleave( read, cycle_length = batch_size, sloppy = True ) )

    if shuffle:
        ds = ds.apply( shuffle_and_repeat( buffer_size = 2*batch_size, count = num_epochs ) )
    else:
        ds = ds.repeat( num_epochs )

    ds = ds.apply( map_and_batch(
        lambda img_pth, ang: (img_pth,ang),
        batch_size,
        num_parallel_batches = max_procs ) )
    
    ds = ds.prefetch( max_procs )

    iterator = ds.make_one_shot_iterator()
    sess = K.get_session()

    next_element = iterator.get_next()

    while True:

        try:
          yield sess.run(next_element)
        except tf.errors.OutOfRangeError:
          break

###########
#GET MODEL
###########

def get_model( lr ):

    keep_prob = 0.5
    rate = keep_prob
    
    l2 = regularizers.l2(0.001)

    with tf.name_scope('Input'):
        inputs = Input( shape=img_shape_input, name='input' )

        x = Lambda(lambda x: x / 255. - 0.5,
                   input_shape=img_shape_input, name = 'norm_-0.5_to_0.5')(inputs)

    with tf.name_scope('Hidden_Layers'):

        with K.name_scope('ConvLayer_01'):
        
            x = Conv2D(4, (5,5),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv01')(x)

        with tf.name_scope('ConvLayer_02'):
        
            x = Conv2D(12, (5,5),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv02')(x)

        with tf.name_scope('ConvLayer_03'):
        
            x = Conv2D(24, (5,5),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv03')(x)

        with tf.name_scope('ConvLayer_04'):
        
            x = Conv2D(24, (3,3),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv04')(x)

        with tf.name_scope('ConvLayer_05'):
        
            x = Conv2D(32, (3,3),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv05')(x)

        with tf.name_scope('Flatten'):
        
            x = Flatten(name='flatten')(x)

        with tf.name_scope('FullyConnectedLayer_01'):
                
            x = Dense(100,
                      kernel_regularizer=l2,
                      bias_regularizer=l2,
                      name='fc01')(x)

        with tf.name_scope('FullyConnectedLayer_02'):
        
            x = Dense(50,
                      kernel_regularizer=l2,
                      bias_regularizer=l2,
                      name='fc02')(x)

        with tf.name_scope('FullyConnectedLayer_03'):

            x = Dense(25,
                      kernel_regularizer=l2,
                      bias_regularizer=l2,
                      name='fc03')(x)

        with tf.name_scope('FullyConnectedLayer_04'):
        
            x = Dense(10,
                      kernel_regularizer=l2,
                      bias_regularizer=l2,
                      name='fc04')(x)

    with tf.name_scope('Output'):
    
        outputs = Dense(1,
                        name='output')(x)

    # Create Model
        
    model = Model( inputs = inputs, outputs = outputs )

    adam = optimizers.Adam( lr = lr, decay = 0.001 ) # Learning rate and decay set in LearningRateScheduler

    # Memory Saving Gradients

    layer_names = [ 'conv02', 'conv04', 'fc01', 'fc03' ]

    [tf.add_to_collection('checkpoints', model.get_layer(l).get_output_at(0))
     for l in layer_names]
    
    K.__dict__['gradients'] = memory_saving_gradients.gradients_collection

    # Compile Model

    model.compile(loss='mean_squared_error', optimizer=adam, metrics=['mse'])

    return model

class CumulativeHistory( History ):
    '''
    History does not allow resume history, but this does.
    '''
    def on_train_begin( self, logs=None ):
        if not hasattr(self, 'epoch'):
            super(CumulativeHistory, self).on_train_begin( logs )

def main(*args, **kargs):
    """ Behavioral Cloning Project
    """

    parser = argparse.ArgumentParser(description='Behavioral Cloning Project')

    parser.add_argument('-c', '--checkpoint', type=str, help='Checkpoint (`.h5` file)')
    parser.add_argument('-e', '--epoch', type=int, help='Initial epoch')
    
    args = parser.parse_args()

    model_type = 'new'
    train_model = None
    initial_epoch = 0

    if args.checkpoint is not None:

        train_model = load_model( args.checkpoint )

        initial_epoch = args.epoch

        model_type = 'loaded'

    # Set Configuration

    config = tf.ConfigProto( intra_op_parallelism_threads = max_procs,
                             inter_op_parallelism_threads = 0) # set automatically to number of logical cores

    config.gpu_options.allow_growth = True

    # Get Data

    df_train, df_val, df_test, bins = get_data( keep_ptl = 60 )
    
    ntrain, nval, ntest = df_train.shape[0], df_val.shape[0], df_test.shape[0]

    # Training

    train_graph = tf.Graph()

    train_generator = generator_from_df( df_train, batch_size )
    val_generator   = generator_from_df( df_val,   batch_size, shuffle=False )

    nbatches_train = ntrain // batch_size
    nbatches_val   = nval // batch_size
    
    history = CumulativeHistory()
    
    early_stop = EarlyStopping( monitor='val_mean_squared_error',
                                min_delta=1e-4,
                                patience=50,
                                verbose=0,
                                mode='min')
    
    model_ckpt = ModelCheckpoint( fl_fmt_wt_ckpt,
                                  monitor='val_mean_squared_error',
                                  verbose=0,
                                  save_best_only=True,
                                  save_weights_only=True,
                                  period=1)
    
    callbacks = [history, early_stop, model_ckpt]

    for i in range(len(lr)):

        train_sess = tf.Session( config = config, graph = train_graph )
        K.set_session( train_sess )

        if model_type == 'new':
            
            with train_graph.as_default():

                # Print model summary
                summary_fl_pth = os.path.join( fldr_summary, 'model_summary_run_{:04d}_'.format(run[0]) + r'.txt' )

                train_model = get_model( lr[i], is_training = True )

                with open(summary_fl_pth, 'w') as summary_file:
                    train_model.summary( print_fn=lambda x: summary_file.write(x + '\n') )

        with train_graph.as_default():
            
            with train_sess.as_default():

                if K.backend() == 'tensorflow':
                    
                    board = TensorBoard( log_dir = fldr_log,
                                         histogram_freq = 0,
                                         write_graph = True,
                                         write_images = True )
                    callbacks.append( board )

                writer = tf.summary.FileWriter( fldr_log, train_graph )

                ts = time.time()
                ts = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d_%H-%M-%S')

                arch_yaml = train_model.to_yaml()
                arch_fl_pth = os.path.join( fldr_arch, 'arch_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts + '.yaml' )

                with open(arch_fl_pth, 'w') as arch_file:
                    arch_file.write( arch_yaml )
                
                train_model.save( os.path.join( fldr_mdl,
                                                'model_init_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts + '.h5') )

                train_model.save_weights( os.path.join( fldr_wt,
                                                        'weights_init_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts  + '.h5' ) )

                train_model.fit_generator(
                    generator = train_generator,
                    steps_per_epoch = nbatches_train,
                    epochs = num_epochs,
                    max_queue_size = max_q_size,
                    validation_data = val_generator,
                    validation_steps = nbatches_val,
                    workers = 0,
                    callbacks = callbacks,
                    initial_epoch = initial_epoch)

                ts = time.time()
                ts = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d_%H-%M-%S')

                train_model.save( os.path.join( fldr_mdl,
                                                'model_final_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts + '.h5') )

                train_model.save_weights( os.path.join( fldr_wt,
                                                        'weights_final_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts  + '.h5' ) )
                
        if K.backend() == 'tensorflow':
            K.clear_session()

        del train_model
        gc.collect()

if __name__ == '__main__':
    """ Entry point to the program
    """

    main()

Does this lib work with layers that contain randomness?

Hi author

Thanks for sharing this useful lib. However, as this techniques require re-forwarding. Will it change the behavior of layer such as batch norm (accumulated mean and var) and dropout (random mask)?

Checkpointing for FP16

Hi guys,
I have added couple of lines of code in memory_saving_gradients.py and in the benchmarks to enable checkpointing and benchmarking on FP16 networks. It seem to work pretty well. Should I do a pull request for those changes?
Thanks
Dmitri