Git Product home page Git Product logo

capsnet-keras's People

Contributors

iretiayo avatar kmader avatar shshemi avatar theredcomputer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

capsnet-keras's Issues

Random performance on other dataset, Why it Does not learns?

I applied your implementation on 2 other datasets. However, it works like a random classification with very low performance. Firstly I trained it on Dog vs Cat (kaggle problem) it can't learns this binary classification and has accuracy around 52 after 50 epochs, While the simple CNN can learns it very well. I also applied it on my own dataset with 20 classes but its accuracy was 6 percentage! Could please let me know what is the main reason of this important issue? Thanks

train et validation data

Hey !
Thank you for your job and sharing it !

I have a question regarding the train data returned by "train_generator" and the validation ones used by fit_generator.
For example, for the train data, why do you use a list ([x_batch, y_batch], [y_batch, x_batch]) instead of a tuple [x_batch, y_batch].

Thank you

Cifar-10 performance

Hi Xi Feng,

May I ask whether you had tried training capsule on cifar-10 or cifar-100 datasets?

My attempt to replicate the paper on cifar-10 accuracy only achieve a 0.54 accuracy. I have set both convolution kernel size to 24, primary capsule size to 64. However, the model stops improving after epoch 14 on cifar-10 datasets. Cifar-100 perform even worse than it's smaller counterpart.
I wonder if there are any parameters I had missed?

figure_1

How to apply the network for many classes

In mnist, it is 10 classes, and we need 10 capsules; but if the num_capsule continue to go up OOM will happen...
So how to apply the CapsNet for .., for example, 50 classes or even more?
Thank you for answering!

GPU version can't work with keras.

Hi, Mr. Guo,
I've found a problem when using the GPU version. The following is my steps:
python capsulenet-multi-gpu.py --gpus 2

Using TensorFlow backend.
Namespace(batch_size=300, debug=0, digit=5, epochs=50, gpus=2, lam_recon=0.392, lr=0.001, routings=3, save_dir='./result', shift_fraction=0.1, testing=False, weights=None)


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 28, 28, 1) 0


conv1 (Conv2D) (None, 20, 20, 256) 20992 input_1[0][0]


primarycap_conv2d (Conv2D) (None, 6, 6, 256) 5308672 conv1[0][0]


primarycap_reshape (Reshape) (None, 1152, 8) 0 primarycap_conv2d[0][0]


primarycap_squash (Lambda) (None, 1152, 8) 0 primarycap_reshape[0][0]


digitcaps (CapsuleLayer) (None, 10, 16) 1474560 primarycap_squash[0][0]


input_2 (InputLayer) (None, 10) 0


mask_1 (Mask) (None, 160) 0 digitcaps[0][0]
input_2[0][0]


capsnet (Length) (None, 10) 0 digitcaps[0][0]


decoder (Sequential) (None, 28, 28, 1) 1411344 mask_1[0][0]

Total params: 8,215,568
Trainable params: 8,215,568
Non-trainable params: 0


Traceback (most recent call last):
File "capsulenet-multi-gpu.py", line 122, in
plot_model(model, to_file=args.save_dir+'/model.png', show_shapes=True)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/vis_utils.py", line 131, in plot_model
dot = model_to_dot(model, show_shapes, show_layer_names, rankdir)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/vis_utils.py", line 52, in model_to_dot
_check_pydot()
File "/usr/local/lib/python2.7/dist-packages/keras/utils/vis_utils.py", line 27, in _check_pydot
raise ImportError('Failed to import pydot. You must install pydot'
ImportError: Failed to import pydot. You must install pydot and graphviz for pydotprint to work.

I've installed pydot and graphviz already, and my keras version is:

keras.version
'2.1.2'

Thanks!

Clock ZHONG

tf.map_fn instead of tf.scan

You should really use tf.map_fn instead of tf.scan, in your code the first argument (the accumulator) of the tf.scan function is not even used. Therefore, using tf.map_fn should do the same just faster because it does not have to account for dependencies between the evaluations of the lambda function on the elements that are to be processed.

Keras speed!

Hi Xifeng,
Just surprised by how smooth, this code was running. I probably spent 5 minutes for getting everything running and finish one epoch. while I spent a couple of hours try to get "naturomics/CapsNet-Tensorflow" to work. I found that this Keras code probably x5 plus faster than the Tensorflow code from Naturomics.

Could you explain whether the fast speed of your code is due to Keras or something else, as I have not tried to learn Keras. I think most people just learn Tensorflow without second thought.

Regards,
Wen

Problem in running for systems not having GPU. Please help.

C:\Users\jeet\Documents\GitHub\CapsNet-Keras>python capsulenet.py --num_routing 1
Using TensorFlow backend.
Traceback (most recent call last):
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow\python\platform\self_check.py", line 62, in preload_check
ctypes.WinDLL(build_info.nvcuda_dll_name)
File "C:\Users\jeet\Anaconda3\lib\ctypes_init_.py", line 348, in init
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "capsulenet.py", line 20, in
from keras import layers, models, optimizers
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras_init_.py", line 3, in
from . import utils
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras\utils_init_.py", line 6, in
from . import conv_utils
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras\utils\conv_utils.py", line 3, in
from .. import backend as K
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras\backend_init_.py", line 83, in
from .tensorflow_backend import *
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 1, in
import tensorflow as tf
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow_init_.py", line 24, in
from tensorflow.python import *
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow\python_init_.py", line 49, in
from tensorflow.python import pywrap_tensorflow
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 30, in
self_check.preload_check()
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow\python\platform\self_check.py", line 70, in preload_check
% build_info.nvcuda_dll_name)
ImportError: Could not find 'nvcuda.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Typically it is installed in 'C:\Windows\System32'. If it is not present, ensure that you have a CUDA-capable GPU with the correct driver installed.

if dim_capsule != 16 for CapsuleLayer an error appears when training

Any idea why this happens? I'm talking about changing the 16 in line 46 in capsulenet.py to something else like 17 or 32.

# Layer 3: Capsule layer. Routing algorithm works here.
digitcaps = CapsuleLayer(num_capsule=n_class, dim_capsule=16, num_routing=num_routing,
                         name='digitcaps')(primarycaps)

Isn't the weight matrix of dim (dim_capsule_1,dim_capsule_2) from capsule layer 1 to capsule layer 2, so it should work in theory right, just like in figure 2 of https://arxiv.org/pdf/1710.09829.pdf

EDIT: found the other hardcoded 16

Dimension Ordering

This one is a quick fix, but the code assumes that you are using "channels-last", but I have my defaults set otherwise and it causes an error. You might want to make it explicit with:

K.set_image_data_format('channels_last')

pydot issue

Hi, I got the following error message (copied below) when I tried the code, although I do have both pydot and graphviz. I don't think this is an issue with your code. It seems to be a general long-standing issue with pydot. The solutions are discussed here:
1: Theano/Theano#1801
2: pydot/pydot#126

Traceback (most recent call last):
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/keras/utils/vis_utils.py", line 23, in _check_pydot
pydot.Dot.create(pydot.Dot())
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/pydot_ng/init.py", line 1890, in create
'GraphViz's executables not found')
pydot_ng.InvocationException: GraphViz's executables not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "capsulenet.py", line 195, in
plot_model(model, to_file=args.save_dir+'/model.png', show_shapes=True)
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/keras/utils/vis_utils.py", line 131, in plot_model
dot = model_to_dot(model, show_shapes, show_layer_names, rankdir)
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/keras/utils/vis_utils.py", line 52, in model_to_dot
_check_pydot()
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/keras/utils/vis_utils.py", line 27, in _check_pydot
raise ImportError('Failed to import pydot. You must install pydot'
ImportError: Failed to import pydot. You must install pydot and graphviz for pydotprint to work.

Learning rate decay in the paper

They use learning rate decay in the paper as it is part of the Adam optimizer:

we use the Adam optimizer (Kingma
and Ba [2014]) with its TensorFlow default parameters, including the exponentially decaying learning
rate, to minimize the sum of the margin losses in Eq. 4.

RGB Dataset 224* 224 dimensions

How can we use ur code in other RGB dataset?
Suppose the structure of dataset is like that. it contains some sub-folder. Each sub-folder represents one class.

Class A:
0001.jpg 1
0002.jpg 1
Class B:
0001.jpg 2
0002.jpg 2

Error about shapes not being equal in rank

When I try to run the code, I get an error saying that "Shapes must be equal rank, but are 1 and 2 for 'digitcaps/map/while/MatMul' (op: 'BatchMatMul') with input shapes: [10,1152,8], [10,1152,16,8]". I am not sure why this is happening since I have not changed any part of the code, does anyone know what I'm doing wrong? Here is the full output in the console log:

Traceback (most recent call last):
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 686, in _call_cpp_shape_fn_impl
    input_tensors_as_shapes, status)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shapes must be equal rank, but are 1 and 2 for 'digitcaps/map/while/MatMul' (op: 'BatchMatMul') with input shapes: [10,1152,8], [10,1152,16,8].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/dpere013/Desktop/CapsNet-Keras-master/capsulenet.py", line 244, in <module>
    num_routing=args.routings)
  File "C:/Users/dpere013/Desktop/CapsNet-Keras-master/capsulenet.py", line 50, in CapsNet
    name='digitcaps')(primarycaps)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\keras\engine\topology.py", line 554, in __call__
    output = self.call(inputs, **kwargs)
  File "C:\Users\dpere013\Desktop\CapsNet-Keras-master\capsulelayers.py", line 127, in call
    inputs_hat = K.map_fn(lambda x: K.batch_dot(x, self.W, [2, 3]), elems=inputs_tiled)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\keras\backend\tensorflow_backend.py", line 3328, in map_fn
    return tf.map_fn(fn, elems, name=name)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\functional_ops.py", line 389, in map_fn
    swap_memory=swap_memory)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2816, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2640, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2590, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\functional_ops.py", line 379, in compute
    packed_fn_values = fn(packed_values)
  File "C:\Users\dpere013\Desktop\CapsNet-Keras-master\capsulelayers.py", line 127, in <lambda>
    inputs_hat = K.map_fn(lambda x: K.batch_dot(x, self.W, [2, 3]), elems=inputs_tiled)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\keras\backend\tensorflow_backend.py", line 915, in batch_dot
    out = tf.matmul(x, y, adjoint_a=adj_x, adjoint_b=adj_y)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1861, in matmul
    a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 708, in _batch_mat_mul
    "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2958, in create_op
    set_shapes_for_outputs(ret)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2209, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2159, in call_with_requiring
    return call_cpp_shape_fn(op, require_shape_fn=True)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 627, in call_cpp_shape_fn
    require_shape_fn)
  File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 691, in _call_cpp_shape_fn_impl
    raise ValueError(err.message)
ValueError: Shapes must be equal rank, but are 1 and 2 for 'digitcaps/map/while/MatMul' (op: 'BatchMatMul') with input shapes: [10,1152,8], [10,1152,16,8].

Utils and plots

Matplotlib causes a lot of issues when run on remote servers without X. If not addressed, the program crashes when it tries to save the plots. I managed to resolve this by adding the following line to the start of utils.py (after the imports):

plt.switch_backend('agg')

I tried some other solutions, but this one seems to do the trick (even though I still get a bunch of warnings). Now, I'm aware that this isn't something you would be willing to add to master for sure, but it may be worthwhile to at least mention it in the Readme, or something...

Appropriate credit for authors

The authors of the paper are:

Sara Sabour, Nicholas Frosst, Geoffrey Hinton

In Hinton's talks about this he is always careful to make clear that while he may have thought of the idea, it was his co-authors who made it work. So I think the README should say Sabour's Capsule Network.

Learning rate decay in addition to Adam?

Hi,

First of all let me compliment you on the swift implementation CapsNet in Keras. It looks very interesting! I haven't gotten around testing it myself but when I was skimming to the source code after reading the CapsNet paper I noticed the following line which schedules updates of the learning rate using a Keras callback:

lr_decay = callbacks.LearningRateScheduler(schedule=lambda epoch: args.lr * (0.9 ** epoch))

This made me wonder whether this is also part of their setup in the paper:

Our implementation is in TensorFlow (...) and we use the Adam optimizer with its TensorFlow default parameters, including the exponentially decaying learning rate, ...

As far as I understand Adam, the optimiser already uses exponentially decaying learning rates but on a per-parameter basis. This makes me think no further learning decay is necessary.

Some time soon I plan to run some tests without the additional learning rate decay and see how it changes the results. In any case I'd like to hear your thoughts on this. I don't seem to find anything conclusive as to whether Adam can benefit from additional learning rate decay.

Kind regards,
Dieter

Valid padding in CapNets

Hello sir,
I am following the Capsule Network paper and your implementation.
I have a quick question about the valid padding in the conv2 you used to get output for the Primary Caps. So as I understand, after the 1st conv layer, the size of output is (batchsize,20,20,256). So if the conv2 has 256, 9x9 kernel, stride 2 then the formula output should be (20-9+2*p)/2+1 = 6. However, mathematically, the formula above can not be solved so I would like to ask how did exactly padding (valid) works in this situation to have the output is (batchsize,6,6,256).
Thanks !

Training on datasets other than MNIST

I have 26 classes contained in a dataset with the same pixel dimensions as MNIST and I have 1,000 samples for each class. I've successfully trained a CapsNet that can classify around 12-14 of my 26 classes with 99% accuracy, but any more classes than this and the loss will converge on a single high value and never improve. There seems to be a specific cut-off point where the network architecture fails above a certain number of classes.

I've experimented with increasing the values for the number of dimensions in the PrimaryCaps layer, setting different values for the learning rate, and changing the batch size, but I haven't been able to solve the issue.

Do you have any tips for other things I should be trying Xifeng? (I'm training on AWS p2.xlarge Tesla K80 GPU)

Thanks!

how to dump routings to a file

Hi,
Thanks for all the effort. I want to dump routings to file but not sure how to do it.
(I usually use Theano not sure how to do it in Keras-TF). I tried also pickle but somehow
I can't make tf to agree to save to a file none tf.Variables. I can't make them tf.Variables because they depend on the batch size.

saver = tf.train.Saver({"inputs": inputs_hat, "c":c})
save(tf.get_default_session(), "capsule_dump/vars.ckpt")

if I try to create a variable from them. I hit again a similar error.

ValueError: initial_value must have a shape specified: Tensor("digitcaps_17/map/TensorArrayStack/TensorArrayGatherV3:0", shape=(?, 10, 1152, 16), dtype=float32)

if I try to initiate another session, I get uninitialzed variable error.

ValueError: Error when checking input: expected input_2 to have shape (221,) but got array with shape (245,)

Hi @XifengGuo, I got the following error when filling in my own dataset (in mnist ubyte format, but to 223 categories instead of 10) :

Traceback (most recent call last):
File "capsulenet.py", line 252, in
train(model=model, data=((x_train, y_train), (x_test, y_test)), args=args)
File "capsulenet.py", line 138, in train
callbacks=[log, tb, checkpoint, lr_decay])
File "/usr/local/lib/python3.5/dist-packages/Keras-2.1.3-py3.5.egg/keras/legacy/interfaces.py", line 91, in wrapper
File "/usr/local/lib/python3.5/dist-packages/Keras-2.1.3-py3.5.egg/keras/engine/training.py", line 2138, in fit_generator
File "/usr/local/lib/python3.5/dist-packages/Keras-2.1.3-py3.5.egg/keras/engine/training.py", line 1428, in _standardize_user_data
File "/usr/local/lib/python3.5/dist-packages/Keras-2.1.3-py3.5.egg/keras/engine/training.py", line 120, in _standardize_input_data
ValueError: Error when checking input: expected input_2 to have shape (221,) but got array with shape (245,)

Do you have idea how to solve it? thanks!

Best Wishes,
Chi Kiu

Error in multigpu

There is a small bug in capsulenet-multi-gpu.py:129

The function returns 3 values, but is assigned to 2. The line should be rewritten, at least, to this:

model, eval_model, _ = CapsNet(input_shape=x_train.shape[1:],

Keyword argument not understood

I'm working with Anaconda 4.3.2 and Keras 2.0.9
When I try to run capsnet.py or its multi-gpus version, on my own data, i am stuck with the following error:
TypeError: ('Keyword argument not understood:', 'routings')
When i searched online for the error, i found from a post that the TypeError occurs when 'kwargs' doesn't contain the type (routings) and/or serialization would have happened in a different Anaconda distribution and using the same Anaconda distribution fixes the error.
need your help on this.

expected input_3 to have shape (15, 16) but got array with shape (10, 16)

After I trained the model succesfully and run the following command on Google Colab.

!python3 /content/drive/tut_competition/capsulenet_colab.py -t -w /content/result/trained_model.h5

keras version = 2.1.4
tensorflow version = 1.4.0

I got this error:

2018-02-25 22:18:08.222033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
------------------------------Begin: manipulate------------------------------
Traceback (most recent call last):
File "/content/drive/tut_competition/capsulenet_colab.py", line 278, in
manipulate_latent(manipulate_model, (x_test, y_test), args)
File "/content/drive/tut_competition/capsulenet_colab.py", line 181, in manipulate_latent
x_recon = model.predict([x, y, tmp])
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1824, in predict
check_batch_axis=False)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 123, in _standardize_input_data
str(data_shape))
ValueError: Error when checking : expected input_3 to have shape (15, 16) but got array with shape (10, 16)

Results

Hi,

Great implementation, congratulations. Do you have any idea why you don't arrive to the performance of the paper?

Thanks

Manuel

ImportError: cannot import name 'combine_images'

Hi there,
while testing the script, I found that the utils module was not installed, So, I installed that through pip install utils (version = 0.9.0) after that, it couldn't import combine_images function. I have installed probably the wrong module.

Can you please help me with this issue?

I'm using Python : 3.6.3
latest keras and tensorflow

Thank you!

Script causes a crash

Running examples on Geforce 1080/linux crashes (restarts) the whole computer every time when training starts. While I do believe that power issues are one possibility strangely this is the only piece of code that so far is able to do it and the computer has trained many networks before without any issues.

I've tried to do some testing but it is really annoying since it means restarting everything all the time. Any ideas how to fix it, log the real reason for crashing or to rule out the power issues by slowing down the code somehow.

Question on Paper Model Setup

Having an EXTREMELY hard time following the below.

What does convolutionial "Unit" mean? Where does the 32x6x6 dimension come from? I get the 32 but the 6? Where does the dimension 20x25 come from? Same goes for the 16x10 digit caps dimension...

Are these numbers arbitrarily chosen or derived from opertations happening in the net. Please provide detailed response where possible. I find reading this paper extremely frustrating. Hopefully reading this code will be better.

EDIT: I think I answered some of my own questions by looking at your model diagram but would still love your opinion to make sure I am tracking correctly.

image

Squash Function Valididty

`
def squash(vectors, axis=-1):
"""
The non-linear activation used in Capsule. It drives the length of a large vector to near 1 and small vector to 0
:param vectors: some vectors to be squashed, N-dim tensor
:param axis: the axis to squash
:return: a Tensor with same shape as input vectors
"""
s_squared_norm = K.sum(K.square(vectors), axis, keepdims=True)
scale = s_squared_norm / (1 + s_squared_norm) / K.sqrt(s_squared_norm + K.epsilon())
return scale * vectors

Shouldnt scale be equal = [s_squared_norm / (1 + s_squared_norm)]*[vectors/ K.sqrt(s_squared_norm + K.epsilon())] per the paper?

image

MNIST learning rate

The paper says in 4:

Our implementation is in TensorFlow (Abadi et al. [2016]) and we use the Adam optimizer (Kingma and Ba [2014]) with its TensorFlow default parameters, including the exponentially decaying learning rate

The TensorFlow defaults for Adam are described here:
https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer

The current capsulenet.py uses lr_decay as a callback to modify the learning rate, but there isn't any evidence that the paper follows this method. Should the lr_decay callback be removed since Adam already decays the learning rate?
(update: the TensorFlow and Keras defaults for Adam appear to be the same)

Routing

As I understand, you reset coupling coefficients after each training sample (batch). Don't you think it would be better to keep their previous state and update them depending on it?

Multi GPU Training

Looks like this repo does not support the latest multi-GPU model which is introduced in Keras 2.0.9. When I do this:


    if(num_gpu > 1):
        model = multi_gpu_model(model, gpus=num_gpu)
    # compile the model
    model.compile(optimizer=optimizers.Adam(lr=args.lr),
                  loss=[margin_loss, 'mse'],
                  loss_weights=[1., args.lam_recon],
                  metrics={'out_caps': 'accuracy'})

It will give me this error, so looks like the input layer does not handle the data well (not sure about this though).


2017-11-10 23:15:25.160851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 451d:00:00.0, compute capability: 3.7)
2017-11-10 23:15:25.160892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 7dcb:00:00.0, compute capability: 3.7)
Train on 60000 samples, validate on 10000 samples
2017-11-10 23:15:27.118862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 451d:00:00.0, compute capability: 3.7)
2017-11-10 23:15:27.118901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 7dcb:00:00.0, compute capability: 3.7)
Epoch 1/30
2017-11-10 23:15:31.162715: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.162970: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.167090: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.170465: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.170701: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.175048: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
Traceback (most recent call last):
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
         [[Node: training/Adam/gradients/concatenate_2/concat_grad/Slice_1/_309 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2229_training/Adam/gradients/concatenate_2/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "capsulenet.py", line 215, in <module>
    train(model=model, data=((x_train, y_train), (x_test, y_test)), args=args)
  File "capsulenet.py", line 113, in train
    validation_data=[[x_test, y_test], [y_test, x_test]], callbacks=[log, tb, checkpoint, lr_decay])
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/training.py", line 1631, in fit
    validation_steps=validation_steps)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/training.py", line 1213, in _fit_loop
    outs = f(ins_batch)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2332, in __call__
    **self.session_kwargs)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
         [[Node: training/Adam/gradients/concatenate_2/concat_grad/Slice_1/_309 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2229_training/Adam/gradients/concatenate_2/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

Caused by op 'replica_0/model_1/digitcaps/mul', defined at:
  File "capsulenet.py", line 215, in <module>
    train(model=model, data=((x_train, y_train), (x_test, y_test)), args=args)
  File "capsulenet.py", line 103, in train
    model = multi_gpu_model(model, gpus=num_gpu)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/utils/training_utils.py", line 143, in multi_gpu_model
    outputs = model(inputs)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/topology.py", line 603, in __call__
    output = self.call(inputs, **kwargs)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/topology.py", line 2061, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/topology.py", line 2212, in run_internal_graph
    output_tensors = _to_list(layer.call(computed_tensor, **kwargs))
  File "/datadrive/xiaoyzhu/RandomExercise/CapsNet-Keras/capsulelayers.py", line 157, in call
    outputs = squash(K.sum(c * inputs_hat, 1, keepdims=True))
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
    return func(x, y, name=name)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1117, in _mul_dispatch
    return gen_math_ops._mul(x, y, name=name)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2726, in _mul
    "Mul", x=x, y=y, name=name)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
         [[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
         [[Node: training/Adam/gradients/concatenate_2/concat_grad/Slice_1/_309 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2229_training/Adam/gradients/concatenate_2/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7f16e115a828>>
Traceback (most recent call last):
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 696, in __del__
  File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init__
TypeError: 'NoneType' object is not callable

Multiple digitcaps layers generate nan training loss

When I add a second digitcaps layer, the training process starts to generate nan loss. What causes this problem ? Thank you so much
In capsulenet.py, I changed the following code from:

digitcaps = CapsuleLayer(num_capsule=n_class, dim_capsule=16, routings=routings,
name='digitcaps')(primarycaps)

to

digitcaps2 = CapsuleLayer(num_capsule=n_class, dim_capsule=16, routings=routings,
name='digitcaps2')(primarycaps)
digitcaps = CapsuleLayer(num_capsule=n_class, dim_capsule=16, routings=routings,
name='digitcaps')(digitcaps2)

Get the batch size

Hi,

I just saw that somewhere in your capsulelayers.py code you wrote a comment:
# b=K.zeros(shape=[batch_size, num_capsule, input_num_capsule]). I just can't get batch_size
In tensorflow you could get it using tf.shape(inputs)[0]. Apparently, there is a difference between inputs.shape and tf.shape(inputs). You can also use this to compute the inputs_hat more efficiently by tiling.W and then using a regular matmul instead of the tf.map_fn.

Best,
Chris

How to test the model?

I don't understand how to load weight for evaluation model. in code, you are making 3 different model for training, evaluating and manipulate. Since in the training process, the evaluation model does not involve, I was wondering how it can be used for testing? we can't load training weight on evaluation model since they have different structures.

Can we run this on Theano?

Hi Xi,

I want to run the code on Theano. Can you suggest any other implementation of CapsNet which is over Theano?.

Thanks.

What is the bottleneck of the computation?

I am noticing that even with all parameters that control for size, it still takes 20 seconds to complete an epoch. Any idea why that is?

So the summary looks like this:

_________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 28, 28, 1)    0
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 20, 20, 1)    82          input_1[0][0]
__________________________________________________________________________________________________
primarycap_conv2d (Conv2D)      (None, 6, 6, 1)      82          conv1[0][0]
__________________________________________________________________________________________________
primarycap_reshape (Reshape)    (None, 36, 1)        0           primarycap_conv2d[0][0]
__________________________________________________________________________________________________
primarycap_squash (Lambda)      (None, 36, 1)        0           primarycap_reshape[0][0]
__________________________________________________________________________________________________
digitcaps (CapsuleLayer)        (None, 10, 1)        360         primarycap_squash[0][0]
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 10)           0
__________________________________________________________________________________________________
mask_1 (Mask)                   (None, 10)           0           digitcaps[0][0]
                                                                 input_2[0][0]
__________________________________________________________________________________________________
capsnet (Length)                (None, 10)           0           digitcaps[0][0]
__________________________________________________________________________________________________
decoder (Sequential)            (None, 28, 28, 1)    2367        mask_1[0][0]
==================================================================================================
Total params: 2,891
Trainable params: 2,891
Non-trainable params: 0

Test accuracy 0

Hi, I wonder why I'm getting accuracy 0 on test. I did not change the code...I trained from scratch and then loaded the trained_model.h5 file as told to do.

performance after > 30 epochs

I tried running for 200 epochs a few times. The performance stabilizes somewhere in the 30-50 epoch range. The paper shows 1250 epochs so something is different. I can't reproduce the result either. Median error with 3 routing iterations and reconstruction loss is 0.36, far outside of the 0.25+-0.005 reported. Maybe this is because the learning rate decays too fast? I think the more interesting result in the paper is the robustness to affine transformation. But its odd that the raw reported mnist error cant be reproduced.

ImportError: No module named 'PIL'

run the script 'capsulenet-multi-gpu.py'
a error occurs.
from PIL import Image
ImportError: No module named 'PIL'

How can i add the PIL file?
Thanks for your attention!

dynamic routing

thx for the amazing work you've done. Since i adapted dynamic routing from your code, and I wanna share some of my ideas about it. Here is my repo with tentorflow
bias updating
you mentioned that you fix bias to 0, but during dynamic routing you are updating it, is that so? code: here and here.
In my opinion, the bias should not be updated, since it's just the initial value for dynamic routing, with your implementation, you will update bias every time you send in some data, even with Variable be set as trainable=False, and of course, the same thing goes for testing procedure. I think the easiest way is make a temporal variable with temp_bias = bias, and use it for dynamic routing.
bias summing
code here, it seems that you are trying to keep the shape of bias as [num_caps, 10], and you sum over all the training examples. I think that's problematic. The paper mentioned that bias is independent from image, but during routing, capsule prediction from layer below varies for different image, so the updated bias should be different too. After bias updated, the shape of bias should be [batch_size, caps, 10].

I tried with 3 iterations of dynamic routing, after less than 4 epoch (2k iters) the validation accuracy is 99.16, it seems working. Still not as efficient as the paper said.
But i got a huge problem that training procedure is slow, with almost 2s per iteration with batch_size 100 in Nvidia 1060, which way more than yours.

Just some of my ideas, glad to discuss with you.
best.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.