xifengguo / capsnet-keras Goto Github PK
View Code? Open in Web Editor NEWA Keras implementation of CapsNet in NIPS2017 paper "Dynamic Routing Between Capsules". Now test error = 0.34%.
License: MIT License
A Keras implementation of CapsNet in NIPS2017 paper "Dynamic Routing Between Capsules". Now test error = 0.34%.
License: MIT License
I applied your implementation on 2 other datasets. However, it works like a random classification with very low performance. Firstly I trained it on Dog vs Cat (kaggle problem) it can't learns this binary classification and has accuracy around 52 after 50 epochs, While the simple CNN can learns it very well. I also applied it on my own dataset with 20 classes but its accuracy was 6 percentage! Could please let me know what is the main reason of this important issue? Thanks
Hey !
Thank you for your job and sharing it !
I have a question regarding the train data returned by "train_generator" and the validation ones used by fit_generator.
For example, for the train data, why do you use a list ([x_batch, y_batch], [y_batch, x_batch]) instead of a tuple [x_batch, y_batch].
Thank you
Why is this necessary?
CapsNet-Keras/capsulelayers.py
Line 152 in 0ca571c
Hi Xi Feng,
May I ask whether you had tried training capsule on cifar-10 or cifar-100 datasets?
My attempt to replicate the paper on cifar-10 accuracy only achieve a 0.54 accuracy. I have set both convolution kernel size to 24, primary capsule size to 64. However, the model stops improving after epoch 14 on cifar-10 datasets. Cifar-100 perform even worse than it's smaller counterpart.
I wonder if there are any parameters I had missed?
In capsulenet.py line 43
The primarycap's dim_capsule is 8.
But the CapsuleLayer's dim_capsule is 16.
Though it doesn't seem to be a problem when training, I wonder why it is unequal...
I want to thank you for this wonderful work =),
How can I cite this repo ?
Thanks
In mnist, it is 10 classes, and we need 10 capsules; but if the num_capsule continue to go up OOM will happen...
So how to apply the CapsNet for .., for example, 50 classes or even more?
Thank you for answering!
Hi, Mr. Guo,
I've found a problem when using the GPU version. The following is my steps:
python capsulenet-multi-gpu.py --gpus 2
Using TensorFlow backend.
Namespace(batch_size=300, debug=0, digit=5, epochs=50, gpus=2, lam_recon=0.392, lr=0.001, routings=3, save_dir='./result', shift_fraction=0.1, testing=False, weights=None)
Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) (None, 28, 28, 1) 0
conv1 (Conv2D) (None, 20, 20, 256) 20992 input_1[0][0]
primarycap_conv2d (Conv2D) (None, 6, 6, 256) 5308672 conv1[0][0]
primarycap_reshape (Reshape) (None, 1152, 8) 0 primarycap_conv2d[0][0]
primarycap_squash (Lambda) (None, 1152, 8) 0 primarycap_reshape[0][0]
digitcaps (CapsuleLayer) (None, 10, 16) 1474560 primarycap_squash[0][0]
input_2 (InputLayer) (None, 10) 0
mask_1 (Mask) (None, 160) 0 digitcaps[0][0]
input_2[0][0]
capsnet (Length) (None, 10) 0 digitcaps[0][0]
decoder (Sequential) (None, 28, 28, 1) 1411344 mask_1[0][0]
Total params: 8,215,568
Trainable params: 8,215,568
Non-trainable params: 0
Traceback (most recent call last):
File "capsulenet-multi-gpu.py", line 122, in
plot_model(model, to_file=args.save_dir+'/model.png', show_shapes=True)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/vis_utils.py", line 131, in plot_model
dot = model_to_dot(model, show_shapes, show_layer_names, rankdir)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/vis_utils.py", line 52, in model_to_dot
_check_pydot()
File "/usr/local/lib/python2.7/dist-packages/keras/utils/vis_utils.py", line 27, in _check_pydot
raise ImportError('Failed to import pydot. You must install pydot'
ImportError: Failed to import pydot. You must install pydot and graphviz forpydotprint
to work.
I've installed pydot and graphviz already, and my keras version is:
keras.version
'2.1.2'
Thanks!
Clock ZHONG
You should really use tf.map_fn instead of tf.scan, in your code the first argument (the accumulator) of the tf.scan function is not even used. Therefore, using tf.map_fn should do the same just faster because it does not have to account for dependencies between the evaluations of the lambda function on the elements that are to be processed.
Hi Xifeng,
Just surprised by how smooth, this code was running. I probably spent 5 minutes for getting everything running and finish one epoch. while I spent a couple of hours try to get "naturomics/CapsNet-Tensorflow" to work. I found that this Keras code probably x5 plus faster than the Tensorflow code from Naturomics.
Could you explain whether the fast speed of your code is due to Keras or something else, as I have not tried to learn Keras. I think most people just learn Tensorflow without second thought.
Regards,
Wen
C:\Users\jeet\Documents\GitHub\CapsNet-Keras>python capsulenet.py --num_routing 1
Using TensorFlow backend.
Traceback (most recent call last):
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow\python\platform\self_check.py", line 62, in preload_check
ctypes.WinDLL(build_info.nvcuda_dll_name)
File "C:\Users\jeet\Anaconda3\lib\ctypes_init_.py", line 348, in init
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be foundDuring handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "capsulenet.py", line 20, in
from keras import layers, models, optimizers
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras_init_.py", line 3, in
from . import utils
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras\utils_init_.py", line 6, in
from . import conv_utils
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras\utils\conv_utils.py", line 3, in
from .. import backend as K
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras\backend_init_.py", line 83, in
from .tensorflow_backend import *
File "C:\Users\jeet\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 1, in
import tensorflow as tf
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow_init_.py", line 24, in
from tensorflow.python import *
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow\python_init_.py", line 49, in
from tensorflow.python import pywrap_tensorflow
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 30, in
self_check.preload_check()
File "C:\Users\jeet\Anaconda3\lib\site-packages\tensorflow\python\platform\self_check.py", line 70, in preload_check
% build_info.nvcuda_dll_name)
ImportError: Could not find 'nvcuda.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Typically it is installed in 'C:\Windows\System32'. If it is not present, ensure that you have a CUDA-capable GPU with the correct driver installed.
Any idea why this happens? I'm talking about changing the 16 in line 46
in capsulenet.py
to something else like 17 or 32.
# Layer 3: Capsule layer. Routing algorithm works here.
digitcaps = CapsuleLayer(num_capsule=n_class, dim_capsule=16, num_routing=num_routing,
name='digitcaps')(primarycaps)
Isn't the weight matrix of dim (dim_capsule_1,dim_capsule_2)
from capsule layer 1 to capsule layer 2, so it should work in theory right, just like in figure 2 of https://arxiv.org/pdf/1710.09829.pdf
EDIT: found the other hardcoded 16
What is this problem for ?
This one is a quick fix, but the code assumes that you are using "channels-last"
, but I have my defaults set otherwise and it causes an error. You might want to make it explicit with:
K.set_image_data_format('channels_last')
Hi, I got the following error message (copied below) when I tried the code, although I do have both pydot
and graphviz
. I don't think this is an issue with your code. It seems to be a general long-standing issue with pydot
. The solutions are discussed here:
1: Theano/Theano#1801
2: pydot/pydot#126
Traceback (most recent call last):
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/keras/utils/vis_utils.py", line 23, in _check_pydot
pydot.Dot.create(pydot.Dot())
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/pydot_ng/init.py", line 1890, in create
'GraphViz's executables not found')
pydot_ng.InvocationException: GraphViz's executables not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "capsulenet.py", line 195, in
plot_model(model, to_file=args.save_dir+'/model.png', show_shapes=True)
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/keras/utils/vis_utils.py", line 131, in plot_model
dot = model_to_dot(model, show_shapes, show_layer_names, rankdir)
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/keras/utils/vis_utils.py", line 52, in model_to_dot
_check_pydot()
File "/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/keras/utils/vis_utils.py", line 27, in _check_pydot
raise ImportError('Failed to import pydot. You must install pydot'
ImportError: Failed to import pydot. You must install pydot and graphviz for pydotprint
to work.
They use learning rate decay in the paper as it is part of the Adam optimizer:
we use the Adam optimizer (Kingma
and Ba [2014]) with its TensorFlow default parameters, including the exponentially decaying learning
rate, to minimize the sum of the margin losses in Eq. 4.
What modifications in the capsule implementation are required to support 3D image input?
Thanks!
How can we use ur code in other RGB dataset?
Suppose the structure of dataset is like that. it contains some sub-folder. Each sub-folder represents one class.
Class A:
0001.jpg 1
0002.jpg 1
Class B:
0001.jpg 2
0002.jpg 2
When I try to run the code, I get an error saying that "Shapes must be equal rank, but are 1 and 2 for 'digitcaps/map/while/MatMul' (op: 'BatchMatMul') with input shapes: [10,1152,8], [10,1152,16,8]". I am not sure why this is happening since I have not changed any part of the code, does anyone know what I'm doing wrong? Here is the full output in the console log:
Traceback (most recent call last):
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 686, in _call_cpp_shape_fn_impl
input_tensors_as_shapes, status)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shapes must be equal rank, but are 1 and 2 for 'digitcaps/map/while/MatMul' (op: 'BatchMatMul') with input shapes: [10,1152,8], [10,1152,16,8].
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/dpere013/Desktop/CapsNet-Keras-master/capsulenet.py", line 244, in <module>
num_routing=args.routings)
File "C:/Users/dpere013/Desktop/CapsNet-Keras-master/capsulenet.py", line 50, in CapsNet
name='digitcaps')(primarycaps)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\keras\engine\topology.py", line 554, in __call__
output = self.call(inputs, **kwargs)
File "C:\Users\dpere013\Desktop\CapsNet-Keras-master\capsulelayers.py", line 127, in call
inputs_hat = K.map_fn(lambda x: K.batch_dot(x, self.W, [2, 3]), elems=inputs_tiled)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\keras\backend\tensorflow_backend.py", line 3328, in map_fn
return tf.map_fn(fn, elems, name=name)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\functional_ops.py", line 389, in map_fn
swap_memory=swap_memory)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2816, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2640, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2590, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\functional_ops.py", line 379, in compute
packed_fn_values = fn(packed_values)
File "C:\Users\dpere013\Desktop\CapsNet-Keras-master\capsulelayers.py", line 127, in <lambda>
inputs_hat = K.map_fn(lambda x: K.batch_dot(x, self.W, [2, 3]), elems=inputs_tiled)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\keras\backend\tensorflow_backend.py", line 915, in batch_dot
out = tf.matmul(x, y, adjoint_a=adj_x, adjoint_b=adj_y)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1861, in matmul
a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 708, in _batch_mat_mul
"BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2958, in create_op
set_shapes_for_outputs(ret)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2209, in set_shapes_for_outputs
shapes = shape_func(op)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2159, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 627, in call_cpp_shape_fn
require_shape_fn)
File "C:\Users\dpere013\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 691, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Shapes must be equal rank, but are 1 and 2 for 'digitcaps/map/while/MatMul' (op: 'BatchMatMul') with input shapes: [10,1152,8], [10,1152,16,8].
Matplotlib causes a lot of issues when run on remote servers without X. If not addressed, the program crashes when it tries to save the plots. I managed to resolve this by adding the following line to the start of utils.py (after the imports):
plt.switch_backend('agg')
I tried some other solutions, but this one seems to do the trick (even though I still get a bunch of warnings). Now, I'm aware that this isn't something you would be willing to add to master for sure, but it may be worthwhile to at least mention it in the Readme, or something...
The authors of the paper are:
Sara Sabour, Nicholas Frosst, Geoffrey Hinton
In Hinton's talks about this he is always careful to make clear that while he may have thought of the idea, it was his co-authors who made it work. So I think the README should say Sabour's Capsule Network.
Hi,
First of all let me compliment you on the swift implementation CapsNet in Keras. It looks very interesting! I haven't gotten around testing it myself but when I was skimming to the source code after reading the CapsNet paper I noticed the following line which schedules updates of the learning rate using a Keras callback:
Line 90 in 12aaa59
This made me wonder whether this is also part of their setup in the paper:
Our implementation is in TensorFlow (...) and we use the Adam optimizer with its TensorFlow default parameters, including the exponentially decaying learning rate, ...
As far as I understand Adam, the optimiser already uses exponentially decaying learning rates but on a per-parameter basis. This makes me think no further learning decay is necessary.
Some time soon I plan to run some tests without the additional learning rate decay and see how it changes the results. In any case I'd like to hear your thoughts on this. I don't seem to find anything conclusive as to whether Adam can benefit from additional learning rate decay.
Kind regards,
Dieter
Hello sir,
I am following the Capsule Network paper and your implementation.
I have a quick question about the valid padding in the conv2 you used to get output for the Primary Caps. So as I understand, after the 1st conv layer, the size of output is (batchsize,20,20,256). So if the conv2 has 256, 9x9 kernel, stride 2 then the formula output should be (20-9+2*p)/2+1 = 6. However, mathematically, the formula above can not be solved so I would like to ask how did exactly padding (valid) works in this situation to have the output is (batchsize,6,6,256).
Thanks !
I have 26 classes contained in a dataset with the same pixel dimensions as MNIST and I have 1,000 samples for each class. I've successfully trained a CapsNet that can classify around 12-14 of my 26 classes with 99% accuracy, but any more classes than this and the loss will converge on a single high value and never improve. There seems to be a specific cut-off point where the network architecture fails above a certain number of classes.
I've experimented with increasing the values for the number of dimensions in the PrimaryCaps layer, setting different values for the learning rate, and changing the batch size, but I haven't been able to solve the issue.
Do you have any tips for other things I should be trying Xifeng? (I'm training on AWS p2.xlarge Tesla K80 GPU)
Thanks!
Hi,
Thanks for all the effort. I want to dump routings to file but not sure how to do it.
(I usually use Theano not sure how to do it in Keras-TF). I tried also pickle but somehow
I can't make tf to agree to save to a file none tf.Variables. I can't make them tf.Variables because they depend on the batch size.
saver = tf.train.Saver({"inputs": inputs_hat, "c":c})
save(tf.get_default_session(), "capsule_dump/vars.ckpt")
if I try to create a variable from them. I hit again a similar error.
ValueError: initial_value must have a shape specified: Tensor("digitcaps_17/map/TensorArrayStack/TensorArrayGatherV3:0", shape=(?, 10, 1152, 16), dtype=float32)
if I try to initiate another session, I get uninitialzed variable error.
Hi @XifengGuo, I got the following error when filling in my own dataset (in mnist ubyte format, but to 223 categories instead of 10) :
Traceback (most recent call last):
File "capsulenet.py", line 252, in
train(model=model, data=((x_train, y_train), (x_test, y_test)), args=args)
File "capsulenet.py", line 138, in train
callbacks=[log, tb, checkpoint, lr_decay])
File "/usr/local/lib/python3.5/dist-packages/Keras-2.1.3-py3.5.egg/keras/legacy/interfaces.py", line 91, in wrapper
File "/usr/local/lib/python3.5/dist-packages/Keras-2.1.3-py3.5.egg/keras/engine/training.py", line 2138, in fit_generator
File "/usr/local/lib/python3.5/dist-packages/Keras-2.1.3-py3.5.egg/keras/engine/training.py", line 1428, in _standardize_user_data
File "/usr/local/lib/python3.5/dist-packages/Keras-2.1.3-py3.5.egg/keras/engine/training.py", line 120, in _standardize_input_data
ValueError: Error when checking input: expected input_2 to have shape (221,) but got array with shape (245,)
Do you have idea how to solve it? thanks!
Best Wishes,
Chi Kiu
There is a small bug in capsulenet-multi-gpu.py:129
The function returns 3 values, but is assigned to 2. The line should be rewritten, at least, to this:
model, eval_model, _ = CapsNet(input_shape=x_train.shape[1:],
I'm working with Anaconda 4.3.2 and Keras 2.0.9
When I try to run capsnet.py or its multi-gpus version, on my own data, i am stuck with the following error:
TypeError: ('Keyword argument not understood:', 'routings')
When i searched online for the error, i found from a post that the TypeError occurs when 'kwargs' doesn't contain the type (routings) and/or serialization would have happened in a different Anaconda distribution and using the same Anaconda distribution fixes the error.
need your help on this.
After I trained the model succesfully and run the following command on Google Colab.
!python3 /content/drive/tut_competition/capsulenet_colab.py -t -w /content/result/trained_model.h5
keras version = 2.1.4
tensorflow version = 1.4.0
I got this error:
2018-02-25 22:18:08.222033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
------------------------------Begin: manipulate------------------------------
Traceback (most recent call last):
File "/content/drive/tut_competition/capsulenet_colab.py", line 278, in
manipulate_latent(manipulate_model, (x_test, y_test), args)
File "/content/drive/tut_competition/capsulenet_colab.py", line 181, in manipulate_latent
x_recon = model.predict([x, y, tmp])
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1824, in predict
check_batch_axis=False)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 123, in _standardize_input_data
str(data_shape))
ValueError: Error when checking : expected input_3 to have shape (15, 16) but got array with shape (10, 16)
Hi,
Great implementation, congratulations. Do you have any idea why you don't arrive to the performance of the paper?
Thanks
Manuel
Hi there,
while testing the script, I found that the utils module was not installed, So, I installed that through pip install utils
(version = 0.9.0) after that, it couldn't import combine_images
function. I have installed probably the wrong module.
Can you please help me with this issue?
I'm using Python : 3.6.3
latest keras and tensorflow
Thank you!
Running examples on Geforce 1080/linux crashes (restarts) the whole computer every time when training starts. While I do believe that power issues are one possibility strangely this is the only piece of code that so far is able to do it and the computer has trained many networks before without any issues.
I've tried to do some testing but it is really annoying since it means restarting everything all the time. Any ideas how to fix it, log the real reason for crashing or to rule out the power issues by slowing down the code somehow.
Having an EXTREMELY hard time following the below.
What does convolutionial "Unit" mean? Where does the 32x6x6 dimension come from? I get the 32 but the 6? Where does the dimension 20x25 come from? Same goes for the 16x10 digit caps dimension...
Are these numbers arbitrarily chosen or derived from opertations happening in the net. Please provide detailed response where possible. I find reading this paper extremely frustrating. Hopefully reading this code will be better.
EDIT: I think I answered some of my own questions by looking at your model diagram but would still love your opinion to make sure I am tracking correctly.
MIT license to match Keras?
`
def squash(vectors, axis=-1):
"""
The non-linear activation used in Capsule. It drives the length of a large vector to near 1 and small vector to 0
:param vectors: some vectors to be squashed, N-dim tensor
:param axis: the axis to squash
:return: a Tensor with same shape as input vectors
"""
s_squared_norm = K.sum(K.square(vectors), axis, keepdims=True)
scale = s_squared_norm / (1 + s_squared_norm) / K.sqrt(s_squared_norm + K.epsilon())
return scale * vectors
Shouldnt scale be equal = [s_squared_norm / (1 + s_squared_norm)]*[vectors/ K.sqrt(s_squared_norm + K.epsilon())] per the paper?
The paper says in 4:
Our implementation is in TensorFlow (Abadi et al. [2016]) and we use the Adam optimizer (Kingma and Ba [2014]) with its TensorFlow default parameters, including the exponentially decaying learning rate
The TensorFlow defaults for Adam are described here:
https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer
The current capsulenet.py uses lr_decay
as a callback to modify the learning rate, but there isn't any evidence that the paper follows this method. Should the lr_decay
callback be removed since Adam already decays the learning rate?
(update: the TensorFlow and Keras defaults for Adam appear to be the same)
When I run capsulenet.py, I have this error:
ImportError: Failed to import pydot. You must install pydot and graphviz for pydotprint to work.
But, I have installed pydot and graphviz:
pydot==1.2.3
graphviz==0.8.1
Any idea?
My server has two 1080ti, however, I almost spent 10 hours or more to run your code. What should I modify in your code?
As I understand, you reset coupling coefficients after each training sample (batch). Don't you think it would be better to keep their previous state and update them depending on it?
Looks like this repo does not support the latest multi-GPU model which is introduced in Keras 2.0.9. When I do this:
if(num_gpu > 1):
model = multi_gpu_model(model, gpus=num_gpu)
# compile the model
model.compile(optimizer=optimizers.Adam(lr=args.lr),
loss=[margin_loss, 'mse'],
loss_weights=[1., args.lam_recon],
metrics={'out_caps': 'accuracy'})
It will give me this error, so looks like the input layer does not handle the data well (not sure about this though).
2017-11-10 23:15:25.160851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 451d:00:00.0, compute capability: 3.7)
2017-11-10 23:15:25.160892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 7dcb:00:00.0, compute capability: 3.7)
Train on 60000 samples, validate on 10000 samples
2017-11-10 23:15:27.118862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 451d:00:00.0, compute capability: 3.7)
2017-11-10 23:15:27.118901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 7dcb:00:00.0, compute capability: 3.7)
Epoch 1/30
2017-11-10 23:15:31.162715: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
[[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.162970: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
[[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.167090: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
[[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.170465: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
[[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.170701: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
[[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
2017-11-10 23:15:31.175048: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
[[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
Traceback (most recent call last):
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
[[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
[[Node: training/Adam/gradients/concatenate_2/concat_grad/Slice_1/_309 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2229_training/Adam/gradients/concatenate_2/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "capsulenet.py", line 215, in <module>
train(model=model, data=((x_train, y_train), (x_test, y_test)), args=args)
File "capsulenet.py", line 113, in train
validation_data=[[x_test, y_test], [y_test, x_test]], callbacks=[log, tb, checkpoint, lr_decay])
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/training.py", line 1631, in fit
validation_steps=validation_steps)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/training.py", line 1213, in _fit_loop
outs = f(ins_batch)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2332, in __call__
**self.session_kwargs)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
[[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
[[Node: training/Adam/gradients/concatenate_2/concat_grad/Slice_1/_309 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2229_training/Adam/gradients/concatenate_2/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
Caused by op 'replica_0/model_1/digitcaps/mul', defined at:
File "capsulenet.py", line 215, in <module>
train(model=model, data=((x_train, y_train), (x_test, y_test)), args=args)
File "capsulenet.py", line 103, in train
model = multi_gpu_model(model, gpus=num_gpu)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/utils/training_utils.py", line 143, in multi_gpu_model
outputs = model(inputs)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/topology.py", line 603, in __call__
output = self.call(inputs, **kwargs)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/topology.py", line 2061, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/keras/engine/topology.py", line 2212, in run_internal_graph
output_tensors = _to_list(layer.call(computed_tensor, **kwargs))
File "/datadrive/xiaoyzhu/RandomExercise/CapsNet-Keras/capsulelayers.py", line 157, in call
outputs = squash(K.sum(c * inputs_hat, 1, keepdims=True))
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
return func(x, y, name=name)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1117, in _mul_dispatch
return gen_math_ops._mul(x, y, name=name)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2726, in _mul
"Mul", x=x, y=y, name=name)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Incompatible shapes: [100,1152,10,1,1] vs. [50,1152,10,1,16]
[[Node: replica_0/model_1/digitcaps/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/digitcaps/transpose_1, replica_0/model_1/digitcaps/scan/TensorArrayStack/TensorArrayGatherV3)]]
[[Node: training/Adam/gradients/concatenate_2/concat_grad/Slice_1/_309 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2229_training/Adam/gradients/concatenate_2/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7f16e115a828>>
Traceback (most recent call last):
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 696, in __del__
File "/datadrive/xiaoyzhu/python3env/lib/python3.5/site-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init__
TypeError: 'NoneType' object is not callable
CapsNet-Keras/capsulelayers.py
Line 171 in 843a765
When I add a second digitcaps layer, the training process starts to generate nan loss. What causes this problem ? Thank you so much
In capsulenet.py, I changed the following code from:
digitcaps = CapsuleLayer(num_capsule=n_class, dim_capsule=16, routings=routings,
name='digitcaps')(primarycaps)
to
digitcaps2 = CapsuleLayer(num_capsule=n_class, dim_capsule=16, routings=routings,
name='digitcaps2')(primarycaps)
digitcaps = CapsuleLayer(num_capsule=n_class, dim_capsule=16, routings=routings,
name='digitcaps')(digitcaps2)
Hi,
I just saw that somewhere in your capsulelayers.py code you wrote a comment:
# b=K.zeros(shape=[batch_size, num_capsule, input_num_capsule]). I just can't get batch_size
In tensorflow you could get it using tf.shape(inputs)[0]. Apparently, there is a difference between inputs.shape and tf.shape(inputs). You can also use this to compute the inputs_hat more efficiently by tiling.W and then using a regular matmul instead of the tf.map_fn.
Best,
Chris
I don't understand how to load weight for evaluation model. in code, you are making 3 different model for training, evaluating and manipulate. Since in the training process, the evaluation model does not involve, I was wondering how it can be used for testing? we can't load training weight on evaluation model since they have different structures.
Hi Xi,
I want to run the code on Theano. Can you suggest any other implementation of CapsNet which is over Theano?.
Thanks.
I am noticing that even with all parameters that control for size, it still takes 20 seconds to complete an epoch. Any idea why that is?
So the summary looks like this:
_________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 28, 28, 1) 0
__________________________________________________________________________________________________
conv1 (Conv2D) (None, 20, 20, 1) 82 input_1[0][0]
__________________________________________________________________________________________________
primarycap_conv2d (Conv2D) (None, 6, 6, 1) 82 conv1[0][0]
__________________________________________________________________________________________________
primarycap_reshape (Reshape) (None, 36, 1) 0 primarycap_conv2d[0][0]
__________________________________________________________________________________________________
primarycap_squash (Lambda) (None, 36, 1) 0 primarycap_reshape[0][0]
__________________________________________________________________________________________________
digitcaps (CapsuleLayer) (None, 10, 1) 360 primarycap_squash[0][0]
__________________________________________________________________________________________________
input_2 (InputLayer) (None, 10) 0
__________________________________________________________________________________________________
mask_1 (Mask) (None, 10) 0 digitcaps[0][0]
input_2[0][0]
__________________________________________________________________________________________________
capsnet (Length) (None, 10) 0 digitcaps[0][0]
__________________________________________________________________________________________________
decoder (Sequential) (None, 28, 28, 1) 2367 mask_1[0][0]
==================================================================================================
Total params: 2,891
Trainable params: 2,891
Non-trainable params: 0
Hi, I wonder why I'm getting accuracy 0 on test. I did not change the code...I trained from scratch and then loaded the trained_model.h5 file as told to do.
I tried running for 200 epochs a few times. The performance stabilizes somewhere in the 30-50 epoch range. The paper shows 1250 epochs so something is different. I can't reproduce the result either. Median error with 3 routing iterations and reconstruction loss is 0.36, far outside of the 0.25+-0.005 reported. Maybe this is because the learning rate decays too fast? I think the more interesting result in the paper is the robustness to affine transformation. But its odd that the raw reported mnist error cant be reproduced.
run the script 'capsulenet-multi-gpu.py'
a error occurs.
from PIL import Image
ImportError: No module named 'PIL'
How can i add the PIL file?
Thanks for your attention!
thx for the amazing work you've done. Since i adapted dynamic routing from your code, and I wanna share some of my ideas about it. Here is my repo with tentorflow
bias updating
you mentioned that you fix bias to 0, but during dynamic routing
you are updating it, is that so? code: here and here.
In my opinion, the bias should not be updated, since it's just the initial value for dynamic routing
, with your implementation, you will update bias every time you send in some data, even with Variable be set as trainable=False
, and of course, the same thing goes for testing procedure. I think the easiest way is make a temporal variable with temp_bias = bias
, and use it for dynamic routing
.
bias summing
code here, it seems that you are trying to keep the shape of bias as [num_caps, 10]
, and you sum over all the training examples. I think that's problematic. The paper mentioned that bias is independent from image, but during routing, capsule prediction from layer below
varies for different image, so the updated bias should be different too. After bias updated, the shape of bias should be [batch_size, caps, 10]
.
I tried with 3 iterations of dynamic routing, after less than 4 epoch (2k iters) the validation accuracy is 99.16, it seems working. Still not as efficient as the paper said.
But i got a huge problem that training procedure is slow, with almost 2s per iteration with batch_size 100 in Nvidia 1060, which way more than yours.
Just some of my ideas, glad to discuss with you.
best.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.