I like deep neural nets.
karpathy / neuraltalk Goto Github PK
View Code? Open in Web Editor NEWNeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.
NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.
I like deep neural nets.
Hi Andrej,
I was interested in using your algorithm for some new data. Basically, each images is associated with one sentence. Is there a convenient way to generate the json file as in your example (Flickr8k, etc). What is the structure of the json, and is there anyway to not using json format?
Thanks!
Wei
I got features using C++ utility in caffe. I do not know how to use it in predict_on_images.py file. I actually do not know whether the code will accept only vgg_feats.mat format only or not?
Please help me. In levelDB, I have got following files -
MANIFEST-000002, 000003.log, LOG, CURRENT, LOCK.
Kindly reply me as soon as possible.
Thanks for making the code public. This is a great work!
This issue is not about the code, but I feel a little confused about the 11th equation in the paper, since the relevant code is not available.
What does i indicates in the above equation? and does t refer to an index of an image fragment or a sentence fragment? Also maybe maximizing this term makes more sense? I would really appreciate it if you can point out my misunderstandings here. Thanks!
For extra street credit, please adopt a R Kelly "real talk" meme photo in the Readme
Hi Andrej,
Thank you very much for open sourcing the code!
You paper talks about MRFs for decoding text segment alignments to images, but I couldn't find any code related to that. Am I missing something?
Thanks
Pradeep.
Hi,
When I try to run the gradient check, for Ws, the gradient check prints "VAL SMALL WARNING". I have printed the numerical gradients and analytical gradients in this case, and find that the numerical gradients are exactly zero, and analytical gradients are in the order of e-12.
I am confused about that, since the numerical gradients are zero, that means some words are not in the batch, so, changing its value will not affect the cost (in grad_check, we add delta to the word vectors). However, the analytical gradients are not zero, that means these words actually appear in the batch, and these word vectors are updated.
Why will this happen?
Thanks.
Hi Andrej & Fei-fei,
I've been playing around with this and reading through the code -- many thanks for making it wonderful code to read! I was under the impression that it used pretrained word vector embeddings from Mikolov et al:
....but I don't see any evidence in the code where these vectors are loaded in. Are the word embeddings learned from scratch or are they in fact initialized in some way?
Many thanks!
chris moody
Hi,
Is this the same script with the Moses's multi-bleu.perl? I've seen that there are some modifications to the original version. I've been investigating that why my baseline model's (Google NIC with VGG-E) BLEU-2-3-4 performance is really low but what I've found is we are not using the same evaluation scripts. I know that this task is different than machine translation task, though. So, my questions are,
Thanks in advance.
Hello Andrej,
Great work!
Is it possible to get the bounding box associated with words? Or is that part of the alignment/retrieval model?
Thanks!
This code line:
features = out[net.outputs[0]].squeeze(axis=(2,3))
Has to be this in order to work with the newest caffe:
features = out[net.outputs[0]]
Additionally remove this line:
caffe.set_phase_test()
Thanks for your kindness to release these codes!
It helps me a lot!
I am interested in your cvpr paper : Deep Visual-Semantic Alignments for Generating Image Descriptions. But I did not found anything about Visual-Semantic Alignments in this released code, have I missed something ? thanks !
~/tf/neuraltalk-master$ python eval_sentence_predictions.py
usage: eval_sentence_predictions.py [-h] [-b BEAM_SIZE]
[--result_struct_filename RESULT_STRUCT_FILENAME]
[-m MAX_IMAGES] [-d DUMP_FOLDER]
checkpoint_path
eval_sentence_predictions.py: error: too few arguments
when i run this script i got this error,is checkpoint path error or others?thank you.
Other approaches like [Show and Tell] use a We matrix for word embedding which optimize the We , But in neuraltalk I found that it direcly optimize the Ws in which each raw represent a word. So Why do this way? or which way performs better?
How exactly would I go about getting a trained models predicition on an image (in some raw format) that I have?
In this code, i only find how to use images and the images description sentences to train a multimodal RNN. But i don't see any founctions about how to use the regions & snippets to train the model.Just like the figure 5 or part 4.3 in the paper.
How can i train my own model? How can i get the result just like the figure 5?
usage: predict_on_images.py [-h] [-r ROOT_PATH] [-b BEAM_SIZE] checkpoint_path
predict_on_images.py: error: the following arguments are required: checkpoint_path
An exception has occurred, use %tb to see the full traceback.
this error happened. what should i do?
Would it be possible to use this code to accept a sentence input, and output the most likely sentence, in order to sustain dialogue, instead of a picture input and sentence output? I believe there is a paper on this. Sorry this is not an issue, didn't know where to comment.
Hello, I recently read your paper, and very much appreciate about you sharing your codes here.
By the way, on your paper it is indicated that you first extracted top regions of obtained by RCNN and then get the CNN features, however I do not see that object detection part in your implementation. Either in training and test phase, it seems not using object detection functionality. Is it because it still works fine using the holistic image?
Thank you.
I am training it over new dataset. I am getting this error in save checkpoint
36/1850 batch done in 2.356s. at epoch 0.97. loss cost = 9.295156, reg cost = 0.000000, ppl2 = 4.59 (smooth 14.32)
evaluating val performance in batches of 100
Traceback (most recent call last):
File "driver.py", line 315, in
main(params)
File "driver.py", line 232, in main
val_ppl2 = eval_split('val', dp, model, params, misc) # perform the evaluation on VAL set
File "/root/neuraltalk/imagernn/imagernn_utils.py", line 38, in eval_split
ppl2 = 2 ** (logppl / logppln)
ZeroDivisionError: integer division or modulo by zero
I think the bicubic implementation is of some problem.
The output image contains some obvious artifacts if you visualize it.
It's definitely not same as Matlab's imresize nor Opencv's resize(Inter_cubic).
I guess the vgg_feats.mat inside examples_images was produced by this function.
The results made by py_caffe_feat_extract were also slightly different with the ones made by opencv's resize(cubic).
Hope some one could fix the bug of the bicubic implementation some day.
Thanks a lot.
In the lstm_generator.py, line 71 Hin[t,1:1+d] = X[t] and 72 Hin[t,1+d:] = prev should be exchanged.
Because the hidden size is d, which is the dimension of the prev.
But i don't why it doesn't raise an error, anyone can explain this?
Hi,
When I try to train the RNN model, the performance is quite poor with default parameters since the default values are tuned for LSTM.
So, could you please share the tuned hyperparameters for RNN model?
Thanks.
I think features are not being written in vgg file. I tried running "py_caffe_feat_extract.py" using both the codes you provided in matlab_features_reference and python_features folders separately. By both, I am getting same sized file. Please help me in CNN feature extraction from images.
it seems it doesn't use GPU by default
Hi Andrej,
I really love this implementation.
The most intriguing part to me is your monitorcv to visualize the cross-validation. It could help a lot during training.
In the code, I found it could show up-to-40 results with different host names, but my computer has only one hostname (using python gethostname).
I bet it's my lack of related knowledge.
I guess we could run on separate hosts (with different parameters or models) using the same computer, right?
Could you please give some instructions on how to do so?
Thank you so much.
Best,
-Ethan
I ran driver.py and received an error as above. Am I missing something?
I created coco_sample
directory containing the following files.
COCO_val2014_000000463825.jpg
)I ran the following command.
python predict_on_images.py coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p -r coco_sample
I got an error message as below.
parsed parameters:
{
"beam_size": 1,
"checkpoint_path": "coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p",
"root_path": "coco_sample"
}
loading checkpoint coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p
image 0/123287:
/home/ec2-user/neuraltalk/imagernn/lstm_generator.py:227: RuntimeWarning: overflow encountered in exp
IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d]))
PRED: (-14.587771) a man and a woman sitting on a bench in the middle of a park
image 1/123287:
Traceback (most recent call last):
File "predict_on_images.py", line 109, in
main(params)
File "predict_on_images.py", line 66, in main
img['local_file_path'] =img_names[n]
IndexError: list index out of range
Isn't it possible to run predict_on_images.py
on a few images?
@karpathy Thanks for open sourcing your image-to-sentences work. I got the code up & running with the Flickr30K dataset but encountered a runtime warning
" RuntimeWarning: overflow encountered in exp"
I have fixed it locally by using scipy.special.expit function. I have attached the patch below in case you want to "cherry-pick' my commit. Let me know if this patch is useful to you and whether you'd like me to make a PR with a fix:
From d3b8d3401a7ebeae1aff88538f1f5eff440b31cf Mon Sep 17 00:00:00 2001
From: Vimal Thilak
Date: Wed, 3 Dec 2014 15:16:28 -0800
Subject: [PATCH] [bugfix] Fix overflow runtime warning
imagernn/lstm_generator.py | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/imagernn/lstm_generator.py b/imagernn/lstm_generator.py
index 011e333..af6797f 100644
--- a/imagernn/lstm_generator.py
+++ b/imagernn/lstm_generator.py
@@ -1,5 +1,6 @@
import numpy as np
import code
+import scipy.special
from imagernn.utils import initw
@@ -75,7 +76,7 @@ class LSTMGenerator:
IFOG[t] = Hin[t].dot(WLSTM)
# non-linearities
IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d])) # sigmoids; these are the gates
IFOGf[t,:3*d] = scipy.special.expit(IFOG[t, :3*d]) #1.0/(1.0+np.exp(-IFOG[t,:3*d])) # sigmoids; these are the gates
IFOGf[t,3_d:] = np.tanh(IFOG[t, 3_d:]) # tanh
@@ -224,7 +225,7 @@ class LSTMGenerator:
C = np.zeros((1, d))
Hout = np.zeros((1, d))
IFOG[t] = Hin[t].dot(WLSTM)
IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d]))
IFOGf[t,:3_d] = scipy.special.expit(-IFOG[t,:3_d]) # 1.0/(1.0+np.exp(-IFOG[t,:3_d]))
2.0.1
Let's create a wiki and write how these things work. Only source code limits the potential of the open source potential of this project. Share the knowledge for it to grow.
When I tried to run the python scripts python_features/extract_features.py
today, I met with a problem as follow:
Traceback (most recent call last):
File "./extract_features.py", line 102, in <module>
net = caffe.Net(args.model_def, args.model)
Boost.Python.ArgumentError: Python argument types in
Net.__init__(Net, str, str)
did not match C++ signature:
__init__(boost::python::api::object, std::string, std::string, int)
__init__(boost::python::api::object, std::string, int)
Then I search this error on the Internet, and I find a same issue in caffe's issue page: Caffe#1905. I think it's an error caused by the update of Caffe's API.
So I change the code in extract_features.py#101 as: net = caffe.Net(args.model_def, args.model, caffe.TEST)
. It worked, but a new problem came out:
Traceback (most recent call last):
File "./extract_features.py", line 102, in <module>
caffe.set_phase_test()
AttributeError: 'module' object has no attribute 'set_phase_test'
I think the reason is that some APIs in python_features/extract_features.py
are too old.
Hi Andrej,
Is there a limit to the size of the descriptive sentences? Has it been tried with multiple sentences each describing different features of the image? For example, if an image had a descriptor "A dog in a park. A kite in the sky." could it generate two sentences if the training data was in a similar format? OR is it better to split the descriptive sentences into several single sentence examples and show the same image for each (ie. image A: dog in a park, image A: kite in the sky).
Also, is the matlab feature extractor GPU enabled?
Thanks!
When I am evaluating and predicting on the datasets called example_images given by you after training flickr8k images, I get all the wrong outputs. For each of the images, the prediction is incorrect. Why is this happening?
Hi Andrej,
I have been learning a ton about RNNs and their implementation from looking through your code. I have a (perhaps silly) question about your dropout implementation. You claim that your code creates a mask that drops a fraction, drop_prob, of the units and then scales the remaining units by 1/(1-drop_prob). This doesn't seem correct to me since you are sampling using np.random.randn, which seems to sample from a normal distribution of mean 0 and variance 1.
For example, if you set drop_prob=1 (and ignore the fact that this makes your scale factor infinite) then you should be dropping all the units, but in reality you will be testing the boolean condition np.random.randn(some_shape)<(1-drop_prob). Since np.random.rand gives you negative values half the time (on average) you will only drop half the units (on average).
It seems like you want to be sampling from a uniform distribution from 0 to 1 in order for this to work properly.
Best,
Sam
It seems like, even when a checkpoint file is passed into --init_model_from
argument, it starts from epoch 0.00 and acts like the initial model was never even passed in.
Hello..
Thanks for the code and the very helpful read me files..
I tried to call the predict_on_images.py on the examples folder you supported but got this error
C:\neuraltalk-master>python predict_on_images.py
usage: predict_on_images.py [-h] [-r ROOT_PATH]
predict_on_images.py: error: too few arguments
I would appreciate any help ...
Regards
training with flickr8k aborts:
253/15000 batch done in 5.037s. at epoch 0.84. loss cost = 37.447347, reg cost = 0.000001, ppl2 = 26.10 (smooth 48.09)
254/15000 batch done in 5.082s. at epoch 0.85. loss cost = 39.408169, reg cost = 0.000001, ppl2 = 29.19 (smooth 47.91)
255/15000 batch done in 4.914s. at epoch 0.85. loss cost = 140.730310, reg cost = 0.000001, ppl2 = 237360.65 (smooth 2421.03)
Aboring, cost seems to be exploding. Run gradcheck? Lower the learning rate?
when i train a batch(batchsize 100) of flickr8k, it takes around 0.8 second. However when train a batch(still 100) of flickr30k, it takes around 7 seconds.
Why?
hi, i like to work on image captioning and i used a novel approach for image segmentation, and now i like to use these segmented image as a preprocessing step for image captioning, can u help me to give me an idea for my next step to do it? and if its possible may i have matlab codes for captioning?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.