karpathy / neuraltalk Goto Github PK

NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.

Python 80.29% Perl 5.18% HTML 11.42% MATLAB 1.97% CSS 0.29% JavaScript 0.84%

neuraltalk's Introduction

I like deep neural nets.

neuraltalk's People

Contributors

Stargazers

Watchers

Forkers

xshhhm chagge amiltonwong ml-ai-nlp-ir fanchb kochergan shyamalschandra c3h3 flaing sheltowt kirkhadley ukituki frrp tjankovic kiriti-badam gijs zhilunt mrsaibot jtcollins kperi vmalarcon cosmoharrigan chobeat iamaaditya rootxnet sandrum phonkee elcct atlaspilotpuppy teaguesterling strongjz virajshah skymeson audiencepropensities jisnor fortitude1 otechnology syathish fdteam tel yiweig thegreenfield ilovefree2 monoku psmit tml dantepawn lamelos armgilles kkhust ccebrand layn35 prodigeni schevalier karthik20522 bashok001 siddharthsrivastava shafiahmed sahilc esaul adammenges sdemario mildsunrise germc fdoperezi labitxet gragtah fudong1127 afthill markstoehr xuanhan863 andres-root glavin001 miolini codeashu fduwjj devsinghsachan prantik futhermocker netconstructor stalmar guillaumesmaha nvdnkpr adedzy josephwinston lianliangwu lucasjhuo yiiwood yangxs spideryan vimalthilak victorhcm harsh1231 old-trunk brandonshin arumugamr noodlefrenzy qzshadow nomaddevw lamegame

neuraltalk's Issues

How to generate json for new data?

Hi Andrej,
I was interested in using your algorithm for some new data. Basically, each images is associated with one sentence. Is there a convenient way to generate the json file as in your example (Flickr8k, etc). What is the structure of the json, and is there anyway to not using json format?

Thanks!
Wei

Can we use levelDB features extracted using C++ utility at the place of vgg_feats.mat?

I got features using C++ utility in caffe. I do not know how to use it in predict_on_images.py file. I actually do not know whether the code will accept only vgg_feats.mat format only or not?
Please help me. In levelDB, I have got following files -
MANIFEST-000002, 000003.log, LOG, CURRENT, LOCK.
Kindly reply me as soon as possible.

Confusion about an equation in the paper

Thanks for making the code public. This is a great work!
This issue is not about the code, but I feel a little confused about the 11th equation in the paper, since the relevant code is not available.

What does i indicates in the above equation? and does t refer to an index of an image fragment or a sentence fragment? Also maybe maximizing this term makes more sense? I would really appreciate it if you can point out my misunderstandings here. Thanks!

R Kelly

For extra street credit, please adopt a R Kelly "real talk" meme photo in the Readme

MRFs for text segment alignments

Hi Andrej,
Thank you very much for open sourcing the code!
You paper talks about MRFs for decoding text segment alignments to images, but I couldn't find any code related to that. Am I missing something?

Thanks
Pradeep.

Problems in gradient check

Hi,

When I try to run the gradient check, for Ws, the gradient check prints "VAL SMALL WARNING". I have printed the numerical gradients and analytical gradients in this case, and find that the numerical gradients are exactly zero, and analytical gradients are in the order of e-12.

I am confused about that, since the numerical gradients are zero, that means some words are not in the batch, so, changing its value will not affect the cost (in grad_check, we add delta to the word vectors). However, the analytical gradients are not zero, that means these words actually appear in the batch, and these word vectors are updated.

Why will this happen?

Thanks.

Transfer Learning with word2vec?

Hi Andrej & Fei-fei,
I've been playing around with this and reading through the code -- many thanks for making it wonderful code to read! I was under the impression that it used pretrained word vector embeddings from Mikolov et al:

....but I don't see any evidence in the code where these vectors are loaded in. Are the word embeddings learned from scratch or are they in fact initialized in some way?

Many thanks!
chris moody

multi-bleu.perl

Hi,

Is this the same script with the Moses's multi-bleu.perl? I've seen that there are some modifications to the original version. I've been investigating that why my baseline model's (Google NIC with VGG-E) BLEU-2-3-4 performance is really low but what I've found is we are not using the same evaluation scripts. I know that this task is different than machine translation task, though. So, my questions are,

What's the intention behind the BLEU evaluation script modification?
Is all captioning people evaluate their models with this approach?

Thanks in advance.

Bounding box

Hello Andrej,

Great work!

Is it possible to get the bounding box associated with words? Or is that part of the alignment/retrieval model?

Thanks!

change in extract_features needed due to caffe update

This code line:
features = out[net.outputs[0]].squeeze(axis=(2,3))
Has to be this in order to work with the newest caffe:
features = out[net.outputs[0]]

Additionally remove this line:
caffe.set_phase_test()

Have you implemented Visual-Semantic Alignments ?

Thanks for your kindness to release these codes!
It helps me a lot!
I am interested in your cvpr paper : Deep Visual-Semantic Alignments for Generating Image Descriptions. But I did not found anything about Visual-Semantic Alignments in this released code, have I missed something ? thanks !

eval_sentence_predictions.py: error: too few arguments

~/tf/neuraltalk-master$ python eval_sentence_predictions.py
usage: eval_sentence_predictions.py [-h] [-b BEAM_SIZE]
[--result_struct_filename RESULT_STRUCT_FILENAME]
[-m MAX_IMAGES] [-d DUMP_FOLDER]
checkpoint_path
eval_sentence_predictions.py: error: too few arguments

when i run this script i got this error,is checkpoint path error or others?thank you.

Why optimizing the Ws matrix directly?

Other approaches like [Show and Tell] use a We matrix for word embedding which optimize the We , But in neuraltalk I found that it direcly optimize the Ws in which each raw represent a word. So Why do this way? or which way performs better?

Running On Raw Images

How exactly would I go about getting a trained models predicition on an image (in some raw format) that I have?

How can i use this code to train regions & snippets RNN model?

In this code, i only find how to use images and the images description sentences to train a multimodal RNN. But i don't see any founctions about how to use the regions & snippets to train the model.Just like the figure 5 or part 4.3 in the paper.
How can i train my own model? How can i get the result just like the figure 5?

predict_on_images.py error

usage: predict_on_images.py [-h] [-r ROOT_PATH] [-b BEAM_SIZE] checkpoint_path
predict_on_images.py: error: the following arguments are required: checkpoint_path
An exception has occurred, use %tb to see the full traceback.

this error happened. what should i do?

Use for sentence input to sentence output

Would it be possible to use this code to accept a sentence input, and output the most likely sentence, in order to sustain dialogue, instead of a picture input and sentence output? I believe there is a paper on this. Sorry this is not an issue, didn't know where to comment.

Question about usage of RCNN

Hello, I recently read your paper, and very much appreciate about you sharing your codes here.

By the way, on your paper it is indicated that you first extracted top regions of obtained by RCNN and then get the CNN features, however I do not see that object detection part in your implementation. Either in training and test phase, it seems not using object detection functionality. Is it because it still works fine using the holistic image?

Thank you.

training over new dataset

I am training it over new dataset. I am getting this error in save checkpoint
36/1850 batch done in 2.356s. at epoch 0.97. loss cost = 9.295156, reg cost = 0.000000, ppl2 = 4.59 (smooth 14.32)
evaluating val performance in batches of 100
Traceback (most recent call last):
File "driver.py", line 315, in
main(params)
File "driver.py", line 232, in main
val_ppl2 = eval_split('val', dp, model, params, misc) # perform the evaluation on VAL set
File "/root/neuraltalk/imagernn/imagernn_utils.py", line 38, in eval_split
ppl2 = 2 ** (logppl / logppln)
ZeroDivisionError: integer division or modulo by zero

py_caffe_feat_extract

I think the bicubic implementation is of some problem.

The output image contains some obvious artifacts if you visualize it.
It's definitely not same as Matlab's imresize nor Opencv's resize(Inter_cubic).

I guess the vgg_feats.mat inside examples_images was produced by this function.
The results made by py_caffe_feat_extract were also slightly different with the ones made by opencv's resize(cubic).
Hope some one could fix the bug of the bicubic implementation some day.

Thanks a lot.

Maybe a mistake in lstm_generator.py

In the lstm_generator.py, line 71 Hin[t,1:1+d] = X[t] and 72 Hin[t,1+d:] = prev should be exchanged.
Because the hidden size is d, which is the dimension of the prev.
But i don't why it doesn't raise an error, anyone can explain this?

Best hyperparameters for RNN model

Hi,

When I try to train the RNN model, the performance is quite poor with default parameters since the default values are tuned for LSTM.

So, could you please share the tuned hyperparameters for RNN model?

Thanks.

vgg_feats.mat file is being created but its having very small size of 194bytes. Why so?

I think features are not being written in vgg file. I tried running "py_caffe_feat_extract.py" using both the codes you provided in matlab_features_reference and python_features folders separately. By both, I am getting same sized file. Please help me in CNN feature extraction from images.

how to use GPU in example like 'python driver.py'

it seems it doesn't use GPU by default

multiple hosts

Hi Andrej,

I really love this implementation.
The most intriguing part to me is your monitorcv to visualize the cross-validation. It could help a lot during training.

In the code, I found it could show up-to-40 results with different host names, but my computer has only one hostname (using python gethostname).
I bet it's my lack of related knowledge.
I guess we could run on separate hosts (with different parameters or models) using the same computer, right?

Could you please give some instructions on how to do so?

Thank you so much.
Best,
-Ethan

TypeError: amax() got an unexpected keyword argument 'keepdims'

I ran driver.py and received an error as above. Am I missing something?

list index out of range error

I created coco_sample directory containing the following files.

COCO_val2014_000000463825.jpg
model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p (from here)
tasks.txt (containing one line COCO_val2014_000000463825.jpg)
vgg_feats.mat (from here)

I ran the following command.

python predict_on_images.py coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p -r coco_sample

I got an error message as below.

parsed parameters:
{
"beam_size": 1,
"checkpoint_path": "coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p",
"root_path": "coco_sample"
}
loading checkpoint coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p
image 0/123287:
/home/ec2-user/neuraltalk/imagernn/lstm_generator.py:227: RuntimeWarning: overflow encountered in exp
IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d]))
PRED: (-14.587771) a man and a woman sitting on a bench in the middle of a park
image 1/123287:
Traceback (most recent call last):
File "predict_on_images.py", line 109, in
main(params)
File "predict_on_images.py", line 66, in main
img['local_file_path'] =img_names[n]
IndexError: list index out of range

Isn't it possible to run predict_on_images.py on a few images?

Encountered runtime warning while computing logistic function

@karpathy Thanks for open sourcing your image-to-sentences work. I got the code up & running with the Flickr30K dataset but encountered a runtime warning
" RuntimeWarning: overflow encountered in exp"

I have fixed it locally by using scipy.special.expit function. I have attached the patch below in case you want to "cherry-pick' my commit. Let me know if this patch is useful to you and whether you'd like me to make a PR with a fix:

From d3b8d3401a7ebeae1aff88538f1f5eff440b31cf Mon Sep 17 00:00:00 2001
From: Vimal Thilak
Date: Wed, 3 Dec 2014 15:16:28 -0800
Subject: [PATCH] [bugfix] Fix overflow runtime warning

Warning encountered in logistic function computation

Signed-off-by: Vimal Thilak

imagernn/lstm_generator.py | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/imagernn/lstm_generator.py b/imagernn/lstm_generator.py
index 011e333..af6797f 100644
--- a/imagernn/lstm_generator.py
+++ b/imagernn/lstm_generator.py
@@ -1,5 +1,6 @@
import numpy as np
import code
+import scipy.special

from imagernn.utils import initw

@@ -75,7 +76,7 @@ class LSTMGenerator:
IFOG[t] = Hin[t].dot(WLSTM)

   # non-linearities

 IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d])) # sigmoids; these are the gates

```
 IFOGf[t,:3*d] = scipy.special.expit(IFOG[t, :3*d])  #1.0/(1.0+np.exp(-IFOG[t,:3*d])) # sigmoids; these are the gates
```
IFOGf[t,3_d:] = np.tanh(IFOG[t, 3_d:]) # tanh

compute the cell activation

@@ -224,7 +225,7 @@ class LSTMGenerator:
C = np.zeros((1, d))
Hout = np.zeros((1, d))
IFOG[t] = Hin[t].dot(WLSTM)

 IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d]))

```
 IFOGf[t,:3_d] = scipy.special.expit(-IFOG[t,:3_d])  # 1.0/(1.0+np.exp(-IFOG[t,:3_d]))
```
IFOGf[t,3_d:] = np.tanh(IFOG[t, 3_d:])
C[t] = IFOGf[t,:d] * IFOGf[t, 3_d:] + IFOGf[t,d:2*d] * c_prev
if tanhC_version:

2.0.1

Please someone explain the magic behind this

Let's create a wiki and write how these things work. Only source code limits the potential of the open source potential of this project. Share the knowledge for it to grow.

CAFFE API error

When I tried to run the python scripts python_features/extract_features.py today, I met with a problem as follow:

Traceback (most recent call last):
  File "./extract_features.py", line 102, in <module>
    net = caffe.Net(args.model_def, args.model)
Boost.Python.ArgumentError: Python argument types in
    Net.__init__(Net, str, str)
did not match C++ signature:
    __init__(boost::python::api::object, std::string, std::string, int)
    __init__(boost::python::api::object, std::string, int)

Then I search this error on the Internet, and I find a same issue in caffe's issue page: Caffe#1905. I think it's an error caused by the update of Caffe's API.
So I change the code in extract_features.py#101 as: net = caffe.Net(args.model_def, args.model, caffe.TEST). It worked, but a new problem came out:

Traceback (most recent call last):
  File "./extract_features.py", line 102, in <module>
    caffe.set_phase_test()
AttributeError: 'module' object has no attribute 'set_phase_test'

I think the reason is that some APIs in python_features/extract_features.py are too old.

Size of Descriptive Sentences

Hi Andrej,

Is there a limit to the size of the descriptive sentences? Has it been tried with multiple sentences each describing different features of the image? For example, if an image had a descriptor "A dog in a park. A kite in the sky." could it generate two sentences if the training data was in a similar format? OR is it better to split the descriptive sentences into several single sentence examples and show the same image for each (ie. image A: dog in a park, image A: kite in the sky).

Also, is the matlab feature extractor GPU enabled?

Thanks!

Incorrect prediction while testing.

When I am evaluating and predicting on the datasets called example_images given by you after training flickr8k images, I get all the wrong outputs. For each of the images, the prediction is incorrect. Why is this happening?

question about dropout implementation

Hi Andrej,

I have been learning a ton about RNNs and their implementation from looking through your code. I have a (perhaps silly) question about your dropout implementation. You claim that your code creates a mask that drops a fraction, drop_prob, of the units and then scales the remaining units by 1/(1-drop_prob). This doesn't seem correct to me since you are sampling using np.random.randn, which seems to sample from a normal distribution of mean 0 and variance 1.

For example, if you set drop_prob=1 (and ignore the fact that this makes your scale factor infinite) then you should be dropping all the units, but in reality you will be testing the boolean condition np.random.randn(some_shape)<(1-drop_prob). Since np.random.rand gives you negative values half the time (on average) you will only drop half the units (on average).

It seems like you want to be sampling from a uniform distribution from 0 to 1 in order for this to work properly.

Best,
Sam

init_model_from argument has no effect on where driver starts

It seems like, even when a checkpoint file is passed into --init_model_from argument, it starts from epoch 0.00 and acts like the initial model was never even passed in.

predict_on_images.py: error: too few arguments

Hello..
Thanks for the code and the very helpful read me files..
I tried to call the predict_on_images.py on the examples folder you supported but got this error
C:\neuraltalk-master>python predict_on_images.py
usage: predict_on_images.py [-h] [-r ROOT_PATH]
predict_on_images.py: error: too few arguments

I would appreciate any help ...

Regards

Aborting, cost seems to be exploding.

training with flickr8k aborts:

253/15000 batch done in 5.037s. at epoch 0.84. loss cost = 37.447347, reg cost = 0.000001, ppl2 = 26.10 (smooth 48.09)
254/15000 batch done in 5.082s. at epoch 0.85. loss cost = 39.408169, reg cost = 0.000001, ppl2 = 29.19 (smooth 47.91)
255/15000 batch done in 4.914s. at epoch 0.85. loss cost = 140.730310, reg cost = 0.000001, ppl2 = 237360.65 (smooth 2421.03)
Aboring, cost seems to be exploding. Run gradcheck? Lower the learning rate?

when i train a batch(batchsize 100) of flickr8k, it takes around 0.8 seconds,

when i train a batch(batchsize 100) of flickr8k, it takes around 0.8 second. However when train a batch(still 100) of flickr30k, it takes around 7 seconds.
Why?

image captioning

hi, i like to work on image captioning and i used a novel approach for image segmentation, and now i like to use these segmented image as a preprocessing step for image captioning, can u help me to give me an idea for my next step to do it? and if its possible may i have matlab codes for captioning?

karpathy / neuraltalk Goto Github PK

neuraltalk's Introduction

neuraltalk's People

Contributors

Stargazers

Watchers

Forkers

neuraltalk's Issues

Signed-off-by: Vimal Thilak

compute the cell activation

IFOGf[t,3_d:] = np.tanh(IFOG[t, 3_d:]) C[t] = IFOGf[t,:d] * IFOGf[t, 3_d:] + IFOGf[t,d:2*d] * c_prev if tanhC_version:

Recommend Projects

Recommend Topics

Recommend Org

IFOGf[t,3_d:] = np.tanh(IFOG[t, 3_d:])
C[t] = IFOGf[t,:d] * IFOGf[t, 3_d:] + IFOGf[t,d:2d] c_prev
if tanhC_version: