localminimum / qanet Goto Github PK
View Code? Open in Web Editor NEWA Tensorflow implementation of QANet for machine reading comprehension
License: MIT License
A Tensorflow implementation of QANet for machine reading comprehension
License: MIT License
I am getting the following error while trying to fine tune
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [326,64] rhs shape= [1427,64]
[[Node: save/Assign_746 = Assign[T=DT_FLOAT, _class=["loc:@char_mat"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](char_mat, save/RestoreV2:746)]]
should be:
-flags.DEFINE_list("bucket_range", [40, 401, 40], "the range of bucket")
Thank you for your implementation, it is very helpful for me.
I run this code and can get the similar result when the number of heads equals to 1. But, I cannot get the result of original paper(73.6/82.7) when I use 8 heads, batch size 32, training step 150k, char dimension of 200 (the same setting as the original paper). I can only get around (71.27/80.58).
Same situation was ocurred when I ran the pytorch repo (https://github.com/andy840314/QANet-pytorch-).
Any suggestions?
I am not able to free GPU for training data. So I am planning how to add/update batch _size?
Thanks for this great implementation. I noticed that you mentioned in the README file that the original system can achieve EM: 72.5, F1: 81.4 after 150,000 training steps, and EM: 76.2, F1: 84.6 after 340,000 training steps. But I didn't find this information in the original paper. It seems that the original system takes much longer time to train? Could you show me where to get this information? Or did you infer that from other statistics?
I think the line "return inputs + mask_value * (1 - mask)" should be "return inputs*mask + mask_value * (1 - mask)"
Hi all,
As both results from Google Brain team and AllenNPL, using ELMO can give a big boost in result. I noticed that AllenNLP provides some pretrained model of ELMO. I would love to see some better results.
Thanks.
[1] QANet slide
[2] ELMO page
i'm checking this model with M40 device , which is 24G memory on this board.
What's you default batch size used on 1080 card ?? as it seem tf show OOM when i increase batch size to 64 ?
i had done:
sudo pip install spacy==2.0.9
mldl@mldlUB1604:~/ub16_prj/Fast-Reading-Comprehension$ python config.py --mode prepro
Traceback (most recent call last):
File "config.py", line 9, in
from prepro import prepro
File "/home/mldl/ub16_prj/Fast-Reading-Comprehension/prepro.py", line 15, in
nlp = spacy.blank("en")
AttributeError: 'module' object has no attribute 'blank'
Any suggestions on how to train network on "Not available" answer for the questions which cannot be answered from the context.
I am trying to run the interactive server, but when I navigate to the server URL, the page throws up a 500 code error (Internal Server Error).
The trace for the error is:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/bottle.py", line 862, in _handle
return route.call(**args)
File "/usr/local/lib/python3.6/dist-packages/bottle.py", line 1740, in wrapper
rv = callback(*a, **ka)
File "/home/rudresh/Documents/machine_comprehension/Fast-Reading-Comprehension/demo.py", line 25, in home
with open('demo.html', 'r') as fl:
FileNotFoundError: [Errno 2] No such file or directory: 'demo.html'
127.0.0.1 - - [07/Apr/2018 10:07:55] "GET / HTTP/1.1" 500 739
127.0.0.1 - - [07/Apr/2018 10:07:56] "GET /favicon.ico HTTP/1.1" 404 740
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/bottle.py", line 862, in _handle
return route.call(**args)
File "/usr/local/lib/python3.6/dist-packages/bottle.py", line 1740, in wrapper
rv = callback(*a, **ka)
File "/home/rudresh/Documents/machine_comprehension/Fast-Reading-Comprehension/demo.py", line 25, in home
with open('demo.html', 'r') as fl:
FileNotFoundError: [Errno 2] No such file or directory: 'demo.html'
Which is obviously caused by the missing demo.html
file. Can you please help me out with where do I procure the file from?
I am running a Python3.6 on Ubuntu 16.04.
I just change nlp = spacy.blank("en") to nlp = spacy.blank("zh")
Is that ok?
I'm trying to train/demo the code and in both cases, python config.py --mode train
and python config.py --mode demo
I end up hitting the same error.
The last few bits of the traceback are:
File "config.py", line 125, in main
train(config)
File "/home/arjoonn/Fast-Reading-Comprehension/main.py", line 19, in train
with open(config.word_emb_file, "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'data/word_emb.json'
I saw some commented out things in the download.sh
file, should I be un-commenting those?
For num_heads 1, hidden size 96, seems not faster then HKUST rnet ?
With batch size 64 , 1.42 batch/s while HKUST RNET with 2.4+ batch/s
Though HKUST RNET default use char dim only 8 , here we use 64 but still I think QANet not as fast as which google show in the paper ?
https://github.com/NLPLearn/QANet/blob/8107d223897775d0c3838cb97f93b089908781d4/layers.py#L52
execuse me, in the paper "Layer Normalization,Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton", it said that the mean and variance is computed over all the hidden units in the same layer, and different training cases have different normalization terms. So I think the mean should be computed like this:
axes = list(range(1, x.shape.ndims))
mean = tf.reduce_mean(x, axes)
So the shape of mean is [batch,]. also the variance is [batch,]
and then feed them to compute the normlized x.
In the tensorflow api of layer normalization, the source code is below, and I think it is the same with mine.
norm_axes = list(range(begin_norm_axis, inputs_rank))
https://github.com/tensorflow/tensorflow/blob/c19e29306ce1777456b2dbb3a14f511edf7883a8/tensorflow/contrib/layers/python/layers/layers.py#L2311
I am using AWS p2.xlarge which has Tesla K80.
While training it is still showing memory issue. Why??
It has 11.17 GIGs of memory which displays in my console.
Logs - attached.
logs.txt
TIA
I don't understand the purpose of "mask_logits" function, which is being used before calling "softmax" function at various places. Can someone please explain.
After training the model to 46 percent there was a power outage. What command do I use to resume training? I'm on checkpoint 26.
Thanks in Advance
Snorkel can generate training data, maybe it is useful to data augmentation.
It is using dynamic programming instead of translation twice.
(.venv) ub16c9@ub16c9-gpu:~/ub16_prj/QANet$ python config.py --mode train
Building model...
WARNING:tensorflow:From /home/ub16c9/ub16_prj/QANet/layers.py:52: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/ub16c9/ub16_prj/QANet/model.py:134: calling softmax (from tensorflow.python.ops.nn_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead
WARNING:tensorflow:From /home/ub16c9/ub16_prj/QANet/model.py:174: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See tf.nn.softmax_cross_entropy_with_logits_v2
.
Total number of trainable parameters: 788673
2018-12-29 11:14:48.345129: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-29 11:14:48.431530: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-29 11:14:48.431955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:01:00.0
totalMemory: 10.92GiB freeMemory: 10.43GiB
2018-12-29 11:14:48.431971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-29 11:14:48.733045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-29 11:14:48.733079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-29 11:14:48.733085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-29 11:14:48.733318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10086 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-12-29 11:14:50.042331: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
2018-12-29 11:14:50.174758: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
2018-12-29 11:14:50.507489: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
2018-12-29 11:14:50.691090: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
2018-12-29 11:14:50.825623: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
55%|██████████████████████████████████████████████████████████████████████████████████████▏ | 32935/60000 [3:15:35<2:19:53, 3.22it/s] 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 53999/60000 [5:17:29<29:48, 3.36it/sException RuntimeError: RuntimeError('cannot join current thread',) in <object repr() failed> ignored██████████████████████████████████████████████████████████████████████| 328/328 [00:36<00:00, 9.07it/s]
(.venv) ub16c9@ub16c9-gpu:~/ub16_prj/QANet$
This is an umbrella issue where we can collectively tackled some problems and improve general open source reading comprehension quality.
Goal
The network is already there. We just need to add more features on top of the current model.
Model
Data
Contribution to any of these issues is welcome and please comment on this issue and let us know if you want to work on these problems.
I have read README.md file, but still don't know how to run this project. Can anybody give more instructions?
I see that tensorflow detected 2 GPU's but the training is only happening in 1 GPU. Please advise?
What are the specification of System you used for training ?
Can you share a pre-trained model weights ?
I am getting following error while preprocessing:
Generating word embedding...
13%|#######################3 | 296814/2200000 [00:36<03:52, 8176.12it/s]Traceback (most recent call last):
File "config.py", line 144, in
tf.app.run()
File "C:\Users\chchauha\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "config.py", line 127, in main
prepro(config)
File "F:\Synapse\QANet-master\prepro.py", line 287, in prepro
word_counter, "word", emb_file=word_emb_file, size=config.glove_word_size, vec_size=config.glove_dim)
File "F:\Synapse\QANet-master\prepro.py", line 99, in get_embedding
vector = list(map(float, array[-vec_size:]))
ValueError: could not convert string to float: 'sania'
Hello,
I have one doubt over your code: in your code, all OOV words are represented by id 1, which means, all OOV words are considered the same word, and its embedding is a zero vector. Also, this embedding will not be updated during training. However, in the original paper, the author mentioned that for OOV words, the word embeddings are updated during training.
I think this may be a reason why the score is lower than the original paper.
Hello All,
I have many json files whose format are the same as the standard train file or dev file so can i feed that to this network and predict to get the answers for different input questions and contexts?
Thanks,
Sachin B. Ichake
for token in context_tokens:
word_counter[token] += len(para["qas"])
for char in token:
char_counter[char] += len(para["qas"])
Should it be +=1?
Hello,
for some questions in SQuAD dataset I got exception:
InvalidArgumentError (see above for traceback): num_upper must be negative or less or equal to number of columns (10) got: 30
[[Node: Output_Layer/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Output_Layer/MatMul, Output_Layer/MatrixBandPart/num_lower, Output_Layer/MatrixBandPart/num_upper)]]
Do you know what is the reason for that? How to get rid of this problem?
Hello everyone,
I've been trying to train a model with different num_heads
, hidden
and num_steps
parameters.
The default parameters in config.py
works like a charm but once I change the mentioned parameters, I get this:
Exception ignored in: <bound method tqdm.__del__ of 42%|██████████████████████▉ | 49999/120000 [15:34:24<18:06:29, 1.07it/s]>
Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 328/328 [02:05<00:00, 2.53it/s]
File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 889, in __del__
self.close()
File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1095, in close
self._decr_instances(self)
File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 454, in _decr_instances
cls.monitor.exit()
File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_monitor.py", line 52, in exit
self.join()
File "/usr/lib/python3.5/threading.py", line 1051, in join
raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread
This occured when I set num_head
to 2, 4 and 8. I could train up to 50k and 54k steps when num_head
was set to 2 and 4, and it failed from the starts when num_head
was set to 8.
I'm using Ubuntu 16.04, Python 3.5.2 and training the network on a GPU. Here's the nvidia-smi
and nvcc --version
output if someone needs it:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 72C P0 63W / 149W | 0MiB / 11441MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
So what could be the real cause of this error?
Thanks in advance!
In the preprocessing mode the execution stops at def build_features()
stating that (example["y1s"][0] - example["y2s"][0]) > ans_ limit
List index out of bound
And later when commenting that statement it moves forward and gives another error at
start, end = example["y1s"][-1], example["y2s"][-1]
List Index out of bound
Please Help. Is it because I am using SQuAD version 2.0?
Hi,
when i try to run your code ,I got an error:
Reducing Glove Matrix
100%|█████████████████████████████████████████████████| 442/442 [01:32<00:00, 4.79it/s]
100%|███████████████████████████████████████████████████| 48/48 [00:10<00:00, 4.43it/s]
Processing 91600 vocabs
Total number of lines: 91604
Reduced vocab size: 91604
Reading GloVe from: ./glove.840B.300d.txt
Processing line 91600
Reading GloVe from: ./glove.840B.300d.char.txt
Tokenizing training data.
100%|█████████████████████████████████████████████████| 442/442 [01:25<00:00, 5.19it/s]
Tokenizing dev data.
100%|███████████████████████████████████████████████████| 48/48 [00:10<00:00, 4.77it/s]
Tokenizing complete
Processing 91600 vocabsTraceback (most recent call last):
File "process.py", line 377, in
main()
File "process.py", line 371, in main
load_glove(Params.glove_dir,"glove",vocab_size = Params.vocab_size)
File "process.py", line 203, in load_glove
assert 0
AssertionError
can you tell me why this happend?
Hi. I have Macbook Air(Mid 2017) and I want to train data. So it haven't a GPU so without GPU how can I train model?
My test and dev sets are same. But I get different results from training check point evaluation vs running config.py in test mode.
Ideally it should give same results because we are loading the saved model and running it on dev file again ?
Model | Training Steps | Size | Attention Heads | Data Size (aug) | EM | F1 |
---|---|---|---|---|---|---|
My Model | 60,000 | 128 | 1 | 87k (no aug) | 70.7 | 79.8 |
The results are obtained on a K80 machine. I modify the trilinear function for memory efficiency, but the results are the same with the current version of this repository.
I'm not sure about the overfitting, the model is the last checkpoint after training 60,000 steps.
When trying to set the "pretrained_char" as True, the is a tensor reshape size conflict.
glove_char_file = os.path.join('data/glove', "glove.840B.300d-char.txt")
flags.DEFINE_string("glove_char_file", glove_char_file, "Glove character embedding source file")
flags.DEFINE_boolean("pretrained_char", True, "Whether to use pretrained character embedding")
Error is from model.py line 76, below. How can the reshape dimensions be adjusted?
Error:
Traceback (most recent call last):
File "config.py", line 152, in <module>
tf.app.run()
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "config.py", line 133, in main
train(config)
File "QANet/main.py", line 95, in train
handle: train_handle, model.dropout: config.dropout})
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 2445312 values, but the requested shape has 11462400
[[Node: Input_Embedding_Layer/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Input_Embedding_Layer/embedding_lookup, Input_Embedding_Layer/Reshape/shape)]]
[[Node: Identity/_4743 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_52979_Identity", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'Input_Embedding_Layer/Reshape', defined at:
File "config.py", line 152, in <module>
tf.app.run()
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "config.py", line 133, in main
train(config)
File "QANet/main.py", line 72, in train
model = Model(config, iterator, word_mat, char_mat, graph = g)
File "QANet/model.py", line 60, in __init__
self.forward()
File "QANet/model.py", line 76, in forward
ch_emb = tf.reshape(tf.nn.embedding_lookup(self.char_mat, self.ch), [N * PL, CL, dc]) # 32*1000?, 16, 64 = 32768000. Input to reshape is a tensor with 34099200 values, but the requested shape has 7274496.
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5782, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 2445312 values, but the requested shape has 11462400
[[Node: Input_Embedding_Layer/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Input_Embedding_Layer/embedding_lookup, Input_Embedding_Layer/Reshape/shape)]]
[[Node: Identity/_4743 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_52979_Identity", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
In highway network, H is a non_linear function. But in this report,H is a linear function. why this is? thanks!
I've trained the QANet model on SQUAD. I wanted to apply this SQUAD trained model to a new dataset using fine tuning. I need to use the weight from this SQUAD trained model as the initialization for the new dataset for training, with a purpose to make the SQUAD model adaptive to the new dataset.
From the train/FRC folder, I can see there are several checkpoint files. Which checkpoint files should I use for initialization of the new model for the new dataset?
Thanks,
First it is a greate job!
The file https://nlp.stanford.edu/data/glove.840B.300d.zip
could not been download,where can I download it?Thanks!
Thanks for the brillient code.
I have noticed a santence in the paper:
"all the out-of-vocabulary words are mapped to a token ,whose embedding is trainable with random initialization." which not in your code. (they used a pretrained matrix)That seems make sence.
Do that works for the model?
Steps i have done
1.Cloned the repo and downloaded weights
2.sh download.sh
3.run python config.py --mode prepro
4. run python config.py --mode demo
error
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/mnt/d/ML exp/Fast-Reading-Comprehension/demo.py", line 74, in demo_backend
saver.restore(sess, tf.train.latest_checkpoint(config.save_dir))
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1755, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [91588,300] rhs shape= [91589,300]
[[Node: save/Assign_375 = Assign[T=DT_FLOAT, _class=["loc:@word_mat"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:C
PU:0"](word_mat, save/RestoreV2:375)]]
Caused by op u'save/Assign_375', defined at:
File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/mnt/d/ML exp/Fast-Reading-Comprehension/demo.py", line 73, in demo_backend
saver = tf.train.Saver()
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1293, in __init__
self.build()
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1302, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1339, in _build
build_save=build_save, build_restore=build_restore)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 796, in _build_internal
restore_sequentially, reshape)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 471, in _AddRestoreOps
assign_ops.append(saveable.restore(saveable_tensors, shapes))
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 161, in restore
self.op.get_shape().is_fully_defined())
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 280, in assign
validate_shape=validate_shape)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 58, in assign
use_locking=use_locking, name=name)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [91588,300] rhs shape= [91589,300]
[[Node: save/Assign_375 = Assign[T=DT_FLOAT, _class=["loc:@word_mat"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:C
PU:0"](word_mat, save/RestoreV2:375)]]
Hi,
i meet runtime error, in sess.run([]),
2018-03-23 11:55:55.959752: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2018-03-23 11:55:55.959887: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.69 Wed Aug 16 19:34:54 PDT 2017
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)
"""
2018-03-23 11:55:55.959970: E tensorflow/stream_executor/cuda/cuda_dnn.cc:393] possibly insufficient driver version: 384.69.0
2018-03-23 11:55:55.959998: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2018-03-23 11:55:55.960028: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
2018-03-23 11:55:55.960040: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
Aborted (core dumped)
tensorflow version: 1.5.0
CUDA: 9.0
Cudnn: 7.0
Driver Version: 384.69
Can i ask your versions, should i update my driver version, or may only change some model code?
it works without gpu.
Thanks
Hi, I trained the model on AWS (GPU instance) for 60K steps and got the model. I then tested it on several GPU/CPU instance and results are consistent. When I deploy it locally on my Ubuntu desktop (CPU only), the inferences are totally off. I tested on AWS GPU instance (p2.xlarge), AWS CPU instance (c5d.4xlarge) and also on Colab. All three show consistent answers for a given context and questions. Only on my desktop the answers are way off. Any inputs as to why this could be happening would help. Thanks!
Hi, I have noticed that you have put the input projection before Highway Network. However, in the paper, it is mentioned that the input of Embedding Encoding Layer is a vector of dimension p1+p2=500 for each word, which means that the projection is placed after the Highway Network.
Have you already try this?
We have trained QA net for our own question and answers data. But when we run it in demo mode for prediction it is giving different results for the same question.
Some times it picks correct answer for the same question and some time does not, but ideally it should pick the same answer, right ? Any ideas what could be the reason for this behaviour of trained model ?
I have commented out below section from test/demo code:
"""
if config.decay < 1.0:
sess.run(model.assign_vars)
"""
The author didn't mention they use conv layer in paper. thanks for any reply!
1.What is the meaning of config.hidden used in conv(), and why is the value of kernel size =5 in conv() , is it a parameter that needs to be debugged?
2.Is the conv function pre-packaged with tensorflow, or you need to rewrite it by yourself?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.