Git Product home page Git Product logo

Comments (13)

dennybritz avatar dennybritz commented on April 29, 2024

I am not sure. I have seen errors like this happen when using the same GPU in multiple processes. Make sure that you don't have any other python/tensorflow processes running (including notebooks).

from seq2seq.

ZhangShiyue avatar ZhangShiyue commented on April 29, 2024

Thanks for reply my #98 issue. But, actually, I don't have any other processes running. But I run only on one GPU, and encounter this problem at evaluation. Is there any conflict between training and validation?

from seq2seq.

ZhangShiyue avatar ZhangShiyue commented on April 29, 2024

I checked the log. It seems when doing validation, a new process is started and trying to use the same GPU and then failed in this error. It's just my guess. I'm still not quite familiar with the code.

from seq2seq.

dennybritz avatar dennybritz commented on April 29, 2024

It's normal that validation loads the model graph again, and it shouldn't be an issue. Otherwise there is no difference between validation and training, except for the data. It's difficult for me to debug this without having a way to reproduce the error..

Does it work if you run on the CPU only?

from seq2seq.

ZhangShiyue avatar ZhangShiyue commented on April 29, 2024

Ok, I see. Thank you! But I think this error must be related to GPU, I didn't encounter the problem on CPU. I cannot figure out what's wrong here now. So, I decide to train model only without validation and save more checkpoints to validate later.

from seq2seq.

dennybritz avatar dennybritz commented on April 29, 2024

What GPU do you have?

from seq2seq.

ZhangShiyue avatar ZhangShiyue commented on April 29, 2024

01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
This GPU works well before, and if I drop the validation and only train, it also works well. So, maybe there are some configs that I ignore or set wrong...

from seq2seq.

nmaac avatar nmaac commented on April 29, 2024

I met the same error...

from seq2seq.

rihardsk avatar rihardsk commented on April 29, 2024

I have also encountered the same error.

from seq2seq.

rihardsk avatar rihardsk commented on April 29, 2024

It seems that I might have managed to solve the issue by switching to the latest development version of Tensorflow (compiled from source from the master branch). At least now I'm getting a different error. See #102. I am not yet sure if now #102 doesn't just make it crash earlier. I will report back if that's the case.

EDIT:
I am no longer sure, that switching to development version of Tensorflow is the thing that helped. I switched back to an evironment with the latest stable version (from pip install tensorflow-gpu) and I could no longer reproduce the error. Instead I now get #101 which also happens while evaluating. It seems that something weird is going on with that process.

It would still be interesting to see if others have any luck by installing the development version of Tensorflow.

from seq2seq.

dennybritz avatar dennybritz commented on April 29, 2024

Thanks for reporting, I do believe all of these issues are caused by the same problem, so I will close them and create a new issue to handle these: #103

from seq2seq.

akanyaani avatar akanyaani commented on April 29, 2024

Hi I am also getting a similar error while training autoencoder.

Caused by op u'decoder/decoder/while/BasicDecoderStep/TrainingHelperNextInputs/cond/TensorArrayReadV3', defined at:
File "train_autoencoder.py", line 137, in
tf.app.run()
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train_autoencoder.py", line 133, in main
train_autoencoder()
File "train_autoencoder.py", line 36, in train_autoencoder
model = Autoencoder(FLAGS.lstm_units, embedding_matrix, 0, 1, num_layers=FLAGS.num_layers, train_embeddings=False)
File "/home/abhay/Search_And_Match/encoder.py", line 110, in init
impute_finished=False)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 304, in dynamic_decode
swap_memory=swap_memory)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3224, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2956, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2893, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 249, in body
decoder_finished) = decoder.step(time, inputs, state)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py", line 146, in step
sample_ids=sample_ids)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/helper.py", line 250, in next_inputs
lambda: nest.map_structure(read_from_ta, self._input_tas))
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
return func(*args, **kwargs)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2072, in cond
orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1913, in BuildCondBranch
original_result = fn()
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/helper.py", line 250, in
lambda: nest.map_structure(read_from_ta, self._input_tas))
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/nest.py", line 375, in map_structure
structure[0], [func(*x) for x in entries])
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/helper.py", line 247, in read_from_ta
return inp.read(next_time)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 861, in read
return self._implementation.read(index, name=name)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 260, in read
name=name)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 6428, in tensor_array_read_v3
dtype=dtype, name=name)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
op_def=op_def)
File "/home/abhay/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Tried to read from index 50 but array size is: 50
[[Node: decoder/decoder/while/BasicDecoderStep/TrainingHelperNextInputs/cond/TensorArrayReadV3 = TensorArrayReadV3[dtype=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](decoder/decoder/while/BasicDecoderStep/TrainingHelperNextInputs/cond/TensorArrayReadV3/Switch, decoder/decoder/while/BasicDecoderStep/TrainingHelperNextInputs/cond/TensorArrayReadV3/Switch_1/_65, decoder/decoder/while/BasicDecoderStep/TrainingHelperNextInputs/cond/TensorArrayReadV3/Switch_2/_67)]]

from seq2seq.

TanyaChowdhury avatar TanyaChowdhury commented on April 29, 2024

Hey @akanyaani ! I'm getting a very similar stack trace. Please could you tell what you did to solve the error?

from seq2seq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.