Comments (12)
Hi,
unfortunately our mdlstm implementation is for thano only, so you can't use the tensorflow backend for that.
from returnn.
Hi,
Could you give me some directions if I want to "hack into" it myself?
Thanks
from returnn.
I got rid of "use_tensorflow:true" from config_real, and "./go.sh",
Here is another exception:
......
File "/home/ubuntu/temp/returnn/Updater.py", line 4, in
line: import theano
locals:
theano =
File "/usr/local/lib/python2.7/dist-packages/theano/init.py", line 116, in
line: theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
locals:
theano = <module 'theano' from '/usr/local/lib/python2.7/dist-packages/theano/init.pyc'>
theano.sandbox = <module 'theano.sandbox' from '/usr/local/lib/python2.7/dist-packages/theano/sandbox/init.pyc'>
theano.sandbox.cuda = <module 'theano.sandbox.cuda' from '/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/init.pyc'>
theano.sandbox.cuda.tests = <module 'theano.sandbox.cuda.tests' from '/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/tests/init.pyc'>
theano.sandbox.cuda.tests.test_driver = <module 'theano.sandbox.cuda.tests.test_driver' from '/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/tests/test_driver.pyc'>
theano.sandbox.cuda.tests.test_driver = <module 'theano.sandbox.cuda.tests.test_driver' from '/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/tests/test_driver.pyc'>
theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1 = <function test_nvidia_driver1 at 0x7ffb30ab4cf8>
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/tests/test_driver.py", line 41, in test_nvidia_driver1
line: raise Exception("The nvidia driver version installed with this OS "
locals:
Exception = <type 'exceptions.Exception'>
Exception: The nvidia driver version installed with this OS does not give good results for reduction.Installing the nvidia driver available on the same download page as the cuda package will fix the problem: http://developer.nvidia.com/cuda-downloads
Device proc gpuX (gpuZ) died: ProcConnectionDied('recv_bytes EOFError: ',)
Theano flags: compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True
EXCEPTION
Traceback (most recent call last):
File "/home/ubuntu/temp/returnn/Device.py", line 347, in startProc
line: self._startProc(*args, **kwargs)
locals:
self = <Device.Device object at 0x7ff35acd5cd0>
self._startProc = <bound method Device._startProc of <Device.Device object at 0x7ff35acd5cd0>>
args = ('gpuZ',)
kwargs = {}
File "/home/ubuntu/temp/returnn/Device.py", line 401, in _startProc
line: interrupt_main()
locals:
interrupt_main = <function interrupt_main at 0x7ff35b881668>
File "/home/ubuntu/temp/returnn/Util.py", line 665, in interrupt_main
line: sys.exit(1) # And exit the thread.
locals:
sys = <module 'sys' (built-in)>
sys.exit =
SystemExit: 1
KeyboardInterrupt
EXCEPTION
Traceback (most recent call last):
File "../../../rnn.py", line 539, in main
line: init(commandLineOptions=argv[1:])
locals:
init = <function init at 0x7ff35accd1b8>
commandLineOptions =
argv = ['../../../rnn.py', 'config_real'], _[0]: {len = 15}
File "../../../rnn.py", line 341, in init
line: devices = initDevices()
locals:
devices =
initDevices = <function initDevices at 0x7ff35acccd70>
File "../../../rnn.py", line 154, in initDevices
line: time.sleep(0.25)
locals:
time = <module 'time' (built-in)>
time.sleep =
KeyboardInterrupt
Quitting
I didn't hit any key though!
from returnn.
Hi,
the important part of the messages is this:
The nvidia driver version installed with this OS does not give good results for reduction.Installing the nvidia driver available on the same download page as the cuda package will fix the problem: http://developer.nvidia.com/cuda-downloads
Please try another nvidia driver.
For mdlstm with tensorflow, also have a look at #8
The most important part would be to wrap our CUDA based mdlstm implementation as a tensorflow kernel. If you want to look further into this, I can give you a few more details. I think, however, that it will require some (not sure how much actually) effort and cannot just be hacked in a few minutes.
from returnn.
Hi,
"If you want to look further into this, I can give you a few more details."
Yes, please send me the details. I am interested in making this work.
Thanks!
from returnn.
Hi,
With regard to the error message:
"
The nvidia driver version installed with this OS does not give good results for reduction.Installing the nvidia driver available on the same download page as the cuda package will fix the problem: http://developer.nvidia.com/cuda-downloads
"
I upgraded to nvidia-390, but still got this error ???!!!
Here is the HW/SW configuration:
Aws, p2.xlarge --- Tesla K80
nvidia 390
theano 0.9.0,
pygpu 0.6.9
Tried
%python ../../../rnn.py config_demo
"Could not find cudnn library (looked for v5[.1])
Tried
%THEANO_FLAGS=device=gpu python ../../../rnn.py config_demo
"... does not give good results ...."
Ran demo/demo-tf-lstm-benchmark.py
KeyError: 'lstmblockfused'
But I can see TFEngine works on the GPU.
Please advice!
Thanks
from returnn.
"Could not find cudnn library (looked for v5[.1])
Have a look at http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html on how to setup cudnn for theano
The nvidia driver version installed with this OS does not give good results for reduction.Installing the nvidia driver available on the same download page as the cuda package will fix the problem
I also don't know much about that. Might be related to the version of theano, are you using 0.9 or 1.0? It should be worth trying 0.9. A google search for the error message brought up quite some results, e.g. Theano/Theano#5530
from returnn.
For a possible port of mdlstm to tensorflow:
The mdlstm implementation is mainly here: https://github.com/rwth-i6/returnn/blob/master/cuda_implementation/MultiDirectionalTwoDLSTMOp.py
Here, we derive from theano.sandbox.cuda.GpuOp to define a theano Op. You would need to create a different wrapper for tensorflow, you can find some information about this here (pay special attention to the GPU kernel parts): https://www.tensorflow.org/extend/adding_an_op
from returnn.
Danke!
from returnn.
Is this bug now only about MDLSTM in TensorFlow? Then this is just a duplicate of #8.
from returnn.
@albertz
more like an unfinished feature than a bug!
from returnn.
Yea. Ok, then I'm closing this now. Anything related to MDLSTM in TF should be discussed in #8. If there is another separate issue, please open a separate issue.
from returnn.
Related Issues (20)
- DistributeFilesDataset has issues with DataLoader and `num_workers > 0` HOT 1
- RF scaled_dot_product_attention
- DistributeFilesDataset Sharding with PT Dataloader breaks HOT 3
- Hang in training (often with multi GPU training) HOT 1
- PyTorch Distributed Training: File descriptors opened and never closed HOT 8
- Dataset ctx_left/ctx_right extension: ctx_clip_to_valid option HOT 5
- PyTorch/RF (?): choosing on which epochs to save optimizer state
- Datasets: blocklist in addition to allowlist for segment list file
- Make batch_size configurable for cross validation HOT 1
- Ignore a single broken gradient HOT 2
- DistributeFilesDataset: _distribute_evenly_by_size suboptimal for multi-gpu sharding HOT 8
- multiprocessing: OSError: AF_UNIX path too long HOT 11
- ConcatSeqsDataset with extended functionality HOT 3
- Torch: print model at log verbosity 3 HOT 1
- RuntimeError: CUDA error: an illegal memory access was encountered HOT 1
- Torch gradient_checkpoint_scope _unregister_custom_saved_tensors_hooks error HOT 4
- RF parametrization breaks Conv
- Torch gradient_checkpoint_scope could trigger segmentation fault? HOT 16
- Torch gradient_checkpoint_scope potential memory leak
- Torch multiple simultaneous gradient_checkpoint_scope
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.