Hi, are you planning to add support for using library with opencl through cltorch and

rnn with cltorch? about rnn HOT 16 CLOSED

element-research commented on August 28, 2024

rnn with cltorch?

from rnn.

Comments (16)

nicholas-leonard commented on August 28, 2024

I guess that would be nice. But I have no plans to do so in the immediate future. The rnn repo has no C/CUDA code per say. Unless you count in the examples where we explicitly call module:cuda() and such. If the clnn and cltorch have about the same interface and module support as cunn and cutorch then it shouldn't be too hard to support cl. You looking to give it a try?

from rnn.

sidec commented on August 28, 2024

Thanks. Yes, I will definitely give a try and share my experience.

from rnn.

hughperkins commented on August 28, 2024

Hi. I just patched cltorch just now, so that at least the training of rnn works. I haven't tested yet that sampling works, but since training is forwards + backwards, and sampling is forwards only, it seems plausible that sampling will work too. Patched in hughperkins/cltorch@1eed093 Note that rnn is fairly aligned with stuff I'm working on at the moment, so please feel free to log any cltorch-related issues directly into https://github.com/hughperkins/cltorch/issues , and I will try to take a look.

from rnn.

hughperkins commented on August 28, 2024

(PS this is the code I used for training, some of the loading code is from Karpathy's char-rnn. The data file is from Karpathy's char-rnn

-- copyright Hugh Perkins 2015
-- license MPLv2
-- this work contains many parts derived technically and conceptually from Andrej Karpathy's https://github.com/karpathy/char-rnn
-- which is under MIT license

require 'os'
require 'paths'
require 'torch'
require 'sys'
require 'nn'
require 'rnn'

local backend = 'cuda'
backend = 'cl'

if backend == 'cuda' then
  require 'cutorch'
  require 'cunn'
elseif backend == 'cl' then
  require 'cltorch'
  require 'clnn'
end

local dataDir = 'data/tinyshakespeare'

local in_file = 'input.txt'
local in_t7 = 'input.t7'
local vocab_t7 = 'vocab.t7'

local dropoutProb = 0.0
local hiddenSizes = {128, 128}
local learningRate = 0.04
local seqLength = 50
local batchSize = 50

function getFileSize(filePath)
  local sizestr = sys.execute('stat --format=%s ' .. filePath)
  local size = tonumber(sizestr)
  return size
end

-- shamelessly adapted from Karpathy's char-rnn :-)
function text_to_t7(in_textfile, out_tensorfile, out_vocabfile)
  local cache_len = 10000
  local rawdata
  local tot_len = getFileSize(in_textfile)
  print('tot_len', tot_len)

  local input_coded = torch.ByteTensor(tot_len)
  local vocab = {}
  local ivocab = {}
  local state = {}
  state.vocab = vocab
  state.ivocab = ivocab

  local f = io.open(in_textfile, "r")
  local input_string = f:read(cache_len)
  local pos = 1
  repeat
    local len = input_string:len()
    print('len', len)
    for i=1,len do
      local char = input_string:byte(i)
      if vocab[char] == nil then
        ivocab[#ivocab + 1] = char
        vocab[char] = #ivocab
      end
      local v = vocab[char]
      input_coded[pos] = v
      pos = pos + 1
    end
    input_string = f:read(cache_len)
  until not input_string
  f:close()

  print('#ivocab', #ivocab)
  local vocabs = {}
  vocabs.vocab = vocab
  vocabs.ivocab = ivocab
  torch.save(out_vocabfile, vocabs)
  torch.save(out_tensorfile, input_coded)
end

if not paths.filep(dataDir .. '/' .. in_t7) or not paths.filep(dataDir .. '/' .. vocab_t7) then
  text_to_t7(dataDir .. '/' .. in_file, dataDir .. '/' .. in_t7, dataDir .. '/' .. vocab_t7)
end

local vocabs = torch.load(dataDir .. '/' .. vocab_t7)
local input = torch.load(dataDir .. '/' .. in_t7)
print('vocabs', vocabs)
for k,v in pairs(vocabs) do
  print('k', k)
end
print('loaded input')
local ivocab = vocabs.ivocab
local vocab = vocabs.vocab

local net = nn.Sequential()
local inputSize = #ivocab
for i, hiddenSize in ipairs(hiddenSizes) do
  net:add(nn.FastLSTM(inputSize, hiddenSize))
  inputSize = hiddenSize
end

net:add(nn.Linear(inputSize, #ivocab))
net:add(nn.LogSoftMax())
--lm:remember('both')

local crit = nn.ClassNLLCriterion()

print('#ivocab', #ivocab)

local it = 1
local offset = math.floor(input:size(1) / batchSize)
local batchInput = torch.Tensor(batchSize, #ivocab)
local batchTarget = torch.Tensor(batchSize)
if backend == 'cuda' then
  batchInput = batchInput:cuda()
  batchTarget = batchTarget:cuda()
  net:cuda()
  crit:cuda()
elseif backend == 'cl' then
  batchInput = batchInput:cl()
  batchTarget = batchTarget:cl()
  net:cl()
  crit:cl()
end
while true do
  sys.tic()
  local seqLoss = 0
  net:forget()
  net:zeroGradParameters()
  net:training()
  net:backwardOnline()
  batchInputs = {}
  batchOutputs = {}
  for s=1,seqLength do
    batchInput:zero()
    for b=1,batchSize do
      local thisOffset = (b * offset + it * seqLength + s - 1) % input:size(1) + 1  -- should handle wrap around...
      local thisChar = input[thisOffset]
      batchInput[b][thisChar] = 1
    end

    local batchOutput = net:forward(batchInput)
    batchInputs[s] = batchInput
      batchOutputs[s] = batchOutput
  end

  for s=seqLength,1,-1 do
    for b=1,batchSize do
      local thisOffset = (b * offset + it * seqLength + s - 1) % input:size(1) + 1  -- should handle wrap around...
      local thisTargetOffset = (thisOffset + 1 - 1) % input:size(1) + 1
      local thisTargetChar = input[thisTargetOffset]
      batchTarget[b] = thisTargetChar
    end

    local batchLoss = crit:forward(batchOutputs[s], batchTarget)
    seqLoss = seqLoss + batchLoss
    local batchGradOutput = crit:backward(batchOutputs[s], batchTarget)
    net:backward(batchInputs[s], batchGradOutput)
  end

  net:updateParameters(learningRate)
  print('it', it, 'seqLoss', seqLoss, 'time', sys.toc())
  it = it + 1
end

)

from rnn.

sidec commented on August 28, 2024

Thanks for this code sample!
Now I'm using Mesa OpenCL (Clover with OpenCL device: AMD ARUBA (DRM 2.43.0, LLVM 3.7.0)) and code fails due to error thrown by Module.lua from dpnn library.

torch/install/share/lua/5.1/dpnn/Module.lua:209: attempt to call method 'data' (a nil value)

However, I run it using cpu and it runs smoothly.

from rnn.

hughperkins commented on August 28, 2024

As far as I know Clover wont actually run on any of your opencl devices as such. As far as I know, it's essentially an OpenCL emulator, that runs as x86 code, on your CPU (but not using cpu gpu, or cpu opencl, just normal x86, as far as I know).

from rnn.

sidec commented on August 28, 2024

Good to know. I used clBLAS 2.6 and saw difference in terms of performance in single precision matrix subroutines, sadly I have no strong evidence at hand to show. I think that main problem with mesa-opencl is that it supports only OpenCL 1.1 and it doesn't implements entire math API.

from rnn.

hughperkins commented on August 28, 2024

I used clBLAS 2.6 and saw difference in terms of performance in single precision matrix subroutines

There could be many reasons for this. It could be that you are running single-threaded when 'using cpu', but running multithreaded, when running in Clover. One way to test this would be to create a calculation that lasts 10-60 seconds, and look at the output of htop when running 'in cpu', and when running on Clover.

from rnn.

sidec commented on August 28, 2024

Although obviously low end, I'm pretty sure that mesa-opengl on Radeon HD 7640G uses GPU. I followed your advise and run an example from aparapi and I was watching closely htop output. It supported my previous feeling which I got when I used clBLAS. It seems that during calculations only 1 core was busy. Here is a sample from console:

gpu time: 9857
cpu time: 22939
valid? yes
boolean 2D
gpu time: 2682
cpu time: 6301
valid? yes
byte 2D
gpu time: 2636
cpu time: 6900
valid? yes
short 2D
gpu time: 1882
cpu time: 7334
valid? no
int 2D
gpu time: 775
cpu time: 7671
valid? yes
long 2D
gpu time: 1277
cpu time: 8724
valid? yes
float 2D
gpu time: 765
cpu time: 7314
valid? yes
double 2D
clBuildProgram failed
************************************************
unsupported call to function __muldf3 in run
************************************************
Jan 04, 2016 8:55:32 PM com.amd.aparapi.internal.kernel.KernelRunner warnFallBackAndExecute
WARNING: Reverting to Java Thread Pool (JTP) for class gov.pnnl.aparapi.sample.mdarray.DMatMul2D: OpenCL compile failed

As expected, the program failed at an attempt to do calculation with double precision on gpu, but aside of that it worked just fine and with clear speed up.

This link is to a features matrix and I hope that in some future we will get fully-fledged open source implementation of OpenCl.

from rnn.

hughperkins commented on August 28, 2024

Ah, interesting :-) That's new information. Seems pretty convincing :-) I shall have to go back to all the threads where I suggested Clover was running in x86, and update those :-D Dont suppose... can you provide also the output of time command for running this twice, just using floats, running once just for 'cpu time', and once just for 'gpu time', and provide the output for that, and the script/program you are using for that?

from rnn.

sidec commented on August 28, 2024

Running this chunk:

require 'torch'
require 'cltorch'
torch.setdefaulttensortype("torch.FloatTensor")
n = 50
K = 2000
M = 1000
N = 1000
r1 = torch.randn(M,K)
r2 = torch.randn(K,N)
start = torch.tic()
result = torch.FloatTensor(M,N) 
for i=1,n do
   torch.mm(result,r1,r2)
end
elapsed = torch.toc(start)
print(string.format("FloatTensor torch.mm: %2.2f sec\n", elapsed/n));
rCl1 = r1:cl()
rCl2 = r2:cl()
result = torch.ClTensor(M,N)
print(torch.type(result))
start = torch.tic()
for i=1,n do
   torch.mm(result,rCl1,rCl2)
end
elapsed = torch.toc(start)
print(string.format("ClTensor torch.mm: %2.2f sec\n", elapsed/n));

I got:

FloatTensor torch.mm: 0.27 sec

Using Mesa , OpenCL platform: Clover
Using OpenCL device: AMD ARUBA (DRM 2.43.0, LLVM 3.7.0)
torch.ClTensor  
ClTensor torch.mm: 0.01 sec

I wasn't able to set for example n=100 and keep the computation longer and observe how CPUs work due to Segmentation fault (core dumped). As I mentioned before - this is low end device but in my opinion open-source driver already uses GPU.

from rnn.

hughperkins commented on August 28, 2024

Hmmm, does seem convincing. I mean, even if the time might be a bit short, because maybe needs a cltorch.synchronize(), but, if it was not running on the GPU, such a sync point would not change anything. Seems fairly convincing.

Going back to your original issue, with :data(), can you update your cltorch and clnn to latest version, and provide the output of:

luajit -l cltorch -e 'cltorch.about()'
luajit -l clnn -e 'clnn.about()'

... and then double-check you do/dont still have the issue?

from rnn.

sidec commented on August 28, 2024

cltorch.  OpenCL backend for Torch
Built from commit 8ac38ef
clnn.  OpenCL backend for Torch nn
Built from commit

Issue with :data() has gone. Now I have well know

Something went wrong with clCreateKernel, OpenCL erorr code -45
Apply_2t_0s_0pt_-2_2_*out = tanh( *in1 ) build log: 
unsupported call to function _Z4tanhf in THClTensor_pointwiseApplyD

Strangely, mesa-opencl misses tanh and other math functions. I can provide you with outputs from cltorch and clnn test suits if you wish.

from rnn.

hughperkins commented on August 28, 2024

Ah yeah, Clover doesnt handle tanh. But you can build tanh from exp. Which is what I suggest you do: create a fork of cltorch, and modify THClTensor_pointwise.cpp to use exp instead of tanh. In the future you/me/someone might figure out a way of doing something like if(clover) { // use exp to write tanh; } else { // use tanh; }, but for now, creating a separate branch/fork is probably a good way forward.

I'm going to throw out links to all the threads where I claim Clover doesnt run on the gpu :-D which contains the formulae for tanh in terms of exp, and will also alert the other threads that it seems Clover really does run on the gpu :-) [after linking a bit] Ok, actually I did the opposite: I linked from them to here.

Here's a post which summarizes how to build tanh from exp:

karpathy/char-rnn#128 (comment)

It looks like here I'm suggesting to modify Tanh.lua, in clnn, by changing line 3. The formula for tanh in terms of exp is: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Edit: and looks like you'd want to change line 3 to something like:

self.output:map(input, 'x = (exp(y) - exp(-y)) / (exp(y) + exp(-y))')

from rnn.

sidec commented on August 28, 2024

Thanks for help, I will patch cltorch and I'm going to give a try. ~~However, I'm not sure mesa-opencl supports exponential either.~~
BTW, I checked this ltsm language model you had adapted from Karpathy on Debian with catalyst driver installed and I saw just a fractional (like 1.2~1.3) speed up when comparing with cpu only. After all, I shouldn't be surprised - tensors are quite small for this toy example. Good new is that it works smoothly and I happy to stick with OpenCl.

from rnn.

hughperkins commented on August 28, 2024

Ok. I guess this issue can be closed now?

from rnn.

rnn with cltorch? about rnn HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent