naibaf7 / libdnn Goto Github PK

View Code? Open in Web Editor NEW

135.0 135.0 35.0 83 KB

Greentea LibDNN - a universal convolution implementation supporting CUDA and OpenCL

License: Other

C++ 81.13% CMake 18.87%

libdnn's People

Contributors

Stargazers

Watchers

libdnn's Issues

Question about performance

With #26 fixed on my side, I was able to perform some benchmarks now. The libDNN generated convolutions are about 3x-4x faster than my naive kernel described in #26, which is very nice! But they are slower than my convolutions running on the CPU :-(

The CPU implementation is single threaded, uses NHWC for the input and uses the following filter layout:
filter = [depth/N, filter, filter, channel, N], where N is 8. This is done to make access to the filter more cache friendly. As far as I understand, the following TVM trick uses a similar approach: http://tvmlang.org/2018/01/16/opt-mali-gpu.html (see tiling and packing).
WDYT about this kind of layout optimization? Have you played with something like this? Do you think it may result in even faster convolution kernels?

BTW, I'm testing on a MacBook Pro 2017 using the AMD GPU.

Is libdnn support Mali gpu?

I use caffe-opencl with Mali gpu but I seems that libdnn can not support Mali.
Actually, I want to make some optimization in opencl kernel for some operate such as element-wise multiplication. You have do some memory optimization in libdnn of opencl kernel. But as I know, the memory of opencl in mali just use CL_MEM_ALLOC_HOST_PTR .. for cpu data.
would you tell me the method libdnn use for memory optimization or show me some resources about this.

thank you

Building instructions

Can you please provide the cmake build instructions on Linux system?

Kernel tuning on mobile GPU

Just to collect some feedbacks from Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation. /cc @CNugteren cause the benckmark on mobile involved CLBlast

integration to tinny-cnn

@naibaf7 @bhack

I open this ticket in order to discuss ideas for integration libdnn to tiny-cnn.

Currently, I implemented a small interface to get native OpenCL context from tiny-cnn:
https://github.com/edgarriba/tiny-cnn/blob/f4d9e1d4f45ad8ac46824b5c007f5507fe8925eb/tiny_cnn/core/session.h

Things I think that are needed:

Implement a module for data transfer between devices
Discuss the shape of libdnn simplified interface if is needed.

BTW, @naibaf7 @hughperkins notice that we are planing to migrate tiny-cnn to an organization account and renaming the library itself since now it's more a pure DNN lib than just CNN. Maybe you are interested in getting more involved in the development. tiny-dnn/tiny-dnn#235

build instructions?

Maybe include viennacl-dev in the sourcecode?

... this way is easier to download and build

(albeit without the 'examples' or 'doc' directories...)

Boda-rtc

Just to be aware of https://arxiv.org/abs/1606.00094 /cc @moskewcz

Question about the generated OpenCL convolution kernels

I'm trying to use LibDNN generated kernels instead of my own naive kernel below:

struct NCHW {
   size_t n;
   size_t c;
   size_t h;
   size_t w;
};

inline size_t getNCHWIndex(NCHW s, size_t n, size_t c, size_t h,
               size_t w) {
  return (n * s.c * s.w * s.h) + (c * s.h * s.w) + (h * s.w) + w;
}

/// Perform a convolution on the inputs in NCHW format.
__kernel 
void oclconvolutionK(
                                     __global float *dest,
                                     __global float *src,
                                     __global float *filter,
                                     __global float *bias,
                                     unsigned filterSize,
                                     unsigned stride,
                                     unsigned pad, 
                                     NCHW destDim, NCHW srcDim,
                                     NCHW filterDim) {
  // dest x
  size_t ax = get_global_id(0);
  // dest y
  size_t ay = get_global_id(1);
  // dest channel
  size_t d = get_global_id(2);

  typedef int ssize_t;
  ssize_t x = -(ssize_t)pad + ax * stride;
  ssize_t y = -(ssize_t)pad + ay * stride;

  // For each input in the batch:
  for (size_t n = 0; n < srcDim.n; n++) {
    // For each element in the convolution-filter:
    float sum = 0;
    for (size_t fx = 0; fx < filterSize; fx++) {
      for (size_t fy = 0; fy < filterSize; fy++) {
        ssize_t ox = x + fx;
        ssize_t oy = y + fy;

        // Ignore index access below zero (this is due to padding).
        if (ox < 0 || oy < 0 || ox >= (ssize_t)srcDim.h ||
            oy >= (ssize_t)srcDim.w) {
          continue;
        }

        for (size_t fd = 0; fd < srcDim.c; fd++) {
          sum += filter[getNCHWIndex(filterDim, d, fd, fx, fy)] *
                 src[getNCHWIndex(srcDim, n, fd, (size_t)ox, (size_t)oy)];
        }
      }
    }

    sum += bias[d];
    dest[getNCHWIndex(destDim, n, d, ax, ay)] = sum;
  } // N
}

I checked that my kernel generates correct results.

But I get rather different results from the libDNN generated kernels, if the filter (i.e. kernel) size is > 1.
I guess I'm doing something wrong.

So, I'd like to check with you if at least at the logical level the kernels generated by libDNN are computing mathematically the same convolution or may be you use a slightly different formula?

OpenCL version?

Hi Fabian, I couldn't figure out which version of OpenCL does libDNN support. Is it 1.1 or 2.x?

Thanks!

PS: hope AMD employees @fsword73 and @dagamayank can chip some optimized code in to make libDNN a fast replacement for cuDNN.

Extended goals

This repository started targeting convolution and its proximity context.
I'm guessing if we could extend this goal to be a multi stakeholder extended API for dnn.
This require the introduction of other common accellerated kernels/ops in the API (probably starting with porting those that in Opencl Caffe branch). I've opened this to collect feedbacks on the possible roadmap cause I think that the Opencl dnn scenario was historically too fragmented and the standard the facto accellaration API design was roadmapped by cudnn. /cc @edgarriba @hughperkins @gongzg (Intel). I don't /cc amd cause I think its Opencl caffe fork has no more resources allocated and it seems more involved in the hccaffe effort.

Non-square padding issues

Quick question: Was libdnn tested with convolutions which use non-square paddings?

It works fine with square padding for me, but has some issues with a non-square one.

It could be, of course, that I do something wrong. But before I dive deep into a debug session, I thought I'd check if it is known to work at all.

Work with style-transfer

Hello again. Faced with problem using libDNN instead of caffe engine when trying to use https://github.com/fzliu/style-transfer this thing.

With caffe engine I got this.
(as planned)

and make runtest log

log_runtest_caffeengn.txt

With libDNN I got this.
(strange)

and make runtest

log_runtest_libDNN.txt

Have no idea what the reason for this glitch.

Tuner

How can I use the libdnn tuner? Is there some client?

Status of libdnn as of April 2018

Your library is pretty cool, but looks like it was not updated for a long period of time.

At the same time, the version of libdnn in your Caffe fork seems to be more maintained and even got some new features, like BLAS routines generators, etc.

Could you provide some insight about your plans regarding the standalone libdnn or libdnn in general?

Specifically, it would be nice if you could answer some of the following questions:

Do you plan to update the standalone libdnn, e.g. from the version in your Caffe fork?
What is the status of the BLAS support in the Caffe's version of libdnn? How does it compare to something like CLBlas, CLBlast or CUDA counterparts of those?
Could you provide a brief description of the algorithms you use when producing optimized fused convolution (and other) kernels and how/why they are better/faster than e.g. im2col-based approaches or other well-known implementations of convolutions either in terms of performance or memory consumption? The documentation is pretty sparse currently. If it is based on any specific papers or well-known approaches, it would be nice if you could provide references.
How is libdnn In terms of the convolutions performance compared to the current versions of cuDNN and other well-known implementations. You reported it was very fast, often faster than competitors in the past. Is it still the case, or may be there were some recent achievements that made other implementations faster?
Do you plan to add any new interesting features or improvements? If so, could you describe your them?

Thanks!

How to run a convolution forwardprop, input gradient, weights gradient?

Assume I have a cl_mem for input, weights, gradOutput, output, gradInput, gradWeights, an opencl queue, probably an opencl context. And metadata representing tensor dimensions. What to do next?

Add tests

Should we add some tests? Does anyone know if travisCI supports OpenCL, CUDA, ... ?

naibaf7 / libdnn Goto Github PK

libdnn's People

Contributors

Stargazers

Watchers

Forkers

libdnn's Issues

Recommend Projects

Recommend Topics

Recommend Org