Git Product home page Git Product logo

libdnn's People

Contributors

edgarriba avatar gfursin avatar naibaf7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libdnn's Issues

Question about performance

With #26 fixed on my side, I was able to perform some benchmarks now. The libDNN generated convolutions are about 3x-4x faster than my naive kernel described in #26, which is very nice! But they are slower than my convolutions running on the CPU :-(

The CPU implementation is single threaded, uses NHWC for the input and uses the following filter layout:
filter = [depth/N, filter, filter, channel, N], where N is 8. This is done to make access to the filter more cache friendly. As far as I understand, the following TVM trick uses a similar approach: http://tvmlang.org/2018/01/16/opt-mali-gpu.html (see tiling and packing).
WDYT about this kind of layout optimization? Have you played with something like this? Do you think it may result in even faster convolution kernels?

BTW, I'm testing on a MacBook Pro 2017 using the AMD GPU.

Is libdnn support Mali gpu?

I use caffe-opencl with Mali gpu but I seems that libdnn can not support Mali.
Actually, I want to make some optimization in opencl kernel for some operate such as element-wise multiplication. You have do some memory optimization in libdnn of opencl kernel. But as I know, the memory of opencl in mali just use CL_MEM_ALLOC_HOST_PTR .. for cpu data.
would you tell me the method libdnn use for memory optimization or show me some resources about this.

thank you

integration to tinny-cnn

@naibaf7 @bhack

I open this ticket in order to discuss ideas for integration libdnn to tiny-cnn.

Currently, I implemented a small interface to get native OpenCL context from tiny-cnn:
https://github.com/edgarriba/tiny-cnn/blob/f4d9e1d4f45ad8ac46824b5c007f5507fe8925eb/tiny_cnn/core/session.h

Things I think that are needed:

  • Implement a module for data transfer between devices
  • Discuss the shape of libdnn simplified interface if is needed.

BTW, @naibaf7 @hughperkins notice that we are planing to migrate tiny-cnn to an organization account and renaming the library itself since now it's more a pure DNN lib than just CNN. Maybe you are interested in getting more involved in the development. tiny-dnn/tiny-dnn#235

Question about the generated OpenCL convolution kernels

I'm trying to use LibDNN generated kernels instead of my own naive kernel below:

struct NCHW {
   size_t n;
   size_t c;
   size_t h;
   size_t w;
};

inline size_t getNCHWIndex(NCHW s, size_t n, size_t c, size_t h,
               size_t w) {
  return (n * s.c * s.w * s.h) + (c * s.h * s.w) + (h * s.w) + w;
}

/// Perform a convolution on the inputs in NCHW format.
__kernel 
void oclconvolutionK(
                                     __global float *dest,
                                     __global float *src,
                                     __global float *filter,
                                     __global float *bias,
                                     unsigned filterSize,
                                     unsigned stride,
                                     unsigned pad, 
                                     NCHW destDim, NCHW srcDim,
                                     NCHW filterDim) {
  // dest x
  size_t ax = get_global_id(0);
  // dest y
  size_t ay = get_global_id(1);
  // dest channel
  size_t d = get_global_id(2);

  typedef int ssize_t;
  ssize_t x = -(ssize_t)pad + ax * stride;
  ssize_t y = -(ssize_t)pad + ay * stride;

  // For each input in the batch:
  for (size_t n = 0; n < srcDim.n; n++) {
    // For each element in the convolution-filter:
    float sum = 0;
    for (size_t fx = 0; fx < filterSize; fx++) {
      for (size_t fy = 0; fy < filterSize; fy++) {
        ssize_t ox = x + fx;
        ssize_t oy = y + fy;

        // Ignore index access below zero (this is due to padding).
        if (ox < 0 || oy < 0 || ox >= (ssize_t)srcDim.h ||
            oy >= (ssize_t)srcDim.w) {
          continue;
        }

        for (size_t fd = 0; fd < srcDim.c; fd++) {
          sum += filter[getNCHWIndex(filterDim, d, fd, fx, fy)] *
                 src[getNCHWIndex(srcDim, n, fd, (size_t)ox, (size_t)oy)];
        }
      }
    }

    sum += bias[d];
    dest[getNCHWIndex(destDim, n, d, ax, ay)] = sum;
  } // N
}

I checked that my kernel generates correct results.

But I get rather different results from the libDNN generated kernels, if the filter (i.e. kernel) size is > 1.
I guess I'm doing something wrong.

So, I'd like to check with you if at least at the logical level the kernels generated by libDNN are computing mathematically the same convolution or may be you use a slightly different formula?

OpenCL version?

Hi Fabian, I couldn't figure out which version of OpenCL does libDNN support. Is it 1.1 or 2.x?

Thanks!

PS: hope AMD employees @fsword73 and @dagamayank can chip some optimized code in to make libDNN a fast replacement for cuDNN.

Extended goals

This repository started targeting convolution and its proximity context.
I'm guessing if we could extend this goal to be a multi stakeholder extended API for dnn.
This require the introduction of other common accellerated kernels/ops in the API (probably starting with porting those that in Opencl Caffe branch). I've opened this to collect feedbacks on the possible roadmap cause I think that the Opencl dnn scenario was historically too fragmented and the standard the facto accellaration API design was roadmapped by cudnn. /cc @edgarriba @hughperkins @gongzg (Intel). I don't /cc amd cause I think its Opencl caffe fork has no more resources allocated and it seems more involved in the hccaffe effort.

Non-square padding issues

Quick question: Was libdnn tested with convolutions which use non-square paddings?

It works fine with square padding for me, but has some issues with a non-square one.

It could be, of course, that I do something wrong. But before I dive deep into a debug session, I thought I'd check if it is known to work at all.

Tuner

How can I use the libdnn tuner? Is there some client?

Status of libdnn as of April 2018

Your library is pretty cool, but looks like it was not updated for a long period of time.

At the same time, the version of libdnn in your Caffe fork seems to be more maintained and even got some new features, like BLAS routines generators, etc.

Could you provide some insight about your plans regarding the standalone libdnn or libdnn in general?

Specifically, it would be nice if you could answer some of the following questions:

  • Do you plan to update the standalone libdnn, e.g. from the version in your Caffe fork?

  • What is the status of the BLAS support in the Caffe's version of libdnn? How does it compare to something like CLBlas, CLBlast or CUDA counterparts of those?

  • Could you provide a brief description of the algorithms you use when producing optimized fused convolution (and other) kernels and how/why they are better/faster than e.g. im2col-based approaches or other well-known implementations of convolutions either in terms of performance or memory consumption? The documentation is pretty sparse currently. If it is based on any specific papers or well-known approaches, it would be nice if you could provide references.

  • How is libdnn In terms of the convolutions performance compared to the current versions of cuDNN and other well-known implementations. You reported it was very fast, often faster than competitors in the past. Is it still the case, or may be there were some recent achievements that made other implementations faster?

  • Do you plan to add any new interesting features or improvements? If so, could you describe your them?

Thanks!

Add tests

Should we add some tests? Does anyone know if travisCI supports OpenCL, CUDA, ... ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.