naibaf7 / libdnn Goto Github PK
View Code? Open in Web Editor NEWGreentea LibDNN - a universal convolution implementation supporting CUDA and OpenCL
License: Other
Greentea LibDNN - a universal convolution implementation supporting CUDA and OpenCL
License: Other
With #26 fixed on my side, I was able to perform some benchmarks now. The libDNN generated convolutions are about 3x-4x faster than my naive kernel described in #26, which is very nice! But they are slower than my convolutions running on the CPU :-(
The CPU implementation is single threaded, uses NHWC for the input and uses the following filter layout:
filter = [depth/N, filter, filter, channel, N]
, where N is 8. This is done to make access to the filter more cache friendly. As far as I understand, the following TVM trick uses a similar approach: http://tvmlang.org/2018/01/16/opt-mali-gpu.html (see tiling and packing).
WDYT about this kind of layout optimization? Have you played with something like this? Do you think it may result in even faster convolution kernels?
BTW, I'm testing on a MacBook Pro 2017 using the AMD GPU.
I use caffe-opencl with Mali gpu but I seems that libdnn can not support Mali.
Actually, I want to make some optimization in opencl kernel for some operate such as element-wise multiplication. You have do some memory optimization in libdnn of opencl kernel. But as I know, the memory of opencl in mali just use CL_MEM_ALLOC_HOST_PTR .. for cpu data.
would you tell me the method libdnn use for memory optimization or show me some resources about this.
thank you
Can you please provide the cmake build instructions on Linux system?
Just to collect some feedbacks from Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation. /cc @CNugteren cause the benckmark on mobile involved CLBlast
I open this ticket in order to discuss ideas for integration libdnn to tiny-cnn.
Currently, I implemented a small interface to get native OpenCL context from tiny-cnn:
https://github.com/edgarriba/tiny-cnn/blob/f4d9e1d4f45ad8ac46824b5c007f5507fe8925eb/tiny_cnn/core/session.h
Things I think that are needed:
BTW, @naibaf7 @hughperkins notice that we are planing to migrate tiny-cnn to an organization account and renaming the library itself since now it's more a pure DNN lib than just CNN. Maybe you are interested in getting more involved in the development. tiny-dnn/tiny-dnn#235
... this way is easier to download and build
(albeit without the 'examples' or 'doc' directories...)
Just to be aware of https://arxiv.org/abs/1606.00094 /cc @moskewcz
I'm trying to use LibDNN generated kernels instead of my own naive kernel below:
struct NCHW {
size_t n;
size_t c;
size_t h;
size_t w;
};
inline size_t getNCHWIndex(NCHW s, size_t n, size_t c, size_t h,
size_t w) {
return (n * s.c * s.w * s.h) + (c * s.h * s.w) + (h * s.w) + w;
}
/// Perform a convolution on the inputs in NCHW format.
__kernel
void oclconvolutionK(
__global float *dest,
__global float *src,
__global float *filter,
__global float *bias,
unsigned filterSize,
unsigned stride,
unsigned pad,
NCHW destDim, NCHW srcDim,
NCHW filterDim) {
// dest x
size_t ax = get_global_id(0);
// dest y
size_t ay = get_global_id(1);
// dest channel
size_t d = get_global_id(2);
typedef int ssize_t;
ssize_t x = -(ssize_t)pad + ax * stride;
ssize_t y = -(ssize_t)pad + ay * stride;
// For each input in the batch:
for (size_t n = 0; n < srcDim.n; n++) {
// For each element in the convolution-filter:
float sum = 0;
for (size_t fx = 0; fx < filterSize; fx++) {
for (size_t fy = 0; fy < filterSize; fy++) {
ssize_t ox = x + fx;
ssize_t oy = y + fy;
// Ignore index access below zero (this is due to padding).
if (ox < 0 || oy < 0 || ox >= (ssize_t)srcDim.h ||
oy >= (ssize_t)srcDim.w) {
continue;
}
for (size_t fd = 0; fd < srcDim.c; fd++) {
sum += filter[getNCHWIndex(filterDim, d, fd, fx, fy)] *
src[getNCHWIndex(srcDim, n, fd, (size_t)ox, (size_t)oy)];
}
}
}
sum += bias[d];
dest[getNCHWIndex(destDim, n, d, ax, ay)] = sum;
} // N
}
I checked that my kernel generates correct results.
But I get rather different results from the libDNN generated kernels, if the filter (i.e. kernel) size is > 1.
I guess I'm doing something wrong.
So, I'd like to check with you if at least at the logical level the kernels generated by libDNN are computing mathematically the same convolution or may be you use a slightly different formula?
Hi Fabian, I couldn't figure out which version of OpenCL does libDNN support. Is it 1.1 or 2.x?
Thanks!
PS: hope AMD employees @fsword73 and @dagamayank can chip some optimized code in to make libDNN a fast replacement for cuDNN.
This repository started targeting convolution and its proximity context.
I'm guessing if we could extend this goal to be a multi stakeholder extended API for dnn.
This require the introduction of other common accellerated kernels/ops in the API (probably starting with porting those that in Opencl Caffe branch). I've opened this to collect feedbacks on the possible roadmap cause I think that the Opencl dnn scenario was historically too fragmented and the standard the facto accellaration API design was roadmapped by cudnn. /cc @edgarriba @hughperkins @gongzg (Intel). I don't /cc amd cause I think its Opencl caffe fork has no more resources allocated and it seems more involved in the hccaffe effort.
Quick question: Was libdnn tested with convolutions which use non-square paddings?
It works fine with square padding for me, but has some issues with a non-square one.
It could be, of course, that I do something wrong. But before I dive deep into a debug session, I thought I'd check if it is known to work at all.
Hello again. Faced with problem using libDNN instead of caffe engine when trying to use https://github.com/fzliu/style-transfer this thing.
With caffe engine I got this.
(as planned)
With libDNN I got this.
(strange)
and make runtest
Have no idea what the reason for this glitch.
How can I use the libdnn tuner? Is there some client?
Your library is pretty cool, but looks like it was not updated for a long period of time.
At the same time, the version of libdnn in your Caffe fork seems to be more maintained and even got some new features, like BLAS routines generators, etc.
Could you provide some insight about your plans regarding the standalone libdnn or libdnn in general?
Specifically, it would be nice if you could answer some of the following questions:
Do you plan to update the standalone libdnn, e.g. from the version in your Caffe fork?
What is the status of the BLAS support in the Caffe's version of libdnn? How does it compare to something like CLBlas, CLBlast or CUDA counterparts of those?
Could you provide a brief description of the algorithms you use when producing optimized fused convolution (and other) kernels and how/why they are better/faster than e.g. im2col-based approaches or other well-known implementations of convolutions either in terms of performance or memory consumption? The documentation is pretty sparse currently. If it is based on any specific papers or well-known approaches, it would be nice if you could provide references.
How is libdnn In terms of the convolutions performance compared to the current versions of cuDNN and other well-known implementations. You reported it was very fast, often faster than competitors in the past. Is it still the case, or may be there were some recent achievements that made other implementations faster?
Do you plan to add any new interesting features or improvements? If so, could you describe your them?
Thanks!
How to run a convolution forwardprop, input gradient, weights gradient?
Assume I have a cl_mem for input, weights, gradOutput, output, gradInput, gradWeights, an opencl queue, probably an opencl context. And metadata representing tensor dimensions. What to do next?
Should we add some tests? Does anyone know if travisCI supports OpenCL, CUDA, ... ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.