cnugteren / cltune Goto Github PK

CLTune: An automatic OpenCL & CUDA kernel tuner

License: Other

C++ 92.32% CMake 2.01% C 5.67%

cltune's Issues

clGetKernelWorkGroupInfo does not tell you the size of the output variable when passed nullptr and size=0

This bug affects clpp11.h in Kernel::LocalMemUsage. The first call to clGetKernelWorkGroupInfo is intented to get the number of bytes required in the result variable, although result is always a size_t anyway. This seems to work on POCL and NVIDIA's OpenCL platform on Ubuntu, but not on Intel's CPU runtime (haven't tested the GPU runtime). On Intel's CPU runtime on Ubuntu, the returned size is 0, causing the second call to clGetKernelWorkGroupInfo to return CL_INVALID_VALUE. As far as I can see, the size should be hard coded to sizeof(cl_ulong). Nowhere in the OpenCL 1.2 or 2.0 documentation for clGetKernelWorkGroupInfo does it actually say that it will return the required size of the output variable, instead the docs list the return types depending the cl_kernel_work_group_info passed to the param_name parameter.

Removing the first call to clGetKernelWorkGroupInfo and hard-coding bytes = sizeof(cl_ulong) works for me when compiling for 64 bit and testing on POCL, NVIDIA and Intel CPU runtimes on Ubuntu. I still need to test 32 bit and check for other places where clGetKernelWorkGroupInfo is used before I submit a pull request with this fix.

Not complied on embdded GPU

Hello

Using CLTune and running "sample_gemm" there are errors


[----------] Testing kernel gemm_fast
device compiler error/warning: (277:0) : error : require any typed expression except arrays, structures containing arrays, sampler types, and structures containing sampler types
(293:0) : error : require any typed expression except arrays, structures containing arrays, sampler types, and structures containing sampler types
(321:0) : error : require a matching typed expression
(388:0) : error : require a struct/union or vector typed expression
(389:0) : error : require a struct/union or vector typed expression
(494:0) : error : function: 'LocalToPrivateA' hasn't the corresponding declaration
(502:0) : error : function: 'LocalToPrivateB' hasn't the corresponding declaration
(509:0) : error : function: 'MultiplyAccumulate' hasn't the corresponding declaration
(520:0) : error : function: 'StoreResults' hasn't the corresponding declaration

[   FAILED ] Kernel gemm_fast failed
[   FAILED ]   catched exception: device compiler error/warning occurred ^^

[   FAILED ] gemm_fast;      0.0 ms;   MWG 32;  NWG 128;   KWG 32; MDIMC 16; NDIMC 16; MDIMA 16; NDIMB 32;    KWI 2;    VWM 2;    VWN 2;   STRM 0;   STRN 0;     SA 1;     SB 1;PRECISION 32;

can you indicate what might be a rootcuase of it?

Thank you.

crash on my mac

./sample_conv_simple

[==========] Initializing on platform 0 device 0
[ INFO ] Device vendor: 'Apple'
[ INFO ] Device name: 'Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz'
[ INFO ] Device extra info: ''
[ INFO ] Platform version: 'OpenCL 1.2 (Oct 29 2018 21:43:16)'

[----------] Testing kernel conv
[ VERBOSE ] Computing the permutations of all parameters
[1] 78238 segmentation fault ./sample_conv_simple

GEMM on input sizes that are not a power of 2

Hello together,

performing GEMM on input sizes that are not a power of 2 (e.g., matrices of the size 2000x2000) leads to errors:

[==========] Initializing on platform 0 device 0
[==========] Device name: 'Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz' (OpenCL 1.2 AMD-APP (1912.5))

[----------] Testing reference gemm_reference
[ RUN      ] Running gemm_reference
[       OK ] Completed gemm_reference (10915.9 ms) - 1 out of 1

[----------] Testing kernel gemm_fast
[ RUN      ] Running gemm_fast
[   FAILED ] Kernel gemm_fast failed
[   FAILED ]   catched exception: Internal OpenCL error: -54
[  WARNING ] Results differ: L2 norm is 1.90e+08
[   FAILED ] gemm_fast;      0.0 ms;  MWG 128;  NWG 128;   KWG 32; MDIMC 32;  NDIMC 8; MDIMA 32; NDIMB 16;    KWI 8;    VWM 1;    VWN 1;   STRM 1;   STRN 0;     SA 1;     SB 1;PRECISION 32;
[ RUN      ] Running gemm_fast
[   FAILED ] Kernel gemm_fast failed
[   FAILED ]   catched exception: Internal OpenCL error: -54
[  WARNING ] Results differ: L2 norm is 1.90e+08
[   FAILED ] gemm_fast;      0.0 ms;  MWG 128;   NWG 64;   KWG 32; MDIMC 32; NDIMC 16; MDIMA 32; NDIMB 16;    KWI 8;    VWM 1;    VWN 4;   STRM 1;   STRN 0;     SA 1;     SB 1;PRECISION 32;
[ RUN      ] Running gemm_fast
[   FAILED ] Kernel gemm_fast failed
[   FAILED ]   catched exception: Internal OpenCL error: -54
[  WARNING ] Results differ: L2 norm is 1.90e+08
[   FAILED ] gemm_fast;      0.0 ms;  MWG 128;   NWG 64;   KWG 32; MDIMC 32; NDIMC 16; MDIMA 32; NDIMB 32;    KWI 8;    VWM 1;    VWN 2;   STRM 1;   STRN 0;     SA 0;     SB 1;PRECISION 32;

Does someone now how to fix it (without using approaches such as padding etc.)?

setting global size

I try to use CLTune for tuning the XgemmDirect kernel which is part of the CLBlast library.
In CLBlast, the global size is set as ((1 + ((kSizeM - 1) / WGD))*WGD * MDIMCD) / WGD in dimension 1 and as ((1 + ((kSizeN - 1) / WGD))*WGD * NDIMCD) / WGD in dimension 2. Is it possible to set the same global in CLTune?

[ERROR] Entries for a single kernel with multiple argument values

Hi, I used the CLTuner to tune some GPU.
After running the tuners and downloading the json files from the device, I ran the script scripts/database/database. Then I saw the following error:

...
[database] Saving database to './scripts/database/database.json'
[database] Calculating the best results per device/kernel...
[database] Calculating the default values...
[ERROR] Entries for a single kernel with multiple argument values: ARM;GPU;32;xgemm;Xgemm

Any advice?

Mismatch between README and example programs

README.md states that ./sample_conv X Y will run the convolution sample on device X of platform Y. However, the sample actually hard codes the platform (platform 0 always). I'm working on updated code in a separate fork and am happy to submit a PR when its finished.

ubuntu 16.04+g++ (Ubuntu 4.9.3-13ubuntu2) 4.9.3

I make install success , but when I run this program.

./sample_simple
Segmentation fault (core dumped)

Why?

Hard-coded relative path in test/tuner.cc

Hello,

I'm writing a Gentoo package for CLTune: sci-libs/cltune

The Gentoo package manager (Portage) can perform unit tests automatically after build and before installation.
A hard-coded relative path at test/tuner.cc#L46 breaks the tests as it assumes that the binary is run from a subdirectory of the source root.

An option would be to pass the source or sample base path by argument and fall back to ../samples/ if omitted.

What do you think ?

half.h

Hi! Your half-float conversion is broken, because it maps larger numbers (>65536.0) to infinity. So you can't really do any HDR with them.

Using CLTune for optimizing saxpy

I try to use CLTune for tuning an OpenCL saxpy implementation. The kernel is as follows:

__kernel void saxpy( const          size_t   N,
                     const          float    a,
                     const __global float*   x,
                           __global float*   y
                   )
{
  for( int w = 0; w < WPT; ++w ) {
    const int id = w * get_global_size(0)
                     + get_global_id(0);

    y[ id ] += a * x[ id ];
  }
}

It is an simplified version of the CLBLast saxpy kernel where the usage of OpenCL vector data types is removed and the input data type is set to float.
For this kernel, we have the following tuning-parameters: i) the work per thread (WPT), and ii) the local size (LS).

My suggestion for a corresponding CLTune program is as follows:

// CLTune program for tuning saxpy
int main()
{
  // Set kernel and input size
  const auto   saxpy = /* path to kernel */
  const size_t N       = /* input size     */

  // Declare input data
  float a;
  auto  vec_x = std::vector<float>( N );
  auto  vec_y = std::vector<float>( N );

  // Initialize random number generator
  const auto random_seed = std::chrono::system_clock::now().time_since_epoch().count();
  std::default_random_engine generator(static_cast<unsigned int>(random_seed));
  std::uniform_real_distribution<float> distribution(-2.0f, 2.0f);

  // Fill input data
  a = distribution(generator);
  for (auto &item: vec_x) { item = distribution(generator); }
  for (auto &item: vec_y) { item = distribution(generator); }

  // Initializes the tuner (platform 1, device 0)
  cltune::Tuner tuner( 1, 0 );
  tuner.UseAnnealing( 1.0f / 2048.0f, 4.0 );
  auto id = tuner.AddKernel( saxpy, "saxpy", { N }, { 1 } );
  
  // Set tuning-parameters
  tuner.AddParameter(id, "LS", {1, ... , N} );
  tuner.AddParameter(id, "WPT", {1, ... ,N} );

  // Sets constraints on parameters
  auto DividesN       = []( std::vector<size_t> v ){ return  N % v[0]         == 0; };
  auto DividesNDivWPT = []( std::vector<size_t> v ){ return (N / v[0]) % v[1] == 0; };
 
  tuner.AddConstraint(id, DividesN      , {"WPT"}        );
  tuner.AddConstraint(id, DividesNDivWPT, {"WPT", "LS"} );

  // Modifies the thread-sizes (both global and local) based on the parameters
  tuner.MulLocalSize(id, {"LS"} );
  tuner.DivGlobalSize(id, {"WPT" } );

  // Sets the function's arguments.
  tuner.AddArgumentScalar( N );
  tuner.AddArgumentScalar( a );
  tuner.AddArgumentInput(vec_x);
  tuner.AddArgumentOutput(vec_y);

  // Starts the tuner
  tuner.Tune();
  tuner.PrintToFile("output.csv");
}

Here, I have some difficulties:

Does CLTune provide a convenient way to define large tuning-parameter ranges. Here, in my suggested CLTune program above, I use a pseudo-code notation of the form 1,...,N to represent the range of size_t values from 1 to N.
Can the best found configuration that CLTune finally determines be accessed within the C++ code or do I have to read it out of the file output.csv?
Can someone confirm that the CLTune program above is correct and in a form that the CLTune developer would approve for tuning the saxpy kernel?

Many thanks in advance.

compiling with Clover (Mesa OpenCL) sample_conv gives errors

I am currently trying to build CLTune with Mesa OpenCL on AMD Tonga platform.
The sample_conv_simple runs out of the box but sample_conv gives errors:

Device compiler error/warning: <unknown>:0:0: in function conv void (i32, i32, float addrspace(1)*, float addrspace(2)*, float addrspace(1)*): unsupported call to function LoadLocalPlusHalo
[   FAILED ] Kernel conv failed

When the keyword inline is removed from *.opencl files the code compiles.
I also noticed that it runs only if I run the sample_conv_simple prior to the latter program.

./sample_conv
[==========] Initializing on platform 0 device 0
[==========] Device name: 'AMD TONGA (DRM 3.8.0 / 4.9.11-1-ARCH, LLVM 3.9.1)' (OpenCL 1.1 Mesa 17.0.0)
[----------] Testing reference conv_reference
[ RUN      ] Running conv_reference
[       OK ] Completed conv_reference (65.7 ms) - 1 out of 1
[----------] Testing kernel conv
[ RUN      ] Running conv
[       OK ] Completed conv (86.1 ms) - 1 out of 22
[ RUN      ] Running conv
[       OK ] Completed conv (91.9 ms) - 2 out of 22
[ RUN      ] Running conv
[       OK ] Completed conv (67.3 ms) - 3 out of 22

machine learning pipeline

This is related to another discussion currently taking place here: jrprice/Oclgrind#109 (comment)

The idea is to emulate an OpenCL kernel using oclgrind and use this to gather kernel-specific runtime information (think dataflow, variable lifetime) and use this information in the ML pipeline to do more sophisticated transformations based on much more comprehensive. and better, information of the kernel's runtime behavior.

To pull this off, some kind of interface would need to be established between the kernel virtualization and the tuner components, even if that just means serializing kernel-specific data to a file on disk and use that for the ML pipeline.

cnugteren / cltune Goto Github PK

cltune's Issues

clGetKernelWorkGroupInfo does not tell you the size of the output variable when passed nullptr and size=0

Not complied on embdded GPU

crash on my mac

GEMM on input sizes that are not a power of 2

setting global size

[ERROR] Entries for a single kernel with multiple argument values

Mismatch between README and example programs

ubuntu 16.04+g++ (Ubuntu 4.9.3-13ubuntu2) 4.9.3

Hard-coded relative path in test/tuner.cc

half.h

Using CLTune for optimizing saxpy

compiling with Clover (Mesa OpenCL) sample_conv gives errors

machine learning pipeline

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent