This is related to another discussion currently taking place here: <a class="issue-lin

A really interesting thread. /cc <a class="user-mention notranslate" data-hovercard-ty

machine learning pipeline about cltune HOT 9 OPEN

cnugteren commented on July 17, 2024

machine learning pipeline

from cltune.

Comments (9)

CNugteren commented on July 17, 2024

Interesting. I'll take a look at oclgrind to get a better understand of what you want. I'll come back to you soon. Perhaps also the Collective Knowledge framework might be of some help: https://github.com/ctuning/ck

from cltune.

UniqueFool commented on July 17, 2024

wow, I wasn't even aware that something like ck existed ... gotta have to do some reading now.

from cltune.

CNugteren commented on July 17, 2024

I found some time and I think I understand your idea. By the way, in your first post you meant "emulate an OpenCL device", not "emulate an OpenCL kernel", right?

I am not sure if CLTune is what you are looking for though. What is your use-case exactly? I can interpret your goal in two ways:

You are trying to machine-learn a kernel optimiser/tuner based on previous kernels it has seen. So you'll need a lot of kernels and some static and run-time information (that's where oclgrind comes into play). Then you can learn what optimisations are a good choice given static and run-time information of a previously unseen kernel.
You are trying to optimise a single kernel but the optimisation-space is too vast. In that case you'll hope that some static and run-time information (oclgrind again here) can help you guide a machine-learned model faster towards a good (or the best) solution.

In the first case CLTune is really not your choice: it can only perform 'optimisations' that are pre-programmed using pre-processor variables into a kernel. CLTune is a tool to help you explore those options, optionally using machine learning to guide you faster towards a decision space. Better to hook this up in the compiler itself I would say.

For the second case it might be a better fit, but I am not so sure if this extra information will be helpful to train a model. With the extra data we might also need to look at larger models that can capture this new information. Keep in mind that I am currently not even using the static data that is readily available (number of instructions of some sort, number of branches, vector width, architecture details), I am only using the current user-defined 'configuration'. So perhaps it is better to start there, instead of using run-time information from device emulation?

from cltune.

UniqueFool commented on July 17, 2024

yes, I meant "device" like you said - your 2) describes the idea pretty well, i.e. it has more to do with kernel-specific runtime information and using that come up with/guide different transformations

I will have to do some reading to see if this is really feasible, for all the reasons you mentioned - however, I did reference a few papers that basically describe doing this sort of thing.

So it really is more about narrowing-down and guiding the search space based on kernel-specific information that can be gathered via emulated execution.

from cltune.

CNugteren commented on July 17, 2024

OK! Which papers are those? I'm interested as well to see what's possible.

from cltune.

UniqueFool commented on July 17, 2024

I basically worked through the referenced paper and its references section: http://arxiv.org/pdf/1506.00842v1.pdf

We have developed and validated a machine learning
based auto-tuning framework for OpenCL. The frame-
work measures the performance of several candidate im-
plementations from a parameter configuration space and
uses this result to build a artificial neural network, which
works as a performance model. This model is then used
to find interesting parts of the configuration space, which are explored exhaustively to find good candidate imple-
mentations. Our neural network model achieves a mean
relative error as low as 6.1% for three different bench-
marks executed on three different devices, a Intel i7 3770
CPU, an Nvidia K40 GPU and a AMD Radeon HD 7970.
The autotuner is able to find good configurations, at best
only 1.3% slower than the best configuration.
Future work includes enhancing the performance of the
model, in particular with regard to invalid configurations,
evaluating the model on novel hardware architectures, be-
yond just CPUs and GPUs, and integrating problem pa-
rameters into the performance model. Incorporating ad-
vanced new features specific to a given architecture[39]
will remain challenging. However, studying multi-GPU
systems[40] and looking into multi-variate analysis[41]
may also be interesting avenues of inquiry.

from cltune.

CNugteren commented on July 17, 2024

This is exactly what CLTune is also doing. I mainly wrote CLTune because the other paper's authors did not made any tool available. However, I did not evaluate the machine learning part too much, the paper actually doesn't include it at all (http://www.cedricnugteren.nl/downloads/Nugteren2015a.pdf). Perhaps someone should do some more experiments using CLTune an a small neural network?

Or are you perhaps referring to the future work part of the paper:

and integrating problem parameters into the performance model

I am not sure exactly what the authors mean with this, but it could be that this is what you are referring to? In that case I would also contact the authors and see if they haven't already done this?

from cltune.

bhack commented on July 17, 2024

A really interesting thread. /cc @hughperkins

from cltune.

bhack commented on July 17, 2024

from cltune.

machine learning pipeline about cltune HOT 9 OPEN

Comments (9)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent