Git Product home page Git Product logo

Comments (10)

kpouget avatar kpouget commented on September 6, 2024

Hello,

My first results are that MATLAB is 20 times faster at executing the example addition operation in the README. That can't be right, can it?

you're comparing GPU and CPU code, you're not fair with C/OCL/Ruby vs MATLAB*!

you can't use such a trivial code compare CPU and GPU processors, what happens internally to run code on the GPU is way too complicated for that: kernel code compilation, kernel code+memory buffer transfers, remote (ie, on the GPU) execution, ...

you need to compare OpenCL with MATLAB/GPU code, or write a more complex code able to exploit the millions of cores of your GPU to see the results you're looking for!

(*I hope I didn't read your Matlab code wrong, but I don't see any instruction related to GPU)

----- Original Message -----

From: "KC Erb" [email protected]
To: "Nanosim-LIG/opencl-ruby" [email protected]
Sent: Friday, December 5, 2014 1:56:47 AM
Subject: [opencl-ruby] Performance Benchmarks (#3)

Hi again,

I wanted to do a quick comparison between MATLAB and openCL. My first results are that MATLAB is 20 times faster at executing the example addition operation in the README. That can't be right, can it? The Details

1. My openCL code gist 

I have a late 2013 Macbook Pro, so it has 3 devices

* 2.3 GHz Intel Core i7 
* Intel Iris Pro 
* NVIDIA GeForce GT 750M 2048 MB 

Here's the benchmark results
user system total real
i7 22.650000 2.110000 24.760000 ( 22.617686)
IrisPro 21.610000 0.560000 22.170000 ( 22.202700)
GeForce 21.830000 0.570000 22.400000 ( 22.734518)

1. My MATLAB code gist 

The average running time on MATLAB was 1.26 seconds. My Guesses

My first guess is that since I'm new to both C and openCL I'm failing to accurately translate that openCL code to MATLAB.

My second guess is that NArray is less efficient than MATLAB's implementation.

Suggestions?


Reply to this email directly or view it on GitHub .

from opencl-ruby.

Kerilk avatar Kerilk commented on September 6, 2024

Hello,

Just to complete Kevin's answer:

the only timing you want to look at (at first) is the timing of the
kernel run in this case:

event= prog.addition(queue, [n_times], float, b_in, b_out,:local_work_size => [128])

which you can do after the queue finishes:
queue.finish
puts "#{(event.profiling_command_end - event.profiling_command_start)} ns"

To be fair, in matlab you should only time the computation (not the
random array initialization):

n_times = 2^26;
a_in = rand(1,n_times); t1 = cputime; a_in(:) = a_in(:)+ 5.0;cputime- t1

Brice

PS:
Ruby gives you tools to avoid copy pasting. Find attached a slightly
reworked benchmark sample (though it is still giving you unfair results)

On 05/12/2014 07:52, Kevin Pouget wrote:

Hello,

My first results are that MATLAB is 20 times faster at executing the
example addition operation in the README. That can't be right, can it?

you're comparing GPU and CPU code, you're not fair with C/OCL/Ruby vs
MATLAB*!

you can't use such a trivial code compare CPU and GPU processors, what
happens internally to run code on the GPU is way too complicated for
that: kernel code compilation, kernel code+memory buffer transfers,
remote (ie, on the GPU) execution, ...

you need to compare OpenCL with MATLAB/GPU code, or write a more
complex code able to exploit the millions of cores of your GPU to see
the results you're looking for!

(*I hope I didn't read your Matlab code wrong, but I don't see any
instruction related to GPU)

----- Original Message -----

From: "KC Erb" [email protected]
To: "Nanosim-LIG/opencl-ruby" [email protected]
Sent: Friday, December 5, 2014 1:56:47 AM
Subject: [opencl-ruby] Performance Benchmarks (#3)

Hi again,

I wanted to do a quick comparison between MATLAB and openCL. My first
results are that MATLAB is 20 times faster at executing the example
addition operation in the README. That can't be right, can it? The
Details

  1. My openCL code gist

I have a late 2013 Macbook Pro, so it has 3 devices

  • 2.3 GHz Intel Core i7
  • Intel Iris Pro
  • NVIDIA GeForce GT 750M 2048 MB

Here's the benchmark results
user system total real
i7 22.650000 2.110000 24.760000 ( 22.617686)
IrisPro 21.610000 0.560000 22.170000 ( 22.202700)
GeForce 21.830000 0.570000 22.400000 ( 22.734518)

  1. My MATLAB code gist

The average running time on MATLAB was 1.26 seconds. My Guesses

My first guess is that since I'm new to both C and openCL I'm failing
to accurately translate that openCL code to MATLAB.

My second guess is that NArray is less efficient than MATLAB's
implementation.

Suggestions?


Reply to this email directly or view it on GitHub .


Reply to this email directly or view it on GitHub
#3 (comment).

from opencl-ruby.

KCErb avatar KCErb commented on September 6, 2024

Thanks for the feedback all, I'll implement the things you suggested and get back with results, I can't do the GPU part until I get up to the school today since I don't have a MATLAB gpu license but my institution does.

Before I get up there though, I have some follow up questions:

Find attached a slightly reworked benchmark sample

Thanks for this Brice, but I don't see an attachment. I'm viewing this conversation on Github and I don't think it supports attachments like that. I checked the email version as well and still no attachments. Some alternatives would be to email me directly, or copy and paste your code into http://gist.github.com and share the link with me either via email or responding to this comment.

you're comparing GPU and CPU code, you're not fair with C/OCL/Ruby vs MATLAB*!

Thanks Kevin, I'll definitely get back to you with GPU results ASAP (along with Brice's suggestions of where to put my timing events) but I do have a question about comparing CPU to CPU.

I thought that with my code, my first run is on the CPU.

If it is on the CPU, and the CPU doesn't take the extra preparation work that the GPU does (as MATLAB seems to demonstrate), then why does my openCL CPU code take 22 seconds to run whereas the MATLAB from start to finish is just a couple of seconds?

I think I can understand what Brice means by saying I should compare the kernel time (system) to MATLAB, instead of total user time. It also makes sense to me that trivial calculations like this are not likely to really give a fair comparison. Can you help give me an idea of a sample that would? For example, when I do 2^16 as my vector length, it looks like the OCL code is comparable to (and maybe a little quicker than) matlab, but it's a tough call since they both just plain do it so fast!

I only ramped up the vector length to 2^26 so that I could actually start using the CPU a bit on the MATLAB side.

Would it perhaps be a closer comparison to do a vector of 2^20 length 1,000,000 times? I'm just trying to look for ways to demonstrate to someone who's never heard of openCL that it can put jobs on CPUs or GPUs and it will maximize use of the device more or less automatically (because of openCL's concept of work items).

Perhaps my question just exposes my ignorance :)

Perhaps a little context would be helpful :)

I'm a graduate student in Physics with an emphasis on Magnetic Resonance Imaging (MRI). I've been doing a lot of work recently on image reconstruction and have thought that I'd really like to stop using MATLAB for this kind of work since I've recently been introduced to Ruby and prefer it 100 times over MATLAB. My PI is open to the idea of my PhD thesis being centered around building Ruby tools for image reconstruction if I can demonstrate its utility.

So I'm kind of trying to strike a balance here. If I can put together a basic proof of concept and demonstrate that I can write code in Ruby that does a basic math function on my computer and our server GPUs faster or at least comparable to what MATLAB does, then I'll probably be given a green light and be working exclusively in this field for the remainder of my PhD work: 2-3 years.

So if I knew more I could build a better proof of concept, if I had a better proof of concept I could have more time to dedicate to learning more.

That's not to say I expect anything from you guys! You've been really wonderful in helping someone completely new to C, openCL, HPC everything already. I just thought it may be useful / interesting to you to know where I'm coming from and where I'm trying to go.

Thanks,
KC

from opencl-ruby.

Kerilk avatar Kerilk commented on September 6, 2024

On 05/12/2014 15:20, KC Erb wrote:

Thanks for the feedback all, I'll implement the things you suggested
and get back with results, I can't do the GPU part until I get up to
the school today since I don't have a MATLAB gpu license but my
institution does.

Before I get up there though, I have some follow up questions:

Find attached a slightly reworked benchmark sample

Thanks for this Brice, but I don't see an attachment. I'm viewing this
conversation on Github and I don't think it supports attachments like
that. I checked the email version as well and still no attachments.
Some alternatives would be to email me directly, or copy and paste
your code into http://gist.github.com and share the link with me
either via email or responding to this comment.

here is the code:

https://gist.github.com/Kerilk/2fa146ab7d1135416f12

you're comparing GPU and CPU code, you're not fair with C/OCL/Ruby
vs MATLAB*!

Thanks Kevin, I'll definitely get back to you with GPU results ASAP
(along with Brice's suggestions of where to put my timing events) but
I do have a question about comparing CPU to CPU.

I thought that with my code, my first run is on the CPU.

If it is on the CPU, and the CPU doesn't take the extra preparation
work that the GPU does (as MATLAB seems to demonstrate), then why does
my openCL CPU code take 22 seconds to run

Because your OpenCL implementation on the Mac is pretty slow don't know why.
On my laptop:
videau@nedni:/tmp$ ruby opencl_addition.rb
user system total real
Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz: 0.630000 0.020000
0.650000 ( 0.598329)
This is running the modified script. Size is 2**20.

whereas the MATLAB from start to finish is just a couple of seconds?

I think I can understand what Brice means by saying I should compare
the kernel time (system) to MATLAB, instead of total user time. It
also makes sense to me that trivial calculations like this are not
likely to really give a fair comparison. Can you help give me an idea
of a sample that would? For example, when I do 2^16 as my vector
length, it looks like the OCL code is comparable to (and maybe a
little quicker than) matlab, but it's a tough call since they both
just plain do it so fast!

The timing method I showed gives results in the nanosecond range (though
precision will of course depend on your hardware counter accuracy). For
kernels on more than a few thousands elements it will be sufficient:
videau@nedni:~/dev/opencl-ruby/opencl_ruby_ffi/test$ ruby small_test.rb
96653 ns
Success!

This is for size 2**16

I only ramped up the vector length to 2^26 so that I could actually
start using the CPU a bit on the MATLAB side.

same problem here, you need precise timings. Maybe there are modules in
matlab that gives more accurate timings.
If there is not maybe you can create one, using clock_gettime and
CLOCK_REALTIME.

Would it perhaps be a closer comparison to do a vector of 2^20 length
1,000,000 times? I'm just trying to look for ways to demonstrate to
someone who's never heard of openCL that it can put jobs on CPUs or
GPUs and it will maximize use of the device more or less automatically
(because of openCL's concept of work items).

There is some research going on to try to do it on the more side of things:
https://runtime.bordeaux.inria.fr/StarPU/doc/html/SOCLOpenclExtensions.html

Brice

Perhaps my question just exposes my ignorance :)

Perhaps a little context would be helpful :)

I'm a graduate student in Physics with an emphasis on Magnetic
Resonance Imaging (MRI). I've been doing a lot of work recently on
image reconstruction and have thought that I'd really like to stop
using MATLAB for this kind of work since I've recently been introduced
to Ruby and prefer it 100 times over MATLAB. My PI is open to the idea
of my PhD thesis being centered around building Ruby tools for image
reconstruction if I can demonstrate its utility.

So I'm kind of trying to strike a balance here. If I can put together
a basic proof of concept and demonstrate that I can write code in Ruby
that does a basic math function on my computer and our server GPUs
faster or at least comparable to what MATLAB does, then I'll probably
be given a green light and be working exclusively in this field for
the remainder of my PhD work: 2-3 years.

So if I knew more I could build a better proof of concept, if I had a
better proof of concept I could have more time to dedicate to learning
more.

That's not to say I expect anything from you guys! You've been really
wonderful in helping someone completely new to C, openCL, HPC
everything already. I just thought it may be useful / interesting to
you to know where I'm coming from and where I'm trying to go.

Thanks,
KC


Reply to this email directly or view it on GitHub
#3 (comment)
Bug from
https://github.com/notifications/beacon/ABuyZtSx6hA9WAKxmvfP9xwI3szcQL_iks5nUba2gaJpZM4DEkQE.gif

from opencl-ruby.

KCErb avatar KCErb commented on September 6, 2024

Wow, that's a great project, SOCL, exactly the sort of thing I'm looking for, I'll have to learn more!

I'm getting ready to put together some more appropriate benchmarks together on CPU and GPU in the next couple of hours here so I'll get back to you on that, but I thought I'd quickly respond to one thing:

Because your OpenCL implementation on the Mac is pretty slow don't know why.

I'm not sure if it is the implementation, here are my results (using your modified benchmark code, thanks!) on 2**20

                                                  user     system      total        real
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz:    0.350000   0.040000   0.390000 (  0.349098)
Iris Pro:                                     0.360000   0.010000   0.370000 (  0.366742)
GeForce GT 750M:                              0.350000   0.010000   0.360000 (  0.375415)

I just find that 2**21 is twice as slow

                                                  user     system      total        real
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz:    0.730000   0.060000   0.790000 (  0.729077)
Iris Pro:                                     0.700000   0.020000   0.720000 (  0.729866)
GeForce GT 750M:                              0.710000   0.030000   0.740000 (  0.750479)

and 2**22 twice as slow again

                                                  user     system      total        real
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz:    1.440000   0.140000   1.580000 (  1.441325)
Iris Pro:                                     1.380000   0.030000   1.410000 (  1.421135)
GeForce GT 750M:                              1.380000   0.050000   1.430000 (  1.455044)

so that's why my 2**26 is taking 22+ seconds.

Of course, as you pointed out before I'm not really doing it right since I should be comparing the kernels to each other, not the whole program. My question here is more about why does the program take so long? On a vector of 2**26 is it spending 22 seconds setting things up? Is there any way to cut that time down, or is the example so bad that it's really not worth talking about / optimizing?

Mmmm, I find this stuff so exciting! I really hope I can get this project approved and dig in! Thanks again 😄

from opencl-ruby.

KCErb avatar KCErb commented on September 6, 2024

Well, I'm up against a little hiccup on using the GPU since my school's machine that has the gpu license doesn't have a Ruby interpreter and I don't have admin rights, so I'll need to work on getting Ruby on that machine.

But until that get's worked out, I'll at least report my findings pitting my first device (the CPU) against MATLAB (also CPU) in the way Brice suggested.

With this code: gist
I get this output

Warning OpenCL 1.2 loader detected!
404745 ns

and that's about average, 0.4 s

With this matlab code: gist

I get an average of ~400ns. So it seems that something is still amiss. Most likely in my understanding . . .

from opencl-ruby.

Kerilk avatar Kerilk commented on September 6, 2024

On 05/12/2014 22:04, KC Erb wrote:

Well, I'm up against a little hiccup on using the GPU since my
school's machine that has the gpu license doesn't have a Ruby
interpreter and I don't have admin rights, so I'll need to work on
getting Ruby on that machine.

But until that get's worked out, I'll at least report my findings
pitting my first device (the CPU) against MATLAB (also CPU) in the way
Brice suggested.

With this code: gist https://gist.github.com/KCErb/158e7d4e433b710dd52e
I get this output

|Warning OpenCL 1.2 loader detected!
404745 ns
|

and that's about average, 0.4 s

didn't you mean 0.0004s ?

With this matlab code: gist
https://gist.github.com/KCErb/aa0377a61f6f4b4e0270

I get an average of ~400ns. So it seems that something is still amiss.
Most likely in my understanding . . .


Reply to this email directly or view it on GitHub
#3 (comment).

from opencl-ruby.

KCErb avatar KCErb commented on September 6, 2024

Oh wow Brice! I can't believe I got nano and micro confused 😊 that means my MATLAB code was running about the same speed at 400 _micro_seconds not nano!

Whew, thanks! I'll probably be able to work on the GPU stuff tomorrow. After I've posted / discussed those benchmarks I'll close this out.

from opencl-ruby.

KCErb avatar KCErb commented on September 6, 2024

OK I've finally got access to our GPU system and got it all running.

I was very pleased with how easy it was to run the addition kernel on our GPUs once I convinced our systems admin to let me put RVM in my home folder!

The results of the benchmarking are as follows:

On my laptop's CPU MATLAB and openCL are clocking in about the same. Averaged over 35 runs I get

MATLAB openCL
473.06 µs 386.75 µs

On my laptop's GPU I get on average 42.64 µs

At my institution's compute server's Tesla C2070 I get an average of
10.58 µs

Getting timing out of MATLAB is hard though, so the result is that MATLAB using the Tesla GPU is coming in around 100 µs but it's hard to tell.

The trouble with MATLAB is that they don't offer high resolution cpu timing.

Their only timing functions are cputime, and tic/toc. cpu time is cpu time only, meaning unaffected by CPU workload, but it's resolution maxes out around a few milliseconds.

tic and toc use wall-clock time and are very high resolution.

MATLAB introduced a gputimeit function in R2014, but my institution doesn't have any GPU R2014 licenses.

Summary: As expected, Ruby on openCL has no reason to not be super fast. It's hard to really gauge what's going on with MATLAB so for now I'll call it a tie and safely assume that openCL can be faster than MATLAB if used right.

Thanks for the input and help!

PS. I just got this project halfway through the approval process last Friday so things are on track for me to really dig in next year!

from opencl-ruby.

Kerilk avatar Kerilk commented on September 6, 2024

Great news!

your results are sensible so I think you got everything right.
Don't hesitate if you encounter further problems.

Brice

On 15/12/2014 23:23, KC Erb wrote:

OK I've finally got access to our GPU system and got it all running.

I was very pleased with how easy it was to run the addition kernel on
our GPUs once I convinced our systems admin to let me put RVM in my
home folder!

The results of the benchmarking are as follows:

On my laptop's CPU MATLAB and openCL are clocking in about the same.
Averaged over 35 runs I get

MATLAB openCL
473.06 µs 386.75 µs

On my laptop's GPU I get on average 42.64 µs

At my institution's compute server's Tesla C2070 I get an average of
10.58 µs

Getting timing out of MATLAB is hard though, so the result is that
MATLAB using the Tesla GPU is coming in around 100 µs but it's hard to
tell.

The trouble with MATLAB is that they don't offer high resolution cpu
timing.

Their only timing functions are cputime, and tic/toc. cpu time is cpu
time only, meaning unaffected by CPU workload, but it's resolution
maxes out around a few milliseconds.

tic and toc use wall-clock time and are very high resolution.

MATLAB introduced a |gputtimeit| function in R2014, but my institution
doesn't have an GPU R2014 licenses.

Summary: As expected, Ruby on openCL has no reason to not be super
fast. It's hard to really gauge what's going on with MATLAB so for now
I'll call it a tie and safely assume that openCL can be faster than
MATLAB if used right.

Thanks for the input and help!

PS. I just got this project halfway through the approval process last
Friday so things are on track for me to really dig in next year!


Reply to this email directly or view it on GitHub
#3 (comment).

from opencl-ruby.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.