Hi, thank you for your nice work. for me and many others, opencl doe

Can you have a try on this? <div class="highlight highlight-source-shell notransla

is it possible c-ocl_*_win64 about antares HOT 15 OPEN

microsoft commented on July 24, 2024

is it possible c-ocl_*_win64

from antares.

Comments (15)

ghostplant commented on July 24, 2024

I'm afraid I don't have such OCL based Windows platform currently. Can you explain more about your OCL environment? e.g. For NVIDIA/AMD GPU, do they link to the same libocl.dll, using shared OCL header files?

from antares.

ghostplant commented on July 24, 2024

BTW, we recommend you try hlsl_win64 since it is DirectX based, working for both AMD GPU, NVIDIA GPU, even Intel GPU.

from antares.

kh-abd-kh commented on July 24, 2024

BACKEND=c-hlsl_win64 antares
didn't work for me, it pops a message (since i don't know how to include a png/jpg, i will copy it)
title: evaluator.c-hsl_win64 - Bad inage
Error1:
\wsl.localhost\Ubuntu\home\mabd.cahe\antares\dxompiler.dll is either not designed to run on Windows or
it contains an error. Try installing the program again using the original installation media or contact your system adminstrator
or the software vendor for support. Error status 0x000007b.
Error2:
[EvalAgent] Evaluating Modules .. (for backend = c-hlsl_win64)
[CheckFail] Failed to load .\antares_hlsl_v0.3.2_x64.dll, please download these libraries first!
[EvalAgent] Results = {}
[Antares] Incorrect compute kernel from evaluator.

To be honest i know a little about DirectX. a long time ago, i used.
for , open openCL, Now people are using pyopenCL or pyCUDA .
But i understand that is not what you need.

You need "old" OpenCL SDK from nVidia (~2012) and/or the last AMD OpenCL SDK (~2013-2014) as far as i remember.

Here from AMD

https://community.amd.com/t5/opencl/where-can-i-get-an-older-version-of-amd-app-sdk/m-p/167540#M14714

it should work for nVidia and AMD gpus. You shall find the header and the libraries inside. I haven't touched
this for years.

here the last/new from AMD one just some releases infos

https://github.com/GPUOpen-LibrariesAndSDKs/OCL-SDK/releases

from antares.

ghostplant commented on July 24, 2024

Wow, it is definitely not expected. Firstly, you need "latest system updated" Windows 10 or Windows 11 (64 bit). Then you may suffer from a broken download of antares_hlsl_v0.3.2_x64.dll (blocked by firewall?), so you need to run antares clean in WSL to reset all settings, then try executing command BACKEND=c-hlsl_win64 antares to see if it works now?

from antares.

ghostplant commented on July 24, 2024

BTW, if you install latest AMDGPU drivers, why your system doesn't have amdhip64.dll installed that can enable c-rocm_win64?

from antares.

kh-abd-kh commented on July 24, 2024

Ok, we will be distracted by two different parallel problem. may be the directx can be done in another thread.
For the AMDGPU, Windows (Win 11) is not the problem. I have Ryzen 7 4800H using clinfo from windows i can see that my "built in" AMDGPU/Renoir is called gfx902.
And , actually the only way, so far, i can "program" it using pyopencl and running the old
opencl/AMD. it works fine and even it works fine with my second gpu nvidia 1660Ti/Turing"striped" with 6GB Ram.
That is amazing from Windows 11.

On, WSL, Ubuntu 18 or the newer one Ubuntu 20, I installed openCL but it doens't work.
I installed different rcom but not working, and it is never said explicitly in AMD site
that they support Renoir/Ryzen 7 over ubuntu not to mention win11/rcom (for programming). But it works for windows.

Now, I am scientist/mathematician i am calculting trillions terms. I can offload my cpu and nVidia
together over WSL (this time nvc++/HPC/nVidia is only over Ubuntu/WSL not windows)

So, I have AMDGPU working in win 11 and i am offloading my cpu/gpu(nvidia) over wsl. i want to let them talk.

PS: if you convert to nvc++/nvidia you can offload the cpu and nvidia cards simultaneously. with the same c code
just in compiling you add -acc=gpu or -acc=multicore then build .so and forked from c code. But over WSL.

Thank you really for antares, making it work for opencl will solve a lot of problems for many many different.
people.

PS: it was a nice trick to use mingw/WSL/Ubuntu to call .dll libraries for the amdhip64.dll, I like it, i didn't know it.

from antares.

kh-abd-kh commented on July 24, 2024

for hlsl

I applied your reg file antares_hlsl_tdr_v0.1.reg
then it
Download Microsoft DirectX Shader Compiler 6 ...
then it works now.
Interesting it is using AMD/Renoir not the nVidia/1660TI.
so i should now learn what is hlsl

from antares.

kh-abd-kh commented on July 24, 2024

I did it.
Now, I can offload my Renoir/Ryzen7 from win11 using opencl/pyopencl
or from WSL using cross platform opencl/HLSL
That 's great.

By the way, you have assumed that there is one WSL. There can be many so for antares to work one should
chose the correct one as "default".

from antares.

ghostplant commented on July 24, 2024

Congratulations. Many of our previous investigations proves HLSL can work as efficient as openCL, and it has a standard interface defined by Windows that can cover all graphic GPUs, as long as you install graph drivers.

If your machines have both AMD/Renoir and nVidia/1660TI equipped, and hlsl work for AMD, a feasible way to make it turn to use nVIDIA resource, is by disabling AMD graphic device in "Windows Device Manager", although this may not be what you want if you want to use them simultaneously.

from antares.

ghostplant commented on July 24, 2024

Can you have a try on this?

$ pip3 install --upgrade antares==0.3.20.12
$ antares clean
$ DEVICE_ID=0 STEP=100 antares     # this should use one of the GPU, maybe AMD
$ DEVICE_ID=1 STEP=100 antares     # this should use another GPU, maybe NVIDIA

from antares.

kh-abd-kh commented on July 24, 2024

very good.

ok, I am learning HLSL now.
Now, I upgraded.
DEVICE_ID=0 STEP=100 antares # it uses the nvidia 1660Ti
$DEVICE_ID=0 STEP=100 antares

Backend = c-hlsl_win64, Python PID = 1757, Task = lang.generic;
MAKE_PARA = 1/16, EXEC_PARA = 1, TUNER = OpEvo
COMPUTE_V1 = - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})

[ ] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s
[] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', tpr = 0.000931, digest = 1.504583e+09, mem_occupy = -1 %
....
[] Param_entity on sid = 100: config = '{"Foutput0:D0": [-1, 2, 8, 8], "Foutput0:D1": [-1, 1, 16, 1], "Foutput0:O": [0, 1], "Foutput0:S": 2, "Foutput0:R": 1}', tpr = 0.000035, digest = 1.504583e+09, mem_occupy = -1 %

========================================================================================================================

STEP[100 / 100] Current Best Config = {"Foutput0:D0": [-1, 2, 4, 4], "Foutput0:D1": [-1, 1, 16, 1], "Foutput0:O": [1, 0], "Foutput0:S": 2, "Foutput0:R": 0}, Perf = 3.01407e-05 sec / op (17.3947 Gflops), MemRatio = -1 %, Occur Step = 29;

========================================================================================================================

[Best Config] CONFIG='{"Foutput0:D0": [-1, 2, 4, 4], "Foutput0:D1": [-1, 1, 16, 1], "Foutput0:O": [1, 0], "Foutput0:S": 2, "Foutput0:R": 0}' ==> Performance is up to 17.394686 Gflops, occurred at step 29 / 100; time per run = 3.01407e-05 sec.

DEVICE_ID=1 STEP=100 antares # it uses the AMD
$ DEVICE_ID=1 STEP=100 antares

Backend = c-hlsl_win64, Python PID = 25143, Task = lang.generic;
MAKE_PARA = 1/16, EXEC_PARA = 1, TUNER = OpEvo
COMPUTE_V1 = - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})

[ ] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s
[] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', tpr = 0.004155, digest = 1.504583e+09, mem_occupy = -1 %
.....
[] Param_entity on sid = 100: config = '{"Foutput0:D0": [-1, 4, 2, 8], "Foutput0:D1": [-1, 2, 8, 4], "Foutput0:O": [1, 0], "Foutput0:S": 2, "Foutput0:R": 0}', tpr = 0.000224, digest = 1.504583e+09, mem_occupy = -1 %

========================================================================================================================

STEP[100 / 100] Current Best Config = {"Foutput0:D0": [-1, 2, 4, 32], "Foutput0:D1": [-1, 1, 16, 2], "Foutput0:O": [1, 0], "Foutput0:S": 4, "Foutput0:R": 1}, Perf = 0.000184122 sec / op (2.8475 Gflops), MemRatio = -1 %, Occur Step = 90;

========================================================================================================================

[Best Config] CONFIG='{"Foutput0:D0": [-1, 2, 4, 32], "Foutput0:D1": [-1, 1, 16, 2], "Foutput0:O": [1, 0], "Foutput0:S": 4, "Foutput0:R": 1}' ==> Performance is up to 2.847503 Gflops, occurred at step 90 / 100; time per run = 0.000184122 sec.

just for curiosity i tried

$ DEVICE_ID=2 STEP=2 antares

Backend = c-hlsl_win64, Python PID = 32252, Task = lang.generic;
MAKE_PARA = 1/16, EXEC_PARA = 1, TUNER = OpEvo
COMPUTE_V1 = - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})

[ ] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s

[Antares] Incorrect compute kernel from evaluator.

[ ] Param_entity on sid = 2: config = '{"Foutput0:D0": [-1, 8, 16, 1], "Foutput0:D1": [-1, 4, 2, 4], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 1}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s

[Antares] Incorrect compute kernel from evaluator.

may be, it is better to give "Incorrect DEVICE_ID"

from antares.

kh-abd-kh commented on July 24, 2024

But I am afraid it is very low GFlops, usually i can get some Teraflops.
How can I try something more complicated (** = power=python), using complex numbers (I=is the imaginary unit)
sum_{n=1}^{n=1000} sum_{m=1}^{m=1000} ( 1/ ( (m + I*n) ** 4 ) )
or at least its real part
( (m ** 4 - 6. * m ** 2 * n ** 2 + n ** 4) / ( (m ** 2 + n ** 2) ** 4 ) )
this called the lemniscate zeta. Even for up to 1000000 can be done in less than 1 min (c+omp,cythom+omp,numba ...)
Gauss calculated it ~1795 by hand but he used some tricks (the AGM=Algebro-Geometric Mean)

from antares.

ghostplant commented on July 24, 2024

Because the computation by default is elementwise which is a memory-bound operation. If you want to test how high it can each in TFlops, you'd better try a large GEMM.

from antares.

kh-abd-kh commented on July 24, 2024

OK, the main power of GPU is parallel array. This is how I discover it. Something that can takes hours in Mathematica(Parallel) can be done in milliseconds over the GPU.
Here a very simple example in pycuda,

I initialize an empty array then copied to the GPU, measuring and computing my zeta over the GPU,
then copied back to CPU and suming it over the CPU just to prove that the sum is not problamatic.

The main point that generating the whole zeta at once over the GPU takes milliseconds actually sometimes less
0.0009sec while trying to doing this over the CPU is meaningless it takes hours. Of course, the standard way is
accumulative
sum which is fast and even in parallel(16threads) faster than accumalative sum over the GPU.
But generating the whole zeta can be useful for example investigating hidden symmetry action and "connical"
distribution and so on.

Here, we go

=============================

import pycuda.driver as cuda
import pycuda.autoinit # noqa
from pycuda.compiler import SourceModule
from time import time as _time
import numpy

n = 32
gd = 512
dd = 32 * gd
a = numpy.empty( dd * dd ).reshape(dd,dd)
a = a.astype(numpy.float64)
a_gpu = cuda.mem_alloc( a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)

mod = SourceModule("""
__global__ void zetac(double *a)
{
int id = blockIdx.x * 32 * 32 + threadIdx.y * 32 + threadIdx.x;
int y = id/(32 * 496) +1;
int x = id%(32 * 496) +1;
double kk1 = double(x) * double(x);
double kk2 = double(y) * double(y);
a[id] = ((kk1-kk2) * (kk1-kk2)- 4. * kk1 * kk2) / (double( x * x+ y * y) * ( x * x+ y * y) * (x * x+ y *y ) * ( x * x+ y * y ));
}
""")

nn=32
tt1=_time()
func = mod.get_function("zetac")
func(a_gpu,block=(nn, nn, 1),grid=( gd*gd, 1, 1), shared=0)
tt2=_time()
print("func gpu time is =",tt2-tt1)
cuda.memcpy_dtoh(a, a_gpu)
tt3=_time()
print("memcpy time is =",tt3-tt2)

print("last element =", a[ nn * gd-1, nn * gd-1])
print("dim =", a.shape)

tt4=_time()
print("sum =", a.sum())
tt5=_time()
print("sum time over the cpu = ", tt5-tt4)

=================================

$ python3 ztc2.py
func gpu time is = 0.002030611038208008
memcpy time is = 0.4167020320892334
last element = -1.854725953404076e-18
dim = (16384, 16384)
sum = -0.29452023348172157
sum time over the cpu = 0.09859681129455566

of course i can do better, if it is for the sum only, in many different ways (numba-cython-c-....)
but just to see the point generating parallel array computationally over the GPU really fantastic.

can you compare this with antares different backends?!

from antares.

kh-abd-kh commented on July 24, 2024

I hope the editor didn't miss up my code. I correct it 3 times.

from antares.

is it possible c-ocl_*_win64 about antares HOT 15 OPEN

Comments (15)

$ python3 ztc2.py
func gpu time is = 0.002030611038208008
memcpy time is = 0.4167020320892334
last element = -1.854725953404076e-18
dim = (16384, 16384)
sum = -0.29452023348172157
sum time over the cpu = 0.09859681129455566

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (15)

$ python3 ztc2.py func gpu time is = 0.002030611038208008 memcpy time is = 0.4167020320892334 last element = -1.854725953404076e-18 dim = (16384, 16384) sum = -0.29452023348172157 sum time over the cpu = 0.09859681129455566

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

$ python3 ztc2.py
func gpu time is = 0.002030611038208008
memcpy time is = 0.4167020320892334
last element = -1.854725953404076e-18
dim = (16384, 16384)
sum = -0.29452023348172157
sum time over the cpu = 0.09859681129455566