Comments (15)
I'm afraid I don't have such OCL based Windows platform currently. Can you explain more about your OCL environment? e.g. For NVIDIA/AMD GPU, do they link to the same libocl.dll, using shared OCL header files?
from antares.
BTW, we recommend you try hlsl_win64 since it is DirectX based, working for both AMD GPU, NVIDIA GPU, even Intel GPU.
from antares.
BACKEND=c-hlsl_win64 antares
didn't work for me, it pops a message (since i don't know how to include a png/jpg, i will copy it)
title: evaluator.c-hsl_win64 - Bad inage
Error1:
\wsl.localhost\Ubuntu\home\mabd.cahe\antares\dxompiler.dll is either not designed to run on Windows or
it contains an error. Try installing the program again using the original installation media or contact your system adminstrator
or the software vendor for support. Error status 0x000007b.
Error2:
[EvalAgent] Evaluating Modules .. (for backend = c-hlsl_win64)
[CheckFail] Failed to load .\antares_hlsl_v0.3.2_x64.dll
, please download these libraries first!
[EvalAgent] Results = {}
[Antares] Incorrect compute kernel from evaluator.
To be honest i know a little about DirectX. a long time ago, i used.
for , open openCL, Now people are using pyopenCL or pyCUDA .
But i understand that is not what you need.
You need "old" OpenCL SDK from nVidia (~2012) and/or the last AMD OpenCL SDK (~2013-2014) as far as i remember.
Here from AMD
it should work for nVidia and AMD gpus. You shall find the header and the libraries inside. I haven't touched
this for years.
here the last/new from AMD one just some releases infos
https://github.com/GPUOpen-LibrariesAndSDKs/OCL-SDK/releases
from antares.
Wow, it is definitely not expected. Firstly, you need "latest system updated" Windows 10 or Windows 11 (64 bit). Then you may suffer from a broken download of antares_hlsl_v0.3.2_x64.dll
(blocked by firewall?), so you need to run antares clean
in WSL to reset all settings, then try executing command BACKEND=c-hlsl_win64 antares
to see if it works now?
from antares.
BTW, if you install latest AMDGPU drivers, why your system doesn't have amdhip64.dll
installed that can enable c-rocm_win64
?
from antares.
Ok, we will be distracted by two different parallel problem. may be the directx can be done in another thread.
For the AMDGPU, Windows (Win 11) is not the problem. I have Ryzen 7 4800H using clinfo from windows i can see that my "built in" AMDGPU/Renoir is called gfx902.
And , actually the only way, so far, i can "program" it using pyopencl and running the old
opencl/AMD. it works fine and even it works fine with my second gpu nvidia 1660Ti/Turing"striped" with 6GB Ram.
That is amazing from Windows 11.
On, WSL, Ubuntu 18 or the newer one Ubuntu 20, I installed openCL but it doens't work.
I installed different rcom but not working, and it is never said explicitly in AMD site
that they support Renoir/Ryzen 7 over ubuntu not to mention win11/rcom (for programming). But it works for windows.
Now, I am scientist/mathematician i am calculting trillions terms. I can offload my cpu and nVidia
together over WSL (this time nvc++/HPC/nVidia is only over Ubuntu/WSL not windows)
So, I have AMDGPU working in win 11 and i am offloading my cpu/gpu(nvidia) over wsl. i want to let them talk.
PS: if you convert to nvc++/nvidia you can offload the cpu and nvidia cards simultaneously. with the same c code
just in compiling you add -acc=gpu or -acc=multicore then build .so and forked from c code. But over WSL.
Thank you really for antares, making it work for opencl will solve a lot of problems for many many different.
people.
PS: it was a nice trick to use mingw/WSL/Ubuntu to call .dll libraries for the amdhip64.dll, I like it, i didn't know it.
from antares.
for hlsl
I applied your reg file antares_hlsl_tdr_v0.1.reg
then it
Download Microsoft DirectX Shader Compiler 6 ...
then it works now.
Interesting it is using AMD/Renoir not the nVidia/1660TI.
so i should now learn what is hlsl
from antares.
I did it.
Now, I can offload my Renoir/Ryzen7 from win11 using opencl/pyopencl
or from WSL using cross platform opencl/HLSL
That 's great.
By the way, you have assumed that there is one WSL. There can be many so for antares to work one should
chose the correct one as "default".
from antares.
Congratulations. Many of our previous investigations proves HLSL can work as efficient as openCL, and it has a standard interface defined by Windows that can cover all graphic GPUs, as long as you install graph drivers.
If your machines have both AMD/Renoir and nVidia/1660TI equipped, and hlsl work for AMD, a feasible way to make it turn to use nVIDIA resource, is by disabling AMD graphic device in "Windows Device Manager", although this may not be what you want if you want to use them simultaneously.
from antares.
Can you have a try on this?
$ pip3 install --upgrade antares==0.3.20.12
$ antares clean
$ DEVICE_ID=0 STEP=100 antares # this should use one of the GPU, maybe AMD
$ DEVICE_ID=1 STEP=100 antares # this should use another GPU, maybe NVIDIA
from antares.
very good.
ok, I am learning HLSL now.
Now, I upgraded.
DEVICE_ID=0 STEP=100 antares # it uses the nvidia 1660Ti
$DEVICE_ID=0 STEP=100 antares
Backend = c-hlsl_win64, Python PID = 1757, Task = lang.generic;
MAKE_PARA = 1/16, EXEC_PARA = 1, TUNER = OpEvo
COMPUTE_V1 = - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})
[ ] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s
[] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', tpr =0.000931
, digest =1.504583e+09
, mem_occupy = -1 %
....
[] Param_entity on sid = 100: config = '{"Foutput0:D0": [-1, 2, 8, 8], "Foutput0:D1": [-1, 1, 16, 1], "Foutput0:O": [0, 1], "Foutput0:S": 2, "Foutput0:R": 1}', tpr =0.000035
, digest =1.504583e+09
, mem_occupy = -1 %
========================================================================================================================
STEP[100 / 100] Current Best Config = {"Foutput0:D0": [-1, 2, 4, 4], "Foutput0:D1": [-1, 1, 16, 1], "Foutput0:O": [1, 0], "Foutput0:S": 2, "Foutput0:R": 0}, Perf = 3.01407e-05 sec / op (17.3947 Gflops), MemRatio = -1 %, Occur Step = 29;
========================================================================================================================
[Best Config] CONFIG='{"Foutput0:D0": [-1, 2, 4, 4], "Foutput0:D1": [-1, 1, 16, 1], "Foutput0:O": [1, 0], "Foutput0:S": 2, "Foutput0:R": 0}' ==> Performance is up to 17.394686 Gflops, occurred at step 29 / 100; time per run = 3.01407e-05 sec.
DEVICE_ID=1 STEP=100 antares # it uses the AMD
$ DEVICE_ID=1 STEP=100 antares
Backend = c-hlsl_win64, Python PID = 25143, Task = lang.generic;
MAKE_PARA = 1/16, EXEC_PARA = 1, TUNER = OpEvo
COMPUTE_V1 = - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})
[ ] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s
[] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', tpr =0.004155
, digest =1.504583e+09
, mem_occupy = -1 %
.....
[] Param_entity on sid = 100: config = '{"Foutput0:D0": [-1, 4, 2, 8], "Foutput0:D1": [-1, 2, 8, 4], "Foutput0:O": [1, 0], "Foutput0:S": 2, "Foutput0:R": 0}', tpr =0.000224
, digest =1.504583e+09
, mem_occupy = -1 %
========================================================================================================================
STEP[100 / 100] Current Best Config = {"Foutput0:D0": [-1, 2, 4, 32], "Foutput0:D1": [-1, 1, 16, 2], "Foutput0:O": [1, 0], "Foutput0:S": 4, "Foutput0:R": 1}, Perf = 0.000184122 sec / op (2.8475 Gflops), MemRatio = -1 %, Occur Step = 90;
========================================================================================================================
[Best Config] CONFIG='{"Foutput0:D0": [-1, 2, 4, 32], "Foutput0:D1": [-1, 1, 16, 2], "Foutput0:O": [1, 0], "Foutput0:S": 4, "Foutput0:R": 1}' ==> Performance is up to 2.847503 Gflops, occurred at step 90 / 100; time per run = 0.000184122 sec.
just for curiosity i tried
$ DEVICE_ID=2 STEP=2 antares
Backend = c-hlsl_win64, Python PID = 32252, Task = lang.generic;
MAKE_PARA = 1/16, EXEC_PARA = 1, TUNER = OpEvo
COMPUTE_V1 = - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})
[ ] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s
[Antares] Incorrect compute kernel from evaluator.
[ ] Param_entity on sid = 2: config = '{"Foutput0:D0": [-1, 8, 16, 1], "Foutput0:D1": [-1, 4, 2, 4], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 1}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s
[Antares] Incorrect compute kernel from evaluator.
may be, it is better to give "Incorrect DEVICE_ID"
from antares.
But I am afraid it is very low GFlops, usually i can get some Teraflops.
How can I try something more complicated (** = power=python), using complex numbers (I=is the imaginary unit)
sum_{n=1}^{n=1000} sum_{m=1}^{m=1000} ( 1/ ( (m + I*n) ** 4 ) )
or at least its real part
( (m ** 4 - 6. * m ** 2 * n ** 2 + n ** 4) / ( (m ** 2 + n ** 2) ** 4 ) )
this called the lemniscate zeta. Even for up to 1000000 can be done in less than 1 min (c+omp,cythom+omp,numba ...)
Gauss calculated it ~1795 by hand but he used some tricks (the AGM=Algebro-Geometric Mean)
from antares.
Because the computation by default is elementwise which is a memory-bound operation. If you want to test how high it can each in TFlops, you'd better try a large GEMM.
from antares.
OK, the main power of GPU is parallel array. This is how I discover it. Something that can takes hours in Mathematica(Parallel) can be done in milliseconds over the GPU.
Here a very simple example in pycuda,
I initialize an empty array then copied to the GPU, measuring and computing my zeta over the GPU,
then copied back to CPU and suming it over the CPU just to prove that the sum is not problamatic.
The main point that generating the whole zeta at once over the GPU takes milliseconds actually sometimes less
0.0009sec while trying to doing this over the CPU is meaningless it takes hours. Of course, the standard way is
accumulative
sum which is fast and even in parallel(16threads) faster than accumalative sum over the GPU.
But generating the whole zeta can be useful for example investigating hidden symmetry action and "connical"
distribution and so on.
Here, we go
=============================
import pycuda.driver as cuda
import pycuda.autoinit # noqa
from pycuda.compiler import SourceModule
from time import time as _time
import numpy
n = 32
gd = 512
dd = 32 * gd
a = numpy.empty( dd * dd ).reshape(dd,dd)
a = a.astype(numpy.float64)
a_gpu = cuda.mem_alloc( a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void zetac(double *a)
{
int id = blockIdx.x * 32 * 32 + threadIdx.y * 32 + threadIdx.x;
int y = id/(32 * 496) +1;
int x = id%(32 * 496) +1;
double kk1 = double(x) * double(x);
double kk2 = double(y) * double(y);
a[id] = ((kk1-kk2) * (kk1-kk2)- 4. * kk1 * kk2) / (double( x * x+ y * y) * ( x * x+ y * y) * (x * x+ y *y ) * ( x * x+ y * y ));
}
""")
nn=32
tt1=_time()
func = mod.get_function("zetac")
func(a_gpu,block=(nn, nn, 1),grid=( gd*gd, 1, 1), shared=0)
tt2=_time()
print("func gpu time is =",tt2-tt1)
cuda.memcpy_dtoh(a, a_gpu)
tt3=_time()
print("memcpy time is =",tt3-tt2)
print("last element =", a[ nn * gd-1, nn * gd-1])
print("dim =", a.shape)
tt4=_time()
print("sum =", a.sum())
tt5=_time()
print("sum time over the cpu = ", tt5-tt4)
=================================
$ python3 ztc2.py
func gpu time is = 0.002030611038208008
memcpy time is = 0.4167020320892334
last element = -1.854725953404076e-18
dim = (16384, 16384)
sum = -0.29452023348172157
sum time over the cpu = 0.09859681129455566
of course i can do better, if it is for the sum only, in many different ways (numba-cython-c-....)
but just to see the point generating parallel array computationally over the GPU really fantastic.
can you compare this with antares different backends?!
from antares.
I hope the editor didn't miss up my code. I correct it 3 times.
from antares.
Related Issues (20)
- gfx 1031 hip kernel crash HOT 1
- Running ROCm computations on Windows over AMD GPU HOT 14
- This repo is missing important files
- Change the cache directory HOT 4
- [BUG] Tune a bert-base-fp16 failed HOT 1
- [Help Request] How can Antares IR support stride size > 1 's Slice operation? HOT 3
- Can antares assign specified gpus for evaluation? HOT 1
- how can antares surport loop which index doesn't start with 0 HOT 5
- Benchmarks HOT 3
- Not an issue but a question due to lack of docs. HOT 1
- Fail to compile, when I use "AMDGFX=gfx1031 BACKEND=c-rocm_win64 antares" HOT 9
- The residue of the last issue (#365)
- [Error] error: ‘CHECK_EQ’ was not declared in this scope; did you mean ‘CHECK_OK’? HOT 17
- Assertion error: SDK for `c-rocm_win64` is not configured correctly, HOT 3
- Is this project based on AI? What is the goal of this project? HOT 3
- will this project replace torch-directml? HOT 2
- Is ROCm no longer supported by 0.9.x? HOT 16
- Lack operator implementation for DirectX: torch.abs() HOT 1
- Is there any document for performance benchmark result vs pytorch2.1 compile mode? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from antares.