microsoft / antares Goto Github PK

Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL for CPU/GPU, OpenCL for AMD/NVIDIA, Android CPU/GPU backends.

License: Other

Makefile 0.56% Python 44.20% Shell 1.67% C++ 44.42% C 9.15%

antares's Introduction

AutoRT: the Next Generation of Antares.

AutoRT for Device Runtime:

AutoRT is a compiler solution that helps runtime users to invent, benchmark and optimize operators for Pytorch using your own accelerators:

AutoRT can be as a benchmark utility for device performance testing and profiling.
AutoRT can also generate Pytorch2 of your device to accelerate standard Pytorch applications (e.g. DirectX).
Additionally, AutoRT futher helps to construct custom defined / fused operators that are beyond the built-in functions of Pytorch.
AutoRT for Windows DirectX 12 / Linux CUDA has experimental version released.
Click here to suggest more platforms (e.g. Pytorch2 for Windows ROCm / OpenCL / SYCL / Apple Metal / ..) you would like AutoRT to support in the follow-up releases.

Archtecture of AutoRT as a Backend for Pytorch 2.0:

Workflow of Custom Operations from Antares IR to Different Backends:

- Quick Installation of AutoRT:

Installation

Platform	OS Requirement	Python Requirement	Download Link
DirectX 12	Windows >= 10 / Microsoft XBox	Python3.12 (Windows)	python3.12 -m pip install https://github.com/microsoft/antares/releases/download/v0.9.6/autort-0.9.6.1+directx.win-cp312-cp312-win_amd64.whl
Vulkan 1.3	Ubuntu >= 18.04	Python3.12 (Linux)	python3.12 -m pip install https://github.com/microsoft/antares/releases/download/v0.9.6/autort-0.9.6.1+vulkan.linux-cp312-cp312-manylinux1_x86_64.whl
CUDA >= 11.0	Windows >= 10 / Ubuntu >= 18.04	Python 3.8/3.9/3.10/3.11/3.12	python3 -m pip install https://github.com/microsoft/antares/releases/download/v0.9.6/autort-0.9.6.2+cuda.zip
..	..	..	.. (More coming soon) ..

For CUDA, here are several Ubuntu >= 18.04 equivalent containers below:

Docker Image: nvidia/cuda:12.0.1-cudnn8-devel-ubuntu18.04
Docker Image: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
Docker Image: nvidia/cuda:12.0.1-cudnn8-devel-ubuntu20.04
Docker Image: nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
..

- Playground 1 - Benchmark your Windows Device:

Quick Test 1: Benchmark to evaluate device memory bandwidth over DirectX 12.

$ python.exe -m autort.utils.memtest
  ...
  [1000/1000] AutoRT Device Memory Bandwidth: (Actual ~= 468.12 GB/s) (Theoretical ~= 561.75 GB/s)

Quick Test 2: Benchmark to evaluate device FP32 performance over DirectX 12.

$ python.exe -m autort.utils.fp32test
  ...
  [5000/5000] AutoRT FP32 TFLOPS: (Actual ~= 9.84 TFLOPS) (Theoretical ~= 10.93 TFLOPS)

- Playground 2 - Running Pytorch2 over DirectX:

Quick Test 1: Create "custom operator" of your own in Pytorch 2.

Style-1: "AutoRT API Style" Custom Operator Generation:

>> import torch, autort
>> data = torch.arange(0, 10, dtype=torch.float32, device=autort.device())

>> f = autort.export(ir="sigmoid_f32[N] = 1 - 1 / (1 + data[N].call(strs.exp))", inputs=["data=float32[N:4096000]"], config="tune:5")
>> print(f(data))
tensor([0.5000, 0.7311, 0.8808, 0.9526, 0.9820, 0.9933, 0.9975, 0.9991, 0.9997, 0.9999])
>> print(autort.ops.sigmoid_f32(data))
tensor([0.5000, 0.7311, 0.8808, 0.9526, 0.9820, 0.9933, 0.9975, 0.9991, 0.9997, 0.9999])

Style-2: "Command Line Style" Custom Operator Generation:

# Fist, create a custom sigmoid activation operator with 5 tuning steps:
$ autort --ir "sigmoid_f32[N] = 1 - 1 / (1 + data[N].call(strs.exp))" -i data=float32[N:4096000] -c "tune:5"

# Then, use it in Pytorch 2 session:
$ python.exe
>> import torch, autort
>>
>> data = torch.arange(0, 10, dtype=torch.float32, device=autort.device())
>> output = autort.ops.sigmoid_f32(data)
>> print(output)
tensor([0.5000, 0.7311, 0.8808, 0.9526, 0.9820, 0.9933, 0.9975, 0.9991, 0.9997,
        0.9999])
>> output = torch.nn.functional.sigmoid(data)
>> print(output)
tensor([0.5000, 0.7311, 0.8808, 0.9526, 0.9820, 0.9933, 0.9975, 0.9991, 0.9997,
        0.9999])

Quick Test 2: Demo of Sorting/MNIST/LLama over Pytorch2:

$ python.exe -m autort.examples.01_sort_even_first

Input : tensor([101, 102, 208,  99,   1, 127,  62,   8, 336, 336], dtype=torch.int32)
  (is_even) tensor([False,  True,  True, False, False, False,  True,  True,  True,  True])

Output: tensor([102, 208,  62,   8, 336, 336, 101,  99,   1, 127], dtype=torch.int32)
  (is_even) tensor([ True,  True,  True,  True,  True,  True, False, False, False, False])

$ python.exe -m autort.examples.02_mnist
  ...
  step = 800, loss = 0.5159, accuracy = 87.50 %
  step = 900, loss = 0.5511, accuracy = 84.38 %
  step = 1000, loss = 0.2616, accuracy = 93.75 %
  ...

$ python.exe -m autort.examples.03_llama_tiny

What is that?"
"That is the sun," her mom said. "It gives us heat."
The little girl was amazed. She had never seen the heat before.
"Can we go outside and feel the sun?" she asked.
"Yes," her mother said.
...

$ python.exe -m autort.examples.05_llama2_7b_int4

How large is Atlantic Ocean?

The Atlantic Ocean is the second largest ocean on Earth, covering approximately 20% of the Earth's surface. ...

$ python.exe -m autort.examples.06_diffuser_no_opt

...
Image converted from `./samurai_nn.png` to `samurai_nn_diffused.png`..

If you like it, welcome to report issues or donate stars which can encourage AutoRT to support more backends, more OS-type and more documentations. See More Information about Microsoft Contributing and Trademarks.

antares's People

Contributors

Stargazers

Watchers

antares's Issues

Assertion error: SDK for `c-rocm_win64` is not configured correctly,

Hi!

I've installed the ROCm drivers on my laptop which has 6800M (gfx1031). I came across this repo while I was trying to see if I could get the various tensile libraries compiled using WSL. I have both WSL1 (ubuntu 20.04) and 2 (ubuntu 22.04) with the rocm-dev, rocm-core and rocm-hip-libraries installed.

I can see the dlls: amdhip64 and amdhik64 at System32 location.

When I try to do: AMDGFX=gfx1031 BACKEND=c-rocm_win64 antares. I keep getting this error:
AssertionError: SDK for c-rocm_win64 is not configured correctly, please look into the error messages and reconfigure the corresponding environment.

How should I resolve it?

Thank you!

Incorrect compute kernel from evaluator on WSL2

Following the other issue about this (#269) I went to install Antares to hopefully get ROCm on WSL2 using Ubuntu 20.04, and it seems to not work.

When running sudo BACKEND=c-rocm_win64 make to install the ROCm backend on windows in WSL2, it tries to evaluate a custom kernel at the end and fails to do so.

The AMD HIP driver is present (C:\Windows\System32\amdhip64.dll) and when running sudo apt install rocm-dev shows it is already installed. Antares is visible in windows /mnt/c/Users/Colin/ubuntu_stuff/antares/

Here is the log during the evaluation: https://gist.github.com/3c77d7003a0a212d3f30abea8ee2b9d8

Should be noted that when running /opt/rocm/bin/rocminfo, it states: ROCk module is NOT loaded, possibly no GPU devices . AMD has closed all issues regarding WSL2 and this error message...

Kernel version is: Linux Colin-Desktop 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Is ROCm no longer supported by 0.9.x?

@ghostplant
I've tried to run ROCM on the WSL platform and haven't been able to find a good way, but I finally found it here and saw a silver lining. I want to try version 0.9.X but can't find a whl that supports ROCM. Install version 0.3.x and use BACKEND=c-rocm to prompt that the gpu cannot be found, use BACKEND=c-rocm_win64 to run and report "/home/root001/miniconda3/lib/python3.11/site-packages/antares_core/backends/c-rocm_win64/../../graph_evaluator/run_graph.cpp:14:29: error: ‘memalign’ was not declared in this scope
14 | void data_ptr = (void)memalign(256, length);" error, I don't know where to start to fix the error, is there an official guidance document to tell me the correct steps? 😒

Not an issue but a question due to lack of docs.

Hello there, is it possible to use this as a means to compile .h and .cpp HIP code on windows for creating GPU shader kernels? fatbins and cubins. I mean from pre-existing code to define the shaders.

The simplest way to explain it is Windows has no means to compile HIP c++ as far as I know, perhaps hiprtc workarounds, make some magic with oneAPI but so far this has looked the most promising.

If I could feed it amdhip64.dll in WSL and ask it to compile .cpp with provided headers and return fatbin to compile a renderer then I would be awestruck

[Error] error: ‘CHECK_EQ’ was not declared in this scope; did you mean ‘CHECK_OK’?

When I try to run this command "AMDGFX=gfx1031 BACKEND=c-rocm_win64 Antares torch-setup" to Setup Plugin for Pytorch, it returned an error: error: ‘CHECK_EQ’ was not declared in this scope; did you mean ‘CHECK_OK’?

And there are also many other errors when compile this.

My gcc version is 9.4.0 and Ubuntu is 20.04.1

These are some screenshots of the errors:

Running ROCm computations on Windows over AMD GPU

Add proper documentation for antares usages, if possible some examples would be a great help for new people that want to test Antares.

[Help Request] How can Antares IR support stride size > 1 's Slice operation?

Currently we got an issue when we use antares to tune a swin-transformer model, it has a slice operation which's steps can be larger that one.

But from the AntaresIR.md, we noticed that the ir can only support step size equal to 1, for a onnx model I made for antares slice operation test, which same to the onnx operation documentation :

If the steps' data is [1, 1], we can simply get the antares ir code :

output0[N0, N1] = input0[N0 + 1, N1 + 0] where N0 in 1 , N1 in 3

but if the steps' data is [1, 2], it's hard to describe the compute logic :

output0[N0, N1] = input0[N0 + 1, N1 + 0] where N0 in 1 , N1 in {0, 2}

the N1 should jump over the index 2, but I don't find the way to come up with this idea.

So please any suggestion for this case?

Benchmarks

Hi !
I was wondering if there were some benchmarks available to compare the performance of Antares with e.g. the scripts from https://github.com/microsoft/antares/tree/v0.3.x/frameworks/pytorch/examples across a set of backends that the user may not have at hand.

Particularly, I'd be interested to know how Antares convolution fares with respect to cuDNN.

Thanks !

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Can antares assign specified gpus for evaluation?

I have four gpus in my environment, each time I evaluated cuda backend, all the four gpus will run evaluator program, but I just wanna antare uses no more than two of them because I also need another gpus to profile other programs.

so can antares assign specified gpus for evaluation, actually I tried setting CUDA_VISIBLE_DEVICES but it didn't work, as I expected...

will this project replace torch-directml?

Does ROCm for Windows ever exist?

Since the document says that Antares supports ROCm for Windows, I wonder whether this kind of thing exists in the world? The ROCm has been officially declared that there's no plans for Windows!

Fail to compile, when I use "AMDGFX=gfx1031 BACKEND=c-rocm_win64 antares"

Hey, there are my problems:

And when I input this:

It returns this:

Andinstalled, automatic this things:
rocm-clang-ocl/focal,now 0.5.0.50401-8420.04 amd64 [installed,automatic]
rocm-cmake/focal,now 0.8.0.50401-8420.04 amd64 [installed,automatic]
rocm-core/focal,now 5.4.1.50401-8420.04 amd64 [installed,automatic]
rocm-dbgapi/focal,now 0.68.0.50401-8420.04 amd64 [installed,automatic]
rocm-debug-agent/focal,now 2.0.3.50401-8420.04 amd64 [installed,automatic]
rocm-dev/focal,now 5.4.1.50401-8420.04 amd64 [installed]
rocm-device-libs/focal,now 1.0.0.50401-8420.04 amd64 [installed,automatic]
rocm-dkms/focal,now 5.4.1.50401-8420.04 amd64 [installed]
rocm-gdb/focal,now 12.1.50401-8420.04 amd64 [installed,automatic]
rocm-llvm/focal,now 15.0.0.22465.50401-8420.04 amd64 [installed,automatic]
rocm-ocl-icd/focal,now 2.0.0.50401-8420.04 amd64 [installed,automatic]
rocm-opencl-dev/focal,now 2.0.0.50401-8420.04 amd64 [installed]
rocm-opencl/focal,now 2.0.0.50401-8420.04 amd64 [installed,automatic]
rocm-smi-lib/focal,now 5.0.0.50401-8420.04 amd64 [installed,automatic]
rocm-utils/focal,now 5.4.1.50401-8420.04 amd64 [installed,automatic]
rocminfo/focal,now 1.0.0.50401-8420.04 amd64 [installed, automatic]

I succeed to use this:

Lack operator implementation for DirectX: torch.abs()

I was trying to run ComfyUI on antares and end up with the following message:

RuntimeError: 0 INTERNAL ASSERT FAILED at "C:\\Users\\weicu\\Desktop\\antares\\nextgen\\torch-setup\\autort-wheel\\autort_extensions.h":1593, please report a bug to PyTorch. __abs

My user is not 'weicu' so I believe that is a personal path.

I did not personally compiled the antares, I installed using 'Quick Installation of AutoRT' and it did work with the tests.

The full log is:

To see the GUI go to: http://127.0.0.1:8188
got prompt
model_type EPS
adm 2816
Using split attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using split attention in VAE
missing {'cond_stage_model.clip_l.text_projection', 'cond_stage_model.clip_l.logit_scale'}
left over keys: dict_keys(['cond_stage_model.clip_l.transformer.text_model.embeddings.position_ids'])
Requested to load SDXLClipModel
Loading 1 new model
Requested to load SDXL
Loading 1 new model
ERROR:root:!!! Exception during processing !!!
ERROR:root:Traceback (most recent call last):
File "C:\AI\ComfyUI\execution.py", line 155, in recursive_execute
output_data, output_ui = get_output_data(obj, input_data_all)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\execution.py", line 85, in get_output_data
return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\execution.py", line 78, in map_node_over_list
results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\nodes.py", line 1300, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\nodes.py", line 1270, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\comfy\sample.py", line 100, in sample
sampler = comfy.samplers.KSampler(real_model, steps=steps, device=model.load_device, sampler=sampler_name, scheduler=scheduler, denoise=denoise, model_options=model.model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\comfy\samplers.py", line 670, in init
self.set_steps(steps, denoise)
File "C:\AI\ComfyUI\comfy\samplers.py", line 691, in set_steps
self.sigmas = self.calculate_sigmas(steps).to(self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\comfy\samplers.py", line 682, in calculate_sigmas
sigmas = calculate_sigmas_scheduler(self.model, self.scheduler, steps)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\comfy\samplers.py", line 635, in calculate_sigmas_scheduler
sigmas = normal_scheduler(model, steps)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\comfy\samplers.py", line 314, in normal_scheduler
start = s.timestep(s.sigma_max)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\ComfyUI\comfy\model_sampling.py", line 76, in timestep
return dists.abs().argmin(dim=0).view(sigma.shape).to(sigma.device)
^^^^^^^^^^^
RuntimeError: 0 INTERNAL ASSERT FAILED at "C:\Users\weicu\Desktop\antares\nextgen\torch-setup\autort-wheel\autort_extensions.h":1593, please report a bug to PyTorch. __abs

Prompt executed in 9.44 seconds

Does one need access to the target HW?

In order to use Antares, does one need to have access to the target HW or is it enough to fill in one of those cfg files?
What is required by the targeted HW if one wants to optimize for it?

The residue of the last issue (#365)

Thank you!
Now I have succeeded to install these and to run "AMDGFX=gfx1031 BACKEND=c-rocm_win64 Antares".

Does this mean I have successfully configured related environment for PyTorch for ROCm?

Now, how can I test if I have done this correctly?
Command "rocminfo" still cannot work.

And I used "pip install" to install a ROCm version of PyTorch in anaconda virtual environment. And I encountered errors when I try to use "torch.cuda.is_available()". It returned:

/home/dragons/anaconda3/envs/PyTorch/lib/python3.9/site-packages/torch/cuda/init.py:88: UserWarning: HIP initialization: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice (Triggered internally at ../c10/hip/HIPFunctions.cpp:110.)
return torch._C._cuda_getDeviceCount() > 0

Is this project based on AI? What is the goal of this project?

Hi,

Just come across this project, I tried to run

# Search an efficient CUDA code for MatMul, using 2000 steps for trial:
BACKEND=c-cuda STEP=2000 COMPUTE_V1='- S = 512; einstein_v2(input_dict={"input0": {"dtype": "float32", "shape": [S, S]}, "input1": {"dtype": "float32", "shape": [S, S]}}, exprss="output0[N, M] +=! input0[N, K] * input1[K, M]")' antares

And it seems to generate and evaluate code on my GPU for 2000 runs.

I am not sure if this project involve LLM generating code and reinforcement learning in code optimization. Maybe a stupid question, could some explain breifly what is the end game goal of this project?

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

how can antares surport loop which index doesn't start with 0

I'm trying to use antares to express the loop like this:

for(int64_t i=1; i < domain_size; ++i) {
      for(int64_t j=1; j < domain_size; ++j) {
        out(i,j) = flx(i-1,j) - flx(i,j) + fly(i,j-1) - fly(i,j);
      }
 }

and the loop indeces i and j don't start with 0.
flx[N-1,M].when([-1 + N >= 0], 0.0) is not a possible solution as it violates the semantics.
Any help to express this loop is appreciated !!!

Usage with Rocm windows for hip code compilation and documentation

First, where is the documentation?, after installation in wsl2 it told me the command antares didn't exist, second i have my kernel.hip.cpp and source .cpp files, how could i compile that?, do i need to install rocm to compile it for gfx 1031?, or can antares compile that?

Is there any document for performance benchmark result vs pytorch2.1 compile mode?

Pytorch 2.1 compile mode supports fused kernels which updated the performance flag.
Is there any document to compare the performance with pytorch 2.1 compile mode . For example, 5 CNN models and 5 LLM models. It is better to have per layer compare.
In this way , developers can decide to move this software stack or not. Usually developers does not expect to benchmark before baseline data were available

[BUG] Tune a bert-base-fp16 failed

device: v100 16G
antares server startup command: BACKEND=c-cuda nohup antares rest-server > antares.log 2>&1 &

bert-base-fp16 microsoft nnfusion tuning progress:

 NOTE: the tuning progress (N/M) means that the current best kernel is searched at the N-th step of the total M steps. 

 |                   OP |                       NAME |     STATUS |   PROGRESS |     PERFORMANCE |
 | --------------------------------------------------------------------------------------------- |
 |                  Sum |                    Sum_355 |  completed |     81/1000  |   0.00260336 ms |
 |                  Dot |                        274 |  completed |    975/1000  |   0.00946126 ms |
 |                  Dot |                        359 |  completed |    630/1000  |    0.0138723 ms |
 |      Matched_Pattern |       Matched_Pattern_1913 |  completed |    657/1000  |   0.00821746 ms |
 |      Matched_Pattern |       Matched_Pattern_1914 |  completed |    905/1000  |   0.00294883 ms |
 |      Matched_Pattern |       Matched_Pattern_1915 |  completed |    689/1000  |   0.00302956 ms |
 |      Matched_Pattern |       Matched_Pattern_1916 |  completed |    890/1000  |   0.00960382 ms |
 |      Matched_Pattern |       Matched_Pattern_1917 |  completed |    946/1000  |   0.00693972 ms |
 |      Matched_Pattern |       Matched_Pattern_1918 |  completed |    748/1000  |   0.00690129 ms |
 |      Matched_Pattern |       Matched_Pattern_1920 |  completed |     56/1000  |   0.00569697 ms |
 |      Matched_Pattern |       Matched_Pattern_1984 |  completed |    825/1000  |   0.00819351 ms |
 |      Matched_Pattern |       Matched_Pattern_2059 |  completed |    730/1000  |   0.00573633 ms |
 |      Matched_Pattern |       Matched_Pattern_2060 |  completed |    680/1000  |   0.00690383 ms |
 |              Softmax |                        328 |  submitted |      0/1000  |           -1 ms |
 |      Matched_Pattern |       Matched_Pattern_1919 |  submitted |      0/1000  |           -1 ms |
 |      Matched_Pattern |       Matched_Pattern_1921 |  submitted |      0/1000  |           -1 ms |
 |      Matched_Pattern |       Matched_Pattern_2061 |  submitted |      0/1000  |           -1 ms |
 |      Matched_Pattern |       Matched_Pattern_2062 |  submitted |      0/1000  |           -1 ms |
 |      Matched_Pattern |       Matched_Pattern_2063 |  submitted |      0/1000  |           -1 ms |

anatares' log output :

/bin/bash: line 1: 21668 Aborted                 (core dumped) sh -c "cd /root/.cache/antares/cache/199 && BACKEND=c-cuda  /root/.cache/antares/evaluator.c-cuda my_kernel.cc --dev 2 --timeout 33.0"
/bin/bash: line 1: 21718 Aborted                 (core dumped) sh -c "cd /root/.cache/antares/cache/201 && BACKEND=c-cuda  /root/.cache/antares/evaluator.c-cuda my_kernel.cc --dev 0 --timeout 33.0"
/bin/bash: line 1: 21693 Aborted                 (core dumped) sh -c "cd /root/.cache/antares/cache/200 && BACKEND=c-cuda  /root/.cache/antares/evaluator.c-cuda my_kernel.cc --dev 3 --timeout 33.0"
/bin/bash: line 1: 21743 Aborted                 (core dumped) sh -c "cd /root/.cache/antares/cache/202 && BACKEND=c-cuda  /root/.cache/antares/evaluator.c-cuda my_kernel.cc --dev 1 --timeout 33.0"
.antares-module-tempfile.1.cu(77): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(78): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(79): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(80): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(81): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(82): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(83): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(84): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(85): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(86): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(87): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

.antares-module-tempfile.1.cu(88): error: more than one instance of overloaded function "tanh" matches the argument list:
            function "std::tanh(long double)"
            function "std::tanh(float)"
            argument types are: (__half)

12 errors detected in the compilation of ".antares-module-tempfile.1.cu".
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to execute command: sh -c '/usr/local/cuda/bin/nvcc .antares-module-tempfile.1.cu --fatbin -O2 -gencode arch=compute_70,code=sm_70 -o .antares-module-tempfile.1.cu.out'

is it possible c-ocl_*_win64

Hi, thank you for your nice work.

for me and many others, opencl doesn't work on wsl2

microsoft/WSL#6372
microsoft/WSL#6951

so i am wondering if your c-rocm_win64 works for you but not for me like
#269
#284

i am wondering is it possible to extend antares opencl_*_win64
to use opencl from windows which definitely works fine for my amd/gpu and nvidia/gpu.

If possible, it will be great and will help many to go around opencl/wsl2 issues.

- einstein_v2(" output0[N, F, HO, WO] +=! input0[N, C, -0 + HO + KH, -0 + WO + KW] * input1[F, C, KH, KW] where HO in 1080, WO in 1920;  ", input_dict={ "input0" : { "dtype" : "float32", "shape" : [1, 48, 1080, 1920]} ,  "input1" : { "dtype" : "float32", "shape" : [16, 48, 1, 1]} }) ## @: plan/convfwd_nchw_v1

Antares terminated with the error info:

Traceback (most recent call last):
  File "./antares/antares_compiler.py", line 664, in <module>
    main_compute()
  File "./antares/antares_compiler.py", line 377, in main_compute
    task = autotvm.task.create("template_op", args=(), target=tvm_target)
  File "/opt/tvm/python/tvm/autotvm/task/task.py", line 457, in create
    sch, _ = ret.func(*args)
  File "/opt/tvm/python/tvm/autotvm/task/task.py", line 236, in __call__
    return self.fcustomized(*args, **kwargs)
  File "/antares/lang/generic.py", line 185, in get_template_op
    exec('import tvm; from tvm import topi; ' + program, globals())
  File "<string>", line 1, in <module>
  File "/antares/lang/generic.py", line 24, in einstein_v2
    ir = einstein_v2.emit_tvm_ir(exprss, input_dict, extra_outputs)
  File "/antares/lang/einstein_v2.py", line 473, in emit_tvm_ir
    return emit_tvm_ir_v2(exprss, input_dict, extra_outputs)
  File "/antares/lang/einstein_v2.py", line 389, in emit_tvm_ir_v2
    ast = parse_to_ast(s, inputs)
  File "/antares/lang/einstein_v2.py", line 262, in parse_to_ast
    _root = eval(rval)
  File "<string>", line 1, in <module>
  File "/antares/lang/einstein_v2.py", line 114, in __radd__
    return other.__radd__(self)
  File "/antares/lang/einstein_v2.py", line 114, in __radd__
    return other.__radd__(self)
  File "/antares/lang/einstein_v2.py", line 114, in __radd__
    return other.__radd__(self)
  [Previous line repeated 9979 more times]
  File "/antares/lang/einstein_v2.py", line 113, in __radd__
    other = OpTensor.parse(other)
RecursionError: maximum recursion depth exceeded