intel / mlir-extensions Goto Github PK

Intel® Extension for MLIR. A staging ground for MLIR dialects and tools for Intel devices using the MLIR toolchain.

License: Other

CMake 1.26% C++ 21.52% Python 1.56% MLIR 74.80% Pascal 0.74% Shell 0.11%

mlir-extensions's Introduction

Intel® Extension for MLIR

Intel® Extension for MLIR (IMEX) is a collection of MLIR dialects and passes from Intel for supporting MLIR lowering to Intel silicon (CPU, GPU, …). Goal of this project is to support development of MLIR enhancements for upstream contribution, and to provide a sandbox for validation independent of front end frameworks. Current project scope includes:

Dialects and passes needed to lower and execute MLIR entry dialect (linalg, CFG, and etc) on Intel GPU.
Wrapper libraries to inteface with level zero runtime and sycl runtime supporting Intel GPU.
Other experimental dialects: NDArray, Dist

Requirements for building and development

For build

CMake >= 3.20.0
Ninja
doxygen (Optional for building docs)

Additionals for development

pre-commit
clang-format
lit (If building with option 2 below. https://pypi.org/project/lit/)

For building GPU runtime (Optional)

Installing Intel® software for general purpose GPU capability

Instructions here
https://dgpu-docs.intel.com/installation-guides/index.html

Getting DPC++ compiler (For Sycl runtime)

Install DPC++ compiler : Instructions here
https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html#dpcpp-cpp

Once DPC++ is installed source the compiler vars:
source /PATH_TO/intel/oneapi/compiler/latest/env/vars.sh

Getting Level Zero loader (For Level zero runtime and Sycl runtime)

Build from source for non system-wide(local) install

git clone https://github.com/oneapi-src/level-zero.git
cd level-zero
cmake -G Ninja -B build -S . \
   -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_INSTALL_PREFIX=../level-zero-install
cmake --build build --target install

Binary package for system-wide install: https://github.com/oneapi-src/level-zero/releases

Example: Setting up requirements using Conda

conda create -n imex-dev -c intel -c defaults -c conda-forge pip">=21.2.4" pre-commit cmake clang-format lit doxygen

conda activate imex-dev

Setting up pre-commit

pre-commit install -f -c .pre-commit-config.yaml

Building IMEX

IMEX supports three different ways of building depending on how LLVM is set up. Option 1 is in-tree (Built as part of LLVM) and option 2 and 3 are out-of-tree (Built outside of LLVM)

Option 1: Build IMEX as an LLVM external project (in-tree)

IMEX can be treated like a sub-project of LLVM and built as part of LLVM by using an LLVM config option called LLVM_EXTERNAL_PROJECTS.

git clone https://github.com/intel/mlir-extensions.git
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
git checkout `cat ../mlir-extensions/build_tools/llvm_version.txt`
git apply ../mlir-extensions/build_tools/patches/*
cmake -G Ninja -B build -S llvm \
   -DLLVM_ENABLE_PROJECTS=mlir \
   -DLLVM_BUILD_EXAMPLES=ON \
   -DLLVM_TARGETS_TO_BUILD="X86" \
   -DCMAKE_BUILD_TYPE=Release \
   -DLLVM_ENABLE_ASSERTIONS=ON \
   -DLLVM_EXTERNAL_PROJECTS="Imex" \
   -DLLVM_EXTERNAL_IMEX_SOURCE_DIR=../mlir-extensions

# For GPU support pass thes cmake variables to enable the required runtime libraries
#  -DIMEX_ENABLE_L0_RUNTIME=1
#  -DIMEX_ENABLE_SYCL_RUNTIME=1
# Additional if using a non system wide Level Zero Loader built from source
#  -DLEVEL_ZERO_DIR=/PATH_TO/level-zero-install

cmake --build build --target check-imex

Note: -DLLVM_INSTALL_UTILS=ON is not needed for this build since all tests will run using the FileCheck utility that is available in the build tree. An external lit is not needed as well, since all tests will run using llvm-lit in the build tree.

Option 2: Build IMEX with an installed LLVM (out-of-tree)

Note: Make sure to pass -DLLVM_INSTALL_UTILS=ON when building LLVM with CMake so that it installs FileCheck to the chosen installation prefix. Additonally, lit has to be installed separately as it does not install with the rest of LLVM.

Make sure the installed LLVM is built from the git commit sha as stated in build_tools/llvm_version.txt. And has all LLVM patches in build_tools/patches applied.

cmake -G Ninja -B build -S . \
   -DMLIR_DIR=<PATH_TO_DIRECTORY_WITH_MLIRConfig.cmake> \
   -DLLVM_EXTERNAL_LIT=<PATH_TO_LIT> \
   -DCMAKE_BUILD_TYPE=Release

# For GPU support pass thes cmake variables to enable the required runtime libraries
#  -DIMEX_ENABLE_L0_RUNTIME=1
#  -DIMEX_ENABLE_SYCL_RUNTIME=1
# Additional if using a non system wide Level Zero Loader built from source
#  -DLEVEL_ZERO_DIR=/PATH_TO/level-zero-install

cmake --build build --target check-imex

Option 3: Build IMEX with LLVM build tree (out-of-tree)

This is similar to option 2. Instead of installed LLVM, LLVM build tree is used.

Make sure before building LLVM, checkout the git commit sha as stated in build_tools/llvm_version.txt. And apply all LLVM patches in build_tools/patches.

cmake -G Ninja -B build -S . \
   -DMLIR_DIR=<PATH_TO_DIRECTORY_WITH_MLIRConfig.cmake> \
   -DCMAKE_BUILD_TYPE=Release

# For GPU support pass thes cmake variables to enable the required runtime libraries
#  -DIMEX_ENABLE_L0_RUNTIME=1
#  -DIMEX_ENABLE_SYCL_RUNTIME=1
# Additional if using a non system wide Level Zero Loader built from source
#  -DLEVEL_ZERO_DIR=/PATH_TO/level-zero-install

cmake --build build --target check-imex

Building docs

To build user documentation do

cmake --build build --target mlir-doc

It will render docs to the 'doc' directory.

To build code documentation use '-DIMEX_INCLUDE_DOCS' when configuring with cmake and do

cd build
cmake --build build --target doc_doxygen

Adding a new dialect

# enter root directory of mlir-extension
cd mlir-extensions
python scripts/add_dialect.py <name-of-new-dialect>

This will

generate directories IR and Transforms in the directories (include/mlir/Dialect and lib/dialect)
Extend/Create cmake infrastructure with defaults
Create stub source files for IR and transforms
- include/imex/Dialect/<name>/IR/<name>Ops.h
- include/imex/Dialect/<name>/IR/<name>Ops.td
- lib/Dialect/IR/<name>Ops.cpp
- include/imex/Dialect/<name>/Transforms/Passes.h
- include/imex/Dialect/<name>/Transforms/Passes.td
- lib/Dialect/Transforms/PassDetail.h

Now, it's your turn to

Add your dialect and its transforms/passes to appropriate places in
- include/imex/InitIMEXDialects.h
- include/imex/InitIMEXPasses.h
- lib/Conversion/IMEXPassDetail.h
Fill in what's marked with FIXME
The documentation of the dialect should go into the description fields in <name>Ops.td. At build time the description will be extracted and a file doc/<name>.md will be generated automatically. It will include descriptions of the dialect and operations in a standardized way.

Adding a new Conversion

# enter root directory of mlir-extension
cd mlir-extensions
python scripts/add_conversion.py $name-of-source-dialect $name-of-target-dialect

This will

Let $conversion-name name be "$name-of-source-dialectTo$name-of-target-dialect"
Add directories include/mlir/Conversion/<conversion-name> and lib/Conversion/<conversion-name>
Extend/Create cmake infrastructure with defaults
Add declarations to header include/mlir/Conversion/<conversion-name>/<conversion-name>.h
Put cpp definition stubs to lib/Conversion/<conversion-name>/<conversion-name>.cpp
Add conversion to include/imex/Conversion/IMEXPasses.td and include/imex/Conversion/IMEXPasses.h
Add a pass def stub to include/imex/Conversion/IMEXPasses.td and include/imex/Conversion/Passes.td

You will now have to

Fill in the above files what's marked with FIXME
The documentation of the pass should go into the description field in Passes.td. At build time the description will be extracted and a file doc/Conversions.md will be generated automatically.
Write your Pattern rewriters

Run the lit tests

To run the FileCheck based tests, follow the following steps:

cmake --build build --target check-imex

Add '-v' to the above command-line to get verbose output.

Benchmarking

IMEX provides an initial set of benchmarks for studying its performance. To build these benchmarks, users need to manually add -DIMEX_ENABLE_BENCHMARK=ON option when building the IMEX. The benchmark testcases and the script for running them will be generated under the build/benchmarks folder.

Currently, IMEX provides benchmarks for the following 4 categories of operations:

Operation	CPU	GPU
elementwise (relu and silu)	Yes	Yes
reduction (softmax)	Yes	Yes
transpose (transpose)	Yes	Yes
fusion (kInputFusion and kLoopFusion)	No	Yes

These test cases are mainly implemented using linalg dialect, and the spriv test cases for relu are also provided. Each testcase is named following the pattern of opname_shape_dtype.mlir

How to run ?

For simplicity, the bench_imex script is provided to run the benchmark. It can take a mlir file or a folder as input. for the later case, it will simply run all test cases inside the folder. In addition, it also has to choose a runtime based on the option. It accepts one of the following three options:

-c for cpu runtime
-l for level-zero runtime (for INTEL GPU)
-s for sycl runtime (for INTEL GPU)

Example

# run a specific test case on CPU
 ./bench_imex -c relu/cpu/relu_1x160x160x120_f16.mlir

# run a set of test cases on GPU using sycl runtime
 ./bench_imex -s relu/gpu/

NOTE: if you are using -c, please use testcases under cpu subfolder; similarly, if you are using -s or -l, please use testcases under gpu subfolder. Otherwise, it may have unspecified errors or behaviors.

How to customize the benchmark ?

IMEX benchmark suite is implemented using CMAKE template, and initially provides limited set of shapes extraced from some production models, e.g., BERT, and AlexNet.

ReLU: 1x160x160x120, 50x640x20x15, 512x640x20x15
SiLU: 1x1024x40x30, 50x20x3072, 512x640x20x15
Softmax: 1x2000, 16x2000, 64x2000, 256x2000, 1024x2000
Transpose: 128x136, 1024x1024, 16x96x96, 96x7x96
Reduce: 32x16x512x512

Users can extend it to evaluate more shapes by editing the, e.g, relu.shapes.in file, in each subfolder, and then rebuild the imex. User can also add new data types, but it is currently only limited to basic data types including fp32, fp16, int32 etc.

Profiling kernel execute time

sycl event

export IMEX_ENABLE_PROFILING=ON
run the test

trace tools

python {your_path}/imex_runner.py xxx -o test.mlir
mlir-translate test.mlir -mlir-to-llvmir -o test.ll
llc test.ll -filetype=obj -o test.o
clang++ test.o {path}/libmlir_runner_utils.so {path}/libmlir_c_runner_utils.so {path}/libsycl-runtime.so -no-pie -o test
ze_tracer ./test

Dist/NDArray Misc

Not using LoadOp. Instead, everything is a SubviewOp. Any size-1 dim must be annotated with static size 1.
- right now we can only broadcast size-1 dims if their extent is statically known (to be 1)
Generally, rank reduction of SubviewOp needs overhaul.
- Right now, no rank reduction is happening, and appropriate affine maps are generated accordingly
- Without dist-coalesce, repartitioning of 0d arrays coming from a subview do not work correctly. Only the owning process will have the right data.
- Even if SubviewOp resulted in rank-reduced arrays, we cannot view into our local data since the element might be remote.
- To follow existing mechanisms (e.g. target parts) we'd basically need to duplicate the entire array.
- We probably need some special feature to hold duplicates of slices with only one element on the distributed axis.
NDArray/dist tests can be run (without GPU tests etc) uwing cmake --build . --target check-ndarray

License

This code is made available under the Apache License 2.0 with LLVM Exceptions. See the LICENSE.txt file for more details.

mlir-extensions's People

Contributors

Stargazers

Watchers

mlir-extensions's Issues

ExplicitPadding Test case fails on GPU pipeline

The explicitpadding.mlir test case https://github.com/intel/mlir-extensions/blob/refactor/test/PlaidML/OpTest.ExplicitPadding.mlir gives incorrect results on the gpu pipeline on refactor branch.

Expected output is :
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]
[0, 1, 2, 3, 0]
[0, 4, 5, 6, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]

But the gpu pipeline produces this output:
[1, 2, 3, 0, 0]
[4, 5, 6, 0, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]

Remove Analysis and Compiler, utils.cpp from lib/ as they are not used.

Lowering of arith.cmpi "ult" op is not supported by upstream ArithToSpirv pass.

Bitwidth emulation is not implemented yet on unsigned op for Spirv lowering

Some math opes (e.g., math.tan) can't be lowered to spirv

There is no lowering pass for math.atan on the gpu pipeline

Add a forked version of SPIRV dialect to IMEX

Currently all development work for the IMEX SPIRV dialect is directly sent as patches to upstream. We should stage all our changes to SPIRV locally, and upstream patches when they are ready.

Doing so will ensure we can keep using our latest changes to SPIRV without having to pull them from upstream.

Provide a script using imex-opt and mlir-cpu-runner to achieve same effect as level-zero-runner

Vector Length Restriction in SPIR-V and subsequent 'spv' dialect

SPIR-V Specification has a vector length restriction of size 2,3, or 4 components(https://www.khronos.org/registry/SPIR-V/specs/1.0/SPIRV.html#_universal_validation_rules).

Vector types for can only be parameterized as having 2, 3, or 4 components, plus any additional sizes enabled by capabilities.
Matrix types can only be parameterized with floating-point types.
Matrix types can only be parameterized as having only 2, 3, or 4 columns.

As per validation rule, the MLIR 'spv' dialect also has vector length restriction of 2, 3, 4, 8, or 16 (https://mlir.llvm.org/docs/Dialects/SPIR-V/#spvfunctioncall-mlirspirvfunctioncallop).
[Note: We think 'spv' dialect allows vector length up to 16 as compared to SPIR-V specification of up to 4 is because SPIR-V specification allows matrix size of up to 4 columns (aka vectors), and each vector can have up to 4 elements, hence the 'spv' length restriction is 16]

However, to utilize Intel vector specific instructions (e.g., DPAS instructions), the vector size has to be increased, which is done by capability extension, 'VectorAnyINTEL' (OpCapability VectorAnyINTEL). The extension is part of the upstream llvm trunk.

Unfortunately, in MLIR 'spv' dialect, the original restriction of vector length (2, 3, 4, 8, 16) still exist (in the validation, even if necessary OpCapability is used), which makes it difficult to use 'spv' dialect to represent codes/instructions that requires vector of length more than 16. Since, SPIR-V specification allows us to extend the vector length using OpCapability, in 'spv' dialect:
1. Either, we can remove (or make it optional) the length restriction, or
2. allow a way to express the OpCapability specific changes to the validation rule.

Remove broken Azure CI

The Azure CI has been broken for a while. There is no plans to revive it and the team is moving to a better CI solution. Can the CI checks be removed?

InsertGpuAlloc Pass cannot handle memref's produced from call op.

In InsertGpuAlloc pass if a memref accessed inside the device kernel has its producer from a mlir::func::CallOp we emit error and dont handle that case

Something like this is not handled as of today:
%8 below is accessing %0 , whose origin is from a callOp

func.func @alloc_buffer() -> memref<8xf32> {
%buf = memref.alloc() : memref<8xf32>
return %buf : memref<8xf32>
}

func.func @main() {
  %c8 = arith.constant 8 : index
  %c1 = arith.constant 1 : index
  %0 = func.call @alloc_buffer() :  () -> memref<8xf32>
  %1 = memref.alloc() : memref<8xf32>
  %2 = memref.alloc() : memref<8xf32>
    gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c8, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c1, %arg10 = %c1, %arg11 = %c1) {
  
    %7 = gpu.block_id  x
    %8 = memref.load %0[%7] : memref<8xf32>
    %9 = memref.load %1[%7] : memref<8xf32>
    %10 = func.call @addf(%8, %9) : (f32, f32) -> f32
    memref.store %10, %2[%7] : memref<8xf32>
    %11 = func.call @cast(%2) : (memref<8xf32>) -> memref<?xf32>
    gpu.terminator
  
  }

Building instructions

Hello, I was trying to build this project following the instructions from the README, but it points to a repository (https://github.com/IntelPython/dpcomp.git) for which I get 404. Are the building instructions up to date? If not, can you provide updated ones? Thanks :)

find_package(LLVM) is redundant since find_package(MLIR) calls it internally

The top-level CMakeLists.txt should not do the extra find_package(LLVM) step as it is already done by the find_package(MLIR) call.

[L0 runner]: Dealloc issue

GPU dealloc causes a segfault and is currently disabled as a workaround.
This leads to memory leak and needs to be fixed

Add support for building and executing on Windows

python "lit" and wrapper script (runlit.py) has to be installed for running lit tests.

The Python lit utility is required to run the FileCheck unit tests for IMEX. Instruction should be provided about how to set up and use lit.

New execution engine for Python E2E pipleine

Historically we have relied on numba execution engine (MCJIT - based) for executing our MLIR code. The main issue here was that Numba uses different llvm version (much older than ours), to handle this we are serializing lvm module to bitcode on our side using our llvm and deserialize it on numba side using theirs. This approach was working reasonably well up until now (although we already have a hack on our side to remove unsupported attributes), but it is going to break as upstream llvm is fully switching to opaque pointers and thus breaking bitcode compatibility.

Better approach would be to have full execution engine on our side and use it compile code on our side. After that we can pass function pointer to the compiled code to the numba side (instead of passing entire IR) and construct simple wrapper there using llvmlite. This will allow to keep most of the existing numba integration as is.

Change the README to properly reflect the scope of the repo

The current README is tailored towards the prototype MLIR back end to the Numba compiler. The document should be updated to reflect te expanded scope for the project, i.e. to be a repo for all Intel developed MLIR dialects and tools.

InsertGPUAlloc pass fails when it encounters memref.get_global

%39 = memref.get_global @__constant_500xf64 : memref<500xf64>
  %40 = memref.alloc() : memref<500xf64>
  memref.dealloc %40 : memref<500xf64>
  %41 = memref.alloc() : memref<500xf64>
  %c1_20 = arith.constant 1 : index
  %c500_21 = arith.constant 500 : index
  gpu.launch blocks(%arg2, %arg3, %arg4) in (%arg8 = %c500_21, %arg9 = %c1_20, %arg10 = %c1_20) threads(%arg5, %arg6, %arg7) in (%arg11 = %c1_20, %arg12 = %c1_20, %arg13 = %c1_20) {
    %60 = arith.muli %arg2, %c1 : index
    %61 = arith.addi %60, %c0 : index
    %62 = memref.load %38[%61] : memref<500xf64>
    %63 = memref.load %39[%61] : memref<500xf64>```



The issue hits in line %63 = memref.load %39[%61] : memref<500xf64> ...when memref.load accesses an  memref.get_global to load. 
This needs to be handled in the InsertGPUAlloc Pass where its assumed that all such aliases will be a memref::allocOP

IMEX needs a CLI tool similar to mlir-opt

The mlir-opt tool makes it easy to run any MLIR pass using a default global registry. The tool is useful to design FileCheck test cases.

IMEX needs a similar tool. The following pieces need to be put in place before the tool can be put together:

Add a Passes.td and Passes.h to our Conversion folder. These two files are provided by the MLIR repo and are used to define the registration functions for all Conversion passes.
Add Passes.td and Passes.h to all IMEX Dialect that define passes.
Add an InitIMEXDialects.h header similar to the InitAllDialects.h in MLIR.
Add an InitIMEXPasses.h header similar to InitAllPasses.h in MLIR.
Add an imex-opt.cpp source similar to mlir-opt.cpp

Explore automation possibilities to auto-generate these files as most of the files will need mechanical maintenance.

PS: The automation scripts in our repo are meta^2 programming. Use Python to generate TableGen that then generates C++. Ergo, Python rules them all!!

Refactoring L0 runner and gpu dialect

Writing down requirements for refactoring igpu dialect and L0 runner. Goal of this document is to keep track of tasks for refactoring code base to avoid duplicate effort.

level zero runner and igpu work would continue on main branch until most of the following requirements/features are ready on refactor branch.
Some of these may be applied to main branch during that period.
Any task with a checkmark against it indicates it is completed in the refactor branch.
Following list assume "dpcomp" project specific passes and dialects will be redone in refactor branch.

Also ran some experiments with https://github.com/silee2/mlir-extensions/blob/cleanup/mlir/CMakeLists.txt to filter out files not need to for level zero runner usage case.

Print functions for I8 and I1 are not supported by upstream

printMemrefI1 and printMemrefI32 is not supported by upstream runnerutil library

bf16 not supported in SPIR-V and 'spv' dialect

SPIR-V specification and subsequently 'spv' dialect does not support bf16 datatype.

Possible Solutions:

Solution 1 [Long term goal]: Make bf16 part of the SPIR-V specification and 'spv' dialect
**Solution 2 **[Current FIX]: In MLIR toolchain, use bf16 up until 'spv' dialect. Then convert the 'bf16' to 'f16' (just change the type from 'bf16' to 'f16'). Since, both datatype uses 16 bits to represent a number, if the underlying instruction is accessing 'f16' data but accessing it as 'bf16' it should stil work.

Assumption/Caveat: This solution would only work if the data converted to 'f16' from 'bf16' is only used by instructions which expects 'bf16'. In other words, those instructions will reinterpret the passed 'f16' data as 'bf16'. We have to enforce this condition. Otherwise, we may incorrectly cast a 'bf16' to 'f16' which can have huge consequences ('bf16' has much higher range of value it can represent, so casting 'bf16' to 'f16' is essentially a lower cast [overflow may occur])

Use MLIR upstream file structure. "refactor" branch is based on that structure. See mlir-hlo project layout for reference.

CMake version and C++ version should follow upstream LLVM/MLIR. Use cmake policy to use advanced cmake features and provide a fallback for older versions of cmake

Enable Mobilenet and Resnet Models on CPU and GPU

Refactor and rename wrapper libraries: IGPURuntimeWrapper, ...

Use MLIR cmake macros for building libraries and executables: https://github.com/llvm/llvm-project/blob/main/mlir/cmake/modules/AddMLIR.cmake

Debugging support in Python E2E pipeline

Tasks:

Translate numba debug info to mlir location info
Make sure location info is correctly propagated through all passes (both ours and upstream)
Make sure llvm dialect have proper support for llvm debug metadata
Translate location info into llvm debug metadata
Make sure spirv dialect have proper support for debug metadata
Translate location info into spirv debug metadata
Make sure execution engine properly installs debug hooks on win/lin

Build system improvements

Currently, IMEX is not installed as an SDK and the build system is integrated with numba-dpcomp.The issue tracks the changes that are needed to clean up the build system.

Move the installation of dpcomp_runtme and dpcomp_gpu_runtime into top-level CMakeLists.txt. Currently, they are inside the CMakeLists.txtof numba-dpcomp/numba-dpcomp/mlir-compiler/CMakeLists.txt.
Install all libraries and headers for IMEX. Only tools are installed right now.
Add a IMEXConfig.cmake module to make finding IMEX and including it inside other projects becomes easier.
Convert numba-dpcomp into a separate CMake project.

Provide cmake option for installing toolchain only (not as an SDK). Useful when using separating development and testing machine. See -DLLVM_INSTALL_TOOLCHAIN_ONLY=ON for reference.

Move scripts to build imex to a separate build_tools directory

Maybe beneficial to have separate build_tools and scripts directory. "build_tools" to scripts related to building only and "scripts" directory for other helper scripts or running tools.

When updating LLVM version, use a version that is/was used by mlir-hlo

Remove need for LLVM_ENABLE_RTTI: imex-opt and FileCheck refactoring should remove need for rtti.

Move Undef op from plier_util to gpu_runtime dialect and remove plier and plier_util dialect

The presence of the Undef op in the plier_util dialect adds an extra PLIER-specific dependency on the gpu_runtime dialect. The Undef op should be moved to another location that is independent of plier.

Move wrapper libraries to lib/ExecutionEngine

[ptensor] Better IR for elementwise ops

Right now the actual operation of elementwise operations are represented as integer values. This is not very readable.
Parsing/printing of ew ops should use a string representation instead.

cmake config needs fixing

Currently, on refactor, the following things don't work correctly with cmake:

lib/cmake/imex/IMEXTargets.cmake is not being installed
- Ideally we'd do this with LLVM's machinery, but that requires more than just set_property(GLOBAL PROPERTY IMEX_HAS_EXPORTS TRUE).
include headers are nested in a double 'include' dir (e.g. imex/include/include/imex/...)

Remove unused Tranforms and Conversions: Only two conversions (gpu-to-gpu-runtime, gpu-runtime-to-llvm) and one transform (func_utils) is used by L0 runner

Add cmake target check-imex-opt to run unit tests

Provide a cmake option to add mlir-hlo project : LLVM version will be common between imex and mlir-hlo

Candidates for refactor

There are some passes in current main which can be moved to the refactor with low effort and/or upstreamed. Most of these passes missing lit tests and docs and only tested in end-to-end python pipeline.

PlierToScfPass - CFG to scf uplifting. It doesn't depend on our python dialect (there is currently 'ArgOpLowering` python pattern here, but is should be moved to separate pass). This is also a good candidate for upstreaming
PlierUtil dialect and llvm lowering. PlierUtil llvm is currently done as part of numba-specific llvm lowering. It should split into separate pass (it will also help with getting rid of IMEX_ENABLE_NUMBA_HOTFIX workaround).
ParallelToTbbPass pass and related TBB runtime. Converts scf.parallel to util.parallel which is not strictly tied to TBB specifically.
UntuplePass - expands tuples to individual values.
MakeSignlessPass - Converts all integers types to signless (we have signed/unsigned integers in early pipeline)
MemorySSA analysis - store to load forwarding, dead store removal, etc optimizations on memref dialect - require a lot of preparation work and RFC to upstream
TBD

[L0 runner]: Accept linalg dialect as input.

L0 runner currently assumes input to be a combination of GPU, arith and memref dialect. This setup requires separate pass pipeline for handling inputs with linalg dialect. L0 runner should be extended to include pass pipeline for lowering linalg to GPU dialect.

multiple Gpu modules

The current code produces one gpu module for every function/kernel. we need to have one module for kernels under a particular module rather than creating a new module for every new kernel

runtime error: symbols not found [memrefcopy]

things got wrong when we translate memcpy in a gpu pipeline

Code refactoring

This intel mlir extensions (imex) repo is created to help reuse MLIR infrastructure and promote collaboration, so we would like to keep all dialects inside the "mlir" directory and follow a similar structure as upstream llvm/mlir directory structure.

Please consider the following code refactoring.

Plier dialect fully move to "mlir" directory. The plier conversion to Linalg/scf/ should move to "mlir" directory. In general, mlir/lib/converter should be organized as upstream – each converter lives inside a subdirectory.
Plier conversion to linalg/std should not know Numba construct (PyFunc). The goal is that imex dialects should not know about upper level SW construct.
Dpcomp_gpu_Runtime should merge into coming mlir/lib/dialect/intel_gpu
Dpcomp_runtime needs some more analysis
Test and tools should move inside mlir, Tools/Dpcomp-opt should be renamed as imex-opt

GPU Kernel Fusion

Currently too many GPU kernels are generated (e.g., on kernel per linalg op), and it would be good for performance to merge them together.

Complex Type is not supported on GPU

The GPUToSpirv pass fails when it sees a complex type and complex dialect as input IR

Refactor CMakeLists.txt to move all options to a separate cmake module

All cmake options should be moved to an options.cmake file as done in oneDNN: https://github.com/oneapi-src/oneDNN/blob/master/cmake/options.cmake

[ptensor] add missing elementwise binary operations

The following elementwise binary operations need an implementation in PTensorToLinalg(.cpp) (see array-API spec for the expected behavior of the operations):

Reference implementations can be found in the TOSA dialect and/or on main (numba_dpcomp/numba_dpcomp/mlir/numpy/funcs.py [Python]).

It is ok to let initial implementations operate on default PTensorTypes only, e.g. ignore device and distribution attributes of input tensors.

[ptensor] Add elementwise unary ops

Generally follow the implementation of elementwise binary ops
- New struct EWUnaryOpId in PTensorOps.h (see below)
- New PTensor Operations in PTensorOps.td as defined in Operation Details->Elementwise Operations in the RFC.
- New conversion patterns in PTensorToLinalg.cpp (new classes and adding patterns to RewritePatternSet in ConvertPTensorToLinalgPass
See array-API spec for the expected behavior of the operations:
- ABS
- ACOS
- ACOSH
- ASIN
- ASINH
- ATAN
- ATANH
- BITWISE_INVERT
- CEIL
- COS
- COSH
- EXP
- EXPM1
- FLOOR
- ISFINITE
- ISINF
- ISNAN
- LOGICAL_NOT
- LOG
- LOG1P
- LOG2
- LOG10
- NEGATIVE
- POSITIVE
- ROUND
- SIGN
- SIN
- SINH
- SQUARE
- SQRT
- TAN
- TANH
- TRUNC
- ERF