Git Product home page Git Product logo

mlir-extensions's Introduction

Intel® Extension for MLIR

Intel® Extension for MLIR (IMEX) is a collection of MLIR dialects and passes from Intel for supporting MLIR lowering to Intel silicon (CPU, GPU, …). Goal of this project is to support development of MLIR enhancements for upstream contribution, and to provide a sandbox for validation independent of front end frameworks. Current project scope includes:

  • Dialects and passes needed to lower and execute MLIR entry dialect (linalg, CFG, and etc) on Intel GPU.
  • Wrapper libraries to inteface with level zero runtime and sycl runtime supporting Intel GPU.
  • Other experimental dialects: NDArray, Dist

Requirements for building and development

For build

  • CMake >= 3.20.0
  • Ninja
  • doxygen (Optional for building docs)

Additionals for development

For building GPU runtime (Optional)

Installing Intel® software for general purpose GPU capability

Instructions here
https://dgpu-docs.intel.com/installation-guides/index.html

Getting DPC++ compiler (For Sycl runtime)

Install DPC++ compiler : Instructions here
https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html#dpcpp-cpp

Once DPC++ is installed source the compiler vars:
source /PATH_TO/intel/oneapi/compiler/latest/env/vars.sh

Getting Level Zero loader (For Level zero runtime and Sycl runtime)

  • Build from source for non system-wide(local) install
git clone https://github.com/oneapi-src/level-zero.git
cd level-zero
cmake -G Ninja -B build -S . \
   -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_INSTALL_PREFIX=../level-zero-install
cmake --build build --target install

Example: Setting up requirements using Conda

conda create -n imex-dev -c intel -c defaults -c conda-forge pip">=21.2.4" pre-commit cmake clang-format lit doxygen

conda activate imex-dev

Setting up pre-commit

pre-commit install -f -c .pre-commit-config.yaml

Building IMEX

IMEX supports three different ways of building depending on how LLVM is set up. Option 1 is in-tree (Built as part of LLVM) and option 2 and 3 are out-of-tree (Built outside of LLVM)

Option 1: Build IMEX as an LLVM external project (in-tree)

IMEX can be treated like a sub-project of LLVM and built as part of LLVM by using an LLVM config option called LLVM_EXTERNAL_PROJECTS.

git clone https://github.com/intel/mlir-extensions.git
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
git checkout `cat ../mlir-extensions/build_tools/llvm_version.txt`
git apply ../mlir-extensions/build_tools/patches/*
cmake -G Ninja -B build -S llvm \
   -DLLVM_ENABLE_PROJECTS=mlir \
   -DLLVM_BUILD_EXAMPLES=ON \
   -DLLVM_TARGETS_TO_BUILD="X86" \
   -DCMAKE_BUILD_TYPE=Release \
   -DLLVM_ENABLE_ASSERTIONS=ON \
   -DLLVM_EXTERNAL_PROJECTS="Imex" \
   -DLLVM_EXTERNAL_IMEX_SOURCE_DIR=../mlir-extensions

# For GPU support pass thes cmake variables to enable the required runtime libraries
#  -DIMEX_ENABLE_L0_RUNTIME=1
#  -DIMEX_ENABLE_SYCL_RUNTIME=1
# Additional if using a non system wide Level Zero Loader built from source
#  -DLEVEL_ZERO_DIR=/PATH_TO/level-zero-install

cmake --build build --target check-imex

Note: -DLLVM_INSTALL_UTILS=ON is not needed for this build since all tests will run using the FileCheck utility that is available in the build tree. An external lit is not needed as well, since all tests will run using llvm-lit in the build tree.

Option 2: Build IMEX with an installed LLVM (out-of-tree)

Note: Make sure to pass -DLLVM_INSTALL_UTILS=ON when building LLVM with CMake so that it installs FileCheck to the chosen installation prefix. Additonally, lit has to be installed separately as it does not install with the rest of LLVM.

Make sure the installed LLVM is built from the git commit sha as stated in build_tools/llvm_version.txt. And has all LLVM patches in build_tools/patches applied.

cmake -G Ninja -B build -S . \
   -DMLIR_DIR=<PATH_TO_DIRECTORY_WITH_MLIRConfig.cmake> \
   -DLLVM_EXTERNAL_LIT=<PATH_TO_LIT> \
   -DCMAKE_BUILD_TYPE=Release

# For GPU support pass thes cmake variables to enable the required runtime libraries
#  -DIMEX_ENABLE_L0_RUNTIME=1
#  -DIMEX_ENABLE_SYCL_RUNTIME=1
# Additional if using a non system wide Level Zero Loader built from source
#  -DLEVEL_ZERO_DIR=/PATH_TO/level-zero-install

cmake --build build --target check-imex

Option 3: Build IMEX with LLVM build tree (out-of-tree)

This is similar to option 2. Instead of installed LLVM, LLVM build tree is used.

Make sure before building LLVM, checkout the git commit sha as stated in build_tools/llvm_version.txt. And apply all LLVM patches in build_tools/patches.

cmake -G Ninja -B build -S . \
   -DMLIR_DIR=<PATH_TO_DIRECTORY_WITH_MLIRConfig.cmake> \
   -DCMAKE_BUILD_TYPE=Release

# For GPU support pass thes cmake variables to enable the required runtime libraries
#  -DIMEX_ENABLE_L0_RUNTIME=1
#  -DIMEX_ENABLE_SYCL_RUNTIME=1
# Additional if using a non system wide Level Zero Loader built from source
#  -DLEVEL_ZERO_DIR=/PATH_TO/level-zero-install

cmake --build build --target check-imex

Building docs

To build user documentation do

cmake --build build --target mlir-doc

It will render docs to the 'doc' directory.

To build code documentation use '-DIMEX_INCLUDE_DOCS' when configuring with cmake and do

cd build
cmake --build build --target doc_doxygen

Adding a new dialect

# enter root directory of mlir-extension
cd mlir-extensions
python scripts/add_dialect.py <name-of-new-dialect>

This will

  • generate directories IR and Transforms in the directories (include/mlir/Dialect and lib/dialect)
  • Extend/Create cmake infrastructure with defaults
  • Create stub source files for IR and transforms
    • include/imex/Dialect/<name>/IR/<name>Ops.h
    • include/imex/Dialect/<name>/IR/<name>Ops.td
    • lib/Dialect/IR/<name>Ops.cpp
    • include/imex/Dialect/<name>/Transforms/Passes.h
    • include/imex/Dialect/<name>/Transforms/Passes.td
    • lib/Dialect/Transforms/PassDetail.h

Now, it's your turn to

  • Add your dialect and its transforms/passes to appropriate places in
    • include/imex/InitIMEXDialects.h
    • include/imex/InitIMEXPasses.h
    • lib/Conversion/IMEXPassDetail.h
  • Fill in what's marked with FIXME
  • The documentation of the dialect should go into the description fields in <name>Ops.td. At build time the description will be extracted and a file doc/<name>.md will be generated automatically. It will include descriptions of the dialect and operations in a standardized way.

Adding a new Conversion

# enter root directory of mlir-extension
cd mlir-extensions
python scripts/add_conversion.py $name-of-source-dialect $name-of-target-dialect

This will

  • Let $conversion-name name be "$name-of-source-dialectTo$name-of-target-dialect"
  • Add directories include/mlir/Conversion/<conversion-name> and lib/Conversion/<conversion-name>
  • Extend/Create cmake infrastructure with defaults
  • Add declarations to header include/mlir/Conversion/<conversion-name>/<conversion-name>.h
  • Put cpp definition stubs to lib/Conversion/<conversion-name>/<conversion-name>.cpp
  • Add conversion to include/imex/Conversion/IMEXPasses.td and include/imex/Conversion/IMEXPasses.h
  • Add a pass def stub to include/imex/Conversion/IMEXPasses.td and include/imex/Conversion/Passes.td

You will now have to

  • Fill in the above files what's marked with FIXME
  • The documentation of the pass should go into the description field in Passes.td. At build time the description will be extracted and a file doc/Conversions.md will be generated automatically.
  • Write your Pattern rewriters

Run the lit tests

To run the FileCheck based tests, follow the following steps:

cmake --build build --target check-imex

Add '-v' to the above command-line to get verbose output.

Benchmarking

IMEX provides an initial set of benchmarks for studying its performance. To build these benchmarks, users need to manually add -DIMEX_ENABLE_BENCHMARK=ON option when building the IMEX. The benchmark testcases and the script for running them will be generated under the build/benchmarks folder.

Currently, IMEX provides benchmarks for the following 4 categories of operations:

Operation CPU GPU
elementwise (relu and silu) Yes Yes
reduction (softmax) Yes Yes
transpose (transpose) Yes Yes
fusion (kInputFusion and kLoopFusion) No Yes

These test cases are mainly implemented using linalg dialect, and the spriv test cases for relu are also provided. Each testcase is named following the pattern of opname_shape_dtype.mlir

How to run ?

For simplicity, the bench_imex script is provided to run the benchmark. It can take a mlir file or a folder as input. for the later case, it will simply run all test cases inside the folder. In addition, it also has to choose a runtime based on the option. It accepts one of the following three options:

  • -c for cpu runtime
  • -l for level-zero runtime (for INTEL GPU)
  • -s for sycl runtime (for INTEL GPU)

Example

# run a specific test case on CPU
 ./bench_imex -c relu/cpu/relu_1x160x160x120_f16.mlir

# run a set of test cases on GPU using sycl runtime
 ./bench_imex -s relu/gpu/

NOTE: if you are using -c, please use testcases under cpu subfolder; similarly, if you are using -s or -l, please use testcases under gpu subfolder. Otherwise, it may have unspecified errors or behaviors.

How to customize the benchmark ?

IMEX benchmark suite is implemented using CMAKE template, and initially provides limited set of shapes extraced from some production models, e.g., BERT, and AlexNet.

  • ReLU: 1x160x160x120, 50x640x20x15, 512x640x20x15
  • SiLU: 1x1024x40x30, 50x20x3072, 512x640x20x15
  • Softmax: 1x2000, 16x2000, 64x2000, 256x2000, 1024x2000
  • Transpose: 128x136, 1024x1024, 16x96x96, 96x7x96
  • Reduce: 32x16x512x512

Users can extend it to evaluate more shapes by editing the, e.g, relu.shapes.in file, in each subfolder, and then rebuild the imex. User can also add new data types, but it is currently only limited to basic data types including fp32, fp16, int32 etc.

Profiling kernel execute time

sycl event

export IMEX_ENABLE_PROFILING=ON
run the test

trace tools

python {your_path}/imex_runner.py xxx -o test.mlir
mlir-translate test.mlir -mlir-to-llvmir -o test.ll
llc test.ll -filetype=obj -o test.o
clang++ test.o {path}/libmlir_runner_utils.so {path}/libmlir_c_runner_utils.so {path}/libsycl-runtime.so -no-pie -o test
ze_tracer ./test

Dist/NDArray Misc

  • Not using LoadOp. Instead, everything is a SubviewOp. Any size-1 dim must be annotated with static size 1.
    • right now we can only broadcast size-1 dims if their extent is statically known (to be 1)
  • Generally, rank reduction of SubviewOp needs overhaul.
    • Right now, no rank reduction is happening, and appropriate affine maps are generated accordingly
    • Without dist-coalesce, repartitioning of 0d arrays coming from a subview do not work correctly. Only the owning process will have the right data.
    • Even if SubviewOp resulted in rank-reduced arrays, we cannot view into our local data since the element might be remote.
    • To follow existing mechanisms (e.g. target parts) we'd basically need to duplicate the entire array.
    • We probably need some special feature to hold duplicates of slices with only one element on the distributed axis.
  • NDArray/dist tests can be run (without GPU tests etc) uwing cmake --build . --target check-ndarray

License

This code is made available under the Apache License 2.0 with LLVM Exceptions. See the LICENSE.txt file for more details.

mlir-extensions's People

Contributors

akroviakov avatar alexander-makaryev avatar alexanderkalistratov avatar allanzyne avatar antonio-cortes-perez avatar charithaintc avatar chencha3 avatar dewei-wang-sh avatar diptorupd avatar drprajap avatar fschlimb avatar garra1980 avatar hardcode84 avatar jianhui-li avatar leshikus avatar mshahneo avatar nbpatel avatar silee2 avatar tkarna avatar vyacheslav-smirnov avatar zilanlia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlir-extensions's Issues

Add a forked version of SPIRV dialect to IMEX

Currently all development work for the IMEX SPIRV dialect is directly sent as patches to upstream. We should stage all our changes to SPIRV locally, and upstream patches when they are ready.

Doing so will ensure we can keep using our latest changes to SPIRV without having to pull them from upstream.

Vector Length Restriction in SPIR-V and subsequent 'spv' dialect

SPIR-V Specification has a vector length restriction of size 2,3, or 4 components(https://www.khronos.org/registry/SPIR-V/specs/1.0/SPIRV.html#_universal_validation_rules).

Vector types for can only be parameterized as having 2, 3, or 4 components, plus any additional sizes enabled by capabilities.
Matrix types can only be parameterized with floating-point types.
Matrix types can only be parameterized as having only 2, 3, or 4 columns.

As per validation rule, the MLIR 'spv' dialect also has vector length restriction of 2, 3, 4, 8, or 16 (https://mlir.llvm.org/docs/Dialects/SPIR-V/#spvfunctioncall-mlirspirvfunctioncallop).
[Note: We think 'spv' dialect allows vector length up to 16 as compared to SPIR-V specification of up to 4 is because SPIR-V specification allows matrix size of up to 4 columns (aka vectors), and each vector can have up to 4 elements, hence the 'spv' length restriction is 16]

However, to utilize Intel vector specific instructions (e.g., DPAS instructions), the vector size has to be increased, which is done by capability extension, 'VectorAnyINTEL' (OpCapability VectorAnyINTEL). The extension is part of the upstream llvm trunk.

Unfortunately, in MLIR 'spv' dialect, the original restriction of vector length (2, 3, 4, 8, 16) still exist (in the validation, even if necessary OpCapability is used), which makes it difficult to use 'spv' dialect to represent codes/instructions that requires vector of length more than 16. Since, SPIR-V specification allows us to extend the vector length using OpCapability, in 'spv' dialect:
1. Either, we can remove (or make it optional) the length restriction, or
2. allow a way to express the OpCapability specific changes to the validation rule.

Remove broken Azure CI

The Azure CI has been broken for a while. There is no plans to revive it and the team is moving to a better CI solution. Can the CI checks be removed?

InsertGpuAlloc Pass cannot handle memref's produced from call op.

In InsertGpuAlloc pass if a memref accessed inside the device kernel has its producer from a mlir::func::CallOp we emit error and dont handle that case

Something like this is not handled as of today:
%8 below is accessing %0 , whose origin is from a callOp

func.func @alloc_buffer() -> memref<8xf32> {
%buf = memref.alloc() : memref<8xf32>
return %buf : memref<8xf32>
}

func.func @main() {
  %c8 = arith.constant 8 : index
  %c1 = arith.constant 1 : index
  %0 = func.call @alloc_buffer() :  () -> memref<8xf32>
  %1 = memref.alloc() : memref<8xf32>
  %2 = memref.alloc() : memref<8xf32>
    gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c8, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c1, %arg10 = %c1, %arg11 = %c1) {
  
    %7 = gpu.block_id  x
    %8 = memref.load %0[%7] : memref<8xf32>
    %9 = memref.load %1[%7] : memref<8xf32>
    %10 = func.call @addf(%8, %9) : (f32, f32) -> f32
    memref.store %10, %2[%7] : memref<8xf32>
    %11 = func.call @cast(%2) : (memref<8xf32>) -> memref<?xf32>
    gpu.terminator
  
  }

[L0 runner]: Dealloc issue

GPU dealloc causes a segfault and is currently disabled as a workaround.
This leads to memory leak and needs to be fixed

New execution engine for Python E2E pipleine

Historically we have relied on numba execution engine (MCJIT - based) for executing our MLIR code. The main issue here was that Numba uses different llvm version (much older than ours), to handle this we are serializing lvm module to bitcode on our side using our llvm and deserialize it on numba side using theirs. This approach was working reasonably well up until now (although we already have a hack on our side to remove unsupported attributes), but it is going to break as upstream llvm is fully switching to opaque pointers and thus breaking bitcode compatibility.

Better approach would be to have full execution engine on our side and use it compile code on our side. After that we can pass function pointer to the compiled code to the numba side (instead of passing entire IR) and construct simple wrapper there using llvmlite. This will allow to keep most of the existing numba integration as is.

Change the README to properly reflect the scope of the repo

The current README is tailored towards the prototype MLIR back end to the Numba compiler. The document should be updated to reflect te expanded scope for the project, i.e. to be a repo for all Intel developed MLIR dialects and tools.

InsertGPUAlloc pass fails when it encounters memref.get_global

%39 = memref.get_global @__constant_500xf64 : memref<500xf64>
  %40 = memref.alloc() : memref<500xf64>
  memref.dealloc %40 : memref<500xf64>
  %41 = memref.alloc() : memref<500xf64>
  %c1_20 = arith.constant 1 : index
  %c500_21 = arith.constant 500 : index
  gpu.launch blocks(%arg2, %arg3, %arg4) in (%arg8 = %c500_21, %arg9 = %c1_20, %arg10 = %c1_20) threads(%arg5, %arg6, %arg7) in (%arg11 = %c1_20, %arg12 = %c1_20, %arg13 = %c1_20) {
    %60 = arith.muli %arg2, %c1 : index
    %61 = arith.addi %60, %c0 : index
    %62 = memref.load %38[%61] : memref<500xf64>
    %63 = memref.load %39[%61] : memref<500xf64>```



The issue hits in line %63 = memref.load %39[%61] : memref<500xf64> ...when memref.load accesses an  memref.get_global to load. 
This needs to be handled in the InsertGPUAlloc Pass where its assumed that all such aliases will be a memref::allocOP

IMEX needs a CLI tool similar to mlir-opt

The mlir-opt tool makes it easy to run any MLIR pass using a default global registry. The tool is useful to design FileCheck test cases.

IMEX needs a similar tool. The following pieces need to be put in place before the tool can be put together:

  • Add a Passes.td and Passes.h to our Conversion folder. These two files are provided by the MLIR repo and are used to define the registration functions for all Conversion passes.
  • Add Passes.td and Passes.h to all IMEX Dialect that define passes.
  • Add an InitIMEXDialects.h header similar to the InitAllDialects.h in MLIR.
  • Add an InitIMEXPasses.h header similar to InitAllPasses.h in MLIR.
  • Add an imex-opt.cpp source similar to mlir-opt.cpp

Explore automation possibilities to auto-generate these files as most of the files will need mechanical maintenance.

PS: The automation scripts in our repo are meta^2 programming. Use Python to generate TableGen that then generates C++. Ergo, Python rules them all!!

Refactoring L0 runner and gpu dialect

Writing down requirements for refactoring igpu dialect and L0 runner. Goal of this document is to keep track of tasks for refactoring code base to avoid duplicate effort.

  • level zero runner and igpu work would continue on main branch until most of the following requirements/features are ready on refactor branch.
  • Some of these may be applied to main branch during that period.
  • Any task with a checkmark against it indicates it is completed in the refactor branch.
  • Following list assume "dpcomp" project specific passes and dialects will be redone in refactor branch.

Also ran some experiments with https://github.com/silee2/mlir-extensions/blob/cleanup/mlir/CMakeLists.txt to filter out files not need to for level zero runner usage case.

  • #236
  • #237
  • Remove dpcomp-opt and add imex-opt by using proper dialect and pass registration mechanism #233
  • #238
  • #239
  • #240
  • #241
  • #242
  • #243
  • Remove local FileCheck util and use -DLLVM_INSTALL_UTILS flag when building LLVM
  • Add IMEXConfig
  • #244
  • #245
  • #246
  • #247
  • #248
  • #249
  • Add support for various build situations: Ex) as an LLVM external project (LLVM_EXTERNAL_${project}_SOURCE_DIR), as part of an external build, standalone build with local LLVM/MLIR. See mlir-hlo project and #211 for reference.
  • #250
  • #251
  • #252
  • #253
  • #254

bf16 not supported in SPIR-V and 'spv' dialect

SPIR-V specification and subsequently 'spv' dialect does not support bf16 datatype.

Possible Solutions:

  • Solution 1 [Long term goal]: Make bf16 part of the SPIR-V specification and 'spv' dialect

  • **Solution 2 **[Current FIX]: In MLIR toolchain, use bf16 up until 'spv' dialect. Then convert the 'bf16' to 'f16' (just change the type from 'bf16' to 'f16'). Since, both datatype uses 16 bits to represent a number, if the underlying instruction is accessing 'f16' data but accessing it as 'bf16' it should stil work.

Assumption/Caveat: This solution would only work if the data converted to 'f16' from 'bf16' is only used by instructions which expects 'bf16'. In other words, those instructions will reinterpret the passed 'f16' data as 'bf16'. We have to enforce this condition. Otherwise, we may incorrectly cast a 'bf16' to 'f16' which can have huge consequences ('bf16' has much higher range of value it can represent, so casting 'bf16' to 'f16' is essentially a lower cast [overflow may occur])

Debugging support in Python E2E pipeline

Tasks:

  • Translate numba debug info to mlir location info
  • Make sure location info is correctly propagated through all passes (both ours and upstream)
  • Make sure llvm dialect have proper support for llvm debug metadata
  • Translate location info into llvm debug metadata
  • Make sure spirv dialect have proper support for debug metadata
  • Translate location info into spirv debug metadata
  • Make sure execution engine properly installs debug hooks on win/lin

Build system improvements

Currently, IMEX is not installed as an SDK and the build system is integrated with numba-dpcomp.The issue tracks the changes that are needed to clean up the build system.

  • Move the installation of dpcomp_runtme and dpcomp_gpu_runtime into top-level CMakeLists.txt. Currently, they are inside the CMakeLists.txtof numba-dpcomp/numba-dpcomp/mlir-compiler/CMakeLists.txt.
  • Install all libraries and headers for IMEX. Only tools are installed right now.
  • Add a IMEXConfig.cmake module to make finding IMEX and including it inside other projects becomes easier.
  • Convert numba-dpcomp into a separate CMake project.

[ptensor] Better IR for elementwise ops

Right now the actual operation of elementwise operations are represented as integer values. This is not very readable.
Parsing/printing of ew ops should use a string representation instead.

cmake config needs fixing

Currently, on refactor, the following things don't work correctly with cmake:

  • lib/cmake/imex/IMEXTargets.cmake is not being installed
    • Ideally we'd do this with LLVM's machinery, but that requires more than just set_property(GLOBAL PROPERTY IMEX_HAS_EXPORTS TRUE).
  • include headers are nested in a double 'include' dir (e.g. imex/include/include/imex/...)

Candidates for refactor

There are some passes in current main which can be moved to the refactor with low effort and/or upstreamed. Most of these passes missing lit tests and docs and only tested in end-to-end python pipeline.

  • PlierToScfPass - CFG to scf uplifting. It doesn't depend on our python dialect (there is currently 'ArgOpLowering` python pattern here, but is should be moved to separate pass). This is also a good candidate for upstreaming
  • PlierUtil dialect and llvm lowering. PlierUtil llvm is currently done as part of numba-specific llvm lowering. It should split into separate pass (it will also help with getting rid of IMEX_ENABLE_NUMBA_HOTFIX workaround).
  • ParallelToTbbPass pass and related TBB runtime. Converts scf.parallel to util.parallel which is not strictly tied to TBB specifically.
  • UntuplePass - expands tuples to individual values.
  • MakeSignlessPass - Converts all integers types to signless (we have signed/unsigned integers in early pipeline)
  • MemorySSA analysis - store to load forwarding, dead store removal, etc optimizations on memref dialect - require a lot of preparation work and RFC to upstream
  • TBD

[L0 runner]: Accept linalg dialect as input.

L0 runner currently assumes input to be a combination of GPU, arith and memref dialect. This setup requires separate pass pipeline for handling inputs with linalg dialect. L0 runner should be extended to include pass pipeline for lowering linalg to GPU dialect.

multiple Gpu modules

The current code produces one gpu module for every function/kernel. we need to have one module for kernels under a particular module rather than creating a new module for every new kernel

Code refactoring

This intel mlir extensions (imex) repo is created to help reuse MLIR infrastructure and promote collaboration, so we would like to keep all dialects inside the "mlir" directory and follow a similar structure as upstream llvm/mlir directory structure.

Please consider the following code refactoring.

  1. Plier dialect fully move to "mlir" directory. The plier conversion to Linalg/scf/ should move to "mlir" directory. In general, mlir/lib/converter should be organized as upstream – each converter lives inside a subdirectory.
  2. Plier conversion to linalg/std should not know Numba construct (PyFunc). The goal is that imex dialects should not know about upper level SW construct.
  3. Dpcomp_gpu_Runtime should merge into coming mlir/lib/dialect/intel_gpu
  4. Dpcomp_runtime needs some more analysis
  5. Test and tools should move inside mlir, Tools/Dpcomp-opt should be renamed as imex-opt

GPU Kernel Fusion

Currently too many GPU kernels are generated (e.g., on kernel per linalg op), and it would be good for performance to merge them together.

[ptensor] add missing elementwise binary operations

The following elementwise binary operations need an implementation in PTensorToLinalg(.cpp) (see array-API spec for the expected behavior of the operations):

  • ptensor::ATAN2
  • ptensor::LOGADDEXP
  • ptensor::LSHIFT
  • ptensor::MATMUL
  • ptensor::TRUE_DIVIDE
  • ptensor::BITWISE_AND
  • ptensor::BITWISE_LEFT_SHIFT
  • ptensor::BITWISE_OR
  • ptensor::BITWISE_RIGHT_SHIFT
  • ptensor::BITWISE_XOR
  • ptensor::EQUAL
  • ptensor::GREATER
  • ptensor::GREATER_EQUAL
  • ptensor::LESS
  • ptensor::LESS_EQUAL
  • ptensor::LOGICAL_AND
  • ptensor::LOGICAL_OR
  • ptensor::LOGICAL_XOR
  • ptensor::NOT_EQUAL

See also Operation Details->Elementwise Operations in the RFC.

Reference implementations can be found in the TOSA dialect and/or on main (numba_dpcomp/numba_dpcomp/mlir/numpy/funcs.py [Python]).

It is ok to let initial implementations operate on default PTensorTypes only, e.g. ignore device and distribution attributes of input tensors.

[ptensor] Add elementwise unary ops

  • Generally follow the implementation of elementwise binary ops
    • New struct EWUnaryOpId in PTensorOps.h (see below)
    • New PTensor Operations in PTensorOps.td as defined in Operation Details->Elementwise Operations in the RFC.
    • New conversion patterns in PTensorToLinalg.cpp (new classes and adding patterns to RewritePatternSet in ConvertPTensorToLinalgPass
  • See array-API spec for the expected behavior of the operations:
    • ABS
    • ACOS
    • ACOSH
    • ASIN
    • ASINH
    • ATAN
    • ATANH
    • BITWISE_INVERT
    • CEIL
    • COS
    • COSH
    • EXP
    • EXPM1
    • FLOOR
    • ISFINITE
    • ISINF
    • ISNAN
    • LOGICAL_NOT
    • LOG
    • LOG1P
    • LOG2
    • LOG10
    • NEGATIVE
    • POSITIVE
    • ROUND
    • SIGN
    • SIN
    • SINH
    • SQUARE
    • SQRT
    • TAN
    • TANH
    • TRUNC
    • ERF

Reference implementations can be found in the TOSA dialect and/or on main (numba_dpcomp/numba_dpcomp/mlir/numpy/funcs.py [Python]).

It is ok to let initial implementations operate on default PTensorTypes only, e.g. ignore device and distribution attributes of input tensors.

enum EWUnaryOpId : int {
    ABS,
    ACOS,
    ACOSH,
    ASIN,
    ASINH,
    ATAN,
    ATANH,
    BITWISE_INVERT,
    CEIL,
    COS,
    COSH,
    EXP,
    EXPM1,
    FLOOR,
    ISFINITE,
    ISINF,
    ISNAN,
    LOGICAL_NOT,
    LOG,
    LOG1P,
    LOG2,
    LOG10,
    NEGATIVE,
    POSITIVE,
    ROUND,
    SIGN,
    SIN,
    SINH,
    SQUARE,
    SQRT,
    TAN,
    TANH,
    TRUNC,
    ERF,
    EWUNARYOP_LAST
};

Jax test case failure on GPU pipeline

The test case under test/qoc/jit_matmul.338_linalg.mlir gives incorrect results on gpu pipeline.

Output on Gpu :
[[[-10,    0], 
  [-20,    0]]]
  
 Expected Output:
 [[[-10,    -10], 
  [-20,    -20]]]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.