kumasento / polymer Goto Github PK

View Code? Open in Web Editor NEW

99.0 9.0 21.0 4.46 MB

Bridging polyhedral analysis tools to the MLIR framework

License: MIT License

CMake 3.52% C++ 61.55% Python 1.06% Dockerfile 0.30% Makefile 0.39% C 18.10% Shell 14.94% TeX 0.14%

mlir polyhedral

polymer's Introduction

Polymer: bridging polyhedral tools to MLIR

Bridging polyhedral analysis tools to the MLIR framework.

Polymer is a component of the Polygeist framework. Please read on to find how to install and use Polymer.

Related Publications/Talks

[bibtex]

Papers

Polymer is a essential component to the following two papers:

Polygeist: Affine C in MLIR (IMPACT'21). This paper gives an overview of the whole Polygeist framework, in which Polymer does the polyhedral optimisation part of work.
Phism: Polyhedral HLS in MLIR (LATTE'21). This paper demonstrates an interesting way to leverage Polymer for polyhedral HLS within the MLIR ecosystem.
Polygeist: Raising C to Polyhedral MLIR (PACT'21). This is an updated version to the IMPACT'21 submission.

Talks

Polymer appears in the following talks:

MLIR Open Dsign Meeting (11/02/2021) on Polygeist. [slides] [recording]
LATTE '21 on Phism. [recording]

Install Polymer

[legacy installation method] [submodule problem]

The recommended way of installing Polymer is to have it as a component of Polygeist. Please find the detailed instruction here.

You can also install Polymer as an individual, out-of-tree project.

Basic usage

Optimize MLIR code described in the Affine dialect by Pluto:

// File name: matmul.mlir
func @matmul() {
  %A = memref.alloc() : memref<64x64xf32>
  %B = memref.alloc() : memref<64x64xf32>
  %C = memref.alloc() : memref<64x64xf32>

  affine.for %i = 0 to 64 {
    affine.for %j = 0 to 64 {
      affine.for %k = 0 to 64 {
        %0 = affine.load %A[%i, %k] : memref<64x64xf32>
        %1 = affine.load %B[%k, %j] : memref<64x64xf32>
        %2 = arith.mulf %0, %1 : f32
        %3 = affine.load %C[%i, %j] : memref<64x64xf32>
        %4 = arith.addf %2, %3 : f32
        affine.store %4, %C[%i, %j] : memref<64x64xf32>
      }
    }
  }

  return
}

The following command will optimize this code piece.

# Go to the build/ directory.
./bin/polymer-opt -reg2mem -extract-scop-stmt -pluto-opt matmul.mlir

Output:

#map0 = affine_map<(d0) -> (d0 * 32)>
#map1 = affine_map<(d0) -> (d0 * 32 + 32)>
module  {
  func private @S0(%arg0: index, %arg1: index, %arg2: memref<64x64xf32>, %arg3: index, %arg4: memref<64x64xf32>, %arg5: memref<64x64xf32>) attributes {scop.stmt} {
    %0 = affine.load %arg5[symbol(%arg0), symbol(%arg3)] : memref<64x64xf32>
    %1 = affine.load %arg4[symbol(%arg3), symbol(%arg1)] : memref<64x64xf32>
    %2 = arith.mulf %0, %1 : f32
    %3 = affine.load %arg2[symbol(%arg0), symbol(%arg1)] : memref<64x64xf32>
    %4 = arith.addf %2, %3 : f32
    affine.store %4, %arg2[symbol(%arg0), symbol(%arg1)] : memref<64x64xf32>
    return
  }
  
  func @matmul() {
    %0 = memref.alloc() : memref<64x64xf32>
    %1 = memref.alloc() : memref<64x64xf32>
    %2 = memref.alloc() : memref<64x64xf32>
    affine.for %arg0 = 0 to 2 {
      affine.for %arg1 = 0 to 2 {
        affine.for %arg2 = 0 to 2 {
          affine.for %arg3 = #map0(%arg0) to #map1(%arg0) {
            affine.for %arg4 = #map0(%arg2) to #map1(%arg2) {
              affine.for %arg5 = #map0(%arg1) to #map1(%arg1) {
                call @S0(%arg3, %arg5, %0, %arg4, %1, %2) : (index, index, memref<64x64xf32>, index, memref<64x64xf32>, memref<64x64xf32>) -> ()
              }
            }
          }
        }
      }
    }
    return
  }
}

polymer's People

Contributors

Stargazers

Watchers

polymer's Issues

Connection to PLUTO

Should create a pluto-opt transform pass.

Add PLUTO to the compilation flow (#10 ).
Create the compiler pass (#13 ).

C integration test

Should check the correctness of the optimized code and compare the performance before and after optimization.

For instance, I should create a matrix multiplication example that has three different implementations -

A plain C implementation
An implementation that is based on an unoptimised MLIR code
An implementation that is based on an optimised MLIR code

All these tests will be under test/integration. Won't create more subdirectories at the moment. Will consider about adding googletest.

Handling local variables

This issue was not properly dealt with in previous versions. Since I'm currently doing the refactorization, we should have a comprehensive solution for that.

Background

A local variable will be introduced when dealing with semi-affine maps. For example, a code as follows:

#map = affine_map<(d0) -> (d0 floordiv 2)>

func @load_store_local_vars_floordiv(%A : memref<?xf32>) -> () {
  %c0 = constant 0 : index
  %N = dim %A, %c0 : memref<?xf32>
  %M = affine.apply #map(%N)

  affine.for %i = 0 to %M {
    %0 = affine.load %A[%i] : memref<?xf32>
    affine.store %0, %A[%i + %M] : memref<?xf32>
  }

  return
}

%M will be a local variable for the domain of the internal affine.for resolved. The constraints around %N and %M can be derived from their floordiv 2 relation.

%i | %N | %M | 1 
 1    0    0   0 >= 0
 0    1   -2   0 >= 0
 0   -1    2   1 >= 0
-1    0    1  -1 >= 0

And these constraints can be translated into a proper OpenScop relation.

DOMAIN
4 5 1 0 1 1
# e/i| i1 | l1 | P1 |  1  
   1    1    0    0    0    ## i1 >= 0
   1    0   -2    1    0    ## -2*l1+P1 >= 0
   1    0    2   -1    1    ## 2*l1-P1+1 >= 0
   1   -1    1    0   -1    ## -i1+l1-1 >= 0

l1 is %M and P1 is %N.

The problem is around the access relation.

Getting the access relation

The access relation is derived from a AffineValueMap that we can get from MemRefAccess (ref). An AffineValueMap is an AffineMap together with a list of operands it applies. The issue of local variables is about these operands: we need to make sure every operand can be traced back to the values held by the domain, but a local variable is not there.

Specifically, given the code above, we can get a polyhedral statement as follows:

  func @load_store_local_vars_floordiv(%arg0: memref<?xf32>) {
    %c0 = constant 0 : index
    %0 = dim %arg0, %c0 : memref<?xf32>
    %1 = affine.apply #map(%0)
    affine.for %arg1 = 0 to %1 {
      call @S0(%arg0, %arg1, %1) : (memref<?xf32>, index, index) -> ()
    }
    return
  }
  func @S0(%arg0: memref<?xf32>, %arg1: index, %arg2: index) attributes {scop.stmt} {
    %0 = affine.load %arg0[%arg1] : memref<?xf32>
    affine.store %0, %arg0[%arg1 + %arg2] : memref<?xf32>
    return
  }

Because %1 (%M) is a symbol based on (isValidSymbol), we will directly pass it into @S0 in a form as %arg2. It will later become an operand of the AffineValueMap for the affine.store. That AffineValueMap would look like:

affine_map<()[s0, s1] -> (s0 + s1)>

and s0 and s1 will be replaced by %arg1 (%i) and %arg2 (%M). However, %arg2, or more specifically, %M, cannot be found in the domain we resolved. %M is a local variable, and a local variable is value-less (or internal, cannot find the correct wording) in the FlatAffineConstraints. Go back to the domain representation above (which is of type FlatAffineConstraints:

%i | %N | %M | 1 
 1    0    0   0 >= 0
 0    1   -2   0 >= 0
 0   -1    2   1 >= 0
-1    0    1  -1 >= 0

%M is not actually stored in the domain.

Solutions

Before we dive into the solutions, there are several things from the Polymer design that cannot be broken:

A domain is immutable. Once we figure out what a domain looks like, especially what its dims and symbols are, we should not further alter it.
An access relation should not use values other than those in the domain, excluding local values. That is, if an access relation needs to use a value marked as local in the domain, it should instantiate again that value as a local one inside the access relation.
The statement interface should only include arrays, IVs, and parameters, e.g., including %M as an argument of @S0 in the previous example violates this rule. This is just a convention that can keep the design clean.

Our solution is:

In -extract-scop-stmt, we will only stop searching when seeing an operand that is included in the current domain constraints (as well as the global parameter set). Such that all the local variable calculation will be included internally to each callee.
When exporting OpenScop, we just build access constraints with all the affine.apply results flattened. This is done automatically, we just need to make sure that these affine constraints have the same columns as the domain (after merging with the context for their symbols) excluding all the local variables.

Result

Given input code:

#map = affine_map<(d0) -> (d0 floordiv 2)>

func @load_store_local_vars_floordiv(%A : memref<?xf32>) -> () {
  %c0 = constant 0 : index
  %N = dim %A, %c0 : memref<?xf32>
  %M = affine.apply #map(%N)

  affine.for %i = 0 to %M {
    %0 = affine.load %A[%i] : memref<?xf32>
    %j = affine.apply #map(%i)
    %1 = addf %0, %0 : f32
    affine.store %0, %A[%i + %M] : memref<?xf32>
    affine.store %1, %A[%j + %M] : memref<?xf32>
  }

  return
}

We will get:

#map = affine_map<(d0) -> (d0 floordiv 2)>
#set0 = affine_set<()[s0] : (s0 floordiv 2 - 1 >= 0, s0 floordiv 2 - 1 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0)>
#set1 = affine_set<(d0)[s0] : (d0 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0, -d0 + s0 floordiv 2 - 1 >= 0)>
#set2 = affine_set<(d0)[s0] : (d0 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0, -d0 + s0 floordiv 2 - 1 >= 0)>
#set3 = affine_set<(d0, d1, d2)[s0] : (d1 - d2 == 0, d0 - 1 == 0)>
#set4 = affine_set<(d0, d1, d2)[s0] : (d1 - d2 - s0 floordiv 2 == 0, d0 - 1 == 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0)>
#set5 = affine_set<(d0, d1, d2)[s0] : (d1 - d2 floordiv 2 - s0 floordiv 2 == 0, d0 - 1 == 0, d2 mod 2 >= 0, -d2 + (d2 floordiv 2) * 2 + 1 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0)>
module  {
  func @load_store_local_vars_floordiv(%arg0: memref<?xf32>) attributes {scop.arg_names = ["A1"], scop.ctx = #set0, scop.ctx_params = ["P1"]} {
    %c0 = constant 0 : index
    %0 = dim {scop.param_names = ["P1"]} %arg0, %c0 : memref<?xf32>
    %1 = affine.apply #map(%0)
    affine.for %arg1 = 0 to %1 {
      call @S0(%arg0, %arg1, %0) {scop.domain = #set1, scop.domain_symbols = ["i1", "P1"], scop.scats = [0, 0]} : (memref<?xf32>, index, index) -> ()
      call @S1(%arg0, %0, %arg1) {scop.domain = #set2, scop.domain_symbols = ["i1", "P1"], scop.scats = [0, 1]} : (memref<?xf32>, index, index) -> ()
    } {scop.iv_name = "i1"}
    return
  }
  func @S0(%arg0: memref<?xf32>, %arg1: index, %arg2: index) attributes {scop.stmt} {
    %0 = affine.load %arg0[%arg1] {scop.access = #set3, scop.access_symbols = ["A1", "", "i1", "P1"]} : memref<?xf32>
    %1 = affine.apply #map(%arg2)
    affine.store %0, %arg0[%arg1 + %1] {scop.access = #set4, scop.access_symbols = ["A1", "", "i1", "P1"]} : memref<?xf32>
    return
  }
  func @S1(%arg0: memref<?xf32>, %arg1: index, %arg2: index) attributes {scop.stmt} {
    %0 = affine.load %arg0[%arg2] {scop.access = #set3, scop.access_symbols = ["A1", "", "i1", "P1"]} : memref<?xf32>
    %1 = addf %0, %0 : f32
    %2 = affine.apply #map(%arg2)
    %3 = affine.apply #map(%arg1)
    affine.store %1, %arg0[%2 + %3] {scop.access = #set5, scop.access_symbols = ["A1", "", "i1", "P1"]} : memref<?xf32>
    return
  }
}

Migrate to Polygeist

Polymer will become a tool to Polygeist, following the roadmap llvm/Polygeist#48

Allow assigning function name to the function created from OpenScop

Now we assign a hardcoded name "main", which will cause confliction especially for this case: #41

No Pluto optimization in matrix multiplication example

I build the polymer with impact branch, and try to optimize matmul.mlir. But I find there is no difference between output mlir and input mlir.
The command I used is polymer-opt -pluto-opt ../matmul.mlir.
And the output mlir is:

module  {
  func @matmul() {
    %0 = alloc() : memref<64x64xf32>
    %1 = alloc() : memref<64x64xf32>
    %2 = alloc() : memref<64x64xf32>
    affine.for %arg0 = 0 to 64 {
      affine.for %arg1 = 0 to 64 {
        affine.for %arg2 = 0 to 64 {
          %3 = affine.load %0[%arg0, %arg2] : memref<64x64xf32>
          %4 = affine.load %1[%arg2, %arg1] : memref<64x64xf32>
          %5 = mulf %3, %4 : f32
          %6 = affine.load %2[%arg0, %arg1] : memref<64x64xf32>
          %7 = addf %5, %6 : f32
          affine.store %7, %2[%arg0, %arg1] : memref<64x64xf32>
        }
      }
    }
    return
  }
}

When I try to build this project with main branch and use the command ./scripts/build-with-polygeist.sh, I get a compile error:

configure: error: clang header file not found
checking for pet/Makefile... no
configure: error: configure in pet/ failed
tools/mlir/tools/polymer/CMakeFiles/pluto.dir/build.make:124: recipe for target 'tools/mlir/tools/polymer/pluto/src/pluto-stamp/pluto-configure' failed
make[3]: *** [tools/mlir/tools/polymer/pluto/src/pluto-stamp/pluto-configure] Error 1
CMakeFiles/Makefile2:109388: recipe for target 'tools/mlir/tools/polymer/CMakeFiles/pluto.dir/all' failed
make[2]: *** [tools/mlir/tools/polymer/CMakeFiles/pluto.dir/all] Error 2
CMakeFiles/Makefile2:110540: recipe for target 'tools/mlir/tools/polymer/test/CMakeFiles/check-polymer.dir/rule' failed
make[1]: *** [tools/mlir/tools/polymer/test/CMakeFiles/check-polymer.dir/rule] Error 2
Makefile:27063: recipe for target 'check-polymer' failed
make: *** [check-polymer] Error 2

Is there any solutions to solve these problems above? Thanks.

Update README to include the latest progress

Local variable in access relation

Generating OpenScop with local vars is not supported yet.

<scatnames> are wrongly generated

The total number of scatnames is not correctly calculated. The original # scatnames should be 1 (root) + n (IVs) + 1 (leaf), what we
want here is 2 n (IVs) + 1 (leaf).

Imperfectly nested loops

This kind of input has not been covered yet.

Fix the failed test of matmul

func @matmul() {
  %A = alloc() : memref<64x64xf32>
  %B = alloc() : memref<64x64xf32>
  %C = alloc() : memref<64x64xf32>

  affine.for %i = 0 to 64 {
    affine.for %j = 0 to 64 {
      affine.for %k = 0 to 64 {
        %0 = affine.load %A[%i, %k] : memref<64x64xf32>
        %1 = affine.load %B[%k, %j] : memref<64x64xf32>
        %2 = mulf %0, %1 : f32
        %3 = affine.load %C[%i, %j] : memref<64x64xf32>
        %4 = addf %2, %3 : f32
        affine.store %4, %C[%i, %j] : memref<64x64xf32>
      }
    }
  }

  return
}

Refactorize between PlutoTransform and InvarianceScopTransform

They share very similar functionalities. Should merge them into one single utility class/library.

Add the reg2mem pass to break dependences in imperfectly nested loops

affine.for %i ... {
  %0 = affine.load %A[%i]
  affine.store %something, %A[%i]
  affine.for %j {
    %1 = mulf %0, %0
    affine.store %1, %B[%i, %j]
  }
}

This code is hard to handle because:

The loops are imperfectly nested, and in our current implementation of PlutoTransform , we need to create a statement that contains a loop structure in it. Don't know if Pluto will be unhappy about this, or it may cause problems;
There is a WAR dependence between the first load/store pair. So we cannot reorder them.

One solution is to adapt a reg2mem pass, which breaks the dependences between nested loops by storing the SSA values (%0 in the given example) that may still live when we enter a nested block. Such that we don't have to create statements that across multiple loops, and we can explicitly model the dependences.

For instance, if we transform the previously given example in the following way:

affine.for %i ... {
  // S1(i)
  %0 = affine.load %A[%i]
  affine.store %0, %scatchpad[%i]
 
  // S2(i)
  affine.store %something %A[%i]
  affine.for %j {
    // S3(i, j)
    %1 = affine.load %scatchpad[%i]
    %2 = mulf %0, %0
    affine.store %2, %B[%i, %j]
  }
}

We may create clearly separated statements just within each loop nest level. The %scratchpad is the memory we create to store those live-out SSA values.

We could implement this pass based on the basic liveness analysis algorithm. The size of the allocated memory depends on the iteration space of the loop nest level it resides in.

It would be helpful for the following issues: #5 #42 #48

Support extracting point loops into standalone functions

This will be useful for further optimisation concentrating on the point loop bands.

cd build
./bin/polymer-opt ../example/polybench/EXTRALARGE/gemm/gemm.mlir \
  -reg2mem \
  -extract-scop-stmt \
  -pluto-opt \
  -canonicalize \
  -annotate-point-loops \
  -extract-point-loops

#map0 = affine_map<()[s0] -> ((s0 - 1) floordiv 32 + 1)>
#map1 = affine_map<(d0) -> (d0 * 32)>
#map2 = affine_map<(d0)[s0] -> (s0, d0 * 32 + 32)>
#set0 = affine_set<()[s0, s1] : (s0 - 1 >= 0, s1 - 1 >= 0)>
#set1 = affine_set<()[s0] : (s0 - 1 >= 0)>
module attributes {llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu"}  {
  llvm.mlir.global internal constant @str7("==END   DUMP_ARRAYS==\0A\00")
  llvm.mlir.global internal constant @str6("\0Aend   dump: %s\0A\00")
  llvm.mlir.global internal constant @str5("%0.2lf \00")
  llvm.mlir.global internal constant @str4("\0A\00")
  llvm.mlir.global internal constant @str3("C\00")
  llvm.mlir.global internal constant @str2("begin dump: %s\00")
  llvm.mlir.global internal constant @str1("==BEGIN DUMP_ARRAYS==\0A\00")
  llvm.mlir.global external @stderr() : !llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>
  llvm.func @fprintf(!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>, ...) -> !llvm.i32
  llvm.mlir.global internal constant @str0("\00")
  llvm.func @strcmp(!llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.i32
  func @main(%arg0: i32, %arg1: !llvm.ptr<ptr<i8>>) -> i32 {
    %c0 = constant 0 : index
    %c2000_i32 = constant 2000 : i32
    %c2300_i32 = constant 2300 : i32
    %c2600_i32 = constant 2600 : i32
    %c42_i32 = constant 42 : i32
    %true = constant true
    %false = constant false
    %c0_i32 = constant 0 : i32
    %0 = alloca() : memref<1xf64>
    %1 = alloca() : memref<1xf64>
    %2 = alloc() : memref<2000x2300xf64>
    %3 = alloc() : memref<2000x2600xf64>
    %4 = alloc() : memref<2600x2300xf64>
    %5 = memref_cast %0 : memref<1xf64> to memref<?xf64>
    %6 = memref_cast %1 : memref<1xf64> to memref<?xf64>
    call @init_array(%c2000_i32, %c2300_i32, %c2600_i32, %5, %6, %2, %3, %4) : (i32, i32, i32, memref<?xf64>, memref<?xf64>, memref<2000x2300xf64>, memref<2000x2600xf64>, memref<2600x2300xf64>) -> ()
    call @polybench_timer_start() : () -> ()
    %7 = load %0[%c0] : memref<1xf64>
    %8 = load %1[%c0] : memref<1xf64>
    call @kernel_gemm_new(%c2000_i32, %c2300_i32, %c2600_i32, %7, %8, %2, %3, %4) : (i32, i32, i32, f64, f64, memref<2000x2300xf64>, memref<2000x2600xf64>, memref<2600x2300xf64>) -> ()
    call @polybench_timer_stop() : () -> ()
    call @polybench_timer_print() : () -> ()
    %9 = cmpi "sgt", %arg0, %c42_i32 : i32
    %10 = scf.if %9 -> (i1) {
      %11 = llvm.load %arg1 : !llvm.ptr<ptr<i8>>
      %12 = llvm.mlir.addressof @str0 : !llvm.ptr<array<1 x i8>>
      %13 = llvm.mlir.constant(0 : index) : !llvm.i64
      %14 = llvm.getelementptr %12[%13, %13] : (!llvm.ptr<array<1 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
      %15 = llvm.call @strcmp(%11, %14) : (!llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.i32
      %16 = llvm.mlir.cast %15 : !llvm.i32 to i32
      %17 = trunci %16 : i32 to i1
      %18 = xor %17, %true : i1
      scf.yield %18 : i1
    } else {
      scf.yield %false : i1
    }
    scf.if %10 {
      call @print_array(%c2000_i32, %c2300_i32, %2) : (i32, i32, memref<2000x2300xf64>) -> ()
    }
    return %c0_i32 : i32
  }
  func private @init_array(%arg0: i32, %arg1: i32, %arg2: i32, %arg3: memref<?xf64>, %arg4: memref<?xf64>, %arg5: memref<2000x2300xf64>, %arg6: memref<2000x2600xf64>, %arg7: memref<2600x2300xf64>) {
    %c0 = constant 0 : index
    %cst = constant 1.500000e+00 : f64
    %cst_0 = constant 1.200000e+00 : f64
    %c0_i32 = constant 0 : i32
    %c2_i32 = constant 2 : i32
    %c1_i32 = constant 1 : i32
    store %cst, %arg3[%c0] : memref<?xf64>
    store %cst_0, %arg4[%c0] : memref<?xf64>
    br ^bb1(%c0_i32 : i32)
  ^bb1(%0: i32):  // 2 preds: ^bb0, ^bb4
    %1 = cmpi "slt", %0, %arg0 : i32
    %2 = index_cast %0 : i32 to index
    cond_br %1, ^bb2(%c0_i32 : i32), ^bb5(%c0_i32 : i32)
  ^bb2(%3: i32):  // 2 preds: ^bb1, ^bb3
    %4 = cmpi "slt", %3, %arg1 : i32
    %5 = index_cast %3 : i32 to index
    cond_br %4, ^bb3, ^bb4
  ^bb3:  // pred: ^bb2
    %6 = muli %0, %3 : i32
    %7 = addi %6, %c1_i32 : i32
    %8 = remi_signed %7, %arg0 : i32
    %9 = sitofp %8 : i32 to f64
    %10 = sitofp %arg0 : i32 to f64
    %11 = divf %9, %10 : f64
    store %11, %arg5[%2, %5] : memref<2000x2300xf64>
    %12 = addi %3, %c1_i32 : i32
    br ^bb2(%12 : i32)
  ^bb4:  // pred: ^bb2
    %13 = addi %0, %c1_i32 : i32
    br ^bb1(%13 : i32)
  ^bb5(%14: i32):  // 2 preds: ^bb1, ^bb8
    %15 = cmpi "slt", %14, %arg0 : i32
    %16 = index_cast %14 : i32 to index
    cond_br %15, ^bb6(%c0_i32 : i32), ^bb9(%c0_i32 : i32)
  ^bb6(%17: i32):  // 2 preds: ^bb5, ^bb7
    %18 = cmpi "slt", %17, %arg2 : i32
    %19 = index_cast %17 : i32 to index
    cond_br %18, ^bb7, ^bb8
  ^bb7:  // pred: ^bb6
    %20 = addi %17, %c1_i32 : i32
    %21 = muli %14, %20 : i32
    %22 = remi_signed %21, %arg2 : i32
    %23 = sitofp %22 : i32 to f64
    %24 = sitofp %arg2 : i32 to f64
    %25 = divf %23, %24 : f64
    store %25, %arg6[%16, %19] : memref<2000x2600xf64>
    br ^bb6(%20 : i32)
  ^bb8:  // pred: ^bb6
    %26 = addi %14, %c1_i32 : i32
    br ^bb5(%26 : i32)
  ^bb9(%27: i32):  // 2 preds: ^bb5, ^bb13
    %28 = cmpi "slt", %27, %arg2 : i32
    %29 = index_cast %27 : i32 to index
    cond_br %28, ^bb11(%c0_i32 : i32), ^bb10
  ^bb10:  // pred: ^bb9
    return
  ^bb11(%30: i32):  // 2 preds: ^bb9, ^bb12
    %31 = cmpi "slt", %30, %arg1 : i32
    %32 = index_cast %30 : i32 to index
    cond_br %31, ^bb12, ^bb13
  ^bb12:  // pred: ^bb11
    %33 = addi %30, %c2_i32 : i32
    %34 = muli %27, %33 : i32
    %35 = remi_signed %34, %arg1 : i32
    %36 = sitofp %35 : i32 to f64
    %37 = sitofp %arg1 : i32 to f64
    %38 = divf %36, %37 : f64
    store %38, %arg7[%29, %32] : memref<2600x2300xf64>
    %39 = addi %30, %c1_i32 : i32
    br ^bb11(%39 : i32)
  ^bb13:  // pred: ^bb11
    %40 = addi %27, %c1_i32 : i32
    br ^bb9(%40 : i32)
  }
  func private @polybench_timer_start()
  func private @kernel_gemm(%arg0: i32, %arg1: i32, %arg2: i32, %arg3: f64, %arg4: f64, %arg5: memref<2000x2300xf64>, %arg6: memref<2000x2600xf64>, %arg7: memref<2600x2300xf64>) {
    %0 = index_cast %arg0 : i32 to index
    %1 = index_cast %arg1 : i32 to index
    %2 = index_cast %arg2 : i32 to index
    affine.for %arg8 = 0 to %0 {
      affine.for %arg9 = 0 to %1 {
        call @S0(%arg5, %arg8, %arg9, %arg4) : (memref<2000x2300xf64>, index, index, f64) -> ()
      } {scop.point_loop}
      affine.for %arg9 = 0 to %2 {
        affine.for %arg10 = 0 to %1 {
          call @S1(%arg5, %arg8, %arg10, %arg7, %arg9, %arg3, %arg6) : (memref<2000x2300xf64>, index, index, memref<2600x2300xf64>, index, f64, memref<2000x2600xf64>) -> ()
        } {scop.point_loop}
      } {scop.point_loop}
    } {scop.point_loop}
    return
  }
  func private @polybench_timer_stop()
  func private @polybench_timer_print()
  func private @print_array(%arg0: i32, %arg1: i32, %arg2: memref<2000x2300xf64>) {
    %c0_i32 = constant 0 : i32
    %c20_i32 = constant 20 : i32
    %c1_i32 = constant 1 : i32
    %0 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %1 = llvm.load %0 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %2 = llvm.mlir.addressof @str1 : !llvm.ptr<array<23 x i8>>
    %3 = llvm.mlir.constant(0 : index) : !llvm.i64
    %4 = llvm.getelementptr %2[%3, %3] : (!llvm.ptr<array<23 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %5 = llvm.call @fprintf(%1, %4) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>) -> !llvm.i32
    %6 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %7 = llvm.load %6 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %8 = llvm.mlir.addressof @str2 : !llvm.ptr<array<15 x i8>>
    %9 = llvm.getelementptr %8[%3, %3] : (!llvm.ptr<array<15 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %10 = llvm.mlir.addressof @str3 : !llvm.ptr<array<2 x i8>>
    %11 = llvm.getelementptr %10[%3, %3] : (!llvm.ptr<array<2 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %12 = llvm.call @fprintf(%7, %9, %11) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.i32
    br ^bb1(%c0_i32 : i32)
  ^bb1(%13: i32):  // 2 preds: ^bb0, ^bb5
    %14 = cmpi "slt", %13, %arg0 : i32
    %15 = index_cast %13 : i32 to index
    cond_br %14, ^bb3(%c0_i32 : i32), ^bb2
  ^bb2:  // pred: ^bb1
    %16 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %17 = llvm.load %16 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %18 = llvm.mlir.addressof @str6 : !llvm.ptr<array<17 x i8>>
    %19 = llvm.getelementptr %18[%3, %3] : (!llvm.ptr<array<17 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %20 = llvm.mlir.addressof @str3 : !llvm.ptr<array<2 x i8>>
    %21 = llvm.getelementptr %20[%3, %3] : (!llvm.ptr<array<2 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %22 = llvm.call @fprintf(%17, %19, %21) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.i32
    %23 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %24 = llvm.load %23 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %25 = llvm.mlir.addressof @str7 : !llvm.ptr<array<23 x i8>>
    %26 = llvm.getelementptr %25[%3, %3] : (!llvm.ptr<array<23 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %27 = llvm.call @fprintf(%24, %26) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>) -> !llvm.i32
    return
  ^bb3(%28: i32):  // 2 preds: ^bb1, ^bb4
    %29 = cmpi "slt", %28, %arg1 : i32
    %30 = index_cast %28 : i32 to index
    cond_br %29, ^bb4, ^bb5
  ^bb4:  // pred: ^bb3
    %31 = muli %13, %arg0 : i32
    %32 = addi %31, %28 : i32
    %33 = remi_signed %32, %c20_i32 : i32
    %34 = cmpi "eq", %33, %c0_i32 : i32
    scf.if %34 {
      %44 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
      %45 = llvm.load %44 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
      %46 = llvm.mlir.addressof @str4 : !llvm.ptr<array<2 x i8>>
      %47 = llvm.getelementptr %46[%3, %3] : (!llvm.ptr<array<2 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
      %48 = llvm.call @fprintf(%45, %47) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>) -> !llvm.i32
    }
    %35 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %36 = llvm.load %35 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %37 = llvm.mlir.addressof @str5 : !llvm.ptr<array<8 x i8>>
    %38 = llvm.getelementptr %37[%3, %3] : (!llvm.ptr<array<8 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %39 = load %arg2[%15, %30] : memref<2000x2300xf64>
    %40 = llvm.mlir.cast %39 : f64 to !llvm.double
    %41 = llvm.call @fprintf(%36, %38, %40) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>, !llvm.double) -> !llvm.i32
    %42 = addi %28, %c1_i32 : i32
    br ^bb3(%42 : i32)
  ^bb5:  // pred: ^bb3
    %43 = addi %13, %c1_i32 : i32
    br ^bb1(%43 : i32)
  }
  func private @S0(%arg0: memref<2000x2300xf64>, %arg1: index, %arg2: index, %arg3: f64) attributes {scop.stmt} {
    %0 = affine.load %arg0[symbol(%arg1), symbol(%arg2)] : memref<2000x2300xf64>
    %1 = mulf %0, %arg3 : f64
    affine.store %1, %arg0[symbol(%arg1), symbol(%arg2)] : memref<2000x2300xf64>
    return
  }
  func private @S1(%arg0: memref<2000x2300xf64>, %arg1: index, %arg2: index, %arg3: memref<2600x2300xf64>, %arg4: index, %arg5: f64, %arg6: memref<2000x2600xf64>) attributes {scop.stmt} {
    %0 = affine.load %arg0[symbol(%arg1), symbol(%arg2)] : memref<2000x2300xf64>
    %1 = affine.load %arg6[symbol(%arg1), symbol(%arg4)] : memref<2000x2600xf64>
    %2 = mulf %arg5, %1 : f64
    %3 = affine.load %arg3[symbol(%arg4), symbol(%arg2)] : memref<2600x2300xf64>
    %4 = mulf %2, %3 : f64
    %5 = addf %0, %4 : f64
    affine.store %5, %arg0[symbol(%arg1), symbol(%arg2)] : memref<2000x2300xf64>
    return
  }
  func private @kernel_gemm_new(%arg0: i32, %arg1: i32, %arg2: i32, %arg3: f64, %arg4: f64, %arg5: memref<2000x2300xf64>, %arg6: memref<2000x2600xf64>, %arg7: memref<2600x2300xf64>) {
    %0 = index_cast %arg0 : i32 to index
    %1 = index_cast %arg1 : i32 to index
    %2 = index_cast %arg2 : i32 to index
    affine.if #set0()[%0, %1] {
      affine.for %arg8 = 0 to #map0()[%0] {
        affine.for %arg9 = 0 to #map0()[%1] {
          affine.for %arg10 = #map1(%arg8) to min #map2(%arg8)[%0] {
            affine.for %arg11 = #map1(%arg9) to min #map2(%arg9)[%1] {
              call @S0(%arg5, %arg10, %arg11, %arg4) : (memref<2000x2300xf64>, index, index, f64) -> ()
            } {scop.point_loop}
          } {scop.point_loop}
        }
      }
      affine.if #set1()[%2] {
        affine.for %arg8 = 0 to #map0()[%0] {
          affine.for %arg9 = 0 to #map0()[%1] {
            affine.for %arg10 = 0 to #map0()[%2] {
              affine.for %arg11 = #map1(%arg8) to min #map2(%arg8)[%0] {
                affine.for %arg12 = #map1(%arg10) to min #map2(%arg10)[%2] {
                  affine.for %arg13 = #map1(%arg9) to min #map2(%arg9)[%1] {
                    call @S1(%arg5, %arg11, %arg13, %arg7, %arg12, %arg3, %arg6) : (memref<2000x2300xf64>, index, index, memref<2600x2300xf64>, index, f64, memref<2000x2600xf64>) -> ()
                  } {scop.point_loop}
                } {scop.point_loop}
              } {scop.point_loop}
            }
          }
        }
      }
    }
    return
  }
  func private @kernel_gemm__PE0(%arg0: memref<2000x2300xf64>, %arg1: f64, %arg2: index, %arg3: memref<2600x2300xf64>, %arg4: f64, %arg5: memref<2000x2600xf64>, %arg6: index, %arg7: index) {
    affine.for %arg8 = 0 to %arg7 {
      affine.for %arg9 = 0 to %arg2 {
        call @S0(%arg0, %arg8, %arg9, %arg1) : (memref<2000x2300xf64>, index, index, f64) -> ()
      } {scop.point_loop}
      affine.for %arg9 = 0 to %arg6 {
        affine.for %arg10 = 0 to %arg2 {
          call @S1(%arg0, %arg8, %arg10, %arg3, %arg9, %arg4, %arg5) : (memref<2000x2300xf64>, index, index, memref<2600x2300xf64>, index, f64, memref<2000x2600xf64>) -> ()
        } {scop.point_loop}
      } {scop.point_loop}
    } {scop.point_loop}
    return
  }
  func private @kernel_gemm_new__PE1(%arg0: memref<2000x2300xf64>, %arg1: f64, %arg2: index, %arg3: index, %arg4: index, %arg5: index) {
    affine.for %arg6 = #map1(%arg4) to min #map2(%arg4)[%arg5] {
      affine.for %arg7 = #map1(%arg2) to min #map2(%arg2)[%arg3] {
        call @S0(%arg0, %arg6, %arg7, %arg1) : (memref<2000x2300xf64>, index, index, f64) -> ()
      } {scop.point_loop}
    } {scop.point_loop}
    return
  }
  func private @kernel_gemm_new__PE2(%arg0: memref<2000x2300xf64>, %arg1: memref<2600x2300xf64>, %arg2: f64, %arg3: memref<2000x2600xf64>, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index, %arg9: index) {
    affine.for %arg10 = #map1(%arg8) to min #map2(%arg8)[%arg9] {
      affine.for %arg11 = #map1(%arg6) to min #map2(%arg6)[%arg7] {
        affine.for %arg12 = #map1(%arg4) to min #map2(%arg4)[%arg5] {
          call @S1(%arg0, %arg10, %arg12, %arg1, %arg11, %arg2, %arg3) : (memref<2000x2300xf64>, index, index, memref<2600x2300xf64>, index, f64, memref<2000x2600xf64>) -> ()
        } {scop.point_loop}
      } {scop.point_loop}
    } {scop.point_loop}
    return
 }
}

Test tiling dependent statements in PLUTO.

func @load_store_dep_tiling() {
  %A = alloc() : memref<64x64xf32>

  affine.for %i0 = 1 to 64 {
    affine.for %j0 = 1 to 64 {
      %i1 = affine.apply affine_map<(d0) -> (d0 - 1)>(%i0)
      %j1 = affine.apply affine_map<(d0) -> (d0 - 1)>(%j0)

      %0 = affine.load %A[%i0, %j1] : memref<64x64xf32>
      %1 = affine.load %A[%i1, %j0] : memref<64x64xf32>
      %2 = addf %0, %1 : f32
      affine.store %2, %A[%i0, %j0] : memref<64x64xf32>
    }
  }

  return
}

Pluto seems to be inadequate of processing this type of input: https://groups.google.com/g/pluto-development/c/9mTvz5j5VxE

Code cleanup on the IMPACT branch

We've got some new updates from the impact branch and there are loads of hacky things. We should gradually migrate them into the master.

TODO:

Robust mapping from old iterator names to new scatnames

Currently we simply use a formula to calculates the mapping, which could fail for many cases.

A good way is to derive the mapping from the scattering relation.

This should help fixing #28

Allow affine.if in the input MLIR code

This is essential for covering the output from Pluto.

PLUTO tiling optimisation

We should allow pluto tiling in polymer.

Annotating loops to indicate if they are point/tile loops

Currently covered by 25afb44

Approach: IVs directly used by a scop.stmt caller are point loop IVs. This is assumed by the Pluto algorithm.

Can it work before canonicalization?
How to deal with complex arithmetic of loop indvars before calling scop.stmt?

Parameterized Pluto Opt pass

Allow passing parameters to the PlutoTransform pass, e.g., tile size.

Handle max/min reduction in clast_expr

Min and max are not parts of the operations that AffineExpr supports. We should return all of min/max operands and create an AffineMap, and wrap its results by an affine.min or affine.max operation.

This means we should create a separate branch of the process method in AffineExprBuilder that can handle multiple return results.

Investigate dependence detection problem when using Candl in Pluto

Handle duplication of for-loop structures

Sometimes Pluto may generate a Scop like this:

<OpenScop>

# =============================================== Global
# Language
C

# Context
CONTEXT
2 4 0 0 0 2
# e/i| P0   P1 |  1  
   1    1    0   -1    ## P0-1 >= 0
   1    0    1   -1    ## P1-1 >= 0

# Parameters are provided
1
<strings>
P0 P1
</strings>

# Number of statements
2

# =============================================== Statement 1
# Number of relations describing the statement:
4

# ----------------------------------------------  1.1 Domain
DOMAIN
6 7 3 0 0 2
# e/i| fk0  fk1  i0 | P0   P1 |  1  
   1    0    0    1    0    0    0    ## i0 >= 0
   1    0    0   -1    1    0   -1    ## -i0+P0-1 >= 0
   1  -32    0    1    0    0    0    ## -32*fk0+i0 >= 0
   1   32    0   -1    0    0   31    ## 32*fk0-i0+31 >= 0
   1    0  -32    0    0    0    0    ## -32*fk1 >= 0
   1    0   32    0    0    0   31    ## 32*fk1+31 >= 0

# ----------------------------------------------  1.2 Scattering
SCATTERING
8 15 8 3 0 2
# e/i| c1   c2   c3   c4   c5   c6   c7   c8 | fk0  fk1  i0 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    0    0    0    0    0    0    0    ## c1 == 0
   0    0   -1    0    0    0    0    0    0    1    0    0    0    0    0    ## c2 == fk0
   0    0    0   -1    0    0    0    0    0    0    1    0    0    0    0    ## c3 == fk1
   0    0    0    0   -1    0    0    0    0    0    0    0    0    0    0    ## c4 == 0
   0    0    0    0    0   -1    0    0    0    0    0    1    0    0    0    ## c5 == i0
   0    0    0    0    0    0   -1    0    0    0    0    0    0    0    0    ## c6 == 0
   0    0    0    0    0    0    0   -1    0    0    0    0    0    0    0    ## c7 == 0
   0    0    0    0    0    0    0    0   -1    0    0    0    0    0    0    ## c8 == 0

# ----------------------------------------------  1.3 Access
WRITE
2 9 2 3 0 2
# e/i| Arr  [1]| fk0  fk1  i0 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    1    ## Arr == A1
   0    0   -1    0    0    1    0    0    0    ## [1] == i0

READ
2 9 2 3 0 2
# e/i| Arr  [1]| fk0  fk1  i0 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    2    ## Arr == A2
   0    0   -1    0    0    1    0    0    0    ## [1] == i0

# ----------------------------------------------  1.4 Statement Extensions
# Number of Statement Extensions
1
<body>
# Number of original iterators
3
# List of original iterators
fk0 fk1 i0
# Statement body expression
S0(A1, 1, A2, 1, i0)
</body>

# =============================================== Statement 2
# Number of relations describing the statement:
4

# ----------------------------------------------  2.1 Domain
DOMAIN
10 9 5 0 0 2
# e/i| fk0  fk1  fk2  i0   i1 | P0   P1 |  1  
   1    0    0    0    1    0    0    0    0    ## i0 >= 0
   1    0    0    0   -1    0    1    0   -1    ## -i0+P0-1 >= 0
   1    0    0    0    0    1    0    0    0    ## i1 >= 0
   1    0    0    0    0   -1    0    1   -1    ## -i1+P1-1 >= 0
   1  -32    0    0    1    0    0    0    0    ## -32*fk0+i0 >= 0
   1   32    0    0   -1    0    0    0   31    ## 32*fk0-i0+31 >= 0
   1    0  -32    0    0    0    0    0    1    ## -32*fk1+1 >= 0
   1    0   32    0    0    0    0    0   30    ## 32*fk1+30 >= 0
   1    0    0  -32    0    1    0    0    0    ## -32*fk2+i1 >= 0
   1    0    0   32    0   -1    0    0   31    ## 32*fk2-i1+31 >= 0

# ----------------------------------------------  2.2 Scattering
SCATTERING
8 17 8 5 0 2
# e/i| c1   c2   c3   c4   c5   c6   c7   c8 | fk0  fk1  fk2  i0   i1 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    ## c1 == 0
   0    0   -1    0    0    0    0    0    0    1    0    0    0    0    0    0    0    ## c2 == fk0
   0    0    0   -1    0    0    0    0    0    0    1    0    0    0    0    0    0    ## c3 == fk1
   0    0    0    0   -1    0    0    0    0    0    0    1    0    0    0    0    0    ## c4 == fk2
   0    0    0    0    0   -1    0    0    0    0    0    0    1    0    0    0    0    ## c5 == i0
   0    0    0    0    0    0   -1    0    0    0    0    0    0    0    0    0    1    ## c6 == 1
   0    0    0    0    0    0    0   -1    0    0    0    0    0    1    0    0    0    ## c7 == i1
   0    0    0    0    0    0    0    0   -1    0    0    0    0    0    0    0    0    ## c8 == 0

# ----------------------------------------------  2.3 Access
WRITE
3 12 3 5 0 2
# e/i| Arr  [1]  [2]| fk0  fk1  fk2  i0   i1 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    0    0    0    3    ## Arr == A3
   0    0   -1    0    0    0    0    1    0    0    0    0    ## [1] == i0
   0    0    0   -1    0    0    0    0    1    0    0    0    ## [2] == i1

READ
2 11 2 5 0 2
# e/i| Arr  [1]| fk0  fk1  fk2  i0   i1 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    0    0    1    ## Arr == A1
   0    0   -1    0    0    0    1    0    0    0    0    ## [1] == i0

# ----------------------------------------------  2.4 Statement Extensions
# Number of Statement Extensions
1
<body>
# Number of original iterators
5
# List of original iterators
fk0 fk1 fk2 i0 i1
# Statement body expression
S1(A3, 2, A1, 1, i0, i1)
</body>

# =============================================== Extensions
<arrays>
# Number of arrays
3
# Mapping array-identifiers/array-names
3 A3
2 A2
1 A1
</arrays>

<scatnames>
t1 t2 t3 t4 t5 t6 t7 t8
</scatnames>

<loop>
# Number of loops
2
# ===========================================
# Loop number 1 
# Iterator name
t2
# Number of stmts
2
# Statement identifiers
1
2
# Private variables
lbv, ubvt3,t4,t5,t6,t7,t8
# Directive
1
# ===========================================
# Loop number 2 
# Iterator name
t7
# Number of stmts
1
# Statement identifiers
2
# Private variables
(null)
# Directive
4
</loop>

</OpenScop>

When it is handled by CLooG, there might be duplicated loop structures that iterate the polyhedra. In that case, the same statement can exist at different places in the code. For different placement, they could have exactly the same naming for their loop IVs, however, if so, our code that generates MLIR from Scop cannot handle well since it relies the fact that IV names to be unique.

We might need to fundamentally change the design of the symbol table. Or maybe we need to add a new pass to make things easier to handle.

Also note that there are conflicts in the naming of the statements as well. For instance, one S1 can be different places of the code due to Pluto and CLooG.

Testing affine.if: if-else, multiple ifs, for within if.

if-else block

Resolve the domain for the else block

This should be fixed in the upstream (getIndexSet method)

multiple ifs and for-within-if

See the export-scop/load-store-if.mlir test case.

nit: the "slides" link in the README for slides from 11/02/2021 is broken

Just an FYI

Create a CI script

Besides doing the regular testing, this CI script will be the reference for installation steps on a clean system.

Unify the libraries being built

Now both Pluto and Cloog have their own versions of ISL. Could I build ISL once and make them to use this identical one?

Allow partially optimising a function

Currently we need to optimise a function as a whole, but the desired way is to split the function into multiple parts, and replace each part by an optimised version if possible. Otherwise, like in the example of #42 we cannot copy those code that get the dimensionality.

Update FileCheck patterns for MLIR output

https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices

Compiled Polybench code

SMALL refers to the dataset size.

SMALL.O0: without Polymer

SMALL.O0.tar.gz

SMALL.O1: with Polymer

SMALL.O1.tar.gz

Compilation command:

cd example/polybench
./compile -f SMALL -O0
./compile -f SMALL -O1

You can find SMALL.O0 and SMALL.O1 under example/polybench/tmp.

Generate runnable MLIR from polymer-opt

Currently we generate a single function that takes in block arguments. In order to make it compatible with polybench code in general, we should change the codegen behaviour.

The workflow should work as follows:

The input MLIR code should contain a @main function that acts as the entry point. It should also have a set of affine functions that should be optimised by polymer-opt.
Given the names of the entry point and the affine functions to be transformed (should be passed as command-line args), polymer-opt optimises all the given affine functions and replaces the function call to them by the optimised versions.
We can then perform some post optimisation and lower it down to LLVM.

For example, command like this should work properly:

# In the build/ directory
./bin/polymer-opt ../test/polymer-opt/PlutoTransform/matmul.mlir --pluto-opt=matmul | ../llvm/build/bin/mlir-opt --lower-affine --convert-scf-to-std --convert-std-to-llvm | ../llvm/build/bin/mlir-cpu-runner --entry-point-result=void -O3

TODO -

Handle div/mod when translating clast_binary to AffineExpr

Currently only clast_bin_fdiv and clast_bin_cdiv are handled. The reason why the other two (div and mod) are not supported is just I cannot find the corresponding API in AffineExpr. Will figure this out later.

Related: #6

Properly handle loop with arbitrary bounds

It would be pretty common in real-world test cases.

Test code -

func @matmul(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
  %c0 = constant 0 : index
  %c1 = constant 1 : index

  %M = dim %A, %c0 : memref<?x?xf32>
  %N = dim %B, %c0 : memref<?x?xf32>
  %K = dim %A, %c1 : memref<?x?xf32>

  affine.for %i = 0 to %M {
    affine.for %j = 0 to %N {
      affine.for %k = 0 to %K {
        %0 = affine.load %A[%i, %k] : memref<?x?xf32>
        %1 = affine.load %B[%k, %j] : memref<?x?xf32>
        %2 = mulf %0, %1 : f32
        %3 = affine.load %C[%i, %j] : memref<?x?xf32>
        %4 = addf %2, %3 : f32
        affine.store %4, %C[%i, %j] : memref<?x?xf32>
      }
    }
  }

  return 
}

func @main() {
  %M = constant 64 : index
  %N = constant 64 : index
  %K = constant 64 : index

  %A = alloc(%M, %K) : memref<?x?xf32>
  %B = alloc(%N, %K) : memref<?x?xf32>
  %C = alloc(%M, %N) : memref<?x?xf32>

  call @matmul(%A, %B, %C)
   : (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()

  return
}

Allow processing clast_name that refers to a loop index

Currently when processing clast_name it should refer to a symbol. We should improve this to cover nested loop.

Should add a test case for nested loops that uses outer loop's index as bounds as well.

Reduce redundant load operations in generated MLIR code

https://github.com/kumasento/polymer/blob/master/test/polymer-opt/InvariantScopTransform/arith.mlir

In this test case, there are several load operations being duplicated. We should remove them by adding a new pass that assumes load operations are non-volatile.

Better compilation flow with PLUTO

It is difficult to link PLUTO as a static library. When it is linked as a shared library, we should update the LD_LIBRARY_PATH when running any polymer tool, which is troublesome.

Resolving context

Constants

If a parameter is given as a constant, we should be able to put the corresponding equation in the context relation.

%N = constant 32 : index
affine.for %i = 0 to %N {
}

Integrate polybench

Should add polybench into the source tree and add instructions for experimentation.

Only consider operations with NoSideEffect when mapping statements

Now we include every operations between load/store. We should only consider the cases that all operations interbetween are NoSideEffect operations.

[BUG][Pluto] Failed to process modulo access relation

func private @S0(%A: memref<40x50xf32>, %t: index, %i: index) attributes {scop.stmt} {
  %cst = constant 0.25: f32
  %cst2 = constant 2.0 : f32

  %0 = affine.load %A[%t mod 2, %i + 1]: memref<40x50xf32>
  %1 = affine.load %A[%t mod 2, %i] : memref<40x50xf32>
  %2 = mulf %cst2, %1 : f32
  %3 = addf %0, %2 : f32
  %4 = affine.load %A[%t mod 2, %i - 1]: memref<40x50xf32>
  %5 = addf %3, %4 : f32
  %6 = mulf %cst, %5 : f32
  affine.store %6, %A[(%t + 1) mod 2, %i] : memref<40x50xf32>

  return
}


func @jacobi_diamond(%A: memref<40x50xf32>) {
  affine.for %t = 0 to 38 {
    affine.for %i = 1 to 47 {
      call @S0(%A, %t, %i): (memref<40x50xf32>, index, index) -> ()
    }
  }
  return
}

ninja && ./bin/polymer-opt ../test/polymer-opt/PlutoTransforms/jacobi-diamond.mlir -pluto-opt                 ninja: no work to do.
polymer-opt: /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/include/llvm/ADT/ArrayRef.h:257: const T &llvm::ArrayRef<long>::operator[](size_t) const [T = long]: Assertion `Index < Length && "Invalid index!"' failed.PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:0.      Program arguments: ./bin/polymer-opt ../test/polymer-opt/PlutoTransforms/jacobi-diamond.mlir -pluto-opt
 #0 0x0000000000b7f48a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/lib/Support/Unix/Signals.inc:565:11
 #1 0x0000000000b7f65b PrintStackTraceSignalHandler(void*) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/lib/Support/Unix/Signals.inc:632:1
 #2 0x0000000000b7dc2b llvm::sys::RunSignalHandlers() /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/lib/Support/Signals.cpp:96:5
 #3 0x0000000000b7fda1 SignalHandler(int) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1 #4 0x00007fb558686630 __restore_rt (/lib64/libpthread.so.0+0xf630)
 #5 0x00007fb555e0e3d7 raise (/lib64/libc.so.6+0x363d7) #6 0x00007fb555e0fac8 abort (/lib64/libc.so.6+0x37ac8)
 #7 0x00007fb555e071a6 __assert_fail_base (/lib64/libc.so.6+0x2f1a6) #8 0x00007fb555e07252 (/lib64/libc.so.6+0x2f252)
 #9 0x0000000000f7d4ba llvm::ArrayRef<long>::operator[](unsigned long) const /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/include/llvm/ADT/ArrayRef.h:0:7#10 0x00000000011c45ff polymer::OslScop::addRelation(int, int, int, int, int, int, int, int, llvm::ArrayRef<long>, llvm::ArrayRef<long>) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Support/OslScop.cc:128:43#11 0x00000000011c588b polymer::OslScop::addAccessRelation(int, bool, mlir::Value, mlir::AffineValueMap&, mlir::FlatAffineValueConstraints&) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Support/OslScop.cc:273:1#12 0x00000000011c20e0 (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0::operator()(mlir::Operation*) const /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Target/OpenScop/ConvertToOpenScop.cc:132:7#13 0x00000000011c2000 void llvm::function_ref<void (mlir::Operation*)>::callback_fn<(anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0>(long, mlir::Operation*) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/include/llvm/ADT/STLExtras.h:177:5#14 0x0000000000f299bc llvm::function_ref<void (mlir::Operation*)>::operator()(mlir::Operation*) const /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/include/llvm/ADT/STLExtras.h:200:5#15 0x000000000111ba26 mlir::detail::walk(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/IR/Visitors.cpp:68:1
#16 0x000000000111b9d3 mlir::detail::walk(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/IR/Visitors.cpp:61:27#17 0x00000000011c1f95 std::enable_if<llvm::is_one_of<mlir::Operation*, mlir::Operation*, mlir::Region*, mlir::Block*>::value, void>::type mlir::detail::walk<(mlir::WalkOrder)1, (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0, mlir::Operation*, void>(mlir::Operation*, (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0&&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/include/mlir/IR/Visitors.h:134:3#18 0x00000000011c1f30 void mlir::Operation::walk<(mlir::WalkOrder)1, (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0, void>((anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0&&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/include/mlir/IR/Operation.h:522:5#19 0x00000000011c1bc3 void mlir::OpState::walk<(mlir::WalkOrder)1, (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0, void>((anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0&&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/include/mlir/IR/OpDefinition.h:161:5
#20 0x00000000011c10b8 (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Target/OpenScop/ConvertToOpenScop.cc:135:11
#21 0x00000000011c0d57 polymer::createOpenScopFromFuncOp(mlir::FuncOp, polymer::OslSymbolTable&) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Target/OpenScop/ConvertToOpenScop.cc:225:27
#22 0x000000000111d3e5 plutoTransform(mlir::FuncOp, mlir::OpBuilder&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, int, int, bool) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Transforms/PlutoTransform.cc:80:8#23 0x000000000111cee8 (anonymous namespace)::PlutoTransformPass::runOnOperation() /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Transforms/PlutoTransform.cc:164:28#24 0x0000000000e8eb48 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Pass/Pass.cpp:386:21#25 0x0000000000e8f10d mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::unique_ptr<mlir::Pass, std::default_delete<mlir::Pass> >*, mlir::Pass> >, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Pass/Pass.cpp:445:16
#26 0x0000000000e907be mlir::PassManager::runPasses(mlir::Operation*, mlir::AnalysisManager) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Pass/Pass.cpp:689:10#27 0x0000000000e906a0 mlir::PassManager::run(mlir::Operation*) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Pass/Pass.cpp:669:60
#28 0x0000000000e35c49 performActions(llvm::raw_ostream&, bool, bool, llvm::SourceMgr&, mlir::MLIRContext*, mlir::PassPipelineCLParser const&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Support/MlirOptMain.cpp:84:17#29 0x0000000000e34b78 processBuffer(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer> >, bool, bool, bool, bool, mlir::PassPipelineCLParser const&, mlir::DialectRegistry&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Support/MlirOptMain.cpp:120:12
#30 0x0000000000e34996 mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer> >, mlir::PassPipelineCLParser const&, mlir::DialectRegistry&, bool, bool, bool, bool, bool) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Support/MlirOptMain.cpp:157:10
#31 0x0000000000a5549e main /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../tools/polymer-opt/polymer-opt.cc:117:17
#32 0x00007fb555dfa555 __libc_start_main (/lib64/libc.so.6+0x22555)
#33 0x0000000000a55029 _start (./bin/polymer-opt+0xa55029)
[1]    160465 abort (core dumped)  ./bin/polymer-opt ../test/polymer-opt/PlutoTransforms/jacobi-diamond.mlir

Issues when sum reduction is the 2nd operand of min/max

This is a tricky bug, if sum reduction appears to be the 2nd operand, this assertion will fail

polymer/lib/Target/OpenScop/ConvertFromOpenScop.cc

Lines 265 to 266 in f0c09de

 assert(affExprs.size() == 1 && "A single affine expr should be returned " 

 "after processing an expr in reduction.");

Instead of checking there is one expr, we should check that there is only one expr newly added.

We should change which affExpr to update as well here -

polymer/lib/Target/OpenScop/ConvertFromOpenScop.cc

Lines 279 to 280 in f0c09de

 // TODO: deal with negative terms. 

 affExprs[0] = affExprs[0] + currExprs[0];

Upgrade MLIR version

Min/max in loop bounds

If a clast min/max expr is used in for bounds, Polymer will create an affine.min/max first, which will result in using values defined by affine.min/max as indvar. This is prohibited.

A workaround is, in this function:

polymer/lib/Target/OpenScop/ConvertFromOpenScop.cc

Lines 1150 to 1192 in 187abcf

 LogicalResult 

 Importer::getAffineLoopBound(clast_expr *expr, 

 llvm::SmallVectorImpl<mlir::Value> &operands, 

 AffineMap &affMap, bool isUpper) { 

 AffineExprBuilder builder(context, symTable, &symbolTable, scop, options); 

 SmallVector<AffineExpr, 4> boundExprs; 

 // Build the AffineExpr for the loop bound. 

 if (failed(builder.process(expr, boundExprs))) 

 return failure(); 

 // If looking at the upper bound, we should add 1 to all of them. 

 if (isUpper) 

 for (auto &expr : boundExprs) 

 expr = expr + b.getAffineConstantExpr(1); 

 // Insert dim operands. 

 unsigned numDims = builder.dimNames.size(); 

 unsigned numSymbols = builder.symbolNames.size(); 

 operands.resize(numDims + numSymbols); 

 for (const auto &it : builder.dimNames) { 

 if (auto iv = symbolTable[it.first()]) { 

 operands[it.second] = iv; 

 } else { 

 llvm::errs() << "Dim " << it.first() 

 << " cannot be recognized as a value.\n"; 

 return failure(); 

 } 

 } 

 // Create or get BlockArgument for the symbols. We assume all symbols come 

 // from the BlockArgument of the generated function. 

 for (const auto &it : builder.symbolNames) { 

 mlir::Value operand = symbolTable[it.first()]; 

 assert(operand != nullptr); 

 operands[it.second + numDims] = operand; 

 } 

 // Create the AffineMap for loop bound. 

 affMap = AffineMap::get(numDims, numSymbols, boundExprs, context); 

 return success(); 

 }

before we process the expr through the builder, we should first check if the type is a min/max reduction, and if so, process its operands individually and then use all of them as affine map results.

This might be tricky if there're nested min/max. Haven't seen that tho. Will ignore for now.

(P.S., this only appears after we propagate constants to the loop domain. Our previous successful results are still valid - in those cases the constants are still undecided.

Abandon InvariantScopTransform

It won't be very necessary to have it once the PlutoTransform is stable and easy to debug. Maintaining InvariantScopTransform would be a burden.

Deal with functions that don't contain any polyhedral statement

The current implementation will break for this case. We should terminate transformation if scop created doesn't have any statement in it.

The instruction in readme seems to mismatch with the source code.

Hi, after building the polymer-opt with the following commit, I try to execute (matmul.mlir is from /test/archive/polymer-translate/export-scop/matmul.mlir)

./bin/polymer-opt -pluto-opt matmul.mlir

It promotes the following error:

./matmul.mlir:4:8: error: custom op 'alloc' is unknown
%A = alloc() : memref<64x64xf32>
^

Then I change all "alloc" to "memref.alloc".

then execute the command like the readme says, it doesn't take effect and the result is as the source one.

only after execute the command like:
polymer-opt -debug -extract-scop-stmt -pluto-opt matmul.mlir

it seems that the pluto transorm is applied:

#map0 = affine_map<(d0) -> (d0 * 32)>
#map1 = affine_map<(d0) -> (d0 * 32 + 32)>
module {
func private @s0(%arg0: index, %arg1: index, %arg2: memref<64x64xf32>, %arg3: index, %arg4: memref<64x64xf32>, %arg5: memref<64x64xf32>) attributes {scop.stmt} {
%0 = affine.load %arg5[symbol(%arg0), symbol(%arg3)] : memref<64x64xf32>
%1 = affine.load %arg4[symbol(%arg3), symbol(%arg1)] : memref<64x64xf32>
%2 = mulf %0, %1 : f32
%3 = affine.load %arg2[symbol(%arg0), symbol(%arg1)] : memref<64x64xf32>
%4 = addf %2, %3 : f32
affine.store %4, %arg2[symbol(%arg0), symbol(%arg1)] : memref<64x64xf32>
return
}
func @Matmul() {
%0 = memref.alloc() : memref<64x64xf32>
%1 = memref.alloc() : memref<64x64xf32>
%2 = memref.alloc() : memref<64x64xf32>
affine.for %arg0 = 0 to 2 {
affine.for %arg1 = 0 to 2 {
affine.for %arg2 = 0 to 2 {
affine.for %arg3 = #map0(%arg0) to #map1(%arg0) {
affine.for %arg4 = #map0(%arg2) to #map1(%arg2) {
affine.for %arg5 = #map0(%arg1) to #map1(%arg1) {
call @s0(%arg3, %arg5, %0, %arg4, %1, %2) : (index, index, memref<64x64xf32>, index, memref<64x64xf32>, memref<64x64xf32>) -> ()
}
}
}
}
}
}
return
}
}

May there is some mismatch between the source code and the test, just file an issue.

commit 76d9c2c (origin/main, origin/HEAD)
Merge: 4b8e998 7756df4
Author: Ruizhe Zhao [email protected]
Date: Tue May 25 16:32:48 2021 +0000

Merge branch 'main' of github.com:kumasento/polymer into main

[Benchmark][AES] handle a mixture of affine and non-affine code

Input code -

func @encrypt(%arg0: memref<?x16xi32>, %arg1: memref<?xi32>) attributes {llvm.linkage = #llvm.linkage<external>} {
  %c1_i32 = arith.constant 1 : i32
  %c4_i32 = arith.constant 4 : i32
  %c15_i32 = arith.constant 15 : i32
  %c8_i32 = arith.constant 8 : i32
  %c283_i32 = arith.constant 283 : i32
  %0 = memref.alloca() : memref<1024xi32>
  affine.for %arg2 = 1 to 5 {
    affine.for %arg3 = 0 to 16 {
      %1 = affine.load %arg1[%arg3 * 4] : memref<?xi32>
      %2 = arith.shrsi %1, %c4_i32 : i32
      %3 = arith.index_cast %2 : i32 to index
      %4 = arith.andi %1, %c15_i32 : i32
      %5 = arith.index_cast %4 : i32 to index
      %6 = memref.load %arg0[%3, %5] : memref<?x16xi32>
      affine.store %6, %arg1[%arg3 * 4] : memref<?xi32>
    }
    affine.for %arg3 = 0 to 1023 {
      %1 = affine.load %arg1[%arg3] : memref<?xi32>
      %2 = arith.shli %1, %c1_i32 : i32
      affine.store %2, %0[%arg3] : memref<1024xi32>
      %3 = arith.shrsi %2, %c8_i32 : i32
      %4 = arith.cmpi eq, %3, %c1_i32 : i32
      scf.if %4 {
        %10 = arith.xori %2, %c283_i32 : i32
        affine.store %10, %0[%arg3] : memref<1024xi32>
      }
      %5 = affine.load %arg1[%arg3 + 1] : memref<?xi32>
      %6 = arith.shli %5, %c1_i32 : i32
      %7 = arith.xori %5, %6 : i32
      %8 = arith.shrsi %7, %c8_i32 : i32
      %9 = arith.cmpi eq, %8, %c1_i32 : i32
      scf.if %9 {
        %10 = arith.xori %7, %c283_i32 : i32
        %11 = affine.load %0[%arg3] : memref<1024xi32>
        %12 = arith.xori %11, %10 : i32
        affine.store %12, %0[%arg3] : memref<1024xi32>
      } else {
        %10 = affine.load %0[%arg3] : memref<1024xi32>
        %11 = arith.xori %10, %7 : i32
        affine.store %11, %0[%arg3] : memref<1024xi32>
      }
    }
    affine.for %arg3 = 0 to 1024 {
      %1 = affine.load %0[%arg3] : memref<1024xi32>
      affine.store %1, %arg1[%arg3] : memref<?xi32>
    }
  }
  return
}

Investigating the performance difference between Pluto and Polymer on seidel-2d.

UPDATE: There is a tentative conclusion on this matter - to make sure that the Pluto version achieves the same performance as the Polymer version, it should use i64 for its induction variables and bounaries (which removes loads of sext and trunc), and -O3 should be applied for both LLVM-IR emission and final executable compilation (previously we just use -O3 for emitting the LLVM IR).

Numbers can be found at: #71 (comment)

This issue is a informal report on the whole investigation procedure.

Setup

To reproduce:

git checkout debug-seidel-2d

cd example/polybench
./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c

seidel-2d-debug contains the EXTRALARGE version of the seidel-2d benchmark code. Running eval-perf compiles and runs the given code by both Pluto and Polymer, and it will output the overall runtime and keep important intermediate results.

Pluto CLI (polycc) and Polymer use the same options, i.e., no parallel, vectorization, or unroll-and-jam.

Vectorization is disabled at the clang level using -fno-vectorize (you may investigate its effect by checking the LLVM IR code).

Only one -O3 is applied while generating the LLVM IR.

The host machine doesn't have any of its multi-core configurations disabled, e.g., hyper-threading config stays the same.

Notice the difference

After running the eval-perf script given above, we have:

Pluto run time:      186.504164
Polymer run time: 134.634325

The performance gap is huge.

Check the CLooG code

We normally need to check the schedule first, but since their CLooG code are both pretty compact, we just skipped the schedule checking and directly look at the generated CLooG code.

From Polymer

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.cloog

Lines 1 to 15 in a086115

 if ((P0 >= 1) && (P1 >= 3)) { 

 for (t1=0;t1<=floord(P0-1,32);t1++) { 

 for (t2=t1;t2<=min(floord(P0+P1-3,32),floord(32*t1+P1+29,32));t2++) { 

 for (t3=max(ceild(64*t2-P1-28,32),t1+t2);t3<=min(min(min(min(floord(P0+P1-3,16),floord(32*t1+P1+29,16)),floord(64*t2+P1+59,32)),floord(32*t1+32*t2+P1+60,32)),floord(32*t2+P0+P1+28,32));t3++) { 

 for (t4=max(max(max(32*t1,32*t2-P1+2),16*t3-P1+2),-32*t2+32*t3-P1-29);t4<=min(min(min(min(P0-1,32*t1+31),32*t2+30),16*t3+14),-32*t2+32*t3+30);t4++) { 

 for (t5=max(max(32*t2,t4+1),32*t3-t4-P1+2);t5<=min(min(32*t2+31,32*t3-t4+30),t4+P1-2);t5++) { 

 for (t6=max(32*t3,t4+t5+1);t6<=min(32*t3+31,t4+t5+P1-2);t6++) { 

 S0(-t4+t5, -t4-t5+t6, t4) 

 } 

 } 

 } 

 } 

 } 

 } 

 }

From Pluto

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c

Lines 84 to 98 in a086115

 if ((_PB_N >= 3) && (_PB_TSTEPS >= 1)) { 

 for (t1=0;t1<=floord(_PB_TSTEPS-1,32);t1++) { 

 for (t2=t1;t2<=min(floord(_PB_TSTEPS+_PB_N-3,32),floord(32*t1+_PB_N+29,32));t2++) { 

 for (t3=max(ceild(64*t2-_PB_N-28,32),t1+t2);t3<=min(min(min(min(floord(_PB_TSTEPS+_PB_N-3,16),floord(32*t1+_PB_N+29,16)),floord(64*t2+_PB_N+59,32)),floord(32*t1+32*t2+_PB_N+60,32)),floord(32*t2+_PB_TSTEPS+_PB_N+28,32));t3++) { 

 for (t4=max(max(max(32*t1,32*t2-_PB_N+2),16*t3-_PB_N+2),-32*t2+32*t3-_PB_N-29);t4<=min(min(min(min(_PB_TSTEPS-1,32*t1+31),32*t2+30),16*t3+14),-32*t2+32*t3+30);t4++) { 

 for (t5=max(max(32*t2,t4+1),32*t3-t4-_PB_N+2);t5<=min(min(32*t2+31,32*t3-t4+30),t4+_PB_N-2);t5++) { 

 for (t6=max(32*t3,t4+t5+1);t6<=min(32*t3+31,t4+t5+_PB_N-2);t6++) { 

 A[(-t4+t5)][(-t4-t5+t6)] = (A[(-t4+t5)-1][(-t4-t5+t6)-1] + A[(-t4+t5)-1][(-t4-t5+t6)] + A[(-t4+t5)-1][(-t4-t5+t6)+1] + A[(-t4+t5)][(-t4-t5+t6)-1] + A[(-t4+t5)][(-t4-t5+t6)] + A[(-t4+t5)][(-t4-t5+t6)+1] + A[(-t4+t5)+1][(-t4-t5+t6)-1] + A[(-t4+t5)+1][(-t4-t5+t6)] + A[(-t4+t5)+1][(-t4-t5+t6)+1])/SCALAR_VAL(9.0);; 

 } 

 } 

 } 

 } 

 } 

 } 

 }

I won't say I'm very focused when comparing these two, but as far as I can tell, they are identical regarding the nested loops.

Check the LLVM IR code

Now we move into the dark domain.

We check two different things: the complicated loop boundary calculation and the loop body.

Loop body

First of all, without additional flags, the Pluto result uses vectorization (i.e., -fno-vectorize is not applied), and its run time is 200.756374, about 7.6% slower than the version without vectorization (shown above).

Now let's have a closer look at the loop body:

Polymer:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll

Lines 263 to 343 in b08410c

 define void @S0(double* %0, double* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8) !dbg !170 { 

 %10 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } undef, double* %0, 0, !dbg !171 

 %11 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %10, double* %1, 1, !dbg !173 

 %12 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %11, i64 %2, 2, !dbg !174 

 %13 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %12, i64 %3, 3, 0, !dbg !175 

 %14 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %13, i64 %5, 4, 0, !dbg !176 

 %15 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %14, i64 %4, 3, 1, !dbg !177 

 %16 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %15, i64 %6, 4, 1, !dbg !178 

 %17 = add i64 %7, -1, !dbg !179 

 %18 = add i64 %8, -1, !dbg !180 

 %19 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !181 

 %20 = mul i64 %17, 4000, !dbg !182 

 %21 = add i64 %20, %18, !dbg !183 

 %22 = getelementptr double, double* %19, i64 %21, !dbg !184 

 %23 = load double, double* %22, align 8, !dbg !185 

 %24 = add i64 %7, -1, !dbg !186 

 %25 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !187 

 %26 = mul i64 %24, 4000, !dbg !188 

 %27 = add i64 %26, %8, !dbg !189 

 %28 = getelementptr double, double* %25, i64 %27, !dbg !190 

 %29 = load double, double* %28, align 8, !dbg !191 

 %30 = fadd double %23, %29, !dbg !192 

 %31 = add i64 %7, -1, !dbg !193 

 %32 = add i64 %8, 1, !dbg !194 

 %33 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !195 

 %34 = mul i64 %31, 4000, !dbg !196 

 %35 = add i64 %34, %32, !dbg !197 

 %36 = getelementptr double, double* %33, i64 %35, !dbg !198 

 %37 = load double, double* %36, align 8, !dbg !199 

 %38 = fadd double %30, %37, !dbg !200 

 %39 = add i64 %8, -1, !dbg !201 

 %40 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !202 

 %41 = mul i64 %7, 4000, !dbg !203 

 %42 = add i64 %41, %39, !dbg !204 

 %43 = getelementptr double, double* %40, i64 %42, !dbg !205 

 %44 = load double, double* %43, align 8, !dbg !206 

 %45 = fadd double %38, %44, !dbg !207 

 %46 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !208 

 %47 = mul i64 %7, 4000, !dbg !209 

 %48 = add i64 %47, %8, !dbg !210 

 %49 = getelementptr double, double* %46, i64 %48, !dbg !211 

 %50 = load double, double* %49, align 8, !dbg !212 

 %51 = fadd double %45, %50, !dbg !213 

 %52 = add i64 %8, 1, !dbg !214 

 %53 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !215 

 %54 = mul i64 %7, 4000, !dbg !216 

 %55 = add i64 %54, %52, !dbg !217 

 %56 = getelementptr double, double* %53, i64 %55, !dbg !218 

 %57 = load double, double* %56, align 8, !dbg !219 

 %58 = fadd double %51, %57, !dbg !220 

 %59 = add i64 %7, 1, !dbg !221 

 %60 = add i64 %8, -1, !dbg !222 

 %61 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !223 

 %62 = mul i64 %59, 4000, !dbg !224 

 %63 = add i64 %62, %60, !dbg !225 

 %64 = getelementptr double, double* %61, i64 %63, !dbg !226 

 %65 = load double, double* %64, align 8, !dbg !227 

 %66 = fadd double %58, %65, !dbg !228 

 %67 = add i64 %7, 1, !dbg !229 

 %68 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !230 

 %69 = mul i64 %67, 4000, !dbg !231 

 %70 = add i64 %69, %8, !dbg !232 

 %71 = getelementptr double, double* %68, i64 %70, !dbg !233 

 %72 = load double, double* %71, align 8, !dbg !234 

 %73 = fadd double %66, %72, !dbg !235 

 %74 = add i64 %7, 1, !dbg !236 

 %75 = add i64 %8, 1, !dbg !237 

 %76 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !238 

 %77 = mul i64 %74, 4000, !dbg !239 

 %78 = add i64 %77, %75, !dbg !240 

 %79 = getelementptr double, double* %76, i64 %78, !dbg !241 

 %80 = load double, double* %79, align 8, !dbg !242 

 %81 = fadd double %73, %80, !dbg !243 

 %82 = fdiv double %81, 9.000000e+00, !dbg !244 

 %83 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !245 

 %84 = mul i64 %7, 4000, !dbg !246 

 %85 = add i64 %84, %8, !dbg !247 

 %86 = getelementptr double, double* %83, i64 %85, !dbg !248 

 store double %82, double* %86, align 8, !dbg !249 

 ret void, !dbg !250 

 }

Pluto:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll

Lines 301 to 340 in b08410c

 for.body1466.i: ; preds = %for.body1466.i, %for.body1466.lr.ph.i 

 %indvars.iv69.i = phi i64 [ %85, %for.body1466.lr.ph.i ], [ %indvars.iv.next70.i, %for.body1466.i ] 

 %92 = trunc i64 %indvars.iv69.i to i32 

 %add1472.i = sub i32 %92, %91 

 %sub1473.i = add nsw i32 %add1472.i, -1 

 %idxprom1474.i = sext i32 %sub1473.i to i64 

 %arrayidx1475.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1474.i 

 %93 = load double, double* %arrayidx1475.i, align 8, !tbaa !2 

 %idxprom1484.i = sext i32 %add1472.i to i64 

 %arrayidx1485.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1484.i 

 %94 = load double, double* %arrayidx1485.i, align 8, !tbaa !2 

 %add1486.i = fadd double %93, %94 

 %add1495.i = add nsw i32 %add1472.i, 1 

 %idxprom1496.i = sext i32 %add1495.i to i64 

 %arrayidx1497.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1496.i 

 %95 = load double, double* %arrayidx1497.i, align 8, !tbaa !2 

 %add1498.i = fadd double %add1486.i, %95 

 %arrayidx1508.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1474.i 

 %96 = load double, double* %arrayidx1508.i, align 8, !tbaa !2 

 %add1509.i = fadd double %add1498.i, %96 

 %arrayidx1518.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1484.i 

 %97 = load double, double* %arrayidx1518.i, align 8, !tbaa !2 

 %add1519.i = fadd double %add1509.i, %97 

 %arrayidx1529.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1496.i 

 %98 = load double, double* %arrayidx1529.i, align 8, !tbaa !2 

 %add1530.i = fadd double %add1519.i, %98 

 %arrayidx1541.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1474.i 

 %99 = load double, double* %arrayidx1541.i, align 8, !tbaa !2 

 %add1542.i = fadd double %add1530.i, %99 

 %arrayidx1552.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1484.i 

 %100 = load double, double* %arrayidx1552.i, align 8, !tbaa !2 

 %add1553.i = fadd double %add1542.i, %100 

 %arrayidx1564.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1496.i 

 %101 = load double, double* %arrayidx1564.i, align 8, !tbaa !2 

 %add1565.i = fadd double %add1553.i, %101 

 %div1566.i = fdiv double %add1565.i, 9.000000e+00 

 store double %div1566.i, double* %arrayidx1518.i, align 8, !tbaa !2 

 %indvars.iv.next70.i = add nsw i64 %indvars.iv69.i, 1 

 %cmp1465.not.not.i = icmp slt i64 %indvars.iv69.i, %90 

 br i1 %cmp1465.not.not.i, label %for.body1466.i, label %for.inc1576.i, !llvm.loop !10

Note that Polymer's loop body is not inlined, so we should look at the body of S0 instead.

Address Calculation

The major difference between the two is the usage of getelementptr. Polymer explicitly calculates the flattened 1-d address, e.g.,

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll

Lines 274 to 276 in b08410c

 %20 = mul i64 %17, 4000, !dbg !182 

 %21 = add i64 %20, %18, !dbg !183 

 %22 = getelementptr double, double* %19, i64 %21, !dbg !184

while Pluto delegates the task to getelementptr:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll

Line 307 in b08410c

 %arrayidx1475.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1474.i 

Since getelementptr in both cases should provides the same address, the difference shouldn't be very high (not empirically verified).

Also, in Pluto since the loop IVs are i32 typed, we should explicitly call sext to extend them to i64 before passing to getelementptr.

For this case, we manually changed the Pluto generated C code by updating the loop IV type from int to long long.

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.c

Line 82 in 857a9a5

long long t1, t2, t3, t4, t5, t6;

and the sexts are eliminated (as well as those trunc).

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.ll

Lines 261 to 262 in 857a9a5

 %sub1740.i = add nsw i64 %add1739.i, -1 

 %arrayidx1741.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %sub1736.i, i64 %sub1740.i

However, the overall performance drops from around 187 to 195.4, which may suggest that using int is not the main reason for the gap between Polymer and Pluto.

Finally, Pluto adds inbounds while Polymer doesn't. In theory, this shouldn't affect much the performance. We have also manually removed inbounds and run again, and the result is roughly the same.

Computation Sequence

Pluto and Polymer computes the seidel-2d kernel in the same order, from top-left to bottom-right.

Conclusion

Loop body may not be the major source for the performance gap.

Boundary calculation

Boundary calculation is a bit harder to look into. So far, there is two differences that I found might affect the performance.

`nsw` and `nuw`

Pluto generated LLVM-IR adds nsw and/or nuw to those address calculation related binary operators, e.g.,

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll

Lines 244 to 252 in 857a9a5

 %indvars.iv.next85.i = add nuw nsw i64 %indvars.iv84.i, 1 

 %cmp1367.i = icmp ugt i64 %32, %indvars.iv.next85.i 

 %cond1373.v.i = select i1 %cmp1367.i, i64 %32, i64 %indvars.iv.next85.i 

 %69 = sub nsw i64 %64, %indvars.iv84.i 

 %70 = add nsw i64 %69, -3998 

 %sext.i = shl i64 %cond1373.v.i, 32 

 %71 = ashr exact i64 %sext.i, 32 

 %cmp1378.i = icmp sgt i64 %71, %70 

 %cond1395.v.i = select i1 %cmp1378.i, i64 %cond1373.v.i, i64 %70

It is unknown that whether this could have significant effect on the performance. Manually removing them won't produce any better result.

Direct translation vs optimized

(Sorry if I use the wrong term)

There are plenty of floordiv and ceildiv in the loop bounds. The Polymer version have them directly compiled to corresponding LLVM IR operators, e.g., sdiv, mul, etc., while Pluto extensively optimizes them by doing constant folding or other strategies.

Take the outermost loop as an example:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c

Line 85 in b08410c

for (t1=0;t1<=floord(_PB_TSTEPS-1,32);t1++) {

In Polymer,

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll

Lines 362 to 376 in b08410c

 24: ; preds = %9 

 %25 = add i64 %17, -1, !dbg !268 

 %26 = icmp slt i64 %25, 0, !dbg !269 

 %27 = sub i64 -1, %25, !dbg !270 

 %28 = select i1 %26, i64 %27, i64 %25, !dbg !271 

 %29 = sdiv i64 %28, 32, !dbg !272 

 %30 = sub i64 -1, %29, !dbg !273 

 %31 = select i1 %26, i64 %30, i64 %29, !dbg !274 

 %32 = add i64 %31, 1, !dbg !275 

 br label %33, !dbg !276 

 33: ; preds = %250, %24 

 %34 = phi i64 [ %251, %250 ], [ 0, %24 ] 

 %35 = icmp slt i64 %34, %32, !dbg !277 

 br i1 %35, label %36, label %252, !dbg !278

You can find out that %32 is floord(_PB_STEPS, 32) + 1, which is calculated mainly by sdiv, and %34 is the loop IV t1.

In Pluto:

polymer/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll

Lines 375 to 385 in b08410c

 for.inc1588.i: ; preds = %for.inc1585.i 

 %indvars.iv.next93.i = add nuw nsw i64 %indvars.iv92.i, 1 

 %indvars.iv.next41.i = add nuw nsw i32 %indvars.iv40.i, 32 

 %indvars.iv.next47.i = add nuw nsw i32 %indvars.iv46.i, 32 

 %indvars.iv.next51.i = add nsw i32 %indvars.iv50.i, -32 

 %indvars.iv.next117.i = add nuw nsw i64 %indvars.iv116.i, 1 

 %exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32 

 %indvars.iv.next = add nsw i32 %indvars.iv, -32 

 %indvars.iv.next191 = add nuw nsw i64 %indvars.iv190, 32 

 %indvars.iv.next193 = add nuw nsw i64 %indvars.iv192, 32 

 br i1 %exitcond124.not.i, label %kernel_seidel_2d.exit, label %for.body90.lr.ph.i, !llvm.loop !14

The condition, %exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32, basically compares t1 + 1 (%indvars.iv.next93.i) with 32 . 32 is the actual result from floord(_PB_STEPS, 32) considering _PB_STEPS=1000.

You may also notice that in Polymer you can find mul by 64, while in Pluto these are optimised into shr by 6.

Conclusion

We still cannot have a solid conclusion on which part that dominates the performance gap. Especially, Pluto's version seems to be more optimized than the Polymer version regarding address calculation. It is very likely that the difference in LLVM-IR is not the major cause for performance difference.

Building the project.

Hi, probably the Pluto submodule is broken as I am not able to clone it.
Should I use the Pluto official repo?

	assert(affExprs.size() == 1 && "A single affine expr should be returned "
	"after processing an expr in reduction.");

	// TODO: deal with negative terms.
	affExprs[0] = affExprs[0] + currExprs[0];

	LogicalResult
	Importer::getAffineLoopBound(clast_expr *expr,
	llvm::SmallVectorImpl<mlir::Value> &operands,
	AffineMap &affMap, bool isUpper) {
	AffineExprBuilder builder(context, symTable, &symbolTable, scop, options);
	SmallVector<AffineExpr, 4> boundExprs;
	// Build the AffineExpr for the loop bound.
	if (failed(builder.process(expr, boundExprs)))
	return failure();

	// If looking at the upper bound, we should add 1 to all of them.
	if (isUpper)
	for (auto &expr : boundExprs)
	expr = expr + b.getAffineConstantExpr(1);

	// Insert dim operands.
	unsigned numDims = builder.dimNames.size();
	unsigned numSymbols = builder.symbolNames.size();
	operands.resize(numDims + numSymbols);

	for (const auto &it : builder.dimNames) {
	if (auto iv = symbolTable[it.first()]) {
	operands[it.second] = iv;
	} else {
	llvm::errs() << "Dim " << it.first()
	<< " cannot be recognized as a value.\n";
	return failure();
	}
	}

	// Create or get BlockArgument for the symbols. We assume all symbols come
	// from the BlockArgument of the generated function.
	for (const auto &it : builder.symbolNames) {
	mlir::Value operand = symbolTable[it.first()];
	assert(operand != nullptr);
	operands[it.second + numDims] = operand;
	}

	// Create the AffineMap for loop bound.
	affMap = AffineMap::get(numDims, numSymbols, boundExprs, context);

	return success();
	}

	if ((P0 >= 1) && (P1 >= 3)) {
	for (t1=0;t1<=floord(P0-1,32);t1++) {
	for (t2=t1;t2<=min(floord(P0+P1-3,32),floord(32*t1+P1+29,32));t2++) {
	for (t3=max(ceild(64t2-P1-28,32),t1+t2);t3<=min(min(min(min(floord(P0+P1-3,16),floord(32t1+P1+29,16)),floord(64t2+P1+59,32)),floord(32t1+32t2+P1+60,32)),floord(32t2+P0+P1+28,32));t3++) {
	for (t4=max(max(max(32t1,32t2-P1+2),16t3-P1+2),-32t2+32t3-P1-29);t4<=min(min(min(min(P0-1,32t1+31),32t2+30),16t3+14),-32t2+32t3+30);t4++) {
	for (t5=max(max(32t2,t4+1),32t3-t4-P1+2);t5<=min(min(32t2+31,32t3-t4+30),t4+P1-2);t5++) {
	for (t6=max(32t3,t4+t5+1);t6<=min(32t3+31,t4+t5+P1-2);t6++) {
	S0(-t4+t5, -t4-t5+t6, t4)
	}
	}
	}
	}
	}
	}
	}

	if ((_PB_N >= 3) && (_PB_TSTEPS >= 1)) {
	for (t1=0;t1<=floord(_PB_TSTEPS-1,32);t1++) {
	for (t2=t1;t2<=min(floord(_PB_TSTEPS+_PB_N-3,32),floord(32*t1+_PB_N+29,32));t2++) {
	for (t3=max(ceild(64t2-_PB_N-28,32),t1+t2);t3<=min(min(min(min(floord(_PB_TSTEPS+_PB_N-3,16),floord(32t1+_PB_N+29,16)),floord(64t2+_PB_N+59,32)),floord(32t1+32t2+_PB_N+60,32)),floord(32t2+_PB_TSTEPS+_PB_N+28,32));t3++) {
	for (t4=max(max(max(32t1,32t2-_PB_N+2),16t3-_PB_N+2),-32t2+32t3-_PB_N-29);t4<=min(min(min(min(_PB_TSTEPS-1,32t1+31),32t2+30),16t3+14),-32t2+32t3+30);t4++) {
	for (t5=max(max(32t2,t4+1),32t3-t4-_PB_N+2);t5<=min(min(32t2+31,32t3-t4+30),t4+_PB_N-2);t5++) {
	for (t6=max(32t3,t4+t5+1);t6<=min(32t3+31,t4+t5+_PB_N-2);t6++) {
	A[(-t4+t5)][(-t4-t5+t6)] = (A[(-t4+t5)-1][(-t4-t5+t6)-1] + A[(-t4+t5)-1][(-t4-t5+t6)] + A[(-t4+t5)-1][(-t4-t5+t6)+1] + A[(-t4+t5)][(-t4-t5+t6)-1] + A[(-t4+t5)][(-t4-t5+t6)] + A[(-t4+t5)][(-t4-t5+t6)+1] + A[(-t4+t5)+1][(-t4-t5+t6)-1] + A[(-t4+t5)+1][(-t4-t5+t6)] + A[(-t4+t5)+1][(-t4-t5+t6)+1])/SCALAR_VAL(9.0);;
	}
	}
	}
	}
	}
	}
	}

	define void @S0(double* %0, double* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8) !dbg !170 {
	%10 = insertvalue { double, double, i64, [2 x i64], [2 x i64] } undef, double* %0, 0, !dbg !171
	%11 = insertvalue { double, double, i64, [2 x i64], [2 x i64] } %10, double* %1, 1, !dbg !173
	%12 = insertvalue { double, double, i64, [2 x i64], [2 x i64] } %11, i64 %2, 2, !dbg !174
	%13 = insertvalue { double, double, i64, [2 x i64], [2 x i64] } %12, i64 %3, 3, 0, !dbg !175
	%14 = insertvalue { double, double, i64, [2 x i64], [2 x i64] } %13, i64 %5, 4, 0, !dbg !176
	%15 = insertvalue { double, double, i64, [2 x i64], [2 x i64] } %14, i64 %4, 3, 1, !dbg !177
	%16 = insertvalue { double, double, i64, [2 x i64], [2 x i64] } %15, i64 %6, 4, 1, !dbg !178
	%17 = add i64 %7, -1, !dbg !179
	%18 = add i64 %8, -1, !dbg !180
	%19 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !181
	%20 = mul i64 %17, 4000, !dbg !182
	%21 = add i64 %20, %18, !dbg !183
	%22 = getelementptr double, double* %19, i64 %21, !dbg !184
	%23 = load double, double* %22, align 8, !dbg !185
	%24 = add i64 %7, -1, !dbg !186
	%25 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !187
	%26 = mul i64 %24, 4000, !dbg !188
	%27 = add i64 %26, %8, !dbg !189
	%28 = getelementptr double, double* %25, i64 %27, !dbg !190
	%29 = load double, double* %28, align 8, !dbg !191
	%30 = fadd double %23, %29, !dbg !192
	%31 = add i64 %7, -1, !dbg !193
	%32 = add i64 %8, 1, !dbg !194
	%33 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !195
	%34 = mul i64 %31, 4000, !dbg !196
	%35 = add i64 %34, %32, !dbg !197
	%36 = getelementptr double, double* %33, i64 %35, !dbg !198
	%37 = load double, double* %36, align 8, !dbg !199
	%38 = fadd double %30, %37, !dbg !200
	%39 = add i64 %8, -1, !dbg !201
	%40 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !202
	%41 = mul i64 %7, 4000, !dbg !203
	%42 = add i64 %41, %39, !dbg !204
	%43 = getelementptr double, double* %40, i64 %42, !dbg !205
	%44 = load double, double* %43, align 8, !dbg !206
	%45 = fadd double %38, %44, !dbg !207
	%46 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !208
	%47 = mul i64 %7, 4000, !dbg !209
	%48 = add i64 %47, %8, !dbg !210
	%49 = getelementptr double, double* %46, i64 %48, !dbg !211
	%50 = load double, double* %49, align 8, !dbg !212
	%51 = fadd double %45, %50, !dbg !213
	%52 = add i64 %8, 1, !dbg !214
	%53 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !215
	%54 = mul i64 %7, 4000, !dbg !216
	%55 = add i64 %54, %52, !dbg !217
	%56 = getelementptr double, double* %53, i64 %55, !dbg !218
	%57 = load double, double* %56, align 8, !dbg !219
	%58 = fadd double %51, %57, !dbg !220
	%59 = add i64 %7, 1, !dbg !221
	%60 = add i64 %8, -1, !dbg !222
	%61 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !223
	%62 = mul i64 %59, 4000, !dbg !224
	%63 = add i64 %62, %60, !dbg !225
	%64 = getelementptr double, double* %61, i64 %63, !dbg !226
	%65 = load double, double* %64, align 8, !dbg !227
	%66 = fadd double %58, %65, !dbg !228
	%67 = add i64 %7, 1, !dbg !229
	%68 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !230
	%69 = mul i64 %67, 4000, !dbg !231
	%70 = add i64 %69, %8, !dbg !232
	%71 = getelementptr double, double* %68, i64 %70, !dbg !233
	%72 = load double, double* %71, align 8, !dbg !234
	%73 = fadd double %66, %72, !dbg !235
	%74 = add i64 %7, 1, !dbg !236
	%75 = add i64 %8, 1, !dbg !237
	%76 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !238
	%77 = mul i64 %74, 4000, !dbg !239
	%78 = add i64 %77, %75, !dbg !240
	%79 = getelementptr double, double* %76, i64 %78, !dbg !241
	%80 = load double, double* %79, align 8, !dbg !242
	%81 = fadd double %73, %80, !dbg !243
	%82 = fdiv double %81, 9.000000e+00, !dbg !244
	%83 = extractvalue { double, double, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !245
	%84 = mul i64 %7, 4000, !dbg !246
	%85 = add i64 %84, %8, !dbg !247
	%86 = getelementptr double, double* %83, i64 %85, !dbg !248
	store double %82, double* %86, align 8, !dbg !249
	ret void, !dbg !250
	}

	for.body1466.i: ; preds = %for.body1466.i, %for.body1466.lr.ph.i
	%indvars.iv69.i = phi i64 [ %85, %for.body1466.lr.ph.i ], [ %indvars.iv.next70.i, %for.body1466.i ]
	%92 = trunc i64 %indvars.iv69.i to i32
	%add1472.i = sub i32 %92, %91
	%sub1473.i = add nsw i32 %add1472.i, -1
	%idxprom1474.i = sext i32 %sub1473.i to i64
	%arrayidx1475.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1474.i
	%93 = load double, double* %arrayidx1475.i, align 8, !tbaa !2
	%idxprom1484.i = sext i32 %add1472.i to i64
	%arrayidx1485.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1484.i
	%94 = load double, double* %arrayidx1485.i, align 8, !tbaa !2
	%add1486.i = fadd double %93, %94
	%add1495.i = add nsw i32 %add1472.i, 1
	%idxprom1496.i = sext i32 %add1495.i to i64
	%arrayidx1497.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1496.i
	%95 = load double, double* %arrayidx1497.i, align 8, !tbaa !2
	%add1498.i = fadd double %add1486.i, %95
	%arrayidx1508.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1474.i
	%96 = load double, double* %arrayidx1508.i, align 8, !tbaa !2
	%add1509.i = fadd double %add1498.i, %96
	%arrayidx1518.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1484.i
	%97 = load double, double* %arrayidx1518.i, align 8, !tbaa !2
	%add1519.i = fadd double %add1509.i, %97
	%arrayidx1529.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1496.i
	%98 = load double, double* %arrayidx1529.i, align 8, !tbaa !2
	%add1530.i = fadd double %add1519.i, %98
	%arrayidx1541.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1474.i
	%99 = load double, double* %arrayidx1541.i, align 8, !tbaa !2
	%add1542.i = fadd double %add1530.i, %99
	%arrayidx1552.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1484.i
	%100 = load double, double* %arrayidx1552.i, align 8, !tbaa !2
	%add1553.i = fadd double %add1542.i, %100
	%arrayidx1564.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1496.i
	%101 = load double, double* %arrayidx1564.i, align 8, !tbaa !2
	%add1565.i = fadd double %add1553.i, %101
	%div1566.i = fdiv double %add1565.i, 9.000000e+00
	store double %div1566.i, double* %arrayidx1518.i, align 8, !tbaa !2
	%indvars.iv.next70.i = add nsw i64 %indvars.iv69.i, 1
	%cmp1465.not.not.i = icmp slt i64 %indvars.iv69.i, %90
	br i1 %cmp1465.not.not.i, label %for.body1466.i, label %for.inc1576.i, !llvm.loop !10

	%sub1740.i = add nsw i64 %add1739.i, -1
	%arrayidx1741.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %sub1736.i, i64 %sub1740.i

	%indvars.iv.next85.i = add nuw nsw i64 %indvars.iv84.i, 1
	%cmp1367.i = icmp ugt i64 %32, %indvars.iv.next85.i
	%cond1373.v.i = select i1 %cmp1367.i, i64 %32, i64 %indvars.iv.next85.i
	%69 = sub nsw i64 %64, %indvars.iv84.i
	%70 = add nsw i64 %69, -3998
	%sext.i = shl i64 %cond1373.v.i, 32
	%71 = ashr exact i64 %sext.i, 32
	%cmp1378.i = icmp sgt i64 %71, %70
	%cond1395.v.i = select i1 %cmp1378.i, i64 %cond1373.v.i, i64 %70

	24: ; preds = %9
	%25 = add i64 %17, -1, !dbg !268
	%26 = icmp slt i64 %25, 0, !dbg !269
	%27 = sub i64 -1, %25, !dbg !270
	%28 = select i1 %26, i64 %27, i64 %25, !dbg !271
	%29 = sdiv i64 %28, 32, !dbg !272
	%30 = sub i64 -1, %29, !dbg !273
	%31 = select i1 %26, i64 %30, i64 %29, !dbg !274
	%32 = add i64 %31, 1, !dbg !275
	br label %33, !dbg !276

	33: ; preds = %250, %24
	%34 = phi i64 [ %251, %250 ], [ 0, %24 ]
	%35 = icmp slt i64 %34, %32, !dbg !277
	br i1 %35, label %36, label %252, !dbg !278

	for.inc1588.i: ; preds = %for.inc1585.i
	%indvars.iv.next93.i = add nuw nsw i64 %indvars.iv92.i, 1
	%indvars.iv.next41.i = add nuw nsw i32 %indvars.iv40.i, 32
	%indvars.iv.next47.i = add nuw nsw i32 %indvars.iv46.i, 32
	%indvars.iv.next51.i = add nsw i32 %indvars.iv50.i, -32
	%indvars.iv.next117.i = add nuw nsw i64 %indvars.iv116.i, 1
	%exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32
	%indvars.iv.next = add nsw i32 %indvars.iv, -32
	%indvars.iv.next191 = add nuw nsw i64 %indvars.iv190, 32
	%indvars.iv.next193 = add nuw nsw i64 %indvars.iv192, 32
	br i1 %exitcond124.not.i, label %kernel_seidel_2d.exit, label %for.body90.lr.ph.i, !llvm.loop !14

kumasento / polymer Goto Github PK

polymer's Introduction

Polymer: bridging polyhedral tools to MLIR

Related Publications/Talks

Papers

Talks

Install Polymer

Basic usage

polymer's People

Contributors

Stargazers

Watchers

Forkers

polymer's Issues

Background

Getting the access relation

Solutions

Result

if-else block

Resolve the domain for the else block

multiple ifs and for-within-if

Constants

Setup

Notice the difference

Check the CLooG code

Check the LLVM IR code

Loop body

Address Calculation

Computation Sequence

Conclusion

Boundary calculation

nsw and nuw

Direct translation vs optimized

Conclusion

Recommend Projects

Recommend Topics

Recommend Org

`nsw` and `nuw`