Git Product home page Git Product logo

polymer's Introduction

Polymer: bridging polyhedral tools to MLIR

Build and Test wakatime GitHub release (latest by date) GitHub GitHub issues GitHub pull requests

Bridging polyhedral analysis tools to the MLIR framework.

Polymer is a component of the Polygeist framework. Please read on to find how to install and use Polymer.

Related Publications/Talks

[bibtex]

Papers

Polymer is a essential component to the following two papers:

Talks

Polymer appears in the following talks:

Install Polymer

[legacy installation method] [submodule problem]

The recommended way of installing Polymer is to have it as a component of Polygeist. Please find the detailed instruction here.

You can also install Polymer as an individual, out-of-tree project.

Basic usage

Optimize MLIR code described in the Affine dialect by Pluto:

// File name: matmul.mlir
func @matmul() {
  %A = memref.alloc() : memref<64x64xf32>
  %B = memref.alloc() : memref<64x64xf32>
  %C = memref.alloc() : memref<64x64xf32>

  affine.for %i = 0 to 64 {
    affine.for %j = 0 to 64 {
      affine.for %k = 0 to 64 {
        %0 = affine.load %A[%i, %k] : memref<64x64xf32>
        %1 = affine.load %B[%k, %j] : memref<64x64xf32>
        %2 = arith.mulf %0, %1 : f32
        %3 = affine.load %C[%i, %j] : memref<64x64xf32>
        %4 = arith.addf %2, %3 : f32
        affine.store %4, %C[%i, %j] : memref<64x64xf32>
      }
    }
  }

  return
}

The following command will optimize this code piece.

# Go to the build/ directory.
./bin/polymer-opt -reg2mem -extract-scop-stmt -pluto-opt matmul.mlir 

Output:

#map0 = affine_map<(d0) -> (d0 * 32)>
#map1 = affine_map<(d0) -> (d0 * 32 + 32)>
module  {
  func private @S0(%arg0: index, %arg1: index, %arg2: memref<64x64xf32>, %arg3: index, %arg4: memref<64x64xf32>, %arg5: memref<64x64xf32>) attributes {scop.stmt} {
    %0 = affine.load %arg5[symbol(%arg0), symbol(%arg3)] : memref<64x64xf32>
    %1 = affine.load %arg4[symbol(%arg3), symbol(%arg1)] : memref<64x64xf32>
    %2 = arith.mulf %0, %1 : f32
    %3 = affine.load %arg2[symbol(%arg0), symbol(%arg1)] : memref<64x64xf32>
    %4 = arith.addf %2, %3 : f32
    affine.store %4, %arg2[symbol(%arg0), symbol(%arg1)] : memref<64x64xf32>
    return
  }
  
  func @matmul() {
    %0 = memref.alloc() : memref<64x64xf32>
    %1 = memref.alloc() : memref<64x64xf32>
    %2 = memref.alloc() : memref<64x64xf32>
    affine.for %arg0 = 0 to 2 {
      affine.for %arg1 = 0 to 2 {
        affine.for %arg2 = 0 to 2 {
          affine.for %arg3 = #map0(%arg0) to #map1(%arg0) {
            affine.for %arg4 = #map0(%arg2) to #map1(%arg2) {
              affine.for %arg5 = #map0(%arg1) to #map1(%arg1) {
                call @S0(%arg3, %arg5, %0, %arg4, %1, %2) : (index, index, memref<64x64xf32>, index, memref<64x64xf32>, memref<64x64xf32>) -> ()
              }
            }
          }
        }
      }
    }
    return
  }
}

polymer's People

Contributors

chelini avatar groverkss avatar hanchenye avatar kumasento avatar yangwang92 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

polymer's Issues

Connection to PLUTO

Should create a pluto-opt transform pass.

  • Add PLUTO to the compilation flow (#10 ).
  • Create the compiler pass (#13 ).

C integration test

Should check the correctness of the optimized code and compare the performance before and after optimization.

For instance, I should create a matrix multiplication example that has three different implementations -

  1. A plain C implementation
  2. An implementation that is based on an unoptimised MLIR code
  3. An implementation that is based on an optimised MLIR code

All these tests will be under test/integration. Won't create more subdirectories at the moment. Will consider about adding googletest.

Handling local variables

This issue was not properly dealt with in previous versions. Since I'm currently doing the refactorization, we should have a comprehensive solution for that.

Background

A local variable will be introduced when dealing with semi-affine maps. For example, a code as follows:

#map = affine_map<(d0) -> (d0 floordiv 2)>

func @load_store_local_vars_floordiv(%A : memref<?xf32>) -> () {
  %c0 = constant 0 : index
  %N = dim %A, %c0 : memref<?xf32>
  %M = affine.apply #map(%N)

  affine.for %i = 0 to %M {
    %0 = affine.load %A[%i] : memref<?xf32>
    affine.store %0, %A[%i + %M] : memref<?xf32>
  }

  return
}

%M will be a local variable for the domain of the internal affine.for resolved. The constraints around %N and %M can be derived from their floordiv 2 relation.

%i | %N | %M | 1 
 1    0    0   0 >= 0
 0    1   -2   0 >= 0
 0   -1    2   1 >= 0
-1    0    1  -1 >= 0

And these constraints can be translated into a proper OpenScop relation.

DOMAIN
4 5 1 0 1 1
# e/i| i1 | l1 | P1 |  1  
   1    1    0    0    0    ## i1 >= 0
   1    0   -2    1    0    ## -2*l1+P1 >= 0
   1    0    2   -1    1    ## 2*l1-P1+1 >= 0
   1   -1    1    0   -1    ## -i1+l1-1 >= 0

l1 is %M and P1 is %N.

The problem is around the access relation.

Getting the access relation

The access relation is derived from a AffineValueMap that we can get from MemRefAccess (ref). An AffineValueMap is an AffineMap together with a list of operands it applies. The issue of local variables is about these operands: we need to make sure every operand can be traced back to the values held by the domain, but a local variable is not there.

Specifically, given the code above, we can get a polyhedral statement as follows:

  func @load_store_local_vars_floordiv(%arg0: memref<?xf32>) {
    %c0 = constant 0 : index
    %0 = dim %arg0, %c0 : memref<?xf32>
    %1 = affine.apply #map(%0)
    affine.for %arg1 = 0 to %1 {
      call @S0(%arg0, %arg1, %1) : (memref<?xf32>, index, index) -> ()
    }
    return
  }
  func @S0(%arg0: memref<?xf32>, %arg1: index, %arg2: index) attributes {scop.stmt} {
    %0 = affine.load %arg0[%arg1] : memref<?xf32>
    affine.store %0, %arg0[%arg1 + %arg2] : memref<?xf32>
    return
  }

Because %1 (%M) is a symbol based on (isValidSymbol), we will directly pass it into @S0 in a form as %arg2. It will later become an operand of the AffineValueMap for the affine.store. That AffineValueMap would look like:

affine_map<()[s0, s1] -> (s0 + s1)>

and s0 and s1 will be replaced by %arg1 (%i) and %arg2 (%M). However, %arg2, or more specifically, %M, cannot be found in the domain we resolved. %M is a local variable, and a local variable is value-less (or internal, cannot find the correct wording) in the FlatAffineConstraints. Go back to the domain representation above (which is of type FlatAffineConstraints:

%i | %N | %M | 1 
 1    0    0   0 >= 0
 0    1   -2   0 >= 0
 0   -1    2   1 >= 0
-1    0    1  -1 >= 0

%M is not actually stored in the domain.

Solutions

Before we dive into the solutions, there are several things from the Polymer design that cannot be broken:

  1. A domain is immutable. Once we figure out what a domain looks like, especially what its dims and symbols are, we should not further alter it.
  2. An access relation should not use values other than those in the domain, excluding local values. That is, if an access relation needs to use a value marked as local in the domain, it should instantiate again that value as a local one inside the access relation.
  3. The statement interface should only include arrays, IVs, and parameters, e.g., including %M as an argument of @S0 in the previous example violates this rule. This is just a convention that can keep the design clean.

Our solution is:

  1. In -extract-scop-stmt, we will only stop searching when seeing an operand that is included in the current domain constraints (as well as the global parameter set). Such that all the local variable calculation will be included internally to each callee.
  2. When exporting OpenScop, we just build access constraints with all the affine.apply results flattened. This is done automatically, we just need to make sure that these affine constraints have the same columns as the domain (after merging with the context for their symbols) excluding all the local variables.

Result

Given input code:

#map = affine_map<(d0) -> (d0 floordiv 2)>

func @load_store_local_vars_floordiv(%A : memref<?xf32>) -> () {
  %c0 = constant 0 : index
  %N = dim %A, %c0 : memref<?xf32>
  %M = affine.apply #map(%N)

  affine.for %i = 0 to %M {
    %0 = affine.load %A[%i] : memref<?xf32>
    %j = affine.apply #map(%i)
    %1 = addf %0, %0 : f32
    affine.store %0, %A[%i + %M] : memref<?xf32>
    affine.store %1, %A[%j + %M] : memref<?xf32>
  }

  return
}

We will get:

#map = affine_map<(d0) -> (d0 floordiv 2)>
#set0 = affine_set<()[s0] : (s0 floordiv 2 - 1 >= 0, s0 floordiv 2 - 1 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0)>
#set1 = affine_set<(d0)[s0] : (d0 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0, -d0 + s0 floordiv 2 - 1 >= 0)>
#set2 = affine_set<(d0)[s0] : (d0 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0, -d0 + s0 floordiv 2 - 1 >= 0)>
#set3 = affine_set<(d0, d1, d2)[s0] : (d1 - d2 == 0, d0 - 1 == 0)>
#set4 = affine_set<(d0, d1, d2)[s0] : (d1 - d2 - s0 floordiv 2 == 0, d0 - 1 == 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0)>
#set5 = affine_set<(d0, d1, d2)[s0] : (d1 - d2 floordiv 2 - s0 floordiv 2 == 0, d0 - 1 == 0, d2 mod 2 >= 0, -d2 + (d2 floordiv 2) * 2 + 1 >= 0, s0 mod 2 >= 0, -s0 + (s0 floordiv 2) * 2 + 1 >= 0)>
module  {
  func @load_store_local_vars_floordiv(%arg0: memref<?xf32>) attributes {scop.arg_names = ["A1"], scop.ctx = #set0, scop.ctx_params = ["P1"]} {
    %c0 = constant 0 : index
    %0 = dim {scop.param_names = ["P1"]} %arg0, %c0 : memref<?xf32>
    %1 = affine.apply #map(%0)
    affine.for %arg1 = 0 to %1 {
      call @S0(%arg0, %arg1, %0) {scop.domain = #set1, scop.domain_symbols = ["i1", "P1"], scop.scats = [0, 0]} : (memref<?xf32>, index, index) -> ()
      call @S1(%arg0, %0, %arg1) {scop.domain = #set2, scop.domain_symbols = ["i1", "P1"], scop.scats = [0, 1]} : (memref<?xf32>, index, index) -> ()
    } {scop.iv_name = "i1"}
    return
  }
  func @S0(%arg0: memref<?xf32>, %arg1: index, %arg2: index) attributes {scop.stmt} {
    %0 = affine.load %arg0[%arg1] {scop.access = #set3, scop.access_symbols = ["A1", "", "i1", "P1"]} : memref<?xf32>
    %1 = affine.apply #map(%arg2)
    affine.store %0, %arg0[%arg1 + %1] {scop.access = #set4, scop.access_symbols = ["A1", "", "i1", "P1"]} : memref<?xf32>
    return
  }
  func @S1(%arg0: memref<?xf32>, %arg1: index, %arg2: index) attributes {scop.stmt} {
    %0 = affine.load %arg0[%arg2] {scop.access = #set3, scop.access_symbols = ["A1", "", "i1", "P1"]} : memref<?xf32>
    %1 = addf %0, %0 : f32
    %2 = affine.apply #map(%arg2)
    %3 = affine.apply #map(%arg1)
    affine.store %1, %arg0[%2 + %3] {scop.access = #set5, scop.access_symbols = ["A1", "", "i1", "P1"]} : memref<?xf32>
    return
  }
}

No Pluto optimization in matrix multiplication example

I build the polymer with impact branch, and try to optimize matmul.mlir. But I find there is no difference between output mlir and input mlir.
The command I used is polymer-opt -pluto-opt ../matmul.mlir.
And the output mlir is:

module  {
  func @matmul() {
    %0 = alloc() : memref<64x64xf32>
    %1 = alloc() : memref<64x64xf32>
    %2 = alloc() : memref<64x64xf32>
    affine.for %arg0 = 0 to 64 {
      affine.for %arg1 = 0 to 64 {
        affine.for %arg2 = 0 to 64 {
          %3 = affine.load %0[%arg0, %arg2] : memref<64x64xf32>
          %4 = affine.load %1[%arg2, %arg1] : memref<64x64xf32>
          %5 = mulf %3, %4 : f32
          %6 = affine.load %2[%arg0, %arg1] : memref<64x64xf32>
          %7 = addf %5, %6 : f32
          affine.store %7, %2[%arg0, %arg1] : memref<64x64xf32>
        }
      }
    }
    return
  }
}

When I try to build this project with main branch and use the command ./scripts/build-with-polygeist.sh, I get a compile error:

configure: error: clang header file not found
checking for pet/Makefile... no
configure: error: configure in pet/ failed
tools/mlir/tools/polymer/CMakeFiles/pluto.dir/build.make:124: recipe for target 'tools/mlir/tools/polymer/pluto/src/pluto-stamp/pluto-configure' failed
make[3]: *** [tools/mlir/tools/polymer/pluto/src/pluto-stamp/pluto-configure] Error 1
CMakeFiles/Makefile2:109388: recipe for target 'tools/mlir/tools/polymer/CMakeFiles/pluto.dir/all' failed
make[2]: *** [tools/mlir/tools/polymer/CMakeFiles/pluto.dir/all] Error 2
CMakeFiles/Makefile2:110540: recipe for target 'tools/mlir/tools/polymer/test/CMakeFiles/check-polymer.dir/rule' failed
make[1]: *** [tools/mlir/tools/polymer/test/CMakeFiles/check-polymer.dir/rule] Error 2
Makefile:27063: recipe for target 'check-polymer' failed
make: *** [check-polymer] Error 2

Is there any solutions to solve these problems above? Thanks.

<scatnames> are wrongly generated

The total number of scatnames is not correctly calculated. The original # scatnames should be 1 (root) + n (IVs) + 1 (leaf), what we
want here is 2 n (IVs) + 1 (leaf).

Fix the failed test of matmul

func @matmul() {
  %A = alloc() : memref<64x64xf32>
  %B = alloc() : memref<64x64xf32>
  %C = alloc() : memref<64x64xf32>

  affine.for %i = 0 to 64 {
    affine.for %j = 0 to 64 {
      affine.for %k = 0 to 64 {
        %0 = affine.load %A[%i, %k] : memref<64x64xf32>
        %1 = affine.load %B[%k, %j] : memref<64x64xf32>
        %2 = mulf %0, %1 : f32
        %3 = affine.load %C[%i, %j] : memref<64x64xf32>
        %4 = addf %2, %3 : f32
        affine.store %4, %C[%i, %j] : memref<64x64xf32>
      }
    }
  }

  return
}

Add the reg2mem pass to break dependences in imperfectly nested loops

affine.for %i ... {
  %0 = affine.load %A[%i]
  affine.store %something, %A[%i]
  affine.for %j {
    %1 = mulf %0, %0
    affine.store %1, %B[%i, %j]
  }
}

This code is hard to handle because:

  1. The loops are imperfectly nested, and in our current implementation of PlutoTransform , we need to create a statement that contains a loop structure in it. Don't know if Pluto will be unhappy about this, or it may cause problems;
  2. There is a WAR dependence between the first load/store pair. So we cannot reorder them.

One solution is to adapt a reg2mem pass, which breaks the dependences between nested loops by storing the SSA values (%0 in the given example) that may still live when we enter a nested block. Such that we don't have to create statements that across multiple loops, and we can explicitly model the dependences.

For instance, if we transform the previously given example in the following way:

affine.for %i ... {
  // S1(i)
  %0 = affine.load %A[%i]
  affine.store %0, %scatchpad[%i]
 
  // S2(i)
  affine.store %something %A[%i]
  affine.for %j {
    // S3(i, j)
    %1 = affine.load %scatchpad[%i]
    %2 = mulf %0, %0
    affine.store %2, %B[%i, %j]
  }
}

We may create clearly separated statements just within each loop nest level. The %scratchpad is the memory we create to store those live-out SSA values.

We could implement this pass based on the basic liveness analysis algorithm. The size of the allocated memory depends on the iteration space of the loop nest level it resides in.

It would be helpful for the following issues: #5 #42 #48

Support extracting point loops into standalone functions

This will be useful for further optimisation concentrating on the point loop bands.

cd build
./bin/polymer-opt ../example/polybench/EXTRALARGE/gemm/gemm.mlir \
  -reg2mem \
  -extract-scop-stmt \
  -pluto-opt \
  -canonicalize \
  -annotate-point-loops \
  -extract-point-loops
#map0 = affine_map<()[s0] -> ((s0 - 1) floordiv 32 + 1)>
#map1 = affine_map<(d0) -> (d0 * 32)>
#map2 = affine_map<(d0)[s0] -> (s0, d0 * 32 + 32)>
#set0 = affine_set<()[s0, s1] : (s0 - 1 >= 0, s1 - 1 >= 0)>
#set1 = affine_set<()[s0] : (s0 - 1 >= 0)>
module attributes {llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu"}  {
  llvm.mlir.global internal constant @str7("==END   DUMP_ARRAYS==\0A\00")
  llvm.mlir.global internal constant @str6("\0Aend   dump: %s\0A\00")
  llvm.mlir.global internal constant @str5("%0.2lf \00")
  llvm.mlir.global internal constant @str4("\0A\00")
  llvm.mlir.global internal constant @str3("C\00")
  llvm.mlir.global internal constant @str2("begin dump: %s\00")
  llvm.mlir.global internal constant @str1("==BEGIN DUMP_ARRAYS==\0A\00")
  llvm.mlir.global external @stderr() : !llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>
  llvm.func @fprintf(!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>, ...) -> !llvm.i32
  llvm.mlir.global internal constant @str0("\00")
  llvm.func @strcmp(!llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.i32
  func @main(%arg0: i32, %arg1: !llvm.ptr<ptr<i8>>) -> i32 {
    %c0 = constant 0 : index
    %c2000_i32 = constant 2000 : i32
    %c2300_i32 = constant 2300 : i32
    %c2600_i32 = constant 2600 : i32
    %c42_i32 = constant 42 : i32
    %true = constant true
    %false = constant false
    %c0_i32 = constant 0 : i32
    %0 = alloca() : memref<1xf64>
    %1 = alloca() : memref<1xf64>
    %2 = alloc() : memref<2000x2300xf64>
    %3 = alloc() : memref<2000x2600xf64>
    %4 = alloc() : memref<2600x2300xf64>
    %5 = memref_cast %0 : memref<1xf64> to memref<?xf64>
    %6 = memref_cast %1 : memref<1xf64> to memref<?xf64>
    call @init_array(%c2000_i32, %c2300_i32, %c2600_i32, %5, %6, %2, %3, %4) : (i32, i32, i32, memref<?xf64>, memref<?xf64>, memref<2000x2300xf64>, memref<2000x2600xf64>, memref<2600x2300xf64>) -> ()
    call @polybench_timer_start() : () -> ()
    %7 = load %0[%c0] : memref<1xf64>
    %8 = load %1[%c0] : memref<1xf64>
    call @kernel_gemm_new(%c2000_i32, %c2300_i32, %c2600_i32, %7, %8, %2, %3, %4) : (i32, i32, i32, f64, f64, memref<2000x2300xf64>, memref<2000x2600xf64>, memref<2600x2300xf64>) -> ()
    call @polybench_timer_stop() : () -> ()
    call @polybench_timer_print() : () -> ()
    %9 = cmpi "sgt", %arg0, %c42_i32 : i32
    %10 = scf.if %9 -> (i1) {
      %11 = llvm.load %arg1 : !llvm.ptr<ptr<i8>>
      %12 = llvm.mlir.addressof @str0 : !llvm.ptr<array<1 x i8>>
      %13 = llvm.mlir.constant(0 : index) : !llvm.i64
      %14 = llvm.getelementptr %12[%13, %13] : (!llvm.ptr<array<1 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
      %15 = llvm.call @strcmp(%11, %14) : (!llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.i32
      %16 = llvm.mlir.cast %15 : !llvm.i32 to i32
      %17 = trunci %16 : i32 to i1
      %18 = xor %17, %true : i1
      scf.yield %18 : i1
    } else {
      scf.yield %false : i1
    }
    scf.if %10 {
      call @print_array(%c2000_i32, %c2300_i32, %2) : (i32, i32, memref<2000x2300xf64>) -> ()
    }
    return %c0_i32 : i32
  }
  func private @init_array(%arg0: i32, %arg1: i32, %arg2: i32, %arg3: memref<?xf64>, %arg4: memref<?xf64>, %arg5: memref<2000x2300xf64>, %arg6: memref<2000x2600xf64>, %arg7: memref<2600x2300xf64>) {
    %c0 = constant 0 : index
    %cst = constant 1.500000e+00 : f64
    %cst_0 = constant 1.200000e+00 : f64
    %c0_i32 = constant 0 : i32
    %c2_i32 = constant 2 : i32
    %c1_i32 = constant 1 : i32
    store %cst, %arg3[%c0] : memref<?xf64>
    store %cst_0, %arg4[%c0] : memref<?xf64>
    br ^bb1(%c0_i32 : i32)
  ^bb1(%0: i32):  // 2 preds: ^bb0, ^bb4
    %1 = cmpi "slt", %0, %arg0 : i32
    %2 = index_cast %0 : i32 to index
    cond_br %1, ^bb2(%c0_i32 : i32), ^bb5(%c0_i32 : i32)
  ^bb2(%3: i32):  // 2 preds: ^bb1, ^bb3
    %4 = cmpi "slt", %3, %arg1 : i32
    %5 = index_cast %3 : i32 to index
    cond_br %4, ^bb3, ^bb4
  ^bb3:  // pred: ^bb2
    %6 = muli %0, %3 : i32
    %7 = addi %6, %c1_i32 : i32
    %8 = remi_signed %7, %arg0 : i32
    %9 = sitofp %8 : i32 to f64
    %10 = sitofp %arg0 : i32 to f64
    %11 = divf %9, %10 : f64
    store %11, %arg5[%2, %5] : memref<2000x2300xf64>
    %12 = addi %3, %c1_i32 : i32
    br ^bb2(%12 : i32)
  ^bb4:  // pred: ^bb2
    %13 = addi %0, %c1_i32 : i32
    br ^bb1(%13 : i32)
  ^bb5(%14: i32):  // 2 preds: ^bb1, ^bb8
    %15 = cmpi "slt", %14, %arg0 : i32
    %16 = index_cast %14 : i32 to index
    cond_br %15, ^bb6(%c0_i32 : i32), ^bb9(%c0_i32 : i32)
  ^bb6(%17: i32):  // 2 preds: ^bb5, ^bb7
    %18 = cmpi "slt", %17, %arg2 : i32
    %19 = index_cast %17 : i32 to index
    cond_br %18, ^bb7, ^bb8
  ^bb7:  // pred: ^bb6
    %20 = addi %17, %c1_i32 : i32
    %21 = muli %14, %20 : i32
    %22 = remi_signed %21, %arg2 : i32
    %23 = sitofp %22 : i32 to f64
    %24 = sitofp %arg2 : i32 to f64
    %25 = divf %23, %24 : f64
    store %25, %arg6[%16, %19] : memref<2000x2600xf64>
    br ^bb6(%20 : i32)
  ^bb8:  // pred: ^bb6
    %26 = addi %14, %c1_i32 : i32
    br ^bb5(%26 : i32)
  ^bb9(%27: i32):  // 2 preds: ^bb5, ^bb13
    %28 = cmpi "slt", %27, %arg2 : i32
    %29 = index_cast %27 : i32 to index
    cond_br %28, ^bb11(%c0_i32 : i32), ^bb10
  ^bb10:  // pred: ^bb9
    return
  ^bb11(%30: i32):  // 2 preds: ^bb9, ^bb12
    %31 = cmpi "slt", %30, %arg1 : i32
    %32 = index_cast %30 : i32 to index
    cond_br %31, ^bb12, ^bb13
  ^bb12:  // pred: ^bb11
    %33 = addi %30, %c2_i32 : i32
    %34 = muli %27, %33 : i32
    %35 = remi_signed %34, %arg1 : i32
    %36 = sitofp %35 : i32 to f64
    %37 = sitofp %arg1 : i32 to f64
    %38 = divf %36, %37 : f64
    store %38, %arg7[%29, %32] : memref<2600x2300xf64>
    %39 = addi %30, %c1_i32 : i32
    br ^bb11(%39 : i32)
  ^bb13:  // pred: ^bb11
    %40 = addi %27, %c1_i32 : i32
    br ^bb9(%40 : i32)
  }
  func private @polybench_timer_start()
  func private @kernel_gemm(%arg0: i32, %arg1: i32, %arg2: i32, %arg3: f64, %arg4: f64, %arg5: memref<2000x2300xf64>, %arg6: memref<2000x2600xf64>, %arg7: memref<2600x2300xf64>) {
    %0 = index_cast %arg0 : i32 to index
    %1 = index_cast %arg1 : i32 to index
    %2 = index_cast %arg2 : i32 to index
    affine.for %arg8 = 0 to %0 {
      affine.for %arg9 = 0 to %1 {
        call @S0(%arg5, %arg8, %arg9, %arg4) : (memref<2000x2300xf64>, index, index, f64) -> ()
      } {scop.point_loop}
      affine.for %arg9 = 0 to %2 {
        affine.for %arg10 = 0 to %1 {
          call @S1(%arg5, %arg8, %arg10, %arg7, %arg9, %arg3, %arg6) : (memref<2000x2300xf64>, index, index, memref<2600x2300xf64>, index, f64, memref<2000x2600xf64>) -> ()
        } {scop.point_loop}
      } {scop.point_loop}
    } {scop.point_loop}
    return
  }
  func private @polybench_timer_stop()
  func private @polybench_timer_print()
  func private @print_array(%arg0: i32, %arg1: i32, %arg2: memref<2000x2300xf64>) {
    %c0_i32 = constant 0 : i32
    %c20_i32 = constant 20 : i32
    %c1_i32 = constant 1 : i32
    %0 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %1 = llvm.load %0 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %2 = llvm.mlir.addressof @str1 : !llvm.ptr<array<23 x i8>>
    %3 = llvm.mlir.constant(0 : index) : !llvm.i64
    %4 = llvm.getelementptr %2[%3, %3] : (!llvm.ptr<array<23 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %5 = llvm.call @fprintf(%1, %4) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>) -> !llvm.i32
    %6 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %7 = llvm.load %6 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %8 = llvm.mlir.addressof @str2 : !llvm.ptr<array<15 x i8>>
    %9 = llvm.getelementptr %8[%3, %3] : (!llvm.ptr<array<15 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %10 = llvm.mlir.addressof @str3 : !llvm.ptr<array<2 x i8>>
    %11 = llvm.getelementptr %10[%3, %3] : (!llvm.ptr<array<2 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %12 = llvm.call @fprintf(%7, %9, %11) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.i32
    br ^bb1(%c0_i32 : i32)
  ^bb1(%13: i32):  // 2 preds: ^bb0, ^bb5
    %14 = cmpi "slt", %13, %arg0 : i32
    %15 = index_cast %13 : i32 to index
    cond_br %14, ^bb3(%c0_i32 : i32), ^bb2
  ^bb2:  // pred: ^bb1
    %16 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %17 = llvm.load %16 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %18 = llvm.mlir.addressof @str6 : !llvm.ptr<array<17 x i8>>
    %19 = llvm.getelementptr %18[%3, %3] : (!llvm.ptr<array<17 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %20 = llvm.mlir.addressof @str3 : !llvm.ptr<array<2 x i8>>
    %21 = llvm.getelementptr %20[%3, %3] : (!llvm.ptr<array<2 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %22 = llvm.call @fprintf(%17, %19, %21) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.i32
    %23 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %24 = llvm.load %23 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %25 = llvm.mlir.addressof @str7 : !llvm.ptr<array<23 x i8>>
    %26 = llvm.getelementptr %25[%3, %3] : (!llvm.ptr<array<23 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %27 = llvm.call @fprintf(%24, %26) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>) -> !llvm.i32
    return
  ^bb3(%28: i32):  // 2 preds: ^bb1, ^bb4
    %29 = cmpi "slt", %28, %arg1 : i32
    %30 = index_cast %28 : i32 to index
    cond_br %29, ^bb4, ^bb5
  ^bb4:  // pred: ^bb3
    %31 = muli %13, %arg0 : i32
    %32 = addi %31, %28 : i32
    %33 = remi_signed %32, %c20_i32 : i32
    %34 = cmpi "eq", %33, %c0_i32 : i32
    scf.if %34 {
      %44 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
      %45 = llvm.load %44 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
      %46 = llvm.mlir.addressof @str4 : !llvm.ptr<array<2 x i8>>
      %47 = llvm.getelementptr %46[%3, %3] : (!llvm.ptr<array<2 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
      %48 = llvm.call @fprintf(%45, %47) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>) -> !llvm.i32
    }
    %35 = llvm.mlir.addressof @stderr : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %36 = llvm.load %35 : !llvm.ptr<ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>>
    %37 = llvm.mlir.addressof @str5 : !llvm.ptr<array<8 x i8>>
    %38 = llvm.getelementptr %37[%3, %3] : (!llvm.ptr<array<8 x i8>>, !llvm.i64, !llvm.i64) -> !llvm.ptr<i8>
    %39 = load %arg2[%15, %30] : memref<2000x2300xf64>
    %40 = llvm.mlir.cast %39 : f64 to !llvm.double
    %41 = llvm.call @fprintf(%36, %38, %40) : (!llvm.ptr<struct<"struct._IO_FILE", (i32, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<i8>, ptr<struct<"struct._IO_marker", opaque>>, ptr<struct<"struct._IO_FILE">>, i32, i32, i64, i16, i8, array<1 x i8>, ptr<i8>, i64, ptr<struct<"struct._IO_codecvt", opaque>>, ptr<struct<"struct._IO_wide_data", opaque>>, ptr<struct<"struct._IO_FILE">>, ptr<i8>, i64, i32, array<20 x i8>)>>, !llvm.ptr<i8>, !llvm.double) -> !llvm.i32
    %42 = addi %28, %c1_i32 : i32
    br ^bb3(%42 : i32)
  ^bb5:  // pred: ^bb3
    %43 = addi %13, %c1_i32 : i32
    br ^bb1(%43 : i32)
  }
  func private @S0(%arg0: memref<2000x2300xf64>, %arg1: index, %arg2: index, %arg3: f64) attributes {scop.stmt} {
    %0 = affine.load %arg0[symbol(%arg1), symbol(%arg2)] : memref<2000x2300xf64>
    %1 = mulf %0, %arg3 : f64
    affine.store %1, %arg0[symbol(%arg1), symbol(%arg2)] : memref<2000x2300xf64>
    return
  }
  func private @S1(%arg0: memref<2000x2300xf64>, %arg1: index, %arg2: index, %arg3: memref<2600x2300xf64>, %arg4: index, %arg5: f64, %arg6: memref<2000x2600xf64>) attributes {scop.stmt} {
    %0 = affine.load %arg0[symbol(%arg1), symbol(%arg2)] : memref<2000x2300xf64>
    %1 = affine.load %arg6[symbol(%arg1), symbol(%arg4)] : memref<2000x2600xf64>
    %2 = mulf %arg5, %1 : f64
    %3 = affine.load %arg3[symbol(%arg4), symbol(%arg2)] : memref<2600x2300xf64>
    %4 = mulf %2, %3 : f64
    %5 = addf %0, %4 : f64
    affine.store %5, %arg0[symbol(%arg1), symbol(%arg2)] : memref<2000x2300xf64>
    return
  }
  func private @kernel_gemm_new(%arg0: i32, %arg1: i32, %arg2: i32, %arg3: f64, %arg4: f64, %arg5: memref<2000x2300xf64>, %arg6: memref<2000x2600xf64>, %arg7: memref<2600x2300xf64>) {
    %0 = index_cast %arg0 : i32 to index
    %1 = index_cast %arg1 : i32 to index
    %2 = index_cast %arg2 : i32 to index
    affine.if #set0()[%0, %1] {
      affine.for %arg8 = 0 to #map0()[%0] {
        affine.for %arg9 = 0 to #map0()[%1] {
          affine.for %arg10 = #map1(%arg8) to min #map2(%arg8)[%0] {
            affine.for %arg11 = #map1(%arg9) to min #map2(%arg9)[%1] {
              call @S0(%arg5, %arg10, %arg11, %arg4) : (memref<2000x2300xf64>, index, index, f64) -> ()
            } {scop.point_loop}
          } {scop.point_loop}
        }
      }
      affine.if #set1()[%2] {
        affine.for %arg8 = 0 to #map0()[%0] {
          affine.for %arg9 = 0 to #map0()[%1] {
            affine.for %arg10 = 0 to #map0()[%2] {
              affine.for %arg11 = #map1(%arg8) to min #map2(%arg8)[%0] {
                affine.for %arg12 = #map1(%arg10) to min #map2(%arg10)[%2] {
                  affine.for %arg13 = #map1(%arg9) to min #map2(%arg9)[%1] {
                    call @S1(%arg5, %arg11, %arg13, %arg7, %arg12, %arg3, %arg6) : (memref<2000x2300xf64>, index, index, memref<2600x2300xf64>, index, f64, memref<2000x2600xf64>) -> ()
                  } {scop.point_loop}
                } {scop.point_loop}
              } {scop.point_loop}
            }
          }
        }
      }
    }
    return
  }
  func private @kernel_gemm__PE0(%arg0: memref<2000x2300xf64>, %arg1: f64, %arg2: index, %arg3: memref<2600x2300xf64>, %arg4: f64, %arg5: memref<2000x2600xf64>, %arg6: index, %arg7: index) {
    affine.for %arg8 = 0 to %arg7 {
      affine.for %arg9 = 0 to %arg2 {
        call @S0(%arg0, %arg8, %arg9, %arg1) : (memref<2000x2300xf64>, index, index, f64) -> ()
      } {scop.point_loop}
      affine.for %arg9 = 0 to %arg6 {
        affine.for %arg10 = 0 to %arg2 {
          call @S1(%arg0, %arg8, %arg10, %arg3, %arg9, %arg4, %arg5) : (memref<2000x2300xf64>, index, index, memref<2600x2300xf64>, index, f64, memref<2000x2600xf64>) -> ()
        } {scop.point_loop}
      } {scop.point_loop}
    } {scop.point_loop}
    return
  }
  func private @kernel_gemm_new__PE1(%arg0: memref<2000x2300xf64>, %arg1: f64, %arg2: index, %arg3: index, %arg4: index, %arg5: index) {
    affine.for %arg6 = #map1(%arg4) to min #map2(%arg4)[%arg5] {
      affine.for %arg7 = #map1(%arg2) to min #map2(%arg2)[%arg3] {
        call @S0(%arg0, %arg6, %arg7, %arg1) : (memref<2000x2300xf64>, index, index, f64) -> ()
      } {scop.point_loop}
    } {scop.point_loop}
    return
  }
  func private @kernel_gemm_new__PE2(%arg0: memref<2000x2300xf64>, %arg1: memref<2600x2300xf64>, %arg2: f64, %arg3: memref<2000x2600xf64>, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index, %arg9: index) {
    affine.for %arg10 = #map1(%arg8) to min #map2(%arg8)[%arg9] {
      affine.for %arg11 = #map1(%arg6) to min #map2(%arg6)[%arg7] {
        affine.for %arg12 = #map1(%arg4) to min #map2(%arg4)[%arg5] {
          call @S1(%arg0, %arg10, %arg12, %arg1, %arg11, %arg2, %arg3) : (memref<2000x2300xf64>, index, index, memref<2600x2300xf64>, index, f64, memref<2000x2600xf64>) -> ()
        } {scop.point_loop}
      } {scop.point_loop}
    } {scop.point_loop}
    return
 }
}

Test tiling dependent statements in PLUTO.

func @load_store_dep_tiling() {
  %A = alloc() : memref<64x64xf32>

  affine.for %i0 = 1 to 64 {
    affine.for %j0 = 1 to 64 {
      %i1 = affine.apply affine_map<(d0) -> (d0 - 1)>(%i0)
      %j1 = affine.apply affine_map<(d0) -> (d0 - 1)>(%j0)

      %0 = affine.load %A[%i0, %j1] : memref<64x64xf32>
      %1 = affine.load %A[%i1, %j0] : memref<64x64xf32>
      %2 = addf %0, %1 : f32
      affine.store %2, %A[%i0, %j0] : memref<64x64xf32>
    }
  }

  return
}

Pluto seems to be inadequate of processing this type of input: https://groups.google.com/g/pluto-development/c/9mTvz5j5VxE

Code cleanup on the IMPACT branch

We've got some new updates from the impact branch and there are loads of hacky things. We should gradually migrate them into the master.

TODO:

  • Handle div/mod #8
  • New ScatTreeNode design.
  • Improve OslScop implementation
  • Directly generate OpenScop relation from FlatAffineConstraints
  • Update how we call Pluto (use pluto_schedule_prog)
  • Merge the latest reg2mem
  • Merge the latest extract-scop-stmt
  • Improve the ConvertToOpenScop pass and add more tests
  • Improve the ConvertFromOpenScop pass and add more tests
  • Refactorize the symbol table passing before and after Pluto optimization.
  • Integrate Polybench test cases #53

Handle max/min reduction in clast_expr

Min and max are not parts of the operations that AffineExpr supports. We should return all of min/max operands and create an AffineMap, and wrap its results by an affine.min or affine.max operation.

This means we should create a separate branch of the process method in AffineExprBuilder that can handle multiple return results.

Handle duplication of for-loop structures

Sometimes Pluto may generate a Scop like this:

<OpenScop>

# =============================================== Global
# Language
C

# Context
CONTEXT
2 4 0 0 0 2
# e/i| P0   P1 |  1  
   1    1    0   -1    ## P0-1 >= 0
   1    0    1   -1    ## P1-1 >= 0

# Parameters are provided
1
<strings>
P0 P1
</strings>

# Number of statements
2

# =============================================== Statement 1
# Number of relations describing the statement:
4

# ----------------------------------------------  1.1 Domain
DOMAIN
6 7 3 0 0 2
# e/i| fk0  fk1  i0 | P0   P1 |  1  
   1    0    0    1    0    0    0    ## i0 >= 0
   1    0    0   -1    1    0   -1    ## -i0+P0-1 >= 0
   1  -32    0    1    0    0    0    ## -32*fk0+i0 >= 0
   1   32    0   -1    0    0   31    ## 32*fk0-i0+31 >= 0
   1    0  -32    0    0    0    0    ## -32*fk1 >= 0
   1    0   32    0    0    0   31    ## 32*fk1+31 >= 0

# ----------------------------------------------  1.2 Scattering
SCATTERING
8 15 8 3 0 2
# e/i| c1   c2   c3   c4   c5   c6   c7   c8 | fk0  fk1  i0 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    0    0    0    0    0    0    0    ## c1 == 0
   0    0   -1    0    0    0    0    0    0    1    0    0    0    0    0    ## c2 == fk0
   0    0    0   -1    0    0    0    0    0    0    1    0    0    0    0    ## c3 == fk1
   0    0    0    0   -1    0    0    0    0    0    0    0    0    0    0    ## c4 == 0
   0    0    0    0    0   -1    0    0    0    0    0    1    0    0    0    ## c5 == i0
   0    0    0    0    0    0   -1    0    0    0    0    0    0    0    0    ## c6 == 0
   0    0    0    0    0    0    0   -1    0    0    0    0    0    0    0    ## c7 == 0
   0    0    0    0    0    0    0    0   -1    0    0    0    0    0    0    ## c8 == 0

# ----------------------------------------------  1.3 Access
WRITE
2 9 2 3 0 2
# e/i| Arr  [1]| fk0  fk1  i0 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    1    ## Arr == A1
   0    0   -1    0    0    1    0    0    0    ## [1] == i0

READ
2 9 2 3 0 2
# e/i| Arr  [1]| fk0  fk1  i0 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    2    ## Arr == A2
   0    0   -1    0    0    1    0    0    0    ## [1] == i0

# ----------------------------------------------  1.4 Statement Extensions
# Number of Statement Extensions
1
<body>
# Number of original iterators
3
# List of original iterators
fk0 fk1 i0
# Statement body expression
S0(A1, 1, A2, 1, i0)
</body>

# =============================================== Statement 2
# Number of relations describing the statement:
4

# ----------------------------------------------  2.1 Domain
DOMAIN
10 9 5 0 0 2
# e/i| fk0  fk1  fk2  i0   i1 | P0   P1 |  1  
   1    0    0    0    1    0    0    0    0    ## i0 >= 0
   1    0    0    0   -1    0    1    0   -1    ## -i0+P0-1 >= 0
   1    0    0    0    0    1    0    0    0    ## i1 >= 0
   1    0    0    0    0   -1    0    1   -1    ## -i1+P1-1 >= 0
   1  -32    0    0    1    0    0    0    0    ## -32*fk0+i0 >= 0
   1   32    0    0   -1    0    0    0   31    ## 32*fk0-i0+31 >= 0
   1    0  -32    0    0    0    0    0    1    ## -32*fk1+1 >= 0
   1    0   32    0    0    0    0    0   30    ## 32*fk1+30 >= 0
   1    0    0  -32    0    1    0    0    0    ## -32*fk2+i1 >= 0
   1    0    0   32    0   -1    0    0   31    ## 32*fk2-i1+31 >= 0

# ----------------------------------------------  2.2 Scattering
SCATTERING
8 17 8 5 0 2
# e/i| c1   c2   c3   c4   c5   c6   c7   c8 | fk0  fk1  fk2  i0   i1 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    ## c1 == 0
   0    0   -1    0    0    0    0    0    0    1    0    0    0    0    0    0    0    ## c2 == fk0
   0    0    0   -1    0    0    0    0    0    0    1    0    0    0    0    0    0    ## c3 == fk1
   0    0    0    0   -1    0    0    0    0    0    0    1    0    0    0    0    0    ## c4 == fk2
   0    0    0    0    0   -1    0    0    0    0    0    0    1    0    0    0    0    ## c5 == i0
   0    0    0    0    0    0   -1    0    0    0    0    0    0    0    0    0    1    ## c6 == 1
   0    0    0    0    0    0    0   -1    0    0    0    0    0    1    0    0    0    ## c7 == i1
   0    0    0    0    0    0    0    0   -1    0    0    0    0    0    0    0    0    ## c8 == 0

# ----------------------------------------------  2.3 Access
WRITE
3 12 3 5 0 2
# e/i| Arr  [1]  [2]| fk0  fk1  fk2  i0   i1 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    0    0    0    3    ## Arr == A3
   0    0   -1    0    0    0    0    1    0    0    0    0    ## [1] == i0
   0    0    0   -1    0    0    0    0    1    0    0    0    ## [2] == i1

READ
2 11 2 5 0 2
# e/i| Arr  [1]| fk0  fk1  fk2  i0   i1 | P0   P1 |  1  
   0   -1    0    0    0    0    0    0    0    0    1    ## Arr == A1
   0    0   -1    0    0    0    1    0    0    0    0    ## [1] == i0

# ----------------------------------------------  2.4 Statement Extensions
# Number of Statement Extensions
1
<body>
# Number of original iterators
5
# List of original iterators
fk0 fk1 fk2 i0 i1
# Statement body expression
S1(A3, 2, A1, 1, i0, i1)
</body>

# =============================================== Extensions
<arrays>
# Number of arrays
3
# Mapping array-identifiers/array-names
3 A3
2 A2
1 A1
</arrays>

<scatnames>
t1 t2 t3 t4 t5 t6 t7 t8
</scatnames>

<loop>
# Number of loops
2
# ===========================================
# Loop number 1 
# Iterator name
t2
# Number of stmts
2
# Statement identifiers
1
2
# Private variables
lbv, ubvt3,t4,t5,t6,t7,t8
# Directive
1
# ===========================================
# Loop number 2 
# Iterator name
t7
# Number of stmts
1
# Statement identifiers
2
# Private variables
(null)
# Directive
4
</loop>

</OpenScop>

When it is handled by CLooG, there might be duplicated loop structures that iterate the polyhedra. In that case, the same statement can exist at different places in the code. For different placement, they could have exactly the same naming for their loop IVs, however, if so, our code that generates MLIR from Scop cannot handle well since it relies the fact that IV names to be unique.

We might need to fundamentally change the design of the symbol table. Or maybe we need to add a new pass to make things easier to handle.

Also note that there are conflicts in the naming of the statements as well. For instance, one S1 can be different places of the code due to Pluto and CLooG.

Create a CI script

Besides doing the regular testing, this CI script will be the reference for installation steps on a clean system.

Allow partially optimising a function

Currently we need to optimise a function as a whole, but the desired way is to split the function into multiple parts, and replace each part by an optimised version if possible. Otherwise, like in the example of #42 we cannot copy those code that get the dimensionality.

Compiled Polybench code

SMALL refers to the dataset size.

SMALL.O0: without Polymer

SMALL.O0.tar.gz

SMALL.O1: with Polymer

SMALL.O1.tar.gz

Compilation command:

cd example/polybench
./compile -f SMALL -O0
./compile -f SMALL -O1

You can find SMALL.O0 and SMALL.O1 under example/polybench/tmp.

Generate runnable MLIR from polymer-opt

Currently we generate a single function that takes in block arguments. In order to make it compatible with polybench code in general, we should change the codegen behaviour.

The workflow should work as follows:

  1. The input MLIR code should contain a @main function that acts as the entry point. It should also have a set of affine functions that should be optimised by polymer-opt.
  2. Given the names of the entry point and the affine functions to be transformed (should be passed as command-line args), polymer-opt optimises all the given affine functions and replaces the function call to them by the optimised versions.
  3. We can then perform some post optimisation and lower it down to LLVM.

For example, command like this should work properly:

# In the build/ directory
./bin/polymer-opt ../test/polymer-opt/PlutoTransform/matmul.mlir --pluto-opt=matmul | ../llvm/build/bin/mlir-opt --lower-affine --convert-scf-to-std --convert-std-to-llvm | ../llvm/build/bin/mlir-cpu-runner --entry-point-result=void -O3

TODO -

Properly handle loop with arbitrary bounds

It would be pretty common in real-world test cases.

Test code -

func @matmul(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
  %c0 = constant 0 : index
  %c1 = constant 1 : index

  %M = dim %A, %c0 : memref<?x?xf32>
  %N = dim %B, %c0 : memref<?x?xf32>
  %K = dim %A, %c1 : memref<?x?xf32>

  affine.for %i = 0 to %M {
    affine.for %j = 0 to %N {
      affine.for %k = 0 to %K {
        %0 = affine.load %A[%i, %k] : memref<?x?xf32>
        %1 = affine.load %B[%k, %j] : memref<?x?xf32>
        %2 = mulf %0, %1 : f32
        %3 = affine.load %C[%i, %j] : memref<?x?xf32>
        %4 = addf %2, %3 : f32
        affine.store %4, %C[%i, %j] : memref<?x?xf32>
      }
    }
  }

  return 
}

func @main() {
  %M = constant 64 : index
  %N = constant 64 : index
  %K = constant 64 : index

  %A = alloc(%M, %K) : memref<?x?xf32>
  %B = alloc(%N, %K) : memref<?x?xf32>
  %C = alloc(%M, %N) : memref<?x?xf32>

  call @matmul(%A, %B, %C)
   : (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()

  return
}

Better compilation flow with PLUTO

It is difficult to link PLUTO as a static library. When it is linked as a shared library, we should update the LD_LIBRARY_PATH when running any polymer tool, which is troublesome.

Resolving context

Constants

If a parameter is given as a constant, we should be able to put the corresponding equation in the context relation.

%N = constant 32 : index
affine.for %i = 0 to %N {
}

Integrate polybench

Should add polybench into the source tree and add instructions for experimentation.

[BUG][Pluto] Failed to process modulo access relation

func private @S0(%A: memref<40x50xf32>, %t: index, %i: index) attributes {scop.stmt} {
  %cst = constant 0.25: f32
  %cst2 = constant 2.0 : f32

  %0 = affine.load %A[%t mod 2, %i + 1]: memref<40x50xf32>
  %1 = affine.load %A[%t mod 2, %i] : memref<40x50xf32>
  %2 = mulf %cst2, %1 : f32
  %3 = addf %0, %2 : f32
  %4 = affine.load %A[%t mod 2, %i - 1]: memref<40x50xf32>
  %5 = addf %3, %4 : f32
  %6 = mulf %cst, %5 : f32
  affine.store %6, %A[(%t + 1) mod 2, %i] : memref<40x50xf32>

  return
}


func @jacobi_diamond(%A: memref<40x50xf32>) {
  affine.for %t = 0 to 38 {
    affine.for %i = 1 to 47 {
      call @S0(%A, %t, %i): (memref<40x50xf32>, index, index) -> ()
    }
  }
  return
}
ninja && ./bin/polymer-opt ../test/polymer-opt/PlutoTransforms/jacobi-diamond.mlir -pluto-opt                 ninja: no work to do.
polymer-opt: /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/include/llvm/ADT/ArrayRef.h:257: const T &llvm::ArrayRef<long>::operator[](size_t) const [T = long]: Assertion `Index < Length && "Invalid index!"' failed.PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:0.      Program arguments: ./bin/polymer-opt ../test/polymer-opt/PlutoTransforms/jacobi-diamond.mlir -pluto-opt
 #0 0x0000000000b7f48a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/lib/Support/Unix/Signals.inc:565:11
 #1 0x0000000000b7f65b PrintStackTraceSignalHandler(void*) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/lib/Support/Unix/Signals.inc:632:1
 #2 0x0000000000b7dc2b llvm::sys::RunSignalHandlers() /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/lib/Support/Signals.cpp:96:5
 #3 0x0000000000b7fda1 SignalHandler(int) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1 #4 0x00007fb558686630 __restore_rt (/lib64/libpthread.so.0+0xf630)
 #5 0x00007fb555e0e3d7 raise (/lib64/libc.so.6+0x363d7) #6 0x00007fb555e0fac8 abort (/lib64/libc.so.6+0x37ac8)
 #7 0x00007fb555e071a6 __assert_fail_base (/lib64/libc.so.6+0x2f1a6) #8 0x00007fb555e07252 (/lib64/libc.so.6+0x2f252)
 #9 0x0000000000f7d4ba llvm::ArrayRef<long>::operator[](unsigned long) const /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/include/llvm/ADT/ArrayRef.h:0:7#10 0x00000000011c45ff polymer::OslScop::addRelation(int, int, int, int, int, int, int, int, llvm::ArrayRef<long>, llvm::ArrayRef<long>) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Support/OslScop.cc:128:43#11 0x00000000011c588b polymer::OslScop::addAccessRelation(int, bool, mlir::Value, mlir::AffineValueMap&, mlir::FlatAffineValueConstraints&) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Support/OslScop.cc:273:1#12 0x00000000011c20e0 (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0::operator()(mlir::Operation*) const /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Target/OpenScop/ConvertToOpenScop.cc:132:7#13 0x00000000011c2000 void llvm::function_ref<void (mlir::Operation*)>::callback_fn<(anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0>(long, mlir::Operation*) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/include/llvm/ADT/STLExtras.h:177:5#14 0x0000000000f299bc llvm::function_ref<void (mlir::Operation*)>::operator()(mlir::Operation*) const /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/llvm/include/llvm/ADT/STLExtras.h:200:5#15 0x000000000111ba26 mlir::detail::walk(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/IR/Visitors.cpp:68:1
#16 0x000000000111b9d3 mlir::detail::walk(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/IR/Visitors.cpp:61:27#17 0x00000000011c1f95 std::enable_if<llvm::is_one_of<mlir::Operation*, mlir::Operation*, mlir::Region*, mlir::Block*>::value, void>::type mlir::detail::walk<(mlir::WalkOrder)1, (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0, mlir::Operation*, void>(mlir::Operation*, (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0&&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/include/mlir/IR/Visitors.h:134:3#18 0x00000000011c1f30 void mlir::Operation::walk<(mlir::WalkOrder)1, (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0, void>((anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0&&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/include/mlir/IR/Operation.h:522:5#19 0x00000000011c1bc3 void mlir::OpState::walk<(mlir::WalkOrder)1, (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0, void>((anonymous namespace)::OslScopBuilder::build(mlir::FuncOp)::$_0&&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/include/mlir/IR/OpDefinition.h:161:5
#20 0x00000000011c10b8 (anonymous namespace)::OslScopBuilder::build(mlir::FuncOp) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Target/OpenScop/ConvertToOpenScop.cc:135:11
#21 0x00000000011c0d57 polymer::createOpenScopFromFuncOp(mlir::FuncOp, polymer::OslSymbolTable&) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Target/OpenScop/ConvertToOpenScop.cc:225:27
#22 0x000000000111d3e5 plutoTransform(mlir::FuncOp, mlir::OpBuilder&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, int, int, bool) /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Transforms/PlutoTransform.cc:80:8#23 0x000000000111cee8 (anonymous namespace)::PlutoTransformPass::runOnOperation() /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../lib/Transforms/PlutoTransform.cc:164:28#24 0x0000000000e8eb48 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Pass/Pass.cpp:386:21#25 0x0000000000e8f10d mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::unique_ptr<mlir::Pass, std::default_delete<mlir::Pass> >*, mlir::Pass> >, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Pass/Pass.cpp:445:16
#26 0x0000000000e907be mlir::PassManager::runPasses(mlir::Operation*, mlir::AnalysisManager) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Pass/Pass.cpp:689:10#27 0x0000000000e906a0 mlir::PassManager::run(mlir::Operation*) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Pass/Pass.cpp:669:60
#28 0x0000000000e35c49 performActions(llvm::raw_ostream&, bool, bool, llvm::SourceMgr&, mlir::MLIRContext*, mlir::PassPipelineCLParser const&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Support/MlirOptMain.cpp:84:17#29 0x0000000000e34b78 processBuffer(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer> >, bool, bool, bool, bool, mlir::PassPipelineCLParser const&, mlir::DialectRegistry&) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Support/MlirOptMain.cpp:120:12
#30 0x0000000000e34996 mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer> >, mlir::PassPipelineCLParser const&, mlir::DialectRegistry&, bool, bool, bool, bool, bool) /mnt/ccnas2/bdp/rz3515/projects/phism/polygeist/llvm-project/mlir/lib/Support/MlirOptMain.cpp:157:10
#31 0x0000000000a5549e main /mnt/ccnas2/bdp/rz3515/projects/phism/polymer/build/../tools/polymer-opt/polymer-opt.cc:117:17
#32 0x00007fb555dfa555 __libc_start_main (/lib64/libc.so.6+0x22555)
#33 0x0000000000a55029 _start (./bin/polymer-opt+0xa55029)
[1]    160465 abort (core dumped)  ./bin/polymer-opt ../test/polymer-opt/PlutoTransforms/jacobi-diamond.mlir 

Issues when sum reduction is the 2nd operand of min/max

This is a tricky bug, if sum reduction appears to be the 2nd operand, this assertion will fail

assert(affExprs.size() == 1 && "A single affine expr should be returned "
"after processing an expr in reduction.");

Instead of checking there is one expr, we should check that there is only one expr newly added.

We should change which affExpr to update as well here -

// TODO: deal with negative terms.
affExprs[0] = affExprs[0] + currExprs[0];

Min/max in loop bounds

If a clast min/max expr is used in for bounds, Polymer will create an affine.min/max first, which will result in using values defined by affine.min/max as indvar. This is prohibited.

A workaround is, in this function:

LogicalResult
Importer::getAffineLoopBound(clast_expr *expr,
llvm::SmallVectorImpl<mlir::Value> &operands,
AffineMap &affMap, bool isUpper) {
AffineExprBuilder builder(context, symTable, &symbolTable, scop, options);
SmallVector<AffineExpr, 4> boundExprs;
// Build the AffineExpr for the loop bound.
if (failed(builder.process(expr, boundExprs)))
return failure();
// If looking at the upper bound, we should add 1 to all of them.
if (isUpper)
for (auto &expr : boundExprs)
expr = expr + b.getAffineConstantExpr(1);
// Insert dim operands.
unsigned numDims = builder.dimNames.size();
unsigned numSymbols = builder.symbolNames.size();
operands.resize(numDims + numSymbols);
for (const auto &it : builder.dimNames) {
if (auto iv = symbolTable[it.first()]) {
operands[it.second] = iv;
} else {
llvm::errs() << "Dim " << it.first()
<< " cannot be recognized as a value.\n";
return failure();
}
}
// Create or get BlockArgument for the symbols. We assume all symbols come
// from the BlockArgument of the generated function.
for (const auto &it : builder.symbolNames) {
mlir::Value operand = symbolTable[it.first()];
assert(operand != nullptr);
operands[it.second + numDims] = operand;
}
// Create the AffineMap for loop bound.
affMap = AffineMap::get(numDims, numSymbols, boundExprs, context);
return success();
}

before we process the expr through the builder, we should first check if the type is a min/max reduction, and if so, process its operands individually and then use all of them as affine map results.

This might be tricky if there're nested min/max. Haven't seen that tho. Will ignore for now.

(P.S., this only appears after we propagate constants to the loop domain. Our previous successful results are still valid - in those cases the constants are still undecided.

Abandon InvariantScopTransform

It won't be very necessary to have it once the PlutoTransform is stable and easy to debug. Maintaining InvariantScopTransform would be a burden.

The instruction in readme seems to mismatch with the source code.

Hi, after building the polymer-opt with the following commit, I try to execute (matmul.mlir is from /test/archive/polymer-translate/export-scop/matmul.mlir)

./bin/polymer-opt -pluto-opt matmul.mlir

It promotes the following error:

./matmul.mlir:4:8: error: custom op 'alloc' is unknown
%A = alloc() : memref<64x64xf32>
^

Then I change all "alloc" to "memref.alloc".

then execute the command like the readme says, it doesn't take effect and the result is as the source one.

only after execute the command like:
polymer-opt -debug -extract-scop-stmt -pluto-opt matmul.mlir

it seems that the pluto transorm is applied:

#map0 = affine_map<(d0) -> (d0 * 32)>
#map1 = affine_map<(d0) -> (d0 * 32 + 32)>
module {
func private @s0(%arg0: index, %arg1: index, %arg2: memref<64x64xf32>, %arg3: index, %arg4: memref<64x64xf32>, %arg5: memref<64x64xf32>) attributes {scop.stmt} {
%0 = affine.load %arg5[symbol(%arg0), symbol(%arg3)] : memref<64x64xf32>
%1 = affine.load %arg4[symbol(%arg3), symbol(%arg1)] : memref<64x64xf32>
%2 = mulf %0, %1 : f32
%3 = affine.load %arg2[symbol(%arg0), symbol(%arg1)] : memref<64x64xf32>
%4 = addf %2, %3 : f32
affine.store %4, %arg2[symbol(%arg0), symbol(%arg1)] : memref<64x64xf32>
return
}
func @Matmul() {
%0 = memref.alloc() : memref<64x64xf32>
%1 = memref.alloc() : memref<64x64xf32>
%2 = memref.alloc() : memref<64x64xf32>
affine.for %arg0 = 0 to 2 {
affine.for %arg1 = 0 to 2 {
affine.for %arg2 = 0 to 2 {
affine.for %arg3 = #map0(%arg0) to #map1(%arg0) {
affine.for %arg4 = #map0(%arg2) to #map1(%arg2) {
affine.for %arg5 = #map0(%arg1) to #map1(%arg1) {
call @s0(%arg3, %arg5, %0, %arg4, %1, %2) : (index, index, memref<64x64xf32>, index, memref<64x64xf32>, memref<64x64xf32>) -> ()
}
}
}
}
}
}
return
}
}

May there is some mismatch between the source code and the test, just file an issue.

commit 76d9c2c (origin/main, origin/HEAD)
Merge: 4b8e998 7756df4
Author: Ruizhe Zhao [email protected]
Date: Tue May 25 16:32:48 2021 +0000

Merge branch 'main' of github.com:kumasento/polymer into main

[Benchmark][AES] handle a mixture of affine and non-affine code

Input code -

func @encrypt(%arg0: memref<?x16xi32>, %arg1: memref<?xi32>) attributes {llvm.linkage = #llvm.linkage<external>} {
  %c1_i32 = arith.constant 1 : i32
  %c4_i32 = arith.constant 4 : i32
  %c15_i32 = arith.constant 15 : i32
  %c8_i32 = arith.constant 8 : i32
  %c283_i32 = arith.constant 283 : i32
  %0 = memref.alloca() : memref<1024xi32>
  affine.for %arg2 = 1 to 5 {
    affine.for %arg3 = 0 to 16 {
      %1 = affine.load %arg1[%arg3 * 4] : memref<?xi32>
      %2 = arith.shrsi %1, %c4_i32 : i32
      %3 = arith.index_cast %2 : i32 to index
      %4 = arith.andi %1, %c15_i32 : i32
      %5 = arith.index_cast %4 : i32 to index
      %6 = memref.load %arg0[%3, %5] : memref<?x16xi32>
      affine.store %6, %arg1[%arg3 * 4] : memref<?xi32>
    }
    affine.for %arg3 = 0 to 1023 {
      %1 = affine.load %arg1[%arg3] : memref<?xi32>
      %2 = arith.shli %1, %c1_i32 : i32
      affine.store %2, %0[%arg3] : memref<1024xi32>
      %3 = arith.shrsi %2, %c8_i32 : i32
      %4 = arith.cmpi eq, %3, %c1_i32 : i32
      scf.if %4 {
        %10 = arith.xori %2, %c283_i32 : i32
        affine.store %10, %0[%arg3] : memref<1024xi32>
      }
      %5 = affine.load %arg1[%arg3 + 1] : memref<?xi32>
      %6 = arith.shli %5, %c1_i32 : i32
      %7 = arith.xori %5, %6 : i32
      %8 = arith.shrsi %7, %c8_i32 : i32
      %9 = arith.cmpi eq, %8, %c1_i32 : i32
      scf.if %9 {
        %10 = arith.xori %7, %c283_i32 : i32
        %11 = affine.load %0[%arg3] : memref<1024xi32>
        %12 = arith.xori %11, %10 : i32
        affine.store %12, %0[%arg3] : memref<1024xi32>
      } else {
        %10 = affine.load %0[%arg3] : memref<1024xi32>
        %11 = arith.xori %10, %7 : i32
        affine.store %11, %0[%arg3] : memref<1024xi32>
      }
    }
    affine.for %arg3 = 0 to 1024 {
      %1 = affine.load %0[%arg3] : memref<1024xi32>
      affine.store %1, %arg1[%arg3] : memref<?xi32>
    }
  }
  return
}

Investigating the performance difference between Pluto and Polymer on seidel-2d.

UPDATE: There is a tentative conclusion on this matter - to make sure that the Pluto version achieves the same performance as the Polymer version, it should use i64 for its induction variables and bounaries (which removes loads of sext and trunc), and -O3 should be applied for both LLVM-IR emission and final executable compilation (previously we just use -O3 for emitting the LLVM IR).

Numbers can be found at: #71 (comment)


This issue is a informal report on the whole investigation procedure.

Setup

To reproduce:

git checkout debug-seidel-2d

cd example/polybench
./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c

seidel-2d-debug contains the EXTRALARGE version of the seidel-2d benchmark code. Running eval-perf compiles and runs the given code by both Pluto and Polymer, and it will output the overall runtime and keep important intermediate results.

Pluto CLI (polycc) and Polymer use the same options, i.e., no parallel, vectorization, or unroll-and-jam.

Vectorization is disabled at the clang level using -fno-vectorize (you may investigate its effect by checking the LLVM IR code).

Only one -O3 is applied while generating the LLVM IR.

The host machine doesn't have any of its multi-core configurations disabled, e.g., hyper-threading config stays the same.

Notice the difference

After running the eval-perf script given above, we have:

Pluto run time:      186.504164
Polymer run time: 134.634325

The performance gap is huge.

Check the CLooG code

We normally need to check the schedule first, but since their CLooG code are both pretty compact, we just skipped the schedule checking and directly look at the generated CLooG code.

From Polymer

if ((P0 >= 1) && (P1 >= 3)) {
for (t1=0;t1<=floord(P0-1,32);t1++) {
for (t2=t1;t2<=min(floord(P0+P1-3,32),floord(32*t1+P1+29,32));t2++) {
for (t3=max(ceild(64*t2-P1-28,32),t1+t2);t3<=min(min(min(min(floord(P0+P1-3,16),floord(32*t1+P1+29,16)),floord(64*t2+P1+59,32)),floord(32*t1+32*t2+P1+60,32)),floord(32*t2+P0+P1+28,32));t3++) {
for (t4=max(max(max(32*t1,32*t2-P1+2),16*t3-P1+2),-32*t2+32*t3-P1-29);t4<=min(min(min(min(P0-1,32*t1+31),32*t2+30),16*t3+14),-32*t2+32*t3+30);t4++) {
for (t5=max(max(32*t2,t4+1),32*t3-t4-P1+2);t5<=min(min(32*t2+31,32*t3-t4+30),t4+P1-2);t5++) {
for (t6=max(32*t3,t4+t5+1);t6<=min(32*t3+31,t4+t5+P1-2);t6++) {
S0(-t4+t5, -t4-t5+t6, t4)
}
}
}
}
}
}
}

From Pluto

if ((_PB_N >= 3) && (_PB_TSTEPS >= 1)) {
for (t1=0;t1<=floord(_PB_TSTEPS-1,32);t1++) {
for (t2=t1;t2<=min(floord(_PB_TSTEPS+_PB_N-3,32),floord(32*t1+_PB_N+29,32));t2++) {
for (t3=max(ceild(64*t2-_PB_N-28,32),t1+t2);t3<=min(min(min(min(floord(_PB_TSTEPS+_PB_N-3,16),floord(32*t1+_PB_N+29,16)),floord(64*t2+_PB_N+59,32)),floord(32*t1+32*t2+_PB_N+60,32)),floord(32*t2+_PB_TSTEPS+_PB_N+28,32));t3++) {
for (t4=max(max(max(32*t1,32*t2-_PB_N+2),16*t3-_PB_N+2),-32*t2+32*t3-_PB_N-29);t4<=min(min(min(min(_PB_TSTEPS-1,32*t1+31),32*t2+30),16*t3+14),-32*t2+32*t3+30);t4++) {
for (t5=max(max(32*t2,t4+1),32*t3-t4-_PB_N+2);t5<=min(min(32*t2+31,32*t3-t4+30),t4+_PB_N-2);t5++) {
for (t6=max(32*t3,t4+t5+1);t6<=min(32*t3+31,t4+t5+_PB_N-2);t6++) {
A[(-t4+t5)][(-t4-t5+t6)] = (A[(-t4+t5)-1][(-t4-t5+t6)-1] + A[(-t4+t5)-1][(-t4-t5+t6)] + A[(-t4+t5)-1][(-t4-t5+t6)+1] + A[(-t4+t5)][(-t4-t5+t6)-1] + A[(-t4+t5)][(-t4-t5+t6)] + A[(-t4+t5)][(-t4-t5+t6)+1] + A[(-t4+t5)+1][(-t4-t5+t6)-1] + A[(-t4+t5)+1][(-t4-t5+t6)] + A[(-t4+t5)+1][(-t4-t5+t6)+1])/SCALAR_VAL(9.0);;
}
}
}
}
}
}
}

I won't say I'm very focused when comparing these two, but as far as I can tell, they are identical regarding the nested loops.

Check the LLVM IR code

Now we move into the dark domain.

We check two different things: the complicated loop boundary calculation and the loop body.

Loop body

First of all, without additional flags, the Pluto result uses vectorization (i.e., -fno-vectorize is not applied), and its run time is 200.756374, about 7.6% slower than the version without vectorization (shown above).

Now let's have a closer look at the loop body:

Polymer:

define void @S0(double* %0, double* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8) !dbg !170 {
%10 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } undef, double* %0, 0, !dbg !171
%11 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %10, double* %1, 1, !dbg !173
%12 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %11, i64 %2, 2, !dbg !174
%13 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %12, i64 %3, 3, 0, !dbg !175
%14 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %13, i64 %5, 4, 0, !dbg !176
%15 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %14, i64 %4, 3, 1, !dbg !177
%16 = insertvalue { double*, double*, i64, [2 x i64], [2 x i64] } %15, i64 %6, 4, 1, !dbg !178
%17 = add i64 %7, -1, !dbg !179
%18 = add i64 %8, -1, !dbg !180
%19 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !181
%20 = mul i64 %17, 4000, !dbg !182
%21 = add i64 %20, %18, !dbg !183
%22 = getelementptr double, double* %19, i64 %21, !dbg !184
%23 = load double, double* %22, align 8, !dbg !185
%24 = add i64 %7, -1, !dbg !186
%25 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !187
%26 = mul i64 %24, 4000, !dbg !188
%27 = add i64 %26, %8, !dbg !189
%28 = getelementptr double, double* %25, i64 %27, !dbg !190
%29 = load double, double* %28, align 8, !dbg !191
%30 = fadd double %23, %29, !dbg !192
%31 = add i64 %7, -1, !dbg !193
%32 = add i64 %8, 1, !dbg !194
%33 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !195
%34 = mul i64 %31, 4000, !dbg !196
%35 = add i64 %34, %32, !dbg !197
%36 = getelementptr double, double* %33, i64 %35, !dbg !198
%37 = load double, double* %36, align 8, !dbg !199
%38 = fadd double %30, %37, !dbg !200
%39 = add i64 %8, -1, !dbg !201
%40 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !202
%41 = mul i64 %7, 4000, !dbg !203
%42 = add i64 %41, %39, !dbg !204
%43 = getelementptr double, double* %40, i64 %42, !dbg !205
%44 = load double, double* %43, align 8, !dbg !206
%45 = fadd double %38, %44, !dbg !207
%46 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !208
%47 = mul i64 %7, 4000, !dbg !209
%48 = add i64 %47, %8, !dbg !210
%49 = getelementptr double, double* %46, i64 %48, !dbg !211
%50 = load double, double* %49, align 8, !dbg !212
%51 = fadd double %45, %50, !dbg !213
%52 = add i64 %8, 1, !dbg !214
%53 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !215
%54 = mul i64 %7, 4000, !dbg !216
%55 = add i64 %54, %52, !dbg !217
%56 = getelementptr double, double* %53, i64 %55, !dbg !218
%57 = load double, double* %56, align 8, !dbg !219
%58 = fadd double %51, %57, !dbg !220
%59 = add i64 %7, 1, !dbg !221
%60 = add i64 %8, -1, !dbg !222
%61 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !223
%62 = mul i64 %59, 4000, !dbg !224
%63 = add i64 %62, %60, !dbg !225
%64 = getelementptr double, double* %61, i64 %63, !dbg !226
%65 = load double, double* %64, align 8, !dbg !227
%66 = fadd double %58, %65, !dbg !228
%67 = add i64 %7, 1, !dbg !229
%68 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !230
%69 = mul i64 %67, 4000, !dbg !231
%70 = add i64 %69, %8, !dbg !232
%71 = getelementptr double, double* %68, i64 %70, !dbg !233
%72 = load double, double* %71, align 8, !dbg !234
%73 = fadd double %66, %72, !dbg !235
%74 = add i64 %7, 1, !dbg !236
%75 = add i64 %8, 1, !dbg !237
%76 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !238
%77 = mul i64 %74, 4000, !dbg !239
%78 = add i64 %77, %75, !dbg !240
%79 = getelementptr double, double* %76, i64 %78, !dbg !241
%80 = load double, double* %79, align 8, !dbg !242
%81 = fadd double %73, %80, !dbg !243
%82 = fdiv double %81, 9.000000e+00, !dbg !244
%83 = extractvalue { double*, double*, i64, [2 x i64], [2 x i64] } %16, 1, !dbg !245
%84 = mul i64 %7, 4000, !dbg !246
%85 = add i64 %84, %8, !dbg !247
%86 = getelementptr double, double* %83, i64 %85, !dbg !248
store double %82, double* %86, align 8, !dbg !249
ret void, !dbg !250
}

Pluto:

for.body1466.i: ; preds = %for.body1466.i, %for.body1466.lr.ph.i
%indvars.iv69.i = phi i64 [ %85, %for.body1466.lr.ph.i ], [ %indvars.iv.next70.i, %for.body1466.i ]
%92 = trunc i64 %indvars.iv69.i to i32
%add1472.i = sub i32 %92, %91
%sub1473.i = add nsw i32 %add1472.i, -1
%idxprom1474.i = sext i32 %sub1473.i to i64
%arrayidx1475.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1474.i
%93 = load double, double* %arrayidx1475.i, align 8, !tbaa !2
%idxprom1484.i = sext i32 %add1472.i to i64
%arrayidx1485.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1484.i
%94 = load double, double* %arrayidx1485.i, align 8, !tbaa !2
%add1486.i = fadd double %93, %94
%add1495.i = add nsw i32 %add1472.i, 1
%idxprom1496.i = sext i32 %add1495.i to i64
%arrayidx1497.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1496.i
%95 = load double, double* %arrayidx1497.i, align 8, !tbaa !2
%add1498.i = fadd double %add1486.i, %95
%arrayidx1508.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1474.i
%96 = load double, double* %arrayidx1508.i, align 8, !tbaa !2
%add1509.i = fadd double %add1498.i, %96
%arrayidx1518.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1484.i
%97 = load double, double* %arrayidx1518.i, align 8, !tbaa !2
%add1519.i = fadd double %add1509.i, %97
%arrayidx1529.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %86, i64 %idxprom1496.i
%98 = load double, double* %arrayidx1529.i, align 8, !tbaa !2
%add1530.i = fadd double %add1519.i, %98
%arrayidx1541.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1474.i
%99 = load double, double* %arrayidx1541.i, align 8, !tbaa !2
%add1542.i = fadd double %add1530.i, %99
%arrayidx1552.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1484.i
%100 = load double, double* %arrayidx1552.i, align 8, !tbaa !2
%add1553.i = fadd double %add1542.i, %100
%arrayidx1564.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %89, i64 %idxprom1496.i
%101 = load double, double* %arrayidx1564.i, align 8, !tbaa !2
%add1565.i = fadd double %add1553.i, %101
%div1566.i = fdiv double %add1565.i, 9.000000e+00
store double %div1566.i, double* %arrayidx1518.i, align 8, !tbaa !2
%indvars.iv.next70.i = add nsw i64 %indvars.iv69.i, 1
%cmp1465.not.not.i = icmp slt i64 %indvars.iv69.i, %90
br i1 %cmp1465.not.not.i, label %for.body1466.i, label %for.inc1576.i, !llvm.loop !10

Note that Polymer's loop body is not inlined, so we should look at the body of S0 instead.

Address Calculation

The major difference between the two is the usage of getelementptr. Polymer explicitly calculates the flattened 1-d address, e.g.,

%20 = mul i64 %17, 4000, !dbg !182
%21 = add i64 %20, %18, !dbg !183
%22 = getelementptr double, double* %19, i64 %21, !dbg !184

while Pluto delegates the task to getelementptr:

%arrayidx1475.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %87, i64 %idxprom1474.i

Since getelementptr in both cases should provides the same address, the difference shouldn't be very high (not empirically verified).

Also, in Pluto since the loop IVs are i32 typed, we should explicitly call sext to extend them to i64 before passing to getelementptr.

For this case, we manually changed the Pluto generated C code by updating the loop IV type from int to long long.

and the sexts are eliminated (as well as those trunc).

%sub1740.i = add nsw i64 %add1739.i, -1
%arrayidx1741.i = getelementptr inbounds [4000 x double], [4000 x double]* %arraydecay, i64 %sub1736.i, i64 %sub1740.i

However, the overall performance drops from around 187 to 195.4, which may suggest that using int is not the main reason for the gap between Polymer and Pluto.

Finally, Pluto adds inbounds while Polymer doesn't. In theory, this shouldn't affect much the performance. We have also manually removed inbounds and run again, and the result is roughly the same.

Computation Sequence

Pluto and Polymer computes the seidel-2d kernel in the same order, from top-left to bottom-right.

Conclusion

Loop body may not be the major source for the performance gap.

Boundary calculation

Boundary calculation is a bit harder to look into. So far, there is two differences that I found might affect the performance.

nsw and nuw

Pluto generated LLVM-IR adds nsw and/or nuw to those address calculation related binary operators, e.g.,

%indvars.iv.next85.i = add nuw nsw i64 %indvars.iv84.i, 1
%cmp1367.i = icmp ugt i64 %32, %indvars.iv.next85.i
%cond1373.v.i = select i1 %cmp1367.i, i64 %32, i64 %indvars.iv.next85.i
%69 = sub nsw i64 %64, %indvars.iv84.i
%70 = add nsw i64 %69, -3998
%sext.i = shl i64 %cond1373.v.i, 32
%71 = ashr exact i64 %sext.i, 32
%cmp1378.i = icmp sgt i64 %71, %70
%cond1395.v.i = select i1 %cmp1378.i, i64 %cond1373.v.i, i64 %70

It is unknown that whether this could have significant effect on the performance. Manually removing them won't produce any better result.

Direct translation vs optimized

(Sorry if I use the wrong term)

There are plenty of floordiv and ceildiv in the loop bounds. The Polymer version have them directly compiled to corresponding LLVM IR operators, e.g., sdiv, mul, etc., while Pluto extensively optimizes them by doing constant folding or other strategies.

Take the outermost loop as an example:

for (t1=0;t1<=floord(_PB_TSTEPS-1,32);t1++) {

In Polymer,

24: ; preds = %9
%25 = add i64 %17, -1, !dbg !268
%26 = icmp slt i64 %25, 0, !dbg !269
%27 = sub i64 -1, %25, !dbg !270
%28 = select i1 %26, i64 %27, i64 %25, !dbg !271
%29 = sdiv i64 %28, 32, !dbg !272
%30 = sub i64 -1, %29, !dbg !273
%31 = select i1 %26, i64 %30, i64 %29, !dbg !274
%32 = add i64 %31, 1, !dbg !275
br label %33, !dbg !276
33: ; preds = %250, %24
%34 = phi i64 [ %251, %250 ], [ 0, %24 ]
%35 = icmp slt i64 %34, %32, !dbg !277
br i1 %35, label %36, label %252, !dbg !278

You can find out that %32 is floord(_PB_STEPS, 32) + 1, which is calculated mainly by sdiv, and %34 is the loop IV t1.

In Pluto:

for.inc1588.i: ; preds = %for.inc1585.i
%indvars.iv.next93.i = add nuw nsw i64 %indvars.iv92.i, 1
%indvars.iv.next41.i = add nuw nsw i32 %indvars.iv40.i, 32
%indvars.iv.next47.i = add nuw nsw i32 %indvars.iv46.i, 32
%indvars.iv.next51.i = add nsw i32 %indvars.iv50.i, -32
%indvars.iv.next117.i = add nuw nsw i64 %indvars.iv116.i, 1
%exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32
%indvars.iv.next = add nsw i32 %indvars.iv, -32
%indvars.iv.next191 = add nuw nsw i64 %indvars.iv190, 32
%indvars.iv.next193 = add nuw nsw i64 %indvars.iv192, 32
br i1 %exitcond124.not.i, label %kernel_seidel_2d.exit, label %for.body90.lr.ph.i, !llvm.loop !14

The condition, %exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32, basically compares t1 + 1 (%indvars.iv.next93.i) with 32 . 32 is the actual result from floord(_PB_STEPS, 32) considering _PB_STEPS=1000.

You may also notice that in Polymer you can find mul by 64, while in Pluto these are optimised into shr by 6.

Conclusion

We still cannot have a solid conclusion on which part that dominates the performance gap. Especially, Pluto's version seems to be more optimized than the Polymer version regarding address calculation. It is very likely that the difference in LLVM-IR is not the major cause for performance difference.

Building the project.

Hi, probably the Pluto submodule is broken as I am not able to clone it.
Should I use the Pluto official repo?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.