Git Product home page Git Product logo

mmperf's Introduction

Single CPU Core Matrix Multiplication Benchmarks

This repository aims to benchmark Matrix Multiply (SGEMM) hand-tuned libraries and code generation stacks on a single thread on one CPU core. The focus will be on machine learning workloads so FP32 or smaller and irregular sizes of matrices. The goal is to expose high performance atomic kernels that can then be used to build highly efficient higher level implemenations spanning multiple cores or distributed across systems.

Results

Results on Nvidia A100 (cublas vs SHARK)

Results

Results on Intel Alderlake 12900k (AVX2)

Results

Results on Intel XEON Skylake (iMAC PRO, AVX512)

Results

Results on Xeon Cascade Lake (GCP C2 instance, AVX 512)

Results

Results on Xeon Cascade Lake Codegen TVM, Halide, MLIR (GCP C2 instance, AVX 512)

Results

Results on AMD Ryzen 5950x (ZenV3, compared to AMD's BLIS and OpenBLAS for RESNET50 sizes)

Results

Results on Intel XEON E-2276M Coffee lake (Thinkpad P53, AVX2)

Results

Results on Apple M1 (NEON - no AMX2)

Note: 8GB Mac Mini runs roughly 25% slower than the 16GB version on other tests. Results

Installation

Clone the repo along with submodules.

git clone --recurse-submodules https://github.com/mmperf/mmperf.git

Create a virtual environment and install requirements. Python 3.8 is required.

cd mmperf
python3 -m venv ./mmperf_env
source mmperf_env/bin/activate
pip install -r requirements.txt
pip install -r ./external/llvm-project/mlir/python/requirements.txt

Building the codes

Building specified backends on CPU

Build the project specifying the backend(s) to run matmul. Below is a command to build mmperf with MLIR backend.

cmake -GNinja \
    -DCMAKE_CXX_COMPILER=clang++-11 \
    -DCMAKE_C_COMPILER=clang-11 \
    -DUSE_MLIR=ON \
    -B build .

cmake --build build

Another example to build with all available backends. Assumes you have MKL, OpenBLAS, and Halide installed (see below for installation details)

HALIDE_DIR=/home/foo/lokal/halide/ MKL_DIR=/opt/intel/oneapi/mkl/latest/ cmake -GNinja \
    -DCMAKE_CXX_COMPILER=clang++-11 \
    -DCMAKE_C_COMPILER=clang-11 \
    -DMKL_DIR=/opt/intel/oneapi/mkl/latest/ \
    -DUSE_MLIR=ON \
    -DUSE_MKL=ON \
    -DUSE_RUY=ON \
    -DUSE_HALIDE=ON \
    -DUSE_OPENBLAS=ON \
    -DUSE_IREE=ON \
    -DIREE_LLVMCPU=ON \
    -B build .

cmake --build build

Building specified backends on GPU

Benchmarking with any GPU backends, NVIDIA CUDA 11 should be pre-installed on your system. To enable CUDA compiler, -DCMAKE_CUDA_COMPILER=nvcc should be set in the command line. For example, to compile the IREE-CUDA and compare with cuBLAS and TVM-CUDA (TVM Auto-scheduler, a.k.a. Ansor) run this command line:

cmake -GNinja \
    -DCMAKE_CXX_COMPILER=clang++-11 \
    -DCMAKE_C_COMPILER=clang-11 \
    -DCMAKE_CUDA_COMPILER=nvcc \
    -DUSE_IREE=ON \
    -DIREE_CUDA=ON \
    -DUSE_CUBLAS=ON \
    -DUSE_TVM_CUDA=ON \
    -DTVM_ENABLE_CUDA=ON \
    -DUSE_TVM_TUNED=ON \
    -DTVM_LIB_DIR=/path/to/tvm-tuner
    -DSIZE_FILE=benchmark_sizes/bert_large_matmul.txt 
    -B build .

Note: -DTVM_LIB_DIR should be set as the absolute path of where TVM binaries located. For how to run TVM auto-scheduler, please refer to this README.

Building with a standalone llvm (optional)

The building of submodule external/llvm-project can be space and time consuming. If you already have your own standalone llvm and don't want to fetch and compile this submodule, you scan specify the llvm on your system with PREBUILT_LLVM_PATH compilation flag:

cmake -GNinja \
    -DCMAKE_CXX_COMPILER=clang++-11 \
    -DCMAKE_C_COMPILER=clang-11 \
    -DPREBUILT_LLVM_PATH=$HOME/opt/llvm \
    -DUSE_MLIR=ON \
    -B build .

cmake --build build

To compile llvm from scratch, you might want all of these as well:

echo "deb http://apt.llvm.org/DISTRO_NAME/ llvm-toolchain-DISTRO_NAME main" >> /etc/apt/sources.list
wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add -
apt-get update && apt-get upgrade -y
apt-get install -y clang-11 clang-tools-11 libc++1-11 libc++-11-dev \
    libc++abi1-11 libc++abi-11-dev libclang1-11 libclang-11-dev \
    libclang-common-11-dev libclang-cpp11 libclang-cpp11-dev liblld-11 \
    liblld-11-dev liblldb-11 liblldb-11-dev libllvm11 libomp-11-dev \
    libomp5-11 lld-11 lldb-11 llvm-11 llvm-11-dev llvm-11-runtime \
    llvm-11-tools libfuzzer-11-dev

Running and generating results

We use AOT compilation to generate binaries for matrix multiplication for specified backends and run them to generate the benchmarking numbers. To generate performance numbers and get a comparison plot run:

python3 mmperf.py ./build/matmul/ results

results folder will be created in the mmperf top-level directory which will contain GLOPS for every matmul size and every backend. A plot comparing performances of backends will also be generated in matmul.png.

Each generated binary can also be executed individually. To run a specific matrix size (say 24x64x512) for a backend run:

./build/matmul/matmul_<LIBRARY>_24x64x512

Program configuration

Matrix sizes: benchmark_sizes folder has text files containing the matrix sizes that mmperf runs on. You can change the matrix size input file by editing SIZE_FILE option in cmake/common.cmake. Default is benchmark_all_sizes.txt.

Precision: The default precision for all backends is FP32. FP16 benchmark has been added to cuBLAS, IREE, and triton backends. To enable FP16, use flag -DUSE_FP16=ON

Run triton with FP16

To test the performance of openai-triton with half precision, one has to pip install triton, and then run the following command

python mmperf.py ./build/matmul results -triton -dtype f16 -benchmark_path=./benchmark_sizes/bert_large_matmul.txt

Run nodai-shark-cuda example on Nvidia A100

We have uploaded the sample of configuration files after running the Nod-ai SHARK auto-tuner. To test the performance of nodai_shark_cuda on Nvidia A100, one has to build IREE-CUDA (as shown in previous section) and then run:

python mmperf.py ./build/matmul/ results -nodai_shark_cuda -nodai_shark_configs=official_results/best_configs_minilm/

Note: The configuration is particularly optimized for Nvidia A100 GPU, and it may be not working well for other GPU architectures.

Related Projects

IREE: Intermediate Representation Execution Environment

Nod-ai SHARK

Installing optional dependencies: Halide, OpenBLAS, MKL

Halide

git clone https://github.com/halide/Halide.git --recurse-submodules
cd Halide/
sudo apt install libclang-11-dev clang-11 liblld-11-dev
LLD_DIR=/usr/lib/llvm-11/lib/cmake/lld cmake . -GNinja \
    -DCMAKE_BUILD_TYPE=Release \
    -DTARGET_WEBASSEMBLY=OFF \
    -DCMAKE_INSTALL_PREFIX=/home/<foo>/lokal/
ninja
ninja install
export HALIDE_DIR=/home/<foo>/lokal/halide

OpenBLAS

sudo apt install libopenblas-dev

BLIS

git clone https://github.com/flame/blis
cd blis
./configure --prefix=/home/foo/lokal/ --enable-cblas -c amd64
make -j 16
make install

Intel MKL

Download and install from https://software.intel.com/content/www/us/en/develop/articles/installation-guide-for-intel-oneapi-toolkits.html

Theoretical Max FLOPS

This benchmark was run on an Intel Xeon CPU running at 3.1GHz. The machine has 256Kb L1 cache, 8Mb L2 cache and 24.8Mb L3 cache. It supports AVX-512 instructions. The peak performance of the machine is 3.1 x 8 x 2 x 2 = 99.2 GFLOPS for double precision and 198.4 GFLOPS for single precision.

mmperf's People

Contributors

chudur-budur avatar dan-garvey avatar harsh-nod avatar mariecwhite avatar monorimet avatar nirvedhmeshram avatar nithinsubbiah avatar powderluv avatar raikonenfnu avatar sogartar avatar spaceotter avatar sploving1 avatar srcarroll avatar yzhang93 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mmperf's Issues

CMake Error when installing mmperf

I tried to run mmperf on my server. I used git shallow clone to pull down just the latest commits, not the entire repo. Then I executed the following instructions in command line, as showed in the README.md:

python3 -m venv ./mmperf_env
source mmperf_env/bin/activate
pip install -r requirements.txt
pip install -r ./external/llvm-project/mlir/python/requirements.txt
cmake -GNinja     -DCMAKE_CXX_COMPILER=clang++-11     -DCMAKE_C_COMPILER=clang-11     -DUSE_MLIR=ON     -B build .
sudo cmake --build build

But when I executed sudo cmake --build build, I got CMake Error:

CMake Error at docs/cmake_install.cmake:46 (file):
  file INSTALL cannot find
  "/home/cmy/mmperf/build/mlir/docs/ocamldoc/html/.": No such file or
  directory.
Call Stack (most recent call first):
  cmake_install.cmake:85 (include)

iree-llvm-sandbox build failed

Hi, thanks for your nice work.
When I tried to build mlir with iree-llvm-sandbox-cuda with the flag -DUSE_IREE_LLVM_SANDBOX_CUDA=ON, I got error:

CMake Error: File /home/mmperf/matmul/matmul_llvmsandbox_MxNxK.mlir does not exist.
CMake Error at CMakeLists.txt:184 (configure_file):
configure_file Problem configuring file
Call Stack (most recent call first):
CMakeLists.txt:311 (compile_llvm_sandbox_mlir)

seems this file matmul_llvmsandbox_MxNxK.mlir is missing.

Runtime Error with TVM CUDA

Getting this error when running with TVM CUDA:

Running /home/mariewhite/github/mmperf/build/matmul/matmul_tvmcuda_512x1024x1024
Run on (12 X 2200.2 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1024 KiB (x6)
  L3 Unified 39424 KiB (x1)
Load Average: 8.03, 10.70, 7.81
terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [16:05:12] /home/mariewhite/github/mmperf/external/tvm/src/tir/analysis/verify_memory.cc:214: RuntimeError: Memory verification failed with the following errors:
    Variable `B` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `A` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `A` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `A` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `A` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `C` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
  Did you forget to bind?
PrimFunc([A, B, C]) attrs={"from_legacy_te_schedule": (bool)1, "global_symbol": "matmul", "tir.noalias": (bool)1, "target": cuda -keys=cuda,gpu -arch=sm_80 -max_num_threads=1024 -thread_warp_size=32} {
  allocate packedB[float32x32 * 32768], storage_scope = global
  parallel (ax0, 0, 32) {
    for (ax1, 0, 1024) {
      packedB[((ax0*1024) + ax1)] = B[ramp(((ax1*1024) + (ax0*32)), 1, 32)]
    }
  }
  parallel (ax0.outer, 0, 16) {
    allocate C.global[float32 * 1024], storage_scope = global
    for (ax1.outer, 0, 32) {
      for (ax0.c.init, 0, 32) {
        C.global[ramp((ax0.c.init*32), 1, 32)] = x32(0f)
      }
      for (k.outer, 0, 256) {
        for (ax0.c, 0, 32) {
          let cse_var_4 = (k.outer*4)
          let cse_var_3 = (ax0.c*32)
          let cse_var_2 = ((ax1.outer*1024) + cse_var_4)
          let cse_var_1 = (((ax0.outer*32768) + (ax0.c*1024)) + cse_var_4)
          C.global[ramp(cse_var_3, 1, 32)] = (C.global[ramp(cse_var_3, 1, 32)] + (x32(A[cse_var_1])*packedB[cse_var_2]))
          C.global[ramp(cse_var_3, 1, 32)] = (C.global[ramp(cse_var_3, 1, 32)] + (x32(A[(cse_var_1 + 1)])*packedB[(cse_var_2 + 1)]))
          C.global[ramp(cse_var_3, 1, 32)] = (C.global[ramp(cse_var_3, 1, 32)] + (x32(A[(cse_var_1 + 2)])*packedB[(cse_var_2 + 2)]))
          C.global[ramp(cse_var_3, 1, 32)] = (C.global[ramp(cse_var_3, 1, 32)] + (x32(A[(cse_var_1 + 3)])*packedB[(cse_var_2 + 3)]))
        }
      }
      for (ax0.inner, 0, 32) {
        for (ax1.inner, 0, 32) {
          C[((((ax0.outer*32768) + (ax0.inner*1024)) + (ax1.outer*32)) + ax1.inner)] = C.global[((ax0.inner*32) + ax1.inner)]
        }
      }
    }
  }
}

Incorrect inputs and perf calculation for batch matmul

For batch matmul, the input dimensions are [b, m, n, k]. But with current implementation, it only gets first three dimensions as m, n, k.

Not sure how to handle the batch_matmul for other backends, but for mlir backend, we should probably use linalg.batch_matmul.

sandbox search execute failed

Hi there, when I tried to run iree-llvm-sandbox, I got the following error:

python3 mmperf.py build/matmul results/haswell -sandbox -benchmark_path=/home/mmperf/benchmark_sizes/bert_large_matmul.txt
Latest symlink path is: /home/mmperf/results/haswell/latest
Latest results path is: /home/mmperf/results/haswell/2022-05-16_11-47-53-448597
Linux System Detected.. looking for /proc/cpuinfo

###############################################################
Compile-time problem size {'m': 512, 'n': 1024, 'k': 1024}
Runtime problem size {'m': 512, 'n': 1024, 'k': 1024}
Problem types [<class 'numpy.float32'>, <class 'numpy.float32'>, <class 'numpy.float32'>]

Compilation expert SingleTiling2DPeel
xxxxxxxxxx: Dialect:
Einsum spec: km,kn->mn
Compile time 0.06579804420471191
cannot dump ExecutionEngine object code to file: object cache is disabled
xxxxxxxxxx : 100 iters time on 1 threads

 slowest          p1         p10         p25         p50         p75         p90         p99     fastest        unit

 2.7e-02     2.7e-02     2.6e-02     2.6e-02     2.5e-02     2.5e-02     2.5e-02     2.5e-02     2.5e-02     seconds
   40.15       40.42       41.60       42.08       42.28       42.40       42.62       43.06       43.06    GFlops/s
    0.39        0.39        0.41        0.41        0.41        0.41        0.42        0.42        0.42       GBs/s

Run time 2.809998035430908
/home/mmperf/external/iree-llvm-sandbox/python/examples/core/harness.py:94: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
self.data = self.data.append(product)

Compilation expert SingleTiling3DPeel
xxxxxxxxxx: Dialect:
Einsum spec: km,kn->mn
Compile time 0.16719818115234375
cannot dump ExecutionEngine object code to file: object cache is disabled
xxxxxxxxxx : 100 iters time on 1 threads

 slowest          p1         p10         p25         p50         p75         p90         p99     fastest        unit

 2.9e-02     2.8e-02     2.7e-02     2.7e-02     2.7e-02     2.7e-02     2.5e-02     2.5e-02     2.5e-02     seconds
   36.62       38.25       39.40       39.94       40.10       40.48       42.35       42.93       42.93    GFlops/s
    0.36        0.37        0.38        0.39        0.39        0.40        0.41        0.42        0.42       GBs/s

Run time 3.0525176525115967
/home/mmperf/external/iree-llvm-sandbox/python/examples/core/harness.py:94: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
self.data = self.data.append(product)

Compilation expert SingleTiling3DPad
xxxxxxxxxx: Dialect:
Einsum spec: km,kn->mn
Compile time 0.19397234916687012
Segmentation fault (core dumped)
Traceback (most recent call last):
File "mmperf.py", line 520, in
sys.exit(main(sys.argv))
File "mmperf.py", line 368, in main
sandbox_sizes, sandbox_speeds = sandbox_perf(file_path, args.num_iters)
File "mmperf.py", line 266, in sandbox_perf
result = subprocess.run(cmd, shell=True, check=True, cwd=dst_f)
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python -m python.examples.matmul.iree_sandbox_matmul -matrix_path /home/mmperf/benchmark_sizes/bert_large_matmul.txt -n_iters 100' returned non-zero exit status 139.

Cmake building MLIR backend on CPU failed

In file included from /home/shenghao/mmperf/external/benchmark/test/benchmark_gtest.cc:6:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googlemock/include/gmock/gmock.h:56:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googlemock/include/gmock/gmock-actions.h:145:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googlemock/include/gmock/internal/gmock-internal-utils.h:50:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/gtest.h:60:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/gtest-death-test.h:43:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/internal/gtest-death-test-internal.h:46:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/gtest-matchers.h:48:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/gtest-printers.h:114:
/home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/internal/gtest-internal.h:635:54: error: too few template arguments for class template 'less'
  typedef ::std::map<std::string, CodeLocation, std::less<>> RegisteredTestsMap;
                                                     ^
/usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_function.h:381:12: note: template is declared here
    struct less : public binary_function<_Tp, _Tp, bool>
           ^
In file included from /home/shenghao/mmperf/external/benchmark/test/benchmark_gtest.cc:6:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googlemock/include/gmock/gmock.h:56:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googlemock/include/gmock/gmock-actions.h:145:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googlemock/include/gmock/internal/gmock-internal-utils.h:50:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/gtest.h:60:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/gtest-death-test.h:43:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/internal/gtest-death-test-internal.h:46:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/gtest-matchers.h:48:
In file included from /home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/gtest-printers.h:114:
/home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/internal/gtest-internal.h:612:22: error: member reference base type 'testing::internal::TypedTestSuitePState::RegisteredTestsMap' (aka 'int') is not a structure or union
    registered_tests_.insert(
    ~~~~~~~~~~~~~~~~~^~~~~~~
/home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/internal/gtest-internal.h:618:29: error: member reference base type 'const testing::internal::TypedTestSuitePState::RegisteredTestsMap' (aka 'const int') is not a structure or union
    return registered_tests_.count(test_name) > 0;
           ~~~~~~~~~~~~~~~~~^~~~~~
/home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/internal/gtest-internal.h:622:5: error: 'testing::internal::TypedTestSuitePState::RegisteredTestsMap' (aka 'int') is not a class, namespace, or enumeration
    RegisteredTestsMap::const_iterator it = registered_tests_.find(test_name);
    ^
/home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/internal/gtest-internal.h:622:62: error: member reference base type 'const testing::internal::TypedTestSuitePState::RegisteredTestsMap' (aka 'const int') is not a structure or union
    RegisteredTestsMap::const_iterator it = registered_tests_.find(test_name);
                                            ~~~~~~~~~~~~~~~~~^~~~~
/home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/internal/gtest-internal.h:623:41: error: member reference base type 'const testing::internal::TypedTestSuitePState::RegisteredTestsMap' (aka 'const int') is not a structure or union
    GTEST_CHECK_(it != registered_tests_.end());
                       ~~~~~~~~~~~~~~~~~^~~~
/home/shenghao/mmperf/build/benchmark/src/benchmark-build/third_party/googletest/src/googletest/include/gtest/internal/gtest-port.h:1005:35: note: expanded from macro 'GTEST_CHECK_'
  if (::testing::internal::IsTrue(condition)) \
                                  ^~~~~~~~~
6 errors generated.
[65/106] Building CXX object test/CMakeFiles/reporter_output_test.dir/reporter_output_test.cc.o
ninja: build stopped: subcommand failed.
FAILED: benchmark/src/benchmark-stamp/benchmark-build /home/shenghao/mmperf/build/benchmark/src/benchmark-stamp/benchmark-build
cd /home/shenghao/mmperf/build/benchmark/src/benchmark-build && /usr/local/bin/cmake --build .
ninja: build stopped: subcommand failed.

fatal: repository 'https://github.com/NodLabs/SHARK/tree/main/iree/' not found

hi ,clone iree error,how can I solve it?

Cloning into '/root/mmperf/external/flatbuffers'...
remote: Enumerating objects: 25292, done.
remote: Counting objects: 100% (2168/2168), done.
remote: Compressing objects: 100% (890/890), done.
remote: Total 25292 (delta 1329), reused 1818 (delta 1174), pack-reused 23124
Receiving objects: 100% (25292/25292), 15.85 MiB | 13.83 MiB/s, done.
Resolving deltas: 100% (17583/17583), done.
Cloning into '/root/mmperf/external/iree'...
fatal: repository 'https://github.com/NodLabs/SHARK/tree/main/iree/' not found
fatal: clone of 'https://github.com/NodLabs/SHARK/tree/main/iree' into submodule path '/root/mmperf/external/iree' failed

Failed to clone 'external/iree'. Retry scheduled
Cloning into '/root/mmperf/external/iree-llvm-sandbox'...
remote: Enumerating objects: 7480, done.
remote: Counting objects: 100% (7480/7480), done.
remote: Compressing objects: 100% (2161/2161), done.
remote: Total 7480 (delta 5395), reused 7135 (delta 5153), pack-reused 0
Receiving objects: 100% (7480/7480), 2.98 MiB | 4.22 MiB/s, done.
Resolving deltas: 100% (5395/5395), done.

FAILED: src/runtime/initmod.hvx_128.bc /root/mmperf/build/halide/src/runtime/initmod.hvx_128.bc

hi, external/llvm-project use clang15, versions 13.0 and 12.0 are supported halide, but 11.0 、14.0 and 15.0 is not.

-- The C compiler identification is Clang 15.0.0
-- The CXX compiler identification is Clang 15.0.0
CMake Warning at dependencies/llvm/CMakeLists.txt:31 (message):
Halide is not tested on LLVM versions beyond 14.0

[26/40] Performing build step for 'halide'
[1/3264] Building CXX object test/generator/CMakeFiles/cxx_mangling_externs.dir/cxx_mangling_externs.cpp.o
[2/3264] Linking CXX static library test/generator/libcxx_mangling_externs.a
[3/3264] Building CXX object test/CMakeFiles/Halide_expect_abort.dir/common/expect_abort.cpp.o
[4/3264] Building CXX object tools/CMakeFiles/binary2cpp.dir/binary2cpp.cpp.o
[5/3264] Linking CXX executable tools/binary2cpp
[6/3264] Generating _initmod_inlined_c.cpp
[7/3264] Generating _initmod_HalidePyTorchCudaHelpers_h.cpp
[8/3264] Generating _initmod_HalidePyTorchHelpers_h.cpp
[9/3264] Generating _initmod_HalideRuntimeCuda_h.cpp
[10/3264] Generating _initmod_HalideRuntimeD3D12Compute_h.cpp
[11/3264] Generating _initmod_HalideRuntimeHexagonDma_h.cpp
[12/3264] Generating _initmod_HalideRuntimeHexagonHost_h.cpp
[13/3264] Generating _initmod_HalideRuntimeMetal_h.cpp
[14/3264] Generating _initmod_HalideRuntimeOpenGLCompute_h.cpp
[15/3264] Generating initmod.arm_no_neon.bc
[16/3264] Generating _initmod_HalideRuntimeOpenCL_h.cpp
[17/3264] Generating _initmod_HalideRuntimeQurt_h.cpp
[18/3264] Generating initmod.arm.bc
[19/3264] Generating initmod.mips.bc
[20/3264] Generating initmod.powerpc.bc
[21/3264] Generating initmod.aarch64.bc
[22/3264] Generating initmod.hvx_128.bc
FAILED: src/runtime/initmod.hvx_128.bc /root/mmperf/build/halide/src/runtime/initmod.hvx_128.bc
cd /root/mmperf/build/halide/src/runtime && /usr/local/clang/bin/llvm-as /root/mmperf/external/Halide/src/runtime/hvx_128.ll -o initmod.hvx_128.bc
/usr/local/clang/bin/llvm-as: assembly parsed, but does not verify as correct!
Operand for indirect constraint must have elementtype attribute
call void asm sideeffect "vmem($0 + #0):scatter_release\0A; v1 = vmem($0 + #0)\0A", "=m,m,~{v1}"(i8 %ptr, i8 %ptr)
[23/3264] Generating initmod.posix_math.bc
[24/3264] Generating initmod.wasm_math.bc
[25/3264] Generating initmod.x86_avx.bc
[26/3264] Generating initmod.x86_avx2.bc
[27/3264] Generating initmod.ptx_dev.bc
[28/3264] Generating initmod.win32_math.bc
[29/3264] Generating initmod.x86.bc
cd /root/mmperf/build/halide/src/runtime && /usr/local/clang/bin/llvm-as /root/mmperf/external/Halide/src/runtime/x86.ll -o initmod.x86.bc
/usr/local/clang/bin/llvm-as: assembly parsed, but does not verify as correct!
Operand for indirect constraint must have elementtype attribute
call void asm sideeffect inteldialect "xchg ebx, esi\0A\09mov eax, dword ptr $$0 $0\0A\09mov ecx, dword ptr $$4 $0\0A\09cpuid\0A\09mov dword ptr $$0 $0, eax\0A\09mov dword ptr $$4 $0, ebx\0A\09mov dword ptr $$8 $0, ecx\0A\09mov dword ptr $$12 $0, edx\0A\09xchg ebx, esi", "=m,{eax},{ebx},{ecx},{edx},{esi},{dirflag},{fpsr},{flags}"(i32 %info)
Operand for indirect constraint must have elementtype attribute
call void asm sideeffect inteldialect "xchg rbx, rsi\0A\09mov eax, dword ptr $$0 $0\0A\09mov ecx, dword ptr $$4 $0\0A\09cpuid\0A\09mov dword ptr $$0 $0, eax\0A\09mov dword ptr $$4 $0, ebx\0A\09mov dword ptr $$8 $0, ecx\0A\09mov dword ptr $$12 $0, edx\0A\09xchg rbx, rsi", "=m,{eax},{ebx},{ecx},{edx},{esi},{dirflag},{fpsr},{flags}"(i32 %info)

Questions about packing and LICM

Hello,
First of all, thanks for the inspiring repository. Following the steps listed here, I was able to get ~73% of the theoretical peak of the hardware I am testing against.

I have two questions:

  • Array packing. Which transformation does the packing of the arrays A and B? By array packing I mean something similar to BLIS: https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
  • LICM. In the docs you mention LICM and loop fusion as possible improvements. I wasn't able to immediately understand from the code what sort of transformations you are using to obtain these.

Thanks in advance for any clarification,
Giuseppe

consider to skip MKL's JIT codegen ovhd for benchmarking

Thanks for open-sourcing this nice project.
Given that it takes tens of milliseconds for Intel MKL to generate just-in-time kernels when it is invoked at the first time, you may want to explictly call MKL multiple times in a main function and to skip the first call to obtain a more meaningful result.
OpenBLAS and BLIS do not have such an issue since both of them are implemented with hand-tuned assembly.
Please correct me if I misunderstood anything. Thank you.

Definition of SGEMM

Hi. I am interested in testing performance of my machine, so I wonder the definition of SGEMM mentioned in README. I know SGEMM is defined as C ← αAB+βC. Are α and β both assigned 1 in testing? Or α and β have other values.

Build instructions for Apple M1

Hi,

Thanks for the great work.
This is very helpful for testing the current hardware-aware libraries for implementing the most time consuming operations in neural networks.

I am interested in arm devices. Can I know what is the cmake build instructions for "Results on Apple M1 (NEON - no AMX2)". Thanks in advance.

FAILED: matmul-compile/CMakeFiles/matmul-compile.dir/matmul-compile.cpp.o

The system environment is as follows:
Rocky Linux release 8.5 (Green Obsidian)
cmake version 3.21.0
clang-11
Python 3.8.12

cmake . -GNinja -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DUSE_MLIR=ON -B build
cmake --build build

hi,The build has the following error:

-- Compiling linalg.matmul ...
-- Compiling matmul_mlir_384x512x2
-- Using default tile sizes
-- Tile sizes = 12 32 16
-- Register Tile sizes =
-- Copy Fill Tile sizes = 4 16
-- Compiling linalg.matmul ...
-- Compiling matmul_mlir_384x384x32
-- Using default tile sizes
-- Tile sizes = 12 32 16
-- Register Tile sizes =
-- Copy Fill Tile sizes = 4 16
-- Compiling linalg.matmul ...
-- Configuring done
-- Generating done
CMake Warning:
Manually-specified variables were not used by the project:

BLASFEO_DIR
BLIS_DIR
HALIDE_DIR
IREE_CUDA
IREE_DYLIB
IREE_VMVX
MKL_DIR
MLIR_BUILD
PREBUILT_LLVM_PATH
RUY_SOURCE
SANDBOX_TILE_FILE
TVM_ENABLE_CUDA
TVM_ENABLE_METAL
TVM_ENABLE_ROCM
TVM_LIB_DIR

-- Build files have been written to: /root/mmperf/build/matmul
[37/40] Performing build step for 'matmul'
[1/185] Generating tiling_for_small.bin
[2/185] Generating tiling_for_large.bin
[3/185] Generating fbs/compile_options_generated.h
[4/185] Building CXX object matmul-compile/CMakeFiles/matmul-compile.dir/matmul-compile.cpp.o
FAILED: matmul-compile/CMakeFiles/matmul-compile.dir/matmul-compile.cpp.o
/usr/local/clang-11/bin/clang++ -DGTEST_HAS_RTTI=0 -I/root/mmperf/build/matmul/../flatbuffers-install/include -I/root/mmperf/build/matmul/matmul-compile/fbs -I/usr/local/clang-11/include -I/root/mmperf/build/mlir-install/include -O2 -g -DNDEBUG -fno-exceptions -fno-rtti -fno-exceptions -fno-rtti -std=gnu++17 -MD -MT matmul-compile/CMakeFiles/matmul-compile.dir/matmul-compile.cpp.o -MF matmul-compile/CMakeFiles/matmul-compile.dir/matmul-compile.cpp.o.d -o matmul-compile/CMakeFiles/matmul-compile.dir/matmul-compile.cpp.o -c /root/mmperf/matmul/matmul-compile/matmul-compile.cpp
In file included from /root/mmperf/matmul/matmul-compile/matmul-compile.cpp:12:
In file included from /root/mmperf/build/mlir-install/include/mlir/Conversion/Passes.h:18:
In file included from /root/mmperf/build/mlir-install/include/mlir/Conversion/ComplexToLLVM/ComplexToLLVM.h:11:
/root/mmperf/build/mlir-install/include/mlir/Conversion/LLVMCommon/StructBuilder.h:26:7: error: redefinition of 'StructBuilder'
class StructBuilder {
^
/usr/local/clang-11/include/mlir/Conversion/StandardToLLVM/ConvertStandardToLLVM.h:188:7: note: previous definition is here
class StructBuilder {
^
In file included from /root/mmperf/matmul/matmul-compile/matmul-compile.cpp:12:
In file included from /root/mmperf/build/mlir-install/include/mlir/Conversion/Passes.h:18:
/root/mmperf/build/mlir-install/include/mlir/Conversion/ComplexToLLVM/ComplexToLLVM.h:18:7: error: redefinition of 'ComplexStructBuilder'
class ComplexStructBuilder : public StructBuilder {
^
/usr/local/clang-11/include/mlir/Conversion/StandardToLLVM/ConvertStandardToLLVM.h:211:7: note: previous definition is here
class ComplexStructBuilder : public StructBuilder {
^
In file included from /root/mmperf/matmul/matmul-compile/matmul-compile.cpp:12:
In file included from /root/mmperf/build/mlir-install/include/mlir/Conversion/Passes.h:63:
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:709:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-avx512-to-llvm", "Convert the operations from the avx512 dialect into the LLVM dialect", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertAVX512ToLLVMPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:709:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:709:173: error: no member named 'createConvertAVX512ToLLVMPass' in namespace 'mlir'; did you mean 'createConvertAsyncToLLVMPass'?
::mlir::registerPass("convert-avx512-to-llvm", "Convert the operations from the avx512 dialect into the LLVM dialect", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertAVX512ToLLVMPass(); });
~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
createConvertAsyncToLLVMPass
/root/mmperf/build/mlir-install/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h:25:42: note: 'createConvertAsyncToLLVMPass' declared here
std::unique_ptr<OperationPass> createConvertAsyncToLLVMPass();
^
In file included from /root/mmperf/matmul/matmul-compile/matmul-compile.cpp:12:
In file included from /root/mmperf/build/mlir-install/include/mlir/Conversion/Passes.h:63:
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:713:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-affine-for-to-gpu", "Convert top-level AffineFor Ops to GPU kernels", -> std::unique_ptr<::mlir::Pass> { return mlir::createAffineForToGPUPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:713:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:717:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("lower-affine", "Lower Affine operations to a combination of Standard and SCF operations", -> std::unique_ptr<::mlir::Pass> { return mlir::createLowerAffinePass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:717:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:717:166: error: no member named 'createLowerAffinePass' in namespace 'mlir'; did you mean 'createLowerToLLVMPass'?
::mlir::registerPass("lower-affine", "Lower Affine operations to a combination of Standard and SCF operations", -> std::unique_ptr<::mlir::Pass> { return mlir::createLowerAffinePass(); });
~~~~~~^~~~~~~~~~~~~~~~~~~~~
createLowerToLLVMPass
/usr/local/clang-11/include/mlir/Conversion/StandardToLLVM/ConvertStandardToLLVMPass.h:75:1: note: 'createLowerToLLVMPass' declared here
createLowerToLLVMPass(const LowerToLLVMOptions &options =
^
In file included from /root/mmperf/matmul/matmul-compile/matmul-compile.cpp:12:
In file included from /root/mmperf/build/mlir-install/include/mlir/Conversion/Passes.h:63:
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:721:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-gpu-to-spirv", "Convert GPU dialect to SPIR-V dialect", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertGPUToSPIRVPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:721:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:725:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("launch-func-to-gpu-runtime", "Convert all launch_func ops to GPU runtime calls", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertGpuLaunchFuncToGpuRuntimeCallsPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:725:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:729:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-gpu-launch-to-vulkan-launch", "Convert gpu.launch_func to vulkanLaunch external call", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertGpuLaunchFuncToVulkanLaunchFuncPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:729:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:733:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-gpu-to-nvvm", "Generate NVVM operations for gpu operations", -> std::unique_ptr<::mlir::Pass> { return mlir::createLowerGpuOpsToNVVMOpsPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:733:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:737:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-gpu-to-rocdl", "Generate ROCDL operations for gpu operations", -> std::unique_ptr<::mlir::Pass> { return mlir::createLowerGpuOpsToROCDLOpsPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:737:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:741:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-linalg-to-llvm", "Convert the operations from the linalg dialect into the LLVM dialect", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertLinalgToLLVMPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:741:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:745:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-linalg-to-spirv", "Convert Linalg ops to SPIR-V ops", -> std::unique_ptr<::mlir::Pass> { return mlir::createLinalgToSPIRVPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:745:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:749:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-linalg-to-std", "Convert the operations from the linalg dialect into the Standard dialect", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertLinalgToStandardPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:749:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:753:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-parallel-loops-to-gpu", "Convert mapped scf.parallel ops to gpu launch operations", -> std::unique_ptr<::mlir::Pass> { return mlir::createParallelLoopToGpuPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:753:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:757:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-spirv-to-llvm", "Convert SPIR-V dialect to LLVM dialect", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertSPIRVToLLVMPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:757:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:761:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-shape-to-scf", "Convert operations from the shape dialect to the SCF dialect", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertShapeToSCFPass(); });
^
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:761:9: error: C++ requires a type specifier for all declarations
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:761:163: error: no member named 'createConvertShapeToSCFPass' in namespace 'mlir'; did you mean 'createConvertSCFToCFPass'?
::mlir::registerPass("convert-shape-to-scf", "Convert operations from the shape dialect to the SCF dialect", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertShapeToSCFPass(); });
~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
createConvertSCFToCFPass
/root/mmperf/build/mlir-install/include/mlir/Conversion/SCFToControlFlow/SCFToControlFlow.h:24:23: note: 'createConvertSCFToCFPass' declared here
std::unique_ptr createConvertSCFToCFPass();
^
In file included from /root/mmperf/matmul/matmul-compile/matmul-compile.cpp:12:
In file included from /root/mmperf/build/mlir-install/include/mlir/Conversion/Passes.h:63:
/usr/local/clang-11/include/mlir/Conversion/Passes.h.inc:765:9: warning: extra qualification on member 'registerPass' [-Wextra-qualification]
::mlir::registerPass("convert-shape-to-std", "Convert operations from the shape dialect into the standard dialect", -> std::unique_ptr<::mlir::Pass> { return mlir::createConvertShapeToStandardPass(); });
^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
15 warnings and 20 errors generated.
ninja: build stopped: subcommand failed.
FAILED: matmul/src/matmul-stamp/matmul-build /root/mmperf/build/matmul/src/matmul-stamp/matmul-build
cd /root/mmperf/build/matmul && /usr/local/cmake/bin/cmake --build .

How to enable nodai-shark-cuda mmperf?

Hi, your work on MMPerf has been very helpful to me. I wonder if mmperf now supports testing nodai-shark or nodai-shark-cuda? If yes, can you provide a guide like how to run TVM Auto-scheduling?

In addition, When I build with flag -DUSE_IREE=ON -DIREE_CUDA=ON, I got error:

/usr/bin/ld: src/CMakeFiles/matmul_ireecuda_512x1024x1024.dir/device_cuda.c.o: in function create_sample_device': device_cuda.c:(.text+0x4c): undefined reference to iree_hal_driver_registry_try_create_by_name'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

MLIR performance is very poor on Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz

Hi all,

I follow the doc here to build the repo and test it on my machine having skylake intel cpu. Below is my cpuinfo:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Stepping: 4
CPU MHz: 2501.000
CPU max MHz: 2501.0000
CPU min MHz: 1000.0000
BogoMIPS: 4998.90
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0-95
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f rdseed adx smap clflushopt avx512cd xsaveopt xsavec xgetbv1 xsaves cqm_llc

And following is my testing results:
image

Is there anything wrong for the configuration since the mlir performance is far from the number report in the official result dir.

Thanks!

MLIR crashes occasionally for automatic benchmark iterations

debian:~/github/mmperf$ lldb ./build/matmul/matmul_mlir_192x256x256
(lldb) target create "./build/matmul/matmul_mlir_192x256x256"
Current executable set to '/home/foo/github/mmperf/build/matmul/matmul_mlir_192x256x256' (x86_64).
(lldb) r
Process 4339 launched: '/home/foo/github/mmperf/build/matmul/matmul_mlir_192x256x256' (x86_64)
Matrix-format: Row-Major
Benchmarking MLIR 192 x 256 x 256
2021-10-17T05:05:46+00:00
Running /home/anush/github/mmperf/build/matmul/matmul_mlir_192x256x256
Run on (12 X 2200.12 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1024 KiB (x6)
L3 Unified 39424 KiB (x1)
Load Average: 0.00, 1.03, 9.84
Process 4339 stopped

  • thread #1, name = 'matmul_mlir_192', stop reason = signal SIGSEGV: invalid address (fault address: 0x7ffff79e2010)
    frame #0: 0x000000000040c064 matmul_mlir_192x256x256`benchmark::internal::LambdaBenchmarkmain::$_0::Run(benchmark::State&) at main.cc:366:20
    363 for (size_t i = 0; i < MDIM; i++) {
    364 for (size_t j = 0; j < NDIM; j++) {
    365 size_t ci = i + j*MDIM;
    -> 366 if (std::abs(C[ci] - C2[ci]) > 0.01f) {
    367 fprintf(stderr, "Incorrect result at index %ld,%ld: C=%0.2f C2=%0.2f\n", i, j, C[ci], C2[ci]);
    368 errors++;
    369 }
    (lldb)

Intermittent Segfaulting

Sometimes, a run will randomly crash. It happens with relatively low frequency so I haven't been able to identify the exact cause. I think it may happen more often on MLIR programs with matmul sizes.

cmake with USE_MLIR_CUDA failed

When I build with flag -DUSE_MLIR_CUDA=ON, I got error:

[228/228] Linking CXX executable /home/mmperf/build/matmul/matmul_ireecuda_16x1024x512
[15/24] Performing build step for 'matmul'
[1/24] Linking CXX executable matmul_mlircuda_16x1024x512
FAILED: matmul_mlircuda_16x1024x512
: && /usr/bin/clang++-12 -O2 -g -DNDEBUG mlir-objs/matmul_mlircuda_16x1024x512.o CMakeFiles/matmul_mlircuda_16x1024x512.dir/main.cc.o -o matmul_mlircuda_16x1024x512 -L/home/mmperf/build/matmul/../benchmark-install/lib -Wl,-rpath,/home/mmperf/build/matmul/../benchmark-install/lib:/home/mmperf/build/mlir/lib /usr/local/cuda/lib64/libcudart_static.a -ldl /usr/lib/x86_64-linux-gnu/librt.so /home/mmperf/build/mlir/lib/libmlir_cuda_runtime.so /home/mmperf/build/mlir/lib/libmlir_runner_utils.so -lpthread -lbenchmark -lpthread && :
/usr/bin/ld: CMakeFiles/matmul_mlircuda_16x1024x512.dir/main.cc.o: in function BenchmarkFunction(benchmark::State&)': /home/mmperf/matmul/main.cc:413: undefined reference to matmul'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.