Git Product home page Git Product logo

matrix-multiply-optimization's Introduction

Matrix Multiplication Optimization

Project done as part of LLNL HPC Cluster Engineer Academy

This project analysized matrix multiplication algorithms in an HPC environment. Analysis was helped by Intel MSRs, enabled by LLNL msr-safe kernel module and interface. Intel Intrinsics SSE 4.1 instruction libraries were used to help optimize.

Environment

Experiment was developed for and executed on Intel Xeon E5-2670 @ 2.6 Hz. Node had two sockets, each with a cpu with 8 cores and 2 threads per core.

Generating Test Matrices

Standard test matrices are made and written to files by R to be used as inputs and known correct answers for checking correctness of algorithm. To generate test matrices, type:

Rscript generate_test_cases.R 256

where 256 is the size of the matrix

Naive Matrix Multiply

To make matrix multiplication without Intel Intrinsics libraries, type:

make naive

To run matrix multiplication without Intel Intrinsics, type:

make runnaive N=256 THREADS=8

where N is the size of the matrix NxN and THREADS is the number of threads to run with

To make naive algorithm with Intel Intrinsics libraries, type:

make naive_intrinsics

To run Intel Intrinsics naive algorithm, type:

make runnaive_intrinsics N=256 THREADS=8

where N is the size of the matrix NxN and THREADS is the number of threads to run with

To make naive algorithm with single-precision floating point numbers instead of double-precision, type:

make naive_float

To run naive algorithm with single-precision floating points, type:

make runnaive_float N=256 THREADS=8

where N is the size of the matrix NxN and THREADS is the number of threads to run with. This may result in errors due to a lack of precision.

Strassen Matrix Multiply

To make strassen multiplication, type:

make strassen

To run the strassen multiplication algorithm, type:

make runstrassen N=256

where N is the size of the matrix NxN.

Naive Analysis

On the initial unoptimized matrix multiplication, as the matrix size increases, the instructions per cycle decreases dramaticaly

Image of Instructions Graph

At the same time, as the matrix size increase, the cache misses per instruction increases.

Image of Cache Misses Graph

Therefore, it seems like as the matrix size increases, the function becomes increasingly more memory bound

To solve this, read and saved 6 values ahead in order to have faster and less reads to RAM, instead pulling these values into cache. This dramatically speed up the time.

Image of Naive Better Graph

Additionally, I did experiments with Intel Intrinsics SSE instructions. Because of the size of the cache, it seemed like more than just 6 values could be saved for fast access. In order to get around whether or not the compiler would unroll the loops, used generate_code.py file to generate multiply functions.

Image of Intrinsics Graph

Focusing on the beginning of the graph, we can see the the most consistently faster amount of entries saved is 8, which seems significantly smaller than the size of cache, which merits some more investigation.

Image of Focused Intrinsics Graph

Strassen Analysis

Strassen's matrix multiplication algorithm is a recursive matrix multiplication algorithm which trades less multiplications for more addition and subtraction operations.

Initially, I had the base case at N=1. Compared to the naive algorithm, the strassen lagged behind the naive until the matrices got bigger than N=4096 at which point the unoptimized single threaded strassen was faster than the single threaded naive.

Image of initial comparison

With both axes taken to log base 2

Image of initial comparison log

To optimize, I took the base case out to N=2 and used Intel Intrinsics to the base multiplication and used Intrinsics to make the intermediate matrices, which dramatically lowered the execution time.

Image of strassen optimization

With time natural log

Image of strassen log

matrix-multiply-optimization's People

Contributors

valenyamamoto avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.