Git Product home page Git Product logo

Comments (2)

corsix avatar corsix commented on September 23, 2024

AMX vector performance on M1/M2 is disappointing, especially in comparison to M1/M2 performance-core NEON. An M1 performance-core using NEON has theoretical maximum f32 performance of 102.4 GFLOPS (dispatch 4 FMA instructions per cycle, each FMA is 4 wide, each FMA counts as two ops, 3.2 GHz). Perfect four-way multithreading would get you up to 409.6 GFLOPS, and perfect eight-way would get you to 819.2 GLOPS. M2 should be slightly higher.

For comparison, AMX f32 vector FMAs on an M1 Max look like:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (64 bytes) per thread 23.2 GFLOPS 46.4 GFLOPS 53.3 GFLOPS 81.1 GFLOPS 89.0 GFLOPS 104.1 GFLOPS
2 (128 bytes) per thread 46.4 GFLOPS 92.7 GFLOPS 106.5 GFLOPS 141.3 GFLOPS 176.8 GFLOPS 206.5 GFLOPS
3 (192 bytes) per thread 69.6 GFLOPS 139.1 GFLOPS 160.1 GFLOPS 213.3 GFLOPS 250.6 GFLOPS 244.9 GFLOPS
4 (256 bytes) per thread 92.7 GFLOPS 185.4 GFLOPS 214.0 GFLOPS 277.6 GFLOPS 325.5 GFLOPS 298.0 GFLOPS
5 (320 bytes) per thread 115.8 GFLOPS 231.7 GFLOPS 241.0 GFLOPS 321.3 GFLOPS 355.1 GFLOPS 347.7 GFLOPS
6 (384 bytes) per thread 139.0 GFLOPS 277.7 GFLOPS 271.2 GFLOPS 361.7 GFLOPS 387.1 GFLOPS 386.2 GFLOPS
7 (448 bytes) per thread 162.2 GFLOPS 324.2 GFLOPS 299.9 GFLOPS 383.4 GFLOPS 394.0 GFLOPS 400.9 GFLOPS
8 (512 bytes) per thread 185.5 GFLOPS 369.9 GFLOPS 335.8 GFLOPS 392.9 GFLOPS 405.8 GFLOPS 416.0 GFLOPS
9 (576 bytes) per thread 178.0 GFLOPS 353.4 GFLOPS 325.5 GFLOPS 396.9 GFLOPS 398.0 GFLOPS 409.2 GFLOPS
10 (640 bytes) per thread 183.1 GFLOPS 360.6 GFLOPS 335.3 GFLOPS 402.4 GFLOPS 401.2 GFLOPS 417.2 GFLOPS
11 (704 bytes) per thread 183.1 GFLOPS 363.0 GFLOPS 334.2 GFLOPS 403.2 GFLOPS 400.6 GFLOPS 415.8 GFLOPS
12 (768 bytes) per thread 185.2 GFLOPS 370.6 GFLOPS 335.5 GFLOPS 378.5 GFLOPS 397.7 GFLOPS 419.0 GFLOPS
13 (832 bytes) per thread 185.2 GFLOPS 369.4 GFLOPS 336.0 GFLOPS 404.2 GFLOPS 400.9 GFLOPS 414.1 GFLOPS
14 (896 bytes) per thread 185.5 GFLOPS 370.5 GFLOPS 336.4 GFLOPS 406.0 GFLOPS 402.9 GFLOPS 416.4 GFLOPS
15 (960 bytes) per thread 185.5 GFLOPS 370.0 GFLOPS 336.8 GFLOPS 405.7 GFLOPS 402.6 GFLOPS 409.6 GFLOPS
16 (1024 bytes) per thread 185.4 GFLOPS 370.4 GFLOPS 336.3 GFLOPS 406.0 GFLOPS 399.7 GFLOPS 405.3 GFLOPS

A single thread can exceed 102.4 GFLOPS, potentially hitting 185.5 GFLOPS, but you need to be using 512 bytes of Z registers to get there. Four threads only get to 406.0 GFLOPS, which is less than the theoretical 409.6 achievable with NEON. More than four threads don't help AMX, but will help NEON.

The same thing on M2 looks like:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (64 bytes) per thread 25.6 GFLOPS 41.2 GFLOPS 61.7 GFLOPS 78.7 GFLOPS 98.4 GFLOPS 117.7 GFLOPS
2 (128 bytes) per thread 51.2 GFLOPS 82.3 GFLOPS 123.5 GFLOPS 157.7 GFLOPS 174.1 GFLOPS 170.4 GFLOPS
3 (192 bytes) per thread 76.7 GFLOPS 123.4 GFLOPS 179.5 GFLOPS 191.0 GFLOPS 216.9 GFLOPS 215.1 GFLOPS
4 (256 bytes) per thread 102.2 GFLOPS 164.6 GFLOPS 237.1 GFLOPS 231.8 GFLOPS 258.3 GFLOPS 263.3 GFLOPS
5 (320 bytes) per thread 127.8 GFLOPS 205.7 GFLOPS 279.1 GFLOPS 264.8 GFLOPS 285.7 GFLOPS 289.5 GFLOPS
6 (384 bytes) per thread 153.5 GFLOPS 226.0 GFLOPS 299.5 GFLOPS 286.6 GFLOPS 300.5 GFLOPS 308.3 GFLOPS
7 (448 bytes) per thread 179.0 GFLOPS 246.6 GFLOPS 300.6 GFLOPS 291.4 GFLOPS 302.4 GFLOPS 306.2 GFLOPS
8 (512 bytes) per thread 204.4 GFLOPS 269.7 GFLOPS 301.6 GFLOPS 299.4 GFLOPS 309.2 GFLOPS 310.4 GFLOPS
9 (576 bytes) per thread 204.6 GFLOPS 270.5 GFLOPS 302.9 GFLOPS 297.9 GFLOPS 304.7 GFLOPS 307.3 GFLOPS
10 (640 bytes) per thread 204.7 GFLOPS 270.3 GFLOPS 303.0 GFLOPS 300.2 GFLOPS 306.9 GFLOPS 308.9 GFLOPS
11 (704 bytes) per thread 204.6 GFLOPS 276.5 GFLOPS 308.4 GFLOPS 302.1 GFLOPS 305.8 GFLOPS 307.5 GFLOPS
12 (768 bytes) per thread 204.5 GFLOPS 270.5 GFLOPS 302.9 GFLOPS 299.9 GFLOPS 304.2 GFLOPS 307.5 GFLOPS
13 (832 bytes) per thread 204.6 GFLOPS 275.3 GFLOPS 307.9 GFLOPS 299.8 GFLOPS 306.4 GFLOPS 307.4 GFLOPS
14 (896 bytes) per thread 204.2 GFLOPS 270.5 GFLOPS 302.9 GFLOPS 299.6 GFLOPS 306.9 GFLOPS 310.6 GFLOPS
15 (960 bytes) per thread 204.5 GFLOPS 275.7 GFLOPS 308.5 GFLOPS 299.5 GFLOPS 305.5 GFLOPS 307.4 GFLOPS
16 (1024 bytes) per thread 204.6 GFLOPS 270.5 GFLOPS 302.8 GFLOPS 299.8 GFLOPS 306.9 GFLOPS 307.4 GFLOPS

The single-threaded numbers get 10% higher than M1 Max, which is consistent with clock speeds being 10% higher. M1 Max gets approximately double performance from 2 threads, and then only marginal improvement from subsequent threads. M2 only gets small performance improvements from additional threads, which is consistent with only having one P cpu cluster (versus two on M1 Max).

The four-at-a-time mode on M2 doesn't look much better:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (256 bytes) per thread 102.3 GFLOPS 164.9 GFLOPS 247.0 GFLOPS 187.5 GFLOPS 250.9 GFLOPS 305.7 GFLOPS
2 (512 bytes) per thread 204.5 GFLOPS 326.9 GFLOPS 351.4 GFLOPS 208.2 GFLOPS 326.6 GFLOPS 323.5 GFLOPS
3 (768 bytes) per thread 204.6 GFLOPS 326.8 GFLOPS 351.4 GFLOPS 211.6 GFLOPS 320.5 GFLOPS 324.9 GFLOPS
4 (1024 bytes) per thread 204.6 GFLOPS 329.3 GFLOPS 351.3 GFLOPS 205.7 GFLOPS 326.6 GFLOPS 325.6 GFLOPS
5 (1280 bytes) per thread 204.6 GFLOPS 329.1 GFLOPS 351.2 GFLOPS 205.4 GFLOPS 326.5 GFLOPS 322.4 GFLOPS
6 (1536 bytes) per thread 204.6 GFLOPS 328.9 GFLOPS 351.4 GFLOPS 208.7 GFLOPS 318.2 GFLOPS 322.7 GFLOPS
7 (1792 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.4 GFLOPS 205.9 GFLOPS 326.4 GFLOPS 324.0 GFLOPS
8 (2048 bytes) per thread 204.5 GFLOPS 329.1 GFLOPS 351.4 GFLOPS 208.1 GFLOPS 326.5 GFLOPS 321.3 GFLOPS
9 (2304 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.4 GFLOPS 207.3 GFLOPS 323.6 GFLOPS 326.9 GFLOPS
10 (2560 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.4 GFLOPS 206.2 GFLOPS 320.8 GFLOPS 326.7 GFLOPS
11 (2816 bytes) per thread 204.5 GFLOPS 326.9 GFLOPS 346.5 GFLOPS 208.2 GFLOPS 326.3 GFLOPS 321.5 GFLOPS
12 (3072 bytes) per thread 204.6 GFLOPS 329.1 GFLOPS 351.1 GFLOPS 205.9 GFLOPS 326.5 GFLOPS 326.4 GFLOPS
13 (3328 bytes) per thread 204.5 GFLOPS 329.3 GFLOPS 351.4 GFLOPS 206.6 GFLOPS 323.5 GFLOPS 323.4 GFLOPS
14 (3584 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.4 GFLOPS 205.5 GFLOPS 326.4 GFLOPS 323.2 GFLOPS
15 (3840 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.0 GFLOPS 205.8 GFLOPS 326.5 GFLOPS 322.5 GFLOPS
16 (4096 bytes) per thread 204.6 GFLOPS 327.1 GFLOPS 351.2 GFLOPS 208.1 GFLOPS 324.0 GFLOPS 321.7 GFLOPS

Furthermore, in order to hit these numbers, two things need to be noted:

  1. For vector operations, the Z accumulators you choose want to be maximally spread out over the permissible range, rather than contiguous (which is why the two-at-a-time / four-at-a-time on M2 use a Z step of 32 / 16).
  2. There's limited-to-no register renaming going on.

Because of point (2), you're in the "1 (256 bytes) per thread" row of the above table. Changing from "ceil(H_width / 32) iterations" to "ceil(H_width / 64) iterations" would help here, as each iteration could then address two different Z ranges, getting you to "2 (512 bytes) per thread".

from amx.

sfjohnson avatar sfjohnson commented on September 23, 2024

Ok interesting, thanks for the data! I think I've got a better understanding of what's a good fit for AMX acceleration and how to get the utilisation as high as possible.

from amx.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.