Comments (2)
AMX vector performance on M1/M2 is disappointing, especially in comparison to M1/M2 performance-core NEON. An M1 performance-core using NEON has theoretical maximum f32 performance of 102.4 GFLOPS (dispatch 4 FMA instructions per cycle, each FMA is 4 wide, each FMA counts as two ops, 3.2 GHz). Perfect four-way multithreading would get you up to 409.6 GFLOPS, and perfect eight-way would get you to 819.2 GLOPS. M2 should be slightly higher.
For comparison, AMX f32 vector FMAs on an M1 Max look like:
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|
1 (64 bytes) per thread | 23.2 GFLOPS | 46.4 GFLOPS | 53.3 GFLOPS | 81.1 GFLOPS | 89.0 GFLOPS | 104.1 GFLOPS |
2 (128 bytes) per thread | 46.4 GFLOPS | 92.7 GFLOPS | 106.5 GFLOPS | 141.3 GFLOPS | 176.8 GFLOPS | 206.5 GFLOPS |
3 (192 bytes) per thread | 69.6 GFLOPS | 139.1 GFLOPS | 160.1 GFLOPS | 213.3 GFLOPS | 250.6 GFLOPS | 244.9 GFLOPS |
4 (256 bytes) per thread | 92.7 GFLOPS | 185.4 GFLOPS | 214.0 GFLOPS | 277.6 GFLOPS | 325.5 GFLOPS | 298.0 GFLOPS |
5 (320 bytes) per thread | 115.8 GFLOPS | 231.7 GFLOPS | 241.0 GFLOPS | 321.3 GFLOPS | 355.1 GFLOPS | 347.7 GFLOPS |
6 (384 bytes) per thread | 139.0 GFLOPS | 277.7 GFLOPS | 271.2 GFLOPS | 361.7 GFLOPS | 387.1 GFLOPS | 386.2 GFLOPS |
7 (448 bytes) per thread | 162.2 GFLOPS | 324.2 GFLOPS | 299.9 GFLOPS | 383.4 GFLOPS | 394.0 GFLOPS | 400.9 GFLOPS |
8 (512 bytes) per thread | 185.5 GFLOPS | 369.9 GFLOPS | 335.8 GFLOPS | 392.9 GFLOPS | 405.8 GFLOPS | 416.0 GFLOPS |
9 (576 bytes) per thread | 178.0 GFLOPS | 353.4 GFLOPS | 325.5 GFLOPS | 396.9 GFLOPS | 398.0 GFLOPS | 409.2 GFLOPS |
10 (640 bytes) per thread | 183.1 GFLOPS | 360.6 GFLOPS | 335.3 GFLOPS | 402.4 GFLOPS | 401.2 GFLOPS | 417.2 GFLOPS |
11 (704 bytes) per thread | 183.1 GFLOPS | 363.0 GFLOPS | 334.2 GFLOPS | 403.2 GFLOPS | 400.6 GFLOPS | 415.8 GFLOPS |
12 (768 bytes) per thread | 185.2 GFLOPS | 370.6 GFLOPS | 335.5 GFLOPS | 378.5 GFLOPS | 397.7 GFLOPS | 419.0 GFLOPS |
13 (832 bytes) per thread | 185.2 GFLOPS | 369.4 GFLOPS | 336.0 GFLOPS | 404.2 GFLOPS | 400.9 GFLOPS | 414.1 GFLOPS |
14 (896 bytes) per thread | 185.5 GFLOPS | 370.5 GFLOPS | 336.4 GFLOPS | 406.0 GFLOPS | 402.9 GFLOPS | 416.4 GFLOPS |
15 (960 bytes) per thread | 185.5 GFLOPS | 370.0 GFLOPS | 336.8 GFLOPS | 405.7 GFLOPS | 402.6 GFLOPS | 409.6 GFLOPS |
16 (1024 bytes) per thread | 185.4 GFLOPS | 370.4 GFLOPS | 336.3 GFLOPS | 406.0 GFLOPS | 399.7 GFLOPS | 405.3 GFLOPS |
A single thread can exceed 102.4 GFLOPS, potentially hitting 185.5 GFLOPS, but you need to be using 512 bytes of Z registers to get there. Four threads only get to 406.0 GFLOPS, which is less than the theoretical 409.6 achievable with NEON. More than four threads don't help AMX, but will help NEON.
The same thing on M2 looks like:
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|
1 (64 bytes) per thread | 25.6 GFLOPS | 41.2 GFLOPS | 61.7 GFLOPS | 78.7 GFLOPS | 98.4 GFLOPS | 117.7 GFLOPS |
2 (128 bytes) per thread | 51.2 GFLOPS | 82.3 GFLOPS | 123.5 GFLOPS | 157.7 GFLOPS | 174.1 GFLOPS | 170.4 GFLOPS |
3 (192 bytes) per thread | 76.7 GFLOPS | 123.4 GFLOPS | 179.5 GFLOPS | 191.0 GFLOPS | 216.9 GFLOPS | 215.1 GFLOPS |
4 (256 bytes) per thread | 102.2 GFLOPS | 164.6 GFLOPS | 237.1 GFLOPS | 231.8 GFLOPS | 258.3 GFLOPS | 263.3 GFLOPS |
5 (320 bytes) per thread | 127.8 GFLOPS | 205.7 GFLOPS | 279.1 GFLOPS | 264.8 GFLOPS | 285.7 GFLOPS | 289.5 GFLOPS |
6 (384 bytes) per thread | 153.5 GFLOPS | 226.0 GFLOPS | 299.5 GFLOPS | 286.6 GFLOPS | 300.5 GFLOPS | 308.3 GFLOPS |
7 (448 bytes) per thread | 179.0 GFLOPS | 246.6 GFLOPS | 300.6 GFLOPS | 291.4 GFLOPS | 302.4 GFLOPS | 306.2 GFLOPS |
8 (512 bytes) per thread | 204.4 GFLOPS | 269.7 GFLOPS | 301.6 GFLOPS | 299.4 GFLOPS | 309.2 GFLOPS | 310.4 GFLOPS |
9 (576 bytes) per thread | 204.6 GFLOPS | 270.5 GFLOPS | 302.9 GFLOPS | 297.9 GFLOPS | 304.7 GFLOPS | 307.3 GFLOPS |
10 (640 bytes) per thread | 204.7 GFLOPS | 270.3 GFLOPS | 303.0 GFLOPS | 300.2 GFLOPS | 306.9 GFLOPS | 308.9 GFLOPS |
11 (704 bytes) per thread | 204.6 GFLOPS | 276.5 GFLOPS | 308.4 GFLOPS | 302.1 GFLOPS | 305.8 GFLOPS | 307.5 GFLOPS |
12 (768 bytes) per thread | 204.5 GFLOPS | 270.5 GFLOPS | 302.9 GFLOPS | 299.9 GFLOPS | 304.2 GFLOPS | 307.5 GFLOPS |
13 (832 bytes) per thread | 204.6 GFLOPS | 275.3 GFLOPS | 307.9 GFLOPS | 299.8 GFLOPS | 306.4 GFLOPS | 307.4 GFLOPS |
14 (896 bytes) per thread | 204.2 GFLOPS | 270.5 GFLOPS | 302.9 GFLOPS | 299.6 GFLOPS | 306.9 GFLOPS | 310.6 GFLOPS |
15 (960 bytes) per thread | 204.5 GFLOPS | 275.7 GFLOPS | 308.5 GFLOPS | 299.5 GFLOPS | 305.5 GFLOPS | 307.4 GFLOPS |
16 (1024 bytes) per thread | 204.6 GFLOPS | 270.5 GFLOPS | 302.8 GFLOPS | 299.8 GFLOPS | 306.9 GFLOPS | 307.4 GFLOPS |
The single-threaded numbers get 10% higher than M1 Max, which is consistent with clock speeds being 10% higher. M1 Max gets approximately double performance from 2 threads, and then only marginal improvement from subsequent threads. M2 only gets small performance improvements from additional threads, which is consistent with only having one P cpu cluster (versus two on M1 Max).
The four-at-a-time mode on M2 doesn't look much better:
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|
1 (256 bytes) per thread | 102.3 GFLOPS | 164.9 GFLOPS | 247.0 GFLOPS | 187.5 GFLOPS | 250.9 GFLOPS | 305.7 GFLOPS |
2 (512 bytes) per thread | 204.5 GFLOPS | 326.9 GFLOPS | 351.4 GFLOPS | 208.2 GFLOPS | 326.6 GFLOPS | 323.5 GFLOPS |
3 (768 bytes) per thread | 204.6 GFLOPS | 326.8 GFLOPS | 351.4 GFLOPS | 211.6 GFLOPS | 320.5 GFLOPS | 324.9 GFLOPS |
4 (1024 bytes) per thread | 204.6 GFLOPS | 329.3 GFLOPS | 351.3 GFLOPS | 205.7 GFLOPS | 326.6 GFLOPS | 325.6 GFLOPS |
5 (1280 bytes) per thread | 204.6 GFLOPS | 329.1 GFLOPS | 351.2 GFLOPS | 205.4 GFLOPS | 326.5 GFLOPS | 322.4 GFLOPS |
6 (1536 bytes) per thread | 204.6 GFLOPS | 328.9 GFLOPS | 351.4 GFLOPS | 208.7 GFLOPS | 318.2 GFLOPS | 322.7 GFLOPS |
7 (1792 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.4 GFLOPS | 205.9 GFLOPS | 326.4 GFLOPS | 324.0 GFLOPS |
8 (2048 bytes) per thread | 204.5 GFLOPS | 329.1 GFLOPS | 351.4 GFLOPS | 208.1 GFLOPS | 326.5 GFLOPS | 321.3 GFLOPS |
9 (2304 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.4 GFLOPS | 207.3 GFLOPS | 323.6 GFLOPS | 326.9 GFLOPS |
10 (2560 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.4 GFLOPS | 206.2 GFLOPS | 320.8 GFLOPS | 326.7 GFLOPS |
11 (2816 bytes) per thread | 204.5 GFLOPS | 326.9 GFLOPS | 346.5 GFLOPS | 208.2 GFLOPS | 326.3 GFLOPS | 321.5 GFLOPS |
12 (3072 bytes) per thread | 204.6 GFLOPS | 329.1 GFLOPS | 351.1 GFLOPS | 205.9 GFLOPS | 326.5 GFLOPS | 326.4 GFLOPS |
13 (3328 bytes) per thread | 204.5 GFLOPS | 329.3 GFLOPS | 351.4 GFLOPS | 206.6 GFLOPS | 323.5 GFLOPS | 323.4 GFLOPS |
14 (3584 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.4 GFLOPS | 205.5 GFLOPS | 326.4 GFLOPS | 323.2 GFLOPS |
15 (3840 bytes) per thread | 204.6 GFLOPS | 329.2 GFLOPS | 351.0 GFLOPS | 205.8 GFLOPS | 326.5 GFLOPS | 322.5 GFLOPS |
16 (4096 bytes) per thread | 204.6 GFLOPS | 327.1 GFLOPS | 351.2 GFLOPS | 208.1 GFLOPS | 324.0 GFLOPS | 321.7 GFLOPS |
Furthermore, in order to hit these numbers, two things need to be noted:
- For vector operations, the Z accumulators you choose want to be maximally spread out over the permissible range, rather than contiguous (which is why the two-at-a-time / four-at-a-time on M2 use a Z step of 32 / 16).
- There's limited-to-no register renaming going on.
Because of point (2), you're in the "1 (256 bytes) per thread" row of the above table. Changing from "ceil(H_width / 32) iterations" to "ceil(H_width / 64) iterations" would help here, as each iteration could then address two different Z ranges, getting you to "2 (512 bytes) per thread".
from amx.
Ok interesting, thanks for the data! I think I've got a better understanding of what's a good fit for AMX acceleration and how to get the utilisation as high as possible.
from amx.
Related Issues (10)
- Some tests failing in the M3 Max HOT 4
- register 31 - SP or XZR? HOT 2
- Updated and new publications on using AMX for cryptography HOT 2
- Some tests failing on M4 HOT 2
- Possibility of adding support for Linux for Apple AMX1 and AMX2 HOT 4
- M2 Compatibility HOT 5
- A15/M2 Performance HOT 7
- matint/matfp vs mac16/fma*/fms*, and how to do multiply-only (without accumulating/subtracting) using matint HOT 1
- A preprint on cryptography using AMX HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amx.