Hi again, Thanks for your excellent research. I've been attempting t

Using AMX for non-ML optimisation about amx HOT 2 CLOSED

sfjohnson commented on September 23, 2024 1

Using AMX for non-ML optimisation

from amx.

Comments (2)

corsix commented on September 23, 2024

AMX vector performance on M1/M2 is disappointing, especially in comparison to M1/M2 performance-core NEON. An M1 performance-core using NEON has theoretical maximum f32 performance of 102.4 GFLOPS (dispatch 4 FMA instructions per cycle, each FMA is 4 wide, each FMA counts as two ops, 3.2 GHz). Perfect four-way multithreading would get you up to 409.6 GFLOPS, and perfect eight-way would get you to 819.2 GLOPS. M2 should be slightly higher.

For comparison, AMX f32 vector FMAs on an M1 Max look like:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (64 bytes) per thread	23.2 GFLOPS	46.4 GFLOPS	53.3 GFLOPS	81.1 GFLOPS	89.0 GFLOPS	104.1 GFLOPS
2 (128 bytes) per thread	46.4 GFLOPS	92.7 GFLOPS	106.5 GFLOPS	141.3 GFLOPS	176.8 GFLOPS	206.5 GFLOPS
3 (192 bytes) per thread	69.6 GFLOPS	139.1 GFLOPS	160.1 GFLOPS	213.3 GFLOPS	250.6 GFLOPS	244.9 GFLOPS
4 (256 bytes) per thread	92.7 GFLOPS	185.4 GFLOPS	214.0 GFLOPS	277.6 GFLOPS	325.5 GFLOPS	298.0 GFLOPS
5 (320 bytes) per thread	115.8 GFLOPS	231.7 GFLOPS	241.0 GFLOPS	321.3 GFLOPS	355.1 GFLOPS	347.7 GFLOPS
6 (384 bytes) per thread	139.0 GFLOPS	277.7 GFLOPS	271.2 GFLOPS	361.7 GFLOPS	387.1 GFLOPS	386.2 GFLOPS
7 (448 bytes) per thread	162.2 GFLOPS	324.2 GFLOPS	299.9 GFLOPS	383.4 GFLOPS	394.0 GFLOPS	400.9 GFLOPS
8 (512 bytes) per thread	185.5 GFLOPS	369.9 GFLOPS	335.8 GFLOPS	392.9 GFLOPS	405.8 GFLOPS	416.0 GFLOPS
9 (576 bytes) per thread	178.0 GFLOPS	353.4 GFLOPS	325.5 GFLOPS	396.9 GFLOPS	398.0 GFLOPS	409.2 GFLOPS
10 (640 bytes) per thread	183.1 GFLOPS	360.6 GFLOPS	335.3 GFLOPS	402.4 GFLOPS	401.2 GFLOPS	417.2 GFLOPS
11 (704 bytes) per thread	183.1 GFLOPS	363.0 GFLOPS	334.2 GFLOPS	403.2 GFLOPS	400.6 GFLOPS	415.8 GFLOPS
12 (768 bytes) per thread	185.2 GFLOPS	370.6 GFLOPS	335.5 GFLOPS	378.5 GFLOPS	397.7 GFLOPS	419.0 GFLOPS
13 (832 bytes) per thread	185.2 GFLOPS	369.4 GFLOPS	336.0 GFLOPS	404.2 GFLOPS	400.9 GFLOPS	414.1 GFLOPS
14 (896 bytes) per thread	185.5 GFLOPS	370.5 GFLOPS	336.4 GFLOPS	406.0 GFLOPS	402.9 GFLOPS	416.4 GFLOPS
15 (960 bytes) per thread	185.5 GFLOPS	370.0 GFLOPS	336.8 GFLOPS	405.7 GFLOPS	402.6 GFLOPS	409.6 GFLOPS
16 (1024 bytes) per thread	185.4 GFLOPS	370.4 GFLOPS	336.3 GFLOPS	406.0 GFLOPS	399.7 GFLOPS	405.3 GFLOPS

A single thread can exceed 102.4 GFLOPS, potentially hitting 185.5 GFLOPS, but you need to be using 512 bytes of Z registers to get there. Four threads only get to 406.0 GFLOPS, which is less than the theoretical 409.6 achievable with NEON. More than four threads don't help AMX, but will help NEON.

The same thing on M2 looks like:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (64 bytes) per thread	25.6 GFLOPS	41.2 GFLOPS	61.7 GFLOPS	78.7 GFLOPS	98.4 GFLOPS	117.7 GFLOPS
2 (128 bytes) per thread	51.2 GFLOPS	82.3 GFLOPS	123.5 GFLOPS	157.7 GFLOPS	174.1 GFLOPS	170.4 GFLOPS
3 (192 bytes) per thread	76.7 GFLOPS	123.4 GFLOPS	179.5 GFLOPS	191.0 GFLOPS	216.9 GFLOPS	215.1 GFLOPS
4 (256 bytes) per thread	102.2 GFLOPS	164.6 GFLOPS	237.1 GFLOPS	231.8 GFLOPS	258.3 GFLOPS	263.3 GFLOPS
5 (320 bytes) per thread	127.8 GFLOPS	205.7 GFLOPS	279.1 GFLOPS	264.8 GFLOPS	285.7 GFLOPS	289.5 GFLOPS
6 (384 bytes) per thread	153.5 GFLOPS	226.0 GFLOPS	299.5 GFLOPS	286.6 GFLOPS	300.5 GFLOPS	308.3 GFLOPS
7 (448 bytes) per thread	179.0 GFLOPS	246.6 GFLOPS	300.6 GFLOPS	291.4 GFLOPS	302.4 GFLOPS	306.2 GFLOPS
8 (512 bytes) per thread	204.4 GFLOPS	269.7 GFLOPS	301.6 GFLOPS	299.4 GFLOPS	309.2 GFLOPS	310.4 GFLOPS
9 (576 bytes) per thread	204.6 GFLOPS	270.5 GFLOPS	302.9 GFLOPS	297.9 GFLOPS	304.7 GFLOPS	307.3 GFLOPS
10 (640 bytes) per thread	204.7 GFLOPS	270.3 GFLOPS	303.0 GFLOPS	300.2 GFLOPS	306.9 GFLOPS	308.9 GFLOPS
11 (704 bytes) per thread	204.6 GFLOPS	276.5 GFLOPS	308.4 GFLOPS	302.1 GFLOPS	305.8 GFLOPS	307.5 GFLOPS
12 (768 bytes) per thread	204.5 GFLOPS	270.5 GFLOPS	302.9 GFLOPS	299.9 GFLOPS	304.2 GFLOPS	307.5 GFLOPS
13 (832 bytes) per thread	204.6 GFLOPS	275.3 GFLOPS	307.9 GFLOPS	299.8 GFLOPS	306.4 GFLOPS	307.4 GFLOPS
14 (896 bytes) per thread	204.2 GFLOPS	270.5 GFLOPS	302.9 GFLOPS	299.6 GFLOPS	306.9 GFLOPS	310.6 GFLOPS
15 (960 bytes) per thread	204.5 GFLOPS	275.7 GFLOPS	308.5 GFLOPS	299.5 GFLOPS	305.5 GFLOPS	307.4 GFLOPS
16 (1024 bytes) per thread	204.6 GFLOPS	270.5 GFLOPS	302.8 GFLOPS	299.8 GFLOPS	306.9 GFLOPS	307.4 GFLOPS

The single-threaded numbers get 10% higher than M1 Max, which is consistent with clock speeds being 10% higher. M1 Max gets approximately double performance from 2 threads, and then only marginal improvement from subsequent threads. M2 only gets small performance improvements from additional threads, which is consistent with only having one P cpu cluster (versus two on M1 Max).

The four-at-a-time mode on M2 doesn't look much better:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (256 bytes) per thread	102.3 GFLOPS	164.9 GFLOPS	247.0 GFLOPS	187.5 GFLOPS	250.9 GFLOPS	305.7 GFLOPS
2 (512 bytes) per thread	204.5 GFLOPS	326.9 GFLOPS	351.4 GFLOPS	208.2 GFLOPS	326.6 GFLOPS	323.5 GFLOPS
3 (768 bytes) per thread	204.6 GFLOPS	326.8 GFLOPS	351.4 GFLOPS	211.6 GFLOPS	320.5 GFLOPS	324.9 GFLOPS
4 (1024 bytes) per thread	204.6 GFLOPS	329.3 GFLOPS	351.3 GFLOPS	205.7 GFLOPS	326.6 GFLOPS	325.6 GFLOPS
5 (1280 bytes) per thread	204.6 GFLOPS	329.1 GFLOPS	351.2 GFLOPS	205.4 GFLOPS	326.5 GFLOPS	322.4 GFLOPS
6 (1536 bytes) per thread	204.6 GFLOPS	328.9 GFLOPS	351.4 GFLOPS	208.7 GFLOPS	318.2 GFLOPS	322.7 GFLOPS
7 (1792 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.4 GFLOPS	205.9 GFLOPS	326.4 GFLOPS	324.0 GFLOPS
8 (2048 bytes) per thread	204.5 GFLOPS	329.1 GFLOPS	351.4 GFLOPS	208.1 GFLOPS	326.5 GFLOPS	321.3 GFLOPS
9 (2304 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.4 GFLOPS	207.3 GFLOPS	323.6 GFLOPS	326.9 GFLOPS
10 (2560 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.4 GFLOPS	206.2 GFLOPS	320.8 GFLOPS	326.7 GFLOPS
11 (2816 bytes) per thread	204.5 GFLOPS	326.9 GFLOPS	346.5 GFLOPS	208.2 GFLOPS	326.3 GFLOPS	321.5 GFLOPS
12 (3072 bytes) per thread	204.6 GFLOPS	329.1 GFLOPS	351.1 GFLOPS	205.9 GFLOPS	326.5 GFLOPS	326.4 GFLOPS
13 (3328 bytes) per thread	204.5 GFLOPS	329.3 GFLOPS	351.4 GFLOPS	206.6 GFLOPS	323.5 GFLOPS	323.4 GFLOPS
14 (3584 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.4 GFLOPS	205.5 GFLOPS	326.4 GFLOPS	323.2 GFLOPS
15 (3840 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.0 GFLOPS	205.8 GFLOPS	326.5 GFLOPS	322.5 GFLOPS
16 (4096 bytes) per thread	204.6 GFLOPS	327.1 GFLOPS	351.2 GFLOPS	208.1 GFLOPS	324.0 GFLOPS	321.7 GFLOPS

Furthermore, in order to hit these numbers, two things need to be noted:

For vector operations, the Z accumulators you choose want to be maximally spread out over the permissible range, rather than contiguous (which is why the two-at-a-time / four-at-a-time on M2 use a Z step of 32 / 16).
There's limited-to-no register renaming going on.

Because of point (2), you're in the "1 (256 bytes) per thread" row of the above table. Changing from "ceil(H_width / 32) iterations" to "ceil(H_width / 64) iterations" would help here, as each iteration could then address two different Z ranges, getting you to "2 (512 bytes) per thread".

from amx.

sfjohnson commented on September 23, 2024

Ok interesting, thanks for the data! I think I've got a better understanding of what's a good fit for AMX acceleration and how to get the utilisation as high as possible.

from amx.

Using AMX for non-ML optimisation about amx HOT 2 CLOSED

Comments (2)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent