PerfTest To select adapter, use: PerfTest.exe [ADAPTER_INDEX] Ad

NVIDIA Kepler results (GTX 760M, driver version 419.67) about perftest HOT 2 CLOSED

sebbbi commented on August 22, 2024

NVIDIA Kepler results (GTX 760M, driver version 419.67)

from perftest.

Comments (2)

ash3D commented on August 22, 2024

Some thoughts and comparison with newer NVIDIA architectures

Results shows that Volta and Turing uses read-write LSU pipeline (which is typically used for UAV and shared mem) for untyped loads (both raw and structured buffers) which is beneficial due to more LSU count compared with TMU and shorter latency. LSUs now backed by R/W L1$ (similar to Fermi), so read-only TMU pipeline has no advantages anymore.
Typed loads seems to still uses TMU despite Maxwell introduced format conversion hardware for UAV loads. Probably not all formats supported by LSU (e.g. sRGB or shared exp), so driver has to stick with TMU for typed loads since it can't know upfront which format will be used and whether it is supported by LSU. On the other hand it can be beneficial to use TMUs in some situations: since NVIDIA GPUs has separated LSU and TMU pipelines it can presumably be utilized simultaneously, so in UAV or shared memory intensive workloads using dormant TMU for read-only SRV access can be free.
Maxwell and Pascal doesn't have RW L1$ so tex L1$ offers advantages for TMU over LSU. Apparently TMU used for structured buffer loads while LSU now used for raw buffers (results from previous test version with older drivers suggested TMU for raw buffers too). Maybe this discrepancy is due to inability of NVIDIA hardware to perform fullspeed 64-bit unaligned fetches (alignment in not guaranteed for raw buffers). On the other hand accesses to structured buffers are aligned and 64-bit loads runs at full speed so driver decided to take advantage of texture L1$. It seems that decision to use LSU without L1$ for raw buffers results in better performance that can be expected from TMU based on unit count except for random Load4 which is slower than expected TMU theoretical rate of 1/4 (apparently lack of L1$ comes out in this case).
Kepler uses TMU for all SRV loads. Maybe NVIDIA driver team decided that read-only tex L1$ gives more advantages than larger LSU count for this architecture but another explanation of this difference compared with Maxwell and Pascal which has similar cache hierarchy is that Kepler has fixed 64 slots for UAVs (it supports unlimited descriptor tables for SRVs only). Maxwell expanded descriptor table approach for UAVs so it can use LSU for untyped SRV access if desired.
My own experiments showed that UAV loads can be faster then SRV on Kepler in some cases (it strongly depends on various factors including driver version) but generally SRV offers more predictable and consistent performance.
It's notable that uniform load optimization works for 1d and 2d raw buffer loads but disabled for 3d and 4d ones. Maybe due to GPR pressure?

RGB32 loads

I've also tried RGB32 (float3) format for typed buffer loads and textures. Results was different from my previous experiments. Current test configuration shows 2/3 rate (somewhat strange) for buffers and linear texture access and 1/3 for random texture reads. My previous experiments provided 1/3 rate for linear buffer load (similar to raw buffer Load3) and somewhat slower for random load. 1/3 rate seemed reasonable - it is consistent with assumption than NVIDIA TMU hardware falls back to 32-bit fetches on unaligned access (otherwise it would be expected 1/2 rate as with RGBA32). I played with test configuration a little - tune thread group size and loop iteration count, and got different results - near 1/3 or much slower in some cases. Maybe cache bank conflicts start to appear.
Such dependence on test configuration prompted the idea of driver optimizations (maybe similar to uniform load optimization) - 64 or 128 bit fetches combined to assemble 96 bit result while sharing data with other threads or loop iterations. It's remarkable that 2/3 rate achieved when both thread group size and loop iteration count is 256. But this assumption can be wrong and TMU hardware able to offer 2/3 rate for RGGB32 format and its drop to 1/3 or slower in some conditions is because of cache bank conflicts or other reasons.

from perftest.

sebbbi commented on August 22, 2024

Thanks! Added Kepler results. This confirms that Nvidia's uniform load driver optimization affects kepler too.

from perftest.

NVIDIA Kepler results (GTX 760M, driver version 419.67) about perftest HOT 2 CLOSED

Comments (2)

Some thoughts and comparison with newer NVIDIA architectures

RGB32 loads

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent