Git Product home page Git Product logo

Comments (2)

ash3D avatar ash3D commented on August 22, 2024

Some thoughts and comparison with newer NVIDIA architectures

  • Results shows that Volta and Turing uses read-write LSU pipeline (which is typically used for UAV and shared mem) for untyped loads (both raw and structured buffers) which is beneficial due to more LSU count compared with TMU and shorter latency. LSUs now backed by R/W L1$ (similar to Fermi), so read-only TMU pipeline has no advantages anymore.
    Typed loads seems to still uses TMU despite Maxwell introduced format conversion hardware for UAV loads. Probably not all formats supported by LSU (e.g. sRGB or shared exp), so driver has to stick with TMU for typed loads since it can't know upfront which format will be used and whether it is supported by LSU. On the other hand it can be beneficial to use TMUs in some situations: since NVIDIA GPUs has separated LSU and TMU pipelines it can presumably be utilized simultaneously, so in UAV or shared memory intensive workloads using dormant TMU for read-only SRV access can be free.

  • Maxwell and Pascal doesn't have RW L1$ so tex L1$ offers advantages for TMU over LSU. Apparently TMU used for structured buffer loads while LSU now used for raw buffers (results from previous test version with older drivers suggested TMU for raw buffers too). Maybe this discrepancy is due to inability of NVIDIA hardware to perform fullspeed 64-bit unaligned fetches (alignment in not guaranteed for raw buffers). On the other hand accesses to structured buffers are aligned and 64-bit loads runs at full speed so driver decided to take advantage of texture L1$. It seems that decision to use LSU without L1$ for raw buffers results in better performance that can be expected from TMU based on unit count except for random Load4 which is slower than expected TMU theoretical rate of 1/4 (apparently lack of L1$ comes out in this case).

  • Kepler uses TMU for all SRV loads. Maybe NVIDIA driver team decided that read-only tex L1$ gives more advantages than larger LSU count for this architecture but another explanation of this difference compared with Maxwell and Pascal which has similar cache hierarchy is that Kepler has fixed 64 slots for UAVs (it supports unlimited descriptor tables for SRVs only). Maxwell expanded descriptor table approach for UAVs so it can use LSU for untyped SRV access if desired.
    My own experiments showed that UAV loads can be faster then SRV on Kepler in some cases (it strongly depends on various factors including driver version) but generally SRV offers more predictable and consistent performance.
    It's notable that uniform load optimization works for 1d and 2d raw buffer loads but disabled for 3d and 4d ones. Maybe due to GPR pressure?

RGB32 loads

I've also tried RGB32 (float3) format for typed buffer loads and textures. Results was different from my previous experiments. Current test configuration shows 2/3 rate (somewhat strange) for buffers and linear texture access and 1/3 for random texture reads. My previous experiments provided 1/3 rate for linear buffer load (similar to raw buffer Load3) and somewhat slower for random load. 1/3 rate seemed reasonable - it is consistent with assumption than NVIDIA TMU hardware falls back to 32-bit fetches on unaligned access (otherwise it would be expected 1/2 rate as with RGBA32). I played with test configuration a little - tune thread group size and loop iteration count, and got different results - near 1/3 or much slower in some cases. Maybe cache bank conflicts start to appear.
Such dependence on test configuration prompted the idea of driver optimizations (maybe similar to uniform load optimization) - 64 or 128 bit fetches combined to assemble 96 bit result while sharing data with other threads or loop iterations. It's remarkable that 2/3 rate achieved when both thread group size and loop iteration count is 256. But this assumption can be wrong and TMU hardware able to offer 2/3 rate for RGGB32 format and its drop to 1/3 or slower in some conditions is because of cache bank conflicts or other reasons.

from perftest.

sebbbi avatar sebbbi commented on August 22, 2024

Thanks! Added Kepler results. This confirms that Nvidia's uniform load driver optimization affects kepler too.

from perftest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.