Saturday, March 16, 2013

GCN's 25 Performance Tips from Nick Thibieroz

Update: Tips 26 to 50 were published.

Nicolas Thibieroz from AMD has been posting a daily series of "Performance Tips" through his Twitter account.

These performance tips refer to the GCN architecture, which stands for "Graphics Core Next" that can be found in the ATI Radeon HD 7000 series.

I do NOT work at AMD, and I thought it would be a shame if all these tips would be lost in the Twitterverse as it is not the most reliable place to keep long term documentation nor tips. They will just get lost forever or become sparsed.

So, I've gathered all of them and posted here. I will try to keep this page up to date as he keeps posting more of them.
Before you start reading, if you're really new to how GPUs work or some of these tips leave you with "wth is he talking about??" I highly recommend Emil Persson's ATI Radeon HD 2000 programming guide & Depth In Depth. Also reading NVIDIA's old GPU Programming Guide GeForce 8 & 9 Series is also very enlightening and can constrast differences.
It's note worthy that many old tips still apply, and that some of the new tips are also general advices (apply to many other archs as well, including old ones).

Ok, here you go:
#1: Issues with Z-Fighting? Use D32_FLOAT_S8X24_UINT format with no performance or memory impact compared to D24S8.

#2: Binding a depth buffer as a texture will decompress it, making subsequent Z ops more expensive.

#3: Invest in DirectCompute R&D to unlock new performance levels in your games.

#4: On current GCN DX11 drivers the maximum recommended size for NO_OVERWRITE dynamic buffers is 4Mb.

#5: Limit Vertex and Domain Shader output size to 4 float4/int4 attributes for best performance.

#6: RGBA16 and RGBA16F are fast export, use those to pack G-Buffer data and avoid ROP bottlenecks.

#7: Design your game engine with geometry instancing support from an early stage.

#8: Pure Vertex Shader-based solutions can be faster than using the GS or HS/DS.

#9: Use a ring of STAGING resources to update textures. UpdateSubresource is slow unless texture size is <4Kb.

#10: DX11 supports free-threaded resource creation, use it to reduce shader compilation and texture loading times.

#11: Use the smallest Input Layout necessary for a given VS; this is especially important for depth-only rendering.

#12: Don't forget to optimize geometry for index locality and sequential read access - including procedural geometry.

#13: Implement backface culling in Hull Shader if tessellation factors are on the high side.

#14: Use flow control in shaders but watch out for GPR pressure caused by deep nested branches.

#15: Some shader instructions are costly; pre-compute constants and store them in constant buffers (e.g. reciprocals).

#16: Use [maxtessfactor(X)] in Hull Shader declaration to control tessellation costs. Max recommended value is 15.

#17: Filtering 64-bit texture formats is half-rate on current GCN architectures, only use if needed.

#18: clip, discard, alpha-to-mask and writing to oMask or oDepth disable Early-Z when depth writes are on.

#19: Writing to a UAV or oDepth disables both Early-Z and Hi-Z unless conservative oDepth is used.

#20: DispatchIndirect() and Draw[Indexed]InstancedIndirect() can be used to implement some form of conditional rendering.

#21: Use a multiple of 64 in Compute Shader threadgroup declaration. 256 is often a good choice.

#22: Occlusion queries will stall the CPU if not used correctly.

#23: GetDimensions() is a TEX instruction; prefer storing texture dimensions in a Constant Buffer if TEX-bound.

#24: Avoid indexing into arrays of shader variables - this has a high performance impact.

#25: Pack Vertex Shader outputs to a float4 vector to optimize attributes storage.

Personal notes

These are my personal notes on the tips. Remember, I do not work at AMD and I'm human, I could be wrong:
#1: Issues with Z-Fighting? Use D32_FLOAT_S8X24_UINT format with no performance or memory impact compared to D24S8.
Interestingly, AMD was not recommending DXGI_FORMAT_D24_UNORM_S8_UINT for shadow maps in 2008. They recommended instead to use DXGI_FORMAT_D16_UNORM (better) or DXGI_FORMAT_D32_FLOAT (slower)
NVIDIA, on the other hand, recommended DXGI_FORMAT_D24_UNORM_S8_UINT and noted that DXGI_FORMAT_D32_FLOAT has lower ZCULL efficiency. And unlike AMD, they completely disregarded DXGI_FORMAT_D16_UNORM as it will not save memory or increase performance

Source: GDC 08 DirectX 10 Performance

#6: RGBA16 and RGBA16F are fast export, use those to pack G-Buffer data and avoid ROP bottlenecks.

According to Thibieroz in 2011 export costs should be calculated as follow:
AMD: Total Export Cost = ( Num RTs ) * ( Slowest RT )
NVIDIA: Total Export Cost = Cost( RT0 ) + Cost( RT1 ) + Cost( RT2 ) +...

I don't know if the same formula still applies for GCN architecture.

AMD was discouraging the use of RGBA16 back then, so probably GCN improved in this aspect.
NVIDIA said cost is proportional to bit depth except:
  • <32bpp same speed as 32bpp
  • sRGB formats are slower
  • 1010102 & 111110 are slower than 8888
Source: GDC 2011 Deferred Shading Optimizations

#12: Don't forget to optimize geometry for index locality and sequential read access - including procedural geometry.
AMD Tootle is an excellent tool for that. There was also a paper in which Tootle was inspired from (or was it backwards?) so you should look for it if you want to do your own implementation.
It's worth noting this tip is even more important for Mobile GPUs.

#17: Filtering 64-bit texture formats is half-rate on current GCN architectures, only use if needed.
When asked further about it by Doug Binks, Thibieroz clarified he meant that 64-bit bilinear filtering is half the rate of 64-bit point filtering. I had the same doubt, so I thought it was worth mentioning.

#18: clip, discard, alpha-to-mask and writing to oMask or oDepth disable Early-Z when depth writes are on.
Won Chun  asked "Oh, only clip/mask/discarded fragments are affected, not subsequent fragments that land on the same pixel. Cool." to which Thibieroz replied "Correct :)"
This is important. IIRC some old hardware, when using clip/discard; they would not only prevent Early-Z, but they would also prevent Early-Z for any next draw call even if subsequent passes didn't use discard at all.

As a further note, it looks like GCN is a step backwards in comparison to ATI Radeon HD 2000. According to Persson's Depth In Depth, page 2 and 3, Early Z was enabled for the discard/clip cases. May be the documentation was incorrect. May be it's a step backwards. Bummer.

#24: Avoid indexing into arrays of shader variables - this has a high performance impact.

I asked whether he was referring to constant waterfalling. To those who don't know, Constant Waterfalling happens when indexing constant variables as an array. When many vertices that are being processed together try to index a different portion of the constant register, operations have to be serialized (i.e. HW skinning & some forms of instancing). Therefore, you should be arranging/sorting the vertices in a way that they all read the same index sequentially to reduce serialization. Constant Waterfalling doesn't happen if all your vertices access the same indices in the same order (i.e. likely when doing lighting calculations)

Anyway, he wasn't referring to that, he answered: "Referring to declaring (and indexing into) arrays of temp variables. Those increase GPR usage and affect latency hiding".
Well, that one's new for me!


Last Updated:  2013-03-16

6 comments:

  1. Thanks for gathering those tips. Here are a few clarifications.

    Regarding #1: Issues with Z-Fighting? Use D32_FLOAT_S8X24_UINT format with no performance or memory impact compared to D24S8.
    Preferred shadow map format is D16. Then D32F. D24X8 is same performance as D32F for GCN.
    The recommended use of D32_FLOAT_S8X24_UINT is for the depth buffer of the main render (not shadow maps) since a lot of developers assume D24S8 is their only choice when a stencil buffer is needed.
    And for even more precision the 1-Z trick will help at longer distances.

    Regarding #6: RGBA16 and RGBA16F are fast export, use those to pack G-Buffer data and avoid ROP bottlenecks.
    On GCN architectures (7x00 and other HW using the same architecture) the export cost calculation is simply additive i.e:
    Total Export Cost = Cost( RT0 ) + Cost( RT1 ) + Cost( RT2 ) etc.
    With RGBA16 and RGB16f being fast export it can therefore be beneficial to pack data in, say, a single RGBA16(F) render target rather than two RGBA8 render targets. It really depends whether the PS is export-limited or not (the shorter the PS the more likely it is to be).

    Regarding #17: Filtering 64-bit texture formats is half-rate on current GCN architectures, only use if needed.
    This means don't use bilinear/trilinear/aniso filtering on such formats unless you have to as this will be more expensive.

    #18: clip, discard, alpha-to-mask and writing to oMask or oDepth disable Early-Z when depth writes are on.
    GCN is definitely not a step backward in this area. Make sure to differentiate between HiZ (coarse, tile-based acceptance/rejection test) and Early-Z (per-pixel depth test).

    Regading #24: Avoid indexing into arrays of shader variables - this has a high performance impact.
    The GCN architecture is more susceptible to register pressure than previous architectures hence the need to ensure shaders are designed with both VGPR and SGPR in mind.












    ReplyDelete
  2. Thanks for taking the time for the further clarifications.

    Good clarification on tip #1, the difference actually lies in that they don't need Stencil (better quality) and D16 is still faster. Good to know.

    Regarding tip #18 (Clip vs Early Z), the Depth in Depth article mentions that "On the Radeon HD 2000 series, Early Z works in all cases." under the "Early-Z" section; which is why I brought it up.
    It is a bit confusing though, may be the doc was saying that Radeon X1000 series never did Early-Z if there is alpha test, while Radeon HD 2000 series could do it if depth/stencil writes were off.
    I was under the impression though, that the X1000 would only do Early-Z w/ depth writes off, while HD 2000 could always do it.

    Thanks

    ReplyDelete
  3. I am dreaming of a tool that would process my shaders and output performance warnings (#14, #15, #18, #24, etc..) :)

    ReplyDelete
  4. #1
    According to MSDN
    DXGI_FORMAT_D24_UNORM_S8_UINT
    A 32-bit z-buffer format that supports 24 bits for depth and 8 bits for stencil.

    DXGI_FORMAT_D32_FLOAT_S8X24_UINT
    A 32-bit floating-point component, and two unsigned-integer components (with an additional 32 bits). This format supports 32-bit depth, 8-bit stencil, and 24 bits are unused.

    My English isn't great but don't all that mean that DXGI_FORMAT_D24_UNORM_S8_UINT uses 32 bits per pixel and DXGI_FORMAT_D32_FLOAT_S8X24_UINT uses 64 bits? So it doubles memory usage

    ReplyDelete
    Replies
    1. Hi,

      "In theory" yes. However one thing is what the the DirectX docs say, another thing is what you actually get. The HW only has to respect the behavior and rendering quality. But how it internally works is up to each GPU. If they want to use 512-bits per pixel (for example), they could do it, always as long as it behaves as if it were D24_S8.

      GCN hardware splits D24_S8 into two buffers. The first is a 32-bit Z Buffer where the first 24 bits are used and the other 8-bits are left unused; the second is the Stencil buffer.
      So even if you request D24_S8; GCN will be using two buffers. Just like it would with D32_FLOAT_S8X24_UINT; therefore there is no performance difference between the two formats ON GCN HARDWARE.

      Delete