Distant Souls Development Blog: Tricks

This is a follow up from GCN's 25 Performance Tips from Nick Thibieroz

I intended to update the original post, but changing the post name would invalidate links, and the post was already huge, so I decided it was wiser to split it.

#26: Coalesce reads and writes to Unordered Access Views to help with memory accesses.

#27: Render your skybox last and your first-person geometry first (should be a given by today's standards :)).

#28: Using D16 shadow maps will provide a modest performance boost and a large memory saving.

#29: Avoid unnecessary DISCARD when Map()ping non-CB resources; some apps still do this at least once a frame.

#30: Minimize GS input and output size or consider Vertex Shader-instancing solutions instead.

#31: A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.

#32: Avoid sparse shader resource slot assignments, e.g. binding resource slot #0 and #127 is a bad idea.

#33: Thread Group Shared Memory accesses are subject to bank conflicts on addresses that are multiples of 32 DWORD.

#34: The D3DXSHADER_IEEE_STRICTNESS shader compiler flag is likely to produce longer shader code.

#35: Use D3D11_USAGE_IMMUTABLE on read-only resources. A surprising number of games don’t!

#36: Avoid calling Map() on DYNAMIC textures as this may require a conversion from tiled to linear memory.

#37: Only use UAV BIND flag when needed. Leaving it on for non-UAV usage will cause perf issues due to extra sync.

#38: Don’t pass SV_POSITION into a pixel shader unless you are going to use it.
#38v2: Passing interpolated screenpos can be better than declaring SV_POSITION in pixel shader especially if PS is short.

#39: Ensure proxy and predicated geometry are spaced by a few draws when using predicated rendering.

#40: Fetch indirections increase execution latency; keep it under control especially for VS and DS stages.

#41: Dynamic indexing into a Constant Buffer counts as fetch indirection and should be avoided.

#42: Always clear MSAA render targets before rendering.

#43: With cascaded shadow maps use area culling to exclude geometry already rendered in finer shadow cascades

#44: Avoid over-tessellating geometry that produces small triangles in screen space; in general avoid tiny triangles.

#45: Create shaders before textures to give the driver enough time to convert the D3D ASM to GCN ASM.

#46: Atomic operations on a single TGSM location from all threads will serialize TGSM access.
(TGSM = Thread group shared memory).

#47: Improve motion blur performance by POINT sampling along motion vectors if your source image is RGBA16F.

#48: MIPMapping is underrated - don't forget to use it on displacement maps and volume textures too.

#49: Trilinear is up to 2x the cost of bilinear. Bilinear on 3D textures is 2x the cost of 2D. Aniso cost depends on taps

#50: Avoid heavy switching between compute and rendering jobs. Jobs of the same type should be done consecutively.

Personal notes

These are my personal notes on the tips. Remember, I do not work at AMD and I'm human, I could be wrong:

#31: A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.

There's A LOT of information regarding this tip in his GDC 2013 slides "DirectX 11 Performance Reloaded"
(slides 12 & 13 regarding this tip in particular)

#38: Don’t pass SV_POSITION into a pixel shader unless you are going to use it.
#38v2: Passing interpolated screenpos can be better than declaring SV_POSITION in pixel shader especially if PS is short.

Tip 38 was done twice because it was rephrased (tip #38 was misleading, so tip #38 v2 replaced it)
Actual quote: "SV_POSITION will actually get removed from pixel shader stage if not used. Will rephrase this tip."

#43: With cascaded shadow maps use area culling to exclude geometry already rendered in finer shadow cascades

While not entire related, Andrew Lauritzen suggested his old SDSM Paper (Sample Distribution Shadow Maps), which long story short, it's a way of PSSM that dynamically adjust the frustum corners (i.e. the "limits") of each split, instead of having custom ones.
If memory serves well, it uses Compute Shaders to analyze the ideal split limits, which was the main reason I never ported it to Ogre :(

Last but not least, something from an older tip #23:

#23: GetDimensions() is a TEX instruction; prefer storing texture dimensions in a Constant Buffer if TEX-bound.

Pope Kim was interested in this function and since it's a TEX instruction, if the result would be cached and Nick Thibieroz made an important clarification:
"TEX instructions will be grouped together but no caching since no texcoords are provided in GetDimensions()"
I thought this was really worth mentioning.

The End

Well, Nick Thibieroz said tip #50 would conclude the series of Twitter tips and he was planning into wrapping them up into a single document. I would be looking forward to see it.
I enjoyed reading those tips, as I learned a lot from the insights of new architecture.
It may do good thinking on this generation of GPUs onwards as just cpu-like architectures (i.e. TextureUnits is just an address pointer to ram with a header in ram describing filtering method and texture bpp) rather than the old fashioned fixed function, state machine they used to be.

If you're a hardcore HW fan who wants to know more about GCN's GPU, you may want to check out this slide.

Follow @matiasgoldberg

Update: Tips 26 to 50 were published.

Nicolas Thibieroz from AMD has been posting a daily series of "Performance Tips" through his Twitter account.

These performance tips refer to the GCN architecture, which stands for "Graphics Core Next" that can be found in the ATI Radeon HD 7000 series.

I do NOT work at AMD, and I thought it would be a shame if all these tips would be lost in the Twitterverse as it is not the most reliable place to keep long term documentation nor tips. They will just get lost forever or become sparsed.

So, I've gathered all of them and posted here. I will try to keep this page up to date as he keeps posting more of them.
Before you start reading, if you're really new to how GPUs work or some of these tips leave you with "wth is he talking about??" I highly recommend Emil Persson's ATI Radeon HD 2000 programming guide & Depth In Depth. Also reading NVIDIA's old GPU Programming Guide GeForce 8 & 9 Series is also very enlightening and can constrast differences.
It's note worthy that many old tips still apply, and that some of the new tips are also general advices (apply to many other archs as well, including old ones).

Ok, here you go:

#1: Issues with Z-Fighting? Use D32_FLOAT_S8X24_UINT format with no performance or memory impact compared to D24S8.

#2: Binding a depth buffer as a texture will decompress it, making subsequent Z ops more expensive.

#3: Invest in DirectCompute R&D to unlock new performance levels in your games.

#4: On current GCN DX11 drivers the maximum recommended size for NO_OVERWRITE dynamic buffers is 4Mb.

#5: Limit Vertex and Domain Shader output size to 4 float4/int4 attributes for best performance.

#6: RGBA16 and RGBA16F are fast export, use those to pack G-Buffer data and avoid ROP bottlenecks.

#7: Design your game engine with geometry instancing support from an early stage.

#8: Pure Vertex Shader-based solutions can be faster than using the GS or HS/DS.

#9: Use a ring of STAGING resources to update textures. UpdateSubresource is slow unless texture size is <4Kb.

#10: DX11 supports free-threaded resource creation, use it to reduce shader compilation and texture loading times.

#11: Use the smallest Input Layout necessary for a given VS; this is especially important for depth-only rendering.

#12: Don't forget to optimize geometry for index locality and sequential read access - including procedural geometry.

#13: Implement backface culling in Hull Shader if tessellation factors are on the high side.

#14: Use flow control in shaders but watch out for GPR pressure caused by deep nested branches.

#15: Some shader instructions are costly; pre-compute constants and store them in constant buffers (e.g. reciprocals).

#16: Use [maxtessfactor(X)] in Hull Shader declaration to control tessellation costs. Max recommended value is 15.

#17: Filtering 64-bit texture formats is half-rate on current GCN architectures, only use if needed.

#18: clip, discard, alpha-to-mask and writing to oMask or oDepth disable Early-Z when depth writes are on.

#19: Writing to a UAV or oDepth disables both Early-Z and Hi-Z unless conservative oDepth is used.

#20: DispatchIndirect() and Draw[Indexed]InstancedIndirect() can be used to implement some form of conditional rendering.

#21: Use a multiple of 64 in Compute Shader threadgroup declaration. 256 is often a good choice.

#22: Occlusion queries will stall the CPU if not used correctly.

#23: GetDimensions() is a TEX instruction; prefer storing texture dimensions in a Constant Buffer if TEX-bound.

#24: Avoid indexing into arrays of shader variables - this has a high performance impact.

#25: Pack Vertex Shader outputs to a float4 vector to optimize attributes storage.

Personal notes

These are my personal notes on the tips. Remember, I do not work at AMD and I'm human, I could be wrong:

#1: Issues with Z-Fighting? Use D32_FLOAT_S8X24_UINT format with no performance or memory impact compared to D24S8.

Interestingly, AMD was not recommending DXGI_FORMAT_D24_UNORM_S8_UINT for shadow maps in 2008. They recommended instead to use DXGI_FORMAT_D16_UNORM (better) or DXGI_FORMAT_D32_FLOAT (slower)
NVIDIA, on the other hand, recommended DXGI_FORMAT_D24_UNORM_S8_UINT and noted that DXGI_FORMAT_D32_FLOAT has lower ZCULL efficiency. And unlike AMD, they completely disregarded DXGI_FORMAT_D16_UNORM as it will not save memory or increase performance

Source: GDC 08 DirectX 10 Performance

#6: RGBA16 and RGBA16F are fast export, use those to pack G-Buffer data and avoid ROP bottlenecks.

According to Thibieroz in 2011 export costs should be calculated as follow:
AMD: Total Export Cost = ( Num RTs ) * ( Slowest RT )
NVIDIA: Total Export Cost = Cost( RT0 ) + Cost( RT1 ) + Cost( RT2 ) +...

I don't know if the same formula still applies for GCN architecture.

AMD was discouraging the use of RGBA16 back then, so probably GCN improved in this aspect.
NVIDIA said cost is proportional to bit depth except:

<32bpp same speed as 32bpp
sRGB formats are slower
1010102 & 111110 are slower than 8888

Source: GDC 2011 Deferred Shading Optimizations

#12: Don't forget to optimize geometry for index locality and sequential read access - including procedural geometry.

AMD Tootle is an excellent tool for that. There was also a paper in which Tootle was inspired from (or was it backwards?) so you should look for it if you want to do your own implementation.
It's worth noting this tip is even more important for Mobile GPUs.

#17: Filtering 64-bit texture formats is half-rate on current GCN architectures, only use if needed.

When asked further about it by Doug Binks, Thibieroz clarified he meant that 64-bit bilinear filtering is half the rate of 64-bit point filtering. I had the same doubt, so I thought it was worth mentioning.

#18: clip, discard, alpha-to-mask and writing to oMask or oDepth disable Early-Z when depth writes are on.

Won Chun asked "Oh, only clip/mask/discarded fragments are affected, not subsequent fragments that land on the same pixel. Cool." to which Thibieroz replied "Correct :)"
This is important. IIRC some old hardware, when using clip/discard; they would not only prevent Early-Z, but they would also prevent Early-Z for any next draw call even if subsequent passes didn't use discard at all.

As a further note, it looks like GCN is a step backwards in comparison to ATI Radeon HD 2000. According to Persson's Depth In Depth, page 2 and 3, Early Z was enabled for the discard/clip cases. May be the documentation was incorrect. May be it's a step backwards. Bummer.

#24: Avoid indexing into arrays of shader variables - this has a high performance impact.

I asked whether he was referring to constant waterfalling. To those who don't know, Constant Waterfalling happens when indexing constant variables as an array. When many vertices that are being processed together try to index a different portion of the constant register, operations have to be serialized (i.e. HW skinning & some forms of instancing). Therefore, you should be arranging/sorting the vertices in a way that they all read the same index sequentially to reduce serialization. Constant Waterfalling doesn't happen if all your vertices access the same indices in the same order (i.e. likely when doing lighting calculations)

Anyway, he wasn't referring to that, he answered: "Referring to declaring (and indexing into) arrays of temp variables. Those increase GPR usage and affect latency hiding".
Well, that one's new for me!

Last Updated: 2013-03-16

Follow @matiasgoldberg

Distant Souls Development Blog

Pages

Sunday, May 5, 2013

GCN's Performance Tips 26 to 50 from Nick Thibieroz

Personal notes

The End

Saturday, March 16, 2013

GCN's 25 Performance Tips from Nick Thibieroz

Personal notes