I intended to update the original post, but changing the post name would invalidate links, and the post was already huge, so I decided it was wiser to split it.
#26: Coalesce reads and writes to Unordered Access Views to help with memory accesses.
#27: Render your skybox last and your first-person geometry first (should be a given by today's standards :)).
#28: Using D16 shadow maps will provide a modest performance boost and a large memory saving.
#29: Avoid unnecessary DISCARD when Map()ping non-CB resources; some apps still do this at least once a frame.
#30: Minimize GS input and output size or consider Vertex Shader-instancing solutions instead.
#31: A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.
#32: Avoid sparse shader resource slot assignments, e.g. binding resource slot #0 and #127 is a bad idea.
#33: Thread Group Shared Memory accesses are subject to bank conflicts on addresses that are multiples of 32 DWORD.
#34: The D3DXSHADER_IEEE_STRICTNESS shader compiler flag is likely to produce longer shader code.
#35: Use D3D11_USAGE_IMMUTABLE on read-only resources. A surprising number of games don’t!
#36: Avoid calling Map() on DYNAMIC textures as this may require a conversion from tiled to linear memory.
#37: Only use UAV BIND flag when needed. Leaving it on for non-UAV usage will cause perf issues due to extra sync.
#38: Don’t pass SV_POSITION into a pixel shader unless you are going to use it.
#38v2: Passing interpolated screenpos can be better than declaring SV_POSITION in pixel shader especially if PS is short.
#39: Ensure proxy and predicated geometry are spaced by a few draws when using predicated rendering.
#40: Fetch indirections increase execution latency; keep it under control especially for VS and DS stages.
#41: Dynamic indexing into a Constant Buffer counts as fetch indirection and should be avoided.
#42: Always clear MSAA render targets before rendering.
#43: With cascaded shadow maps use area culling to exclude geometry already rendered in finer shadow cascades
#44: Avoid over-tessellating geometry that produces small triangles in screen space; in general avoid tiny triangles.
#45: Create shaders before textures to give the driver enough time to convert the D3D ASM to GCN ASM.
#46: Atomic operations on a single TGSM location from all threads will serialize TGSM access.
(TGSM = Thread group shared memory).
#47: Improve motion blur performance by POINT sampling along motion vectors if your source image is RGBA16F.
#48: MIPMapping is underrated - don't forget to use it on displacement maps and volume textures too.
#49: Trilinear is up to 2x the cost of bilinear. Bilinear on 3D textures is 2x the cost of 2D. Aniso cost depends on taps
#50: Avoid heavy switching between compute and rendering jobs. Jobs of the same type should be done consecutively.
Personal notes
These are my personal notes on the tips. Remember, I do not work at AMD and I'm human, I could be wrong:#31: A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.There's A LOT of information regarding this tip in his GDC 2013 slides "DirectX 11 Performance Reloaded"
(slides 12 & 13 regarding this tip in particular)
#38: Don’t pass SV_POSITION into a pixel shader unless you are going to use it.Tip 38 was done twice because it was rephrased (tip #38 was misleading, so tip #38 v2 replaced it)
#38v2: Passing interpolated screenpos can be better than declaring SV_POSITION in pixel shader especially if PS is short.
Actual quote: "SV_POSITION will actually get removed from pixel shader stage if not used. Will rephrase this tip."
#43: With cascaded shadow maps use area culling to exclude geometry already rendered in finer shadow cascades
While not entire related, Andrew Lauritzen suggested his old SDSM Paper (Sample Distribution Shadow Maps), which long story short, it's a way of PSSM that dynamically adjust the frustum corners (i.e. the "limits") of each split, instead of having custom ones.
If memory serves well, it uses Compute Shaders to analyze the ideal split limits, which was the main reason I never ported it to Ogre :(
Last but not least, something from an older tip #23:
#23: GetDimensions() is a TEX instruction; prefer storing texture dimensions in a Constant Buffer if TEX-bound.Pope Kim was interested in this function and since it's a TEX instruction, if the result would be cached and Nick Thibieroz made an important clarification:
"TEX instructions will be grouped together but no caching since no texcoords are provided in GetDimensions()"
I thought this was really worth mentioning.
The End
Well, Nick Thibieroz said tip #50 would conclude the series of Twitter tips and he was planning into wrapping them up into a single document. I would be looking forward to see it.I enjoyed reading those tips, as I learned a lot from the insights of new architecture.
It may do good thinking on this generation of GPUs onwards as just cpu-like architectures (i.e. TextureUnits is just an address pointer to ram with a header in ram describing filtering method and texture bpp) rather than the old fashioned fixed function, state machine they used to be.
If you're a hardcore HW fan who wants to know more about GCN's GPU, you may want to check out this slide.
> If memory serves well, it uses Compute Shaders to analyze the ideal split limits, which was the main reason I never ported it to Ogre :(
ReplyDeleteThe sample does for convenience, but it works just as well with pixel shaders. It's a few simple reductions on the depth buffer.
Thanks for the insight! Definitely I should revisit the paper in the near future.
ReplyDeleteWe never did propper INTZ support in Ogre; in other words, just no proper depth textures. The design for it is there though and isn't hard to code. It's a pending issue.
We're now focusing on Ogre 2.0 and the Compositor is getting a review, which should support depth textures when it's done.
Thanks again!