I've been having it for a while (mainly just to store royalty free music), but now I decided to join everything in one place:
Ogre Meshy, Royalty Free Music, Distant Souls & the blog.
It's easier for me to manage by keeping everything together. I'll be updating that blog more often (particularly during GSoC) from now on.
I'd wish I had the time to upload more music. There's around 100 pieces waiting for mastering & uploading in my HDD.
This blog won't be closed to keep as archives of the old stuff.
Saturday, June 15, 2013
Sunday, May 5, 2013
GCN's Performance Tips 26 to 50 from Nick Thibieroz
This is a follow up from GCN's 25 Performance Tips from Nick Thibieroz
I intended to update the original post, but changing the post name would invalidate links, and the post was already huge, so I decided it was wiser to split it.
(slides 12 & 13 regarding this tip in particular)
Actual quote: "SV_POSITION will actually get removed from pixel shader stage if not used. Will rephrase this tip."
While not entire related, Andrew Lauritzen suggested his old SDSM Paper (Sample Distribution Shadow Maps), which long story short, it's a way of PSSM that dynamically adjust the frustum corners (i.e. the "limits") of each split, instead of having custom ones.
If memory serves well, it uses Compute Shaders to analyze the ideal split limits, which was the main reason I never ported it to Ogre :(
Last but not least, something from an older tip #23:
"TEX instructions will be grouped together but no caching since no texcoords are provided in GetDimensions()"
I thought this was really worth mentioning.
I enjoyed reading those tips, as I learned a lot from the insights of new architecture.
It may do good thinking on this generation of GPUs onwards as just cpu-like architectures (i.e. TextureUnits is just an address pointer to ram with a header in ram describing filtering method and texture bpp) rather than the old fashioned fixed function, state machine they used to be.
If you're a hardcore HW fan who wants to know more about GCN's GPU, you may want to check out this slide.
I intended to update the original post, but changing the post name would invalidate links, and the post was already huge, so I decided it was wiser to split it.
#26: Coalesce reads and writes to Unordered Access Views to help with memory accesses.
#27: Render your skybox last and your first-person geometry first (should be a given by today's standards :)).
#28: Using D16 shadow maps will provide a modest performance boost and a large memory saving.
#29: Avoid unnecessary DISCARD when Map()ping non-CB resources; some apps still do this at least once a frame.
#30: Minimize GS input and output size or consider Vertex Shader-instancing solutions instead.
#31: A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.
#32: Avoid sparse shader resource slot assignments, e.g. binding resource slot #0 and #127 is a bad idea.
#33: Thread Group Shared Memory accesses are subject to bank conflicts on addresses that are multiples of 32 DWORD.
#34: The D3DXSHADER_IEEE_STRICTNESS shader compiler flag is likely to produce longer shader code.
#35: Use D3D11_USAGE_IMMUTABLE on read-only resources. A surprising number of games don’t!
#36: Avoid calling Map() on DYNAMIC textures as this may require a conversion from tiled to linear memory.
#37: Only use UAV BIND flag when needed. Leaving it on for non-UAV usage will cause perf issues due to extra sync.
#38: Don’t pass SV_POSITION into a pixel shader unless you are going to use it.
#38v2: Passing interpolated screenpos can be better than declaring SV_POSITION in pixel shader especially if PS is short.
#39: Ensure proxy and predicated geometry are spaced by a few draws when using predicated rendering.
#40: Fetch indirections increase execution latency; keep it under control especially for VS and DS stages.
#41: Dynamic indexing into a Constant Buffer counts as fetch indirection and should be avoided.
#42: Always clear MSAA render targets before rendering.
#43: With cascaded shadow maps use area culling to exclude geometry already rendered in finer shadow cascades
#44: Avoid over-tessellating geometry that produces small triangles in screen space; in general avoid tiny triangles.
#45: Create shaders before textures to give the driver enough time to convert the D3D ASM to GCN ASM.
#46: Atomic operations on a single TGSM location from all threads will serialize TGSM access.
(TGSM = Thread group shared memory).
#47: Improve motion blur performance by POINT sampling along motion vectors if your source image is RGBA16F.
#48: MIPMapping is underrated - don't forget to use it on displacement maps and volume textures too.
#49: Trilinear is up to 2x the cost of bilinear. Bilinear on 3D textures is 2x the cost of 2D. Aniso cost depends on taps
#50: Avoid heavy switching between compute and rendering jobs. Jobs of the same type should be done consecutively.
Personal notes
These are my personal notes on the tips. Remember, I do not work at AMD and I'm human, I could be wrong:#31: A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.There's A LOT of information regarding this tip in his GDC 2013 slides "DirectX 11 Performance Reloaded"
(slides 12 & 13 regarding this tip in particular)
#38: Don’t pass SV_POSITION into a pixel shader unless you are going to use it.Tip 38 was done twice because it was rephrased (tip #38 was misleading, so tip #38 v2 replaced it)
#38v2: Passing interpolated screenpos can be better than declaring SV_POSITION in pixel shader especially if PS is short.
Actual quote: "SV_POSITION will actually get removed from pixel shader stage if not used. Will rephrase this tip."
#43: With cascaded shadow maps use area culling to exclude geometry already rendered in finer shadow cascades
While not entire related, Andrew Lauritzen suggested his old SDSM Paper (Sample Distribution Shadow Maps), which long story short, it's a way of PSSM that dynamically adjust the frustum corners (i.e. the "limits") of each split, instead of having custom ones.
If memory serves well, it uses Compute Shaders to analyze the ideal split limits, which was the main reason I never ported it to Ogre :(
Last but not least, something from an older tip #23:
#23: GetDimensions() is a TEX instruction; prefer storing texture dimensions in a Constant Buffer if TEX-bound.Pope Kim was interested in this function and since it's a TEX instruction, if the result would be cached and Nick Thibieroz made an important clarification:
"TEX instructions will be grouped together but no caching since no texcoords are provided in GetDimensions()"
I thought this was really worth mentioning.
The End
Well, Nick Thibieroz said tip #50 would conclude the series of Twitter tips and he was planning into wrapping them up into a single document. I would be looking forward to see it.I enjoyed reading those tips, as I learned a lot from the insights of new architecture.
It may do good thinking on this generation of GPUs onwards as just cpu-like architectures (i.e. TextureUnits is just an address pointer to ram with a header in ram describing filtering method and texture bpp) rather than the old fashioned fixed function, state machine they used to be.
If you're a hardcore HW fan who wants to know more about GCN's GPU, you may want to check out this slide.
Saturday, March 30, 2013
Adventures in branchless min-max with VS2012
So, I'm researching the fastest ways to perform min & max because it's very used in Ogre.
I decided to perform a test between the following versions:
Before I continue, a couple notes:
Additionally, I even do a constant operation to ensure constant propagation still works as expected (which does).
I haven't checked whether the actual binary size of the instructions is equal. Also note some instructions are following a different order, which may pipeline better/worse.
I was also afraid that the compiler wouldn't do this efficiently (i.e. store xmm0's result from the first minss unnecessarily into stack and then load it back) but note that it writes to memory after the second one, maxss, is executed!
Not only that, the code is also smaller. It couldn't have gone better. All advantages, no disadvantage.
Completely branchless, no exchange with gpr registers, no conditional moves either!
The only way to implement it is using a local variable which is where _mm_store_ss will send the result.
But then the compiler warns that we're returning the address of a local variable. Which can be very dangerous, since "const float &r = std::min( a, b )" is valid, and would cause undefined behavior.
On x86 architectures, it will use the SSE2 intrinsics as it translates to optimum assembly. On non-x86 architectures, it will default to the inline C version (or an architecture variant, ie. Neon intrinsics)
We still have to measure what happens with GCC though.
I decided to perform a test between the following versions:
- std::min & std::max
- inline min & max using pure C
- inline min & max using SSE intrinsics
inline const float& min( const float &a, const float &b )The SSE version:
{
return a < b ? a : b;
}
inline const float min( const float &left, const float &right )
{
float retVal;
_mm_store_ss( &retVal, _mm_min_ss( _mm_set_ss( left ), _mm_set_ss( right ) ) );
return retVal;
}
Before I continue, a couple notes:
- Everything compiled with /arch:SSE2. Not setting it produced the same code. Setting /arch:IA32 produced ugly & long x87 code.
- /O2 was enabled.
- /LCGT was disabled.
- The same tests were applied to "max" versions, as well as their double counterparts. I got the same results.
- VS2012 Express Edition
- This was a win32 build, not x64
The Test
int main( int argc, char **argv )I cout the result to ensure the compiler doesn't wipe the whole code. I cast argv to ensure the compiler doesn't do constant propagation. I could use atof, but that only clutters the assembly output. I'm not interested in whether the floats are nans or the pointers are valid.
{
float a = *(float*)( &argv[0][0] );
float b = *(float*)( &argv[0][1] );
float c = *(float*)( &argv[0][2] );
float result = min( (-1.0f + 20.5f), a * b + c );
result = max( result, c );
std::cout << result;
return 0;
}
Additionally, I even do a constant operation to ensure constant propagation still works as expected (which does).
Using std::min
VS2012 is using comiss + cmovbe to perform conditional movement. Definitely an improvement over previous compiler which generated jumps.
_main PROC ; COMDATpush ebpmov ebp, espsub esp, 8mov eax, DWORD PTR _argv$[ebp]movss xmm0, DWORD PTR __real@419c0000mov eax, DWORD PTR [eax]lea ecx, DWORD PTR $T1[ebp]movss xmm1, DWORD PTR [eax+1]mulss xmm1, DWORD PTR [eax]movss xmm2, DWORD PTR [eax+2]lea eax, DWORD PTR $T2[ebp]addss xmm1, xmm2mov DWORD PTR $T1[ebp], 1100742656 ; 419c0000Hmovss DWORD PTR _c$[ebp], xmm2comiss xmm0, xmm1movss DWORD PTR $T2[ebp], xmm1cmovbe eax, ecxlea ecx, DWORD PTR _result$[ebp]movss xmm0, DWORD PTR [eax]comiss xmm2, xmm0lea eax, DWORD PTR _c$[ebp]cmovbe eax, ecxpush ecxmov ecx, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@Amovss DWORD PTR _result$[ebp], xmm0movss xmm0, DWORD PTR [eax]movss DWORD PTR [esp], xmm0call DWORD PTR __imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@M@Zxor eax, eaxmov esp, ebppop ebpret 0_main ENDP
The good:
- Code is branchless
The bad:
- There's an interleave of xmm& gpr registers in the process. The data is also copied to two locations in RAM. Interleaving like this plus moving to memory back and forth doesn't look good at all.
The inline C version
The generated assembly is almost identical to std::min case, with the exception that it saves one instruction by executing comiss directly on memory instead of caching the value first in an xmm register. This value by the way, is -21.5, which is the constant arithmetic we did in the C++ code._main PROC ; COMDATpush ebpmov ebp, espsub esp, 8mov eax, DWORD PTR _argv$[ebp]lea ecx, DWORD PTR $T1[ebp]mov eax, DWORD PTR [eax]mov DWORD PTR $T2[ebp], 1100742656 ; 419c0000Hmovss xmm0, DWORD PTR [eax+1]mulss xmm0, DWORD PTR [eax]movss xmm1, DWORD PTR [eax+2]lea eax, DWORD PTR $T2[ebp]addss xmm0, xmm1movss DWORD PTR _c$[ebp], xmm1comiss xmm0, DWORD PTR __real@419c0000movss DWORD PTR $T1[ebp], xmm0cmovbe eax, ecxlea ecx, DWORD PTR _c$[ebp]movss xmm0, DWORD PTR [eax]comiss xmm0, xmm1lea eax, DWORD PTR _result$[ebp]cmovbe eax, ecxpush ecxmov ecx, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@Amovss DWORD PTR _result$[ebp], xmm0movss xmm0, DWORD PTR [eax]movss DWORD PTR [esp], xmm0call DWORD PTR __imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@M@Zxor eax, eaxmov esp, ebppop ebpret 0_main ENDP
I haven't checked whether the actual binary size of the instructions is equal. Also note some instructions are following a different order, which may pipeline better/worse.
Important remarks:
If the code is changed so that the arguments are not references, but rather copy by value, a jump is generated:inline const float min( const float a, const float b )Note the lack of '&'. If compiled with that definition...
{
return a < b ? a : b;
}
_main PROC ; COMDATpush ebpmov ebp, espmov eax, DWORD PTR _argv$[ebp]movss xmm0, DWORD PTR __real@419c0000mov eax, DWORD PTR [eax]movss xmm1, DWORD PTR [eax+1]mulss xmm1, DWORD PTR [eax]movss xmm2, DWORD PTR [eax+2]addss xmm1, xmm2comiss xmm1, xmm0ja SHORT $LN6@mainmovaps xmm0, xmm1$LN6@main:comiss xmm0, xmm2ja SHORT $LN10@mainmovaps xmm0, xmm2$LN10@main:push ecxmov ecx, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@Amovss DWORD PTR [esp], xmm0call DWORD PTR __imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@M@Zxor eax, eaxpop ebpret 0_main ENDP
The good
- No mixing between general purpose & SSE registers. Everything stays whithin xmm regs. That's much better.
The bad
- Code contains jumps (branches)
Using SSE2 intrinsics
This is the assembly output from using _mm_min_ss & _mm_max_ss:Sweet! That's exactly what we were aiming for. The code translates to minss & maxss instructions._main PROC ; COMDATpush ebpmov ebp, espmov eax, DWORD PTR _argv$[ebp]push ecxmov eax, DWORD PTR [eax]mov ecx, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@Amovss xmm0, DWORD PTR [eax+1]mulss xmm0, DWORD PTR [eax]movss xmm2, DWORD PTR [eax+2]addss xmm0, xmm2movaps xmm1, xmm0movss xmm0, DWORD PTR __real@419c0000minss xmm0, xmm1movaps xmm1, xmm0movaps xmm0, xmm2maxss xmm1, xmm0
movss DWORD PTR [esp], xmm1call DWORD PTR __imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@M@Zxor eax, eaxpop ebpret 0_main ENDP
I was also afraid that the compiler wouldn't do this efficiently (i.e. store xmm0's result from the first minss unnecessarily into stack and then load it back) but note that it writes to memory after the second one, maxss, is executed!
Not only that, the code is also smaller. It couldn't have gone better. All advantages, no disadvantage.
Completely branchless, no exchange with gpr registers, no conditional moves either!
Doubles
I won't post the results from using doubles, because it's the same: instead of comiss, comisd is used. Instead of minss, minsd is generated.Specializing std::min & std::max
I tried specializing std functions for float & double to use the SSE2 intrinsics. This actually worked and produced same optimal assembly with minss & maxss. However, std::min returns by reference. Not by value.The only way to implement it is using a local variable which is where _mm_store_ss will send the result.
But then the compiler warns that we're returning the address of a local variable. Which can be very dangerous, since "const float &r = std::min( a, b )" is valid, and would cause undefined behavior.
Conclusion
Given what I found, I will be using Ogre::min & Ogre::max to replace std::min & std::max (where floats & doubles are involved).On x86 architectures, it will use the SSE2 intrinsics as it translates to optimum assembly. On non-x86 architectures, it will default to the inline C version (or an architecture variant, ie. Neon intrinsics)
We still have to measure what happens with GCC though.
Subscribe to:
Posts (Atom)