Saturday, March 30, 2013

Adventures in branchless min-max with VS2012

So, I'm researching the fastest ways to perform min & max because it's very used in Ogre.

I decided to perform a test between the following versions:
  • std::min & std::max
  • inline min & max using pure C
  • inline min & max using SSE intrinsics
The C version:
inline const float& min( const float &a, const float &b )
{
    return a < b ? a : b;
}
The SSE version:
inline const float min( const float &left, const float &right )
{
    float retVal;
    _mm_store_ss( &retVal, _mm_min_ss( _mm_set_ss( left ), _mm_set_ss( right ) ) );
    return retVal;
}

Before I continue, a couple notes:
  • Everything compiled with /arch:SSE2. Not setting it produced the same code. Setting /arch:IA32 produced ugly & long x87 code.
  • /O2 was enabled.
  • /LCGT was disabled.
  • The same tests were applied to "max" versions, as well as their double counterparts. I got the same results.
  • VS2012 Express Edition
  • This was a win32 build, not x64

The Test

 int main( int argc, char **argv )
{
    float a = *(float*)( &argv[0][0] );
    float b = *(float*)( &argv[0][1] );
    float c = *(float*)( &argv[0][2] );
    float result = min( (-1.0f + 20.5f), a * b + c );
    result = max( result, c );

    std::cout << result;

    return 0;
}
I cout the result to ensure the compiler doesn't wipe the whole code. I cast argv to ensure the compiler doesn't do constant propagation. I could use atof, but that only clutters the assembly output. I'm not interested in whether the floats are nans or the pointers are valid.
Additionally, I even do a constant operation to ensure constant propagation still works as expected (which does).


Using std::min

 
_main PROC ; COMDAT
push ebp
mov ebp, esp
sub esp, 8
mov eax, DWORD PTR _argv$[ebp]

movss xmm0, DWORD PTR __real@419c0000
mov eax, DWORD PTR [eax]
lea ecx, DWORD PTR $T1[ebp]
movss xmm1, DWORD PTR [eax+1]
mulss xmm1, DWORD PTR [eax]
movss xmm2, DWORD PTR [eax+2]

lea eax, DWORD PTR $T2[ebp]
addss xmm1, xmm2
mov DWORD PTR $T1[ebp], 1100742656 ; 419c0000H
movss DWORD PTR _c$[ebp], xmm2

comiss xmm0, xmm1
movss DWORD PTR $T2[ebp], xmm1
cmovbe eax, ecx

lea ecx, DWORD PTR _result$[ebp]
movss xmm0, DWORD PTR [eax]
comiss xmm2, xmm0
lea eax, DWORD PTR _c$[ebp]
cmovbe eax, ecx

push ecx
mov ecx, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A
movss DWORD PTR _result$[ebp], xmm0
movss xmm0, DWORD PTR [eax]
movss DWORD PTR [esp], xmm0
call DWORD PTR __imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@M@Z

xor eax, eax

mov esp, ebp
pop ebp
ret 0
_main ENDP
VS2012 is using comiss + cmovbe to perform conditional movement. Definitely an improvement over previous compiler which generated jumps.

The good:

  • Code is branchless

The bad:

  • There's an interleave of  xmm& gpr registers in the process. The data is also copied to two locations in RAM. Interleaving like this plus moving to memory back and forth doesn't look good at all.

The inline C version

_main PROC ; COMDAT
push ebp
mov ebp, esp
sub esp, 8
mov eax, DWORD PTR _argv$[ebp]
lea ecx, DWORD PTR $T1[ebp]
mov eax, DWORD PTR [eax]
mov DWORD PTR $T2[ebp], 1100742656 ; 419c0000H
movss xmm0, DWORD PTR [eax+1]
mulss xmm0, DWORD PTR [eax]
movss xmm1, DWORD PTR [eax+2]
lea eax, DWORD PTR $T2[ebp]
addss xmm0, xmm1
movss DWORD PTR _c$[ebp], xmm1
comiss xmm0, DWORD PTR __real@419c0000
movss DWORD PTR $T1[ebp], xmm0
cmovbe eax, ecx
lea ecx, DWORD PTR _c$[ebp]
movss xmm0, DWORD PTR [eax]
comiss xmm0, xmm1
lea eax, DWORD PTR _result$[ebp]
cmovbe eax, ecx
push ecx
mov ecx, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A
movss DWORD PTR _result$[ebp], xmm0
movss xmm0, DWORD PTR [eax]
movss DWORD PTR [esp], xmm0
call DWORD PTR __imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@M@Z
xor eax, eax
mov esp, ebp
pop ebp
ret 0
_main ENDP
The generated assembly is almost identical to std::min case, with the exception that it saves one instruction by executing comiss directly on memory instead of caching the value first in an xmm register. This value by the way, is -21.5, which is the constant arithmetic we did in the C++ code.
I haven't checked whether the actual binary size of the instructions is equal. Also note some instructions are following a different order, which may pipeline better/worse.

Important remarks:

If the code is changed so that the arguments are not references, but rather copy by value, a jump is generated:
inline const float min( const float a, const float b )
{
    return a < b ? a : b;
}
Note the lack of '&'. If compiled with that definition...
_main PROC ; COMDAT
push ebp
mov ebp, esp
mov eax, DWORD PTR _argv$[ebp]
movss xmm0, DWORD PTR __real@419c0000
mov eax, DWORD PTR [eax]

movss xmm1, DWORD PTR [eax+1]
mulss  xmm1, DWORD PTR [eax]
movss xmm2, DWORD PTR [eax+2]
addss  xmm1, xmm2

comiss xmm1, xmm0
ja SHORT $LN6@main
movaps xmm0, xmm1
$LN6@main:
comiss xmm0, xmm2
ja SHORT $LN10@main
movaps xmm0, xmm2
$LN10@main:
push ecx
mov ecx, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A
movss DWORD PTR [esp], xmm0
call DWORD PTR __imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@M@Z

xor eax, eax
pop ebp
ret 0
_main ENDP

The good

  • No mixing between general purpose & SSE registers. Everything stays whithin xmm regs. That's much better.

The bad

  • Code contains jumps (branches)
On a personal note, I think it's counterintuitive that copying by reference and by value produce such different results.

Using SSE2 intrinsics

This is the assembly output from using  _mm_min_ss & _mm_max_ss:
_main PROC ; COMDAT
push ebp
mov ebp, esp

mov eax, DWORD PTR _argv$[ebp]

push ecx
mov eax, DWORD PTR [eax]
mov ecx, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A
movss  xmm0, DWORD PTR [eax+1]
mulss   xmm0, DWORD PTR [eax]
movss  xmm2, DWORD PTR [eax+2]
addss   xmm0, xmm2

movaps  xmm1, xmm0
movss    xmm0, DWORD PTR __real@419c0000
minss     xmm0, xmm1

movaps xmm1, xmm0
movaps xmm0, xmm2
maxss   xmm1, xmm0


movss DWORD PTR [esp], xmm1
call DWORD PTR __imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@M@Z

xor eax, eax

pop ebp
ret 0
_main ENDP
Sweet! That's exactly what we were aiming for. The code translates to minss & maxss instructions.
I was also afraid that the compiler wouldn't do this efficiently (i.e. store xmm0's result from the first minss unnecessarily into stack and then load it back) but note that it writes to memory after the second one, maxss, is executed!

Not only that, the code is also smaller. It couldn't have gone better. All advantages, no disadvantage.
Completely branchless, no exchange with gpr registers, no conditional moves either!

Doubles

I won't post the results from using doubles, because it's the same: instead of comiss, comisd is used. Instead of minss, minsd is generated.


Specializing std::min & std::max

I tried specializing std functions for float & double to use the SSE2 intrinsics. This actually worked and produced same optimal assembly with minss & maxss. However, std::min returns by reference. Not by value.

The only way to implement it is using a local variable which is where _mm_store_ss will send the result.
But then the compiler warns that we're returning the address of a local variable. Which can be very dangerous, since "const float &r = std::min( a, b )" is valid, and would cause undefined behavior.

Conclusion

Given what I found, I will be using Ogre::min & Ogre::max to replace std::min & std::max (where floats & doubles are involved).
On x86 architectures, it will use the SSE2 intrinsics as it translates to optimum assembly. On non-x86 architectures, it will default to the inline C version (or an architecture variant, ie. Neon intrinsics)
We still have to measure what happens with GCC though.

3 comments:

  1. Gcc is pretty smart about the

    (left < right) ? left : right

    it does the minss trick!

    0000000000000670 :
    670: f3 0f 5d c1 minss %xmm1,%xmm0
    674: c3 retq
    675: 90 nop
    676: 90 nop
    677: 90 nop

    gcc (Debian 4.7.2-5) 4.7.2

    on an amd64 system.

    Haven't inlined it, for sake of asm-generation just a mini-so containing the function.

    ReplyDelete
    Replies
    1. Forgot to mention, that is with -O1 and up.

      Delete
  2. Thanks for mentioning it!

    I'll try to check it out for myself when I get the time. I'm not surprised, GCC is better at this, there are multiple reports where gcc outperforms msvc in SSE code generation (whether packed or scalar).

    To be honest this is an obvious optimization, msvc should always be producing minss/maxss unless I'm missing something.

    ReplyDelete