The Battle of the Lean and the Inlined Bone Functions

July 12, 2016

Inline Functionality

Using inline functions can make your program faster because they eliminate the overhead associated with function calls. Functions expanded inline are subject to code optimizations not available to normal functions. (MSDN)

Ah, thank the heavens for inline functions…

In the future, compilers may be able to do a better job of making inline decisions than programmers.
(Randy Meyers, Dr Dobbs, July 1st 2002)

Flash-forward to 2016 and let’s revisit Randy’s thinking. He was absolutely correct, compilers became way more intelligent since 2002. It makes sense – things got a whole lot more complicated. Developers, particularly game developers, need to think about multiple target platforms with potentially HUGE differences in how their CPUs work and perform. The compiler is definitely, in most cases, the best placed to make decisions about whether or not a function should be inlined in modern times. __inline is really just a lightweight hint to the compiler, though. It’s at the compiler’s discretion whether or not a function is really inlined – if it determines that it would better not be, for example if the function being considered is long and likely to reduce code cache performance, then it will just ignore the programmer’s recommendation.

With that, __forceinline comes into play of course. This is a much stronger hint to the compiler… a “please, I really want to inline this”. A nice little addition to the C++ language for the cases where you have a tiny piece of code that definitely should be inlined. Just in case the compiler doesn’t agree otherwise.

Here’s what MSDN has to say about this:-

The __forceinline keyword overrides the cost/benefit analysis and relies on the judgment of the programmer instead. Exercise caution when using __forceinline. Indiscriminate use of __forceinline can result in larger code with only marginal performance gains or, in some cases, even performance losses (due to increased paging of a larger executable, for example).

So basically, we can use it – but we shouldn’t use it too much.

Inlining In UE4

Why is this relevant to UE4? Well, looking at the source code as of today, here’re some stats for you relating to the code available through GitHub (nb. I’m excluding the thirdparty folder here as there’s not much we’d want to do with that):-

There are 7856 references to FORCEINLINE (defined as __forceinline on all platforms);
There’s 1 __forceinline (a mistake – this should of course be FORCEINLINE) in D3D12Resources.h;
There’re 136 FORCENOINLINE references.

Hmm… “Exercise caution” and “7856 references” don’t exactly roll together.

This is a common issue with large code bases. Inlining of functions can definitely help performance – but it shouldn’t just be applied everywhere. Here’s an example of something that I -have- seen, recently, and which leads to some very poorly performing piece of code:-

A() is approx 30 lines of fairly complex code… within it are 3 separate calls to B().
B() is set to be FORCEINLINE’d and runs to around 40 lines. B() has 3 calls to C().
C() is also FORCEINLINE’d with 35 lines of code, 3 calls to D() and 5 calls to E().
D() is FORCEINLINE’d and has 8 lines of code.
E() is also FORCEINLINE’d with 5 lines of code.
Totalling up, we’ll have 30 + 3 * ( 40 + 3 * ( 35 + 3 * 8 + 5 * 5 ) ) = 906 lines of code.
If none of that was inlined at all, we’d have just 30 + 40 + 35 + 8 + 5 = 118 lines.

To see whether inlining helps or not, you really need to profile the code. Sometimes, as in the example that follows, it’s immediately apparent that it’s not (quite) working… it doesn’t mean that we should just disable inlining completely – but that we really need to investigate carefully to decide.

Investigation of Some Animation Code

During a recent profiling run of our server standalone, we were seeing large amounts of time spent in PopulateFromAnimation(). Let’s investigate how that function works…

// Populates this pose from the supplied animation and track data
void PopulateFromAnimation(
  const UAnimSequence& Seq,
  const BoneTrackArray& RotationTracks,
  const BoneTrackArray& TranslationTracks,
  const BoneTrackArray& ScaleTracks,
  float Time)
{
  // @todo fixme 
  FTransformArray LocalBones;
  LocalBones = this->Bones;

  AnimationFormat_GetAnimationPose(
    LocalBones, //@TODO:@ANIMATION: Nasty hack
    RotationTracks,
    TranslationTracks,
    ScaleTracks,
    Seq,
    Time);
  this->Bones = LocalBones;
}

“@todo fixme” and “@TODO:@ANIMATION: Nasty hack” gave some warning signs – but changing those would be a bigger task and I believe Epic’s programmers may already be looking into those… the issue they’re referring to, I believe, being that this->Bones is being copied to LocalBones, only to be reinstated after the call to AnimationFormat_GetAnimationPose() … it definitely seems like there should be a better way to do that which would prevent the copies… but, anyway, that’s not our concern today… let’s look at the called function within the above:-

void AnimationFormat_GetAnimationPose(  
  FTransformArray& Atoms, 
  const BoneTrackArray& RotationPairs,
  const BoneTrackArray& TranslationPairs,
  const BoneTrackArray& ScalePairs,
  const UAnimSequence& Seq,
  float Time)
{
  // decompress the translation component using the proper method
  checkSlow(Seq.TranslationCodec != NULL);
  if (TranslationPairs.Num() > 0)
  {
    ((AnimEncoding*)Seq.TranslationCodec)->GetPoseTranslations(Atoms, TranslationPairs, Seq, Time);
  }

  // decompress the rotation component using the proper method
  checkSlow(Seq.RotationCodec != NULL);
  ((AnimEncoding*)Seq.RotationCodec)->GetPoseRotations(Atoms, RotationPairs, Seq, Time);

  checkSlow(Seq.ScaleCodec != NULL);
  // we allow scale key to be empty
  if (Seq.CompressedScaleOffsets.IsValid())
  {
    ((AnimEncoding*)Seq.ScaleCodec)->GetPoseScales(Atoms, ScalePairs, Seq, Time);
  }
}

Pretty simple function, again… including 3 calls… GetPoseTranslations(), GetPoseRotations() and GetPoseScales(). In our profiling, we were actually seeing the most time spent in GetPoseRotations(). There are 3 versions of this – the first in AnimEncoding.cpp, the next in AnimEncoding_ConstantKeyLerp.h and the last in AnimEncoding_VariableKeyLerp.h. It’s actually only the latter 2 that we’re concerned with here. Let’s take a look at one of these:-

template<int32 FORMAT>
FORCEINLINE_DEBUGGABLE void AEFVariableKeyLerp<FORMAT>::GetPoseRotations(  
  FTransformArray& Atoms, 
  const BoneTrackArray& DesiredPairs,
  const UAnimSequence& Seq,
  float Time)
{
  const int32 PairCount = DesiredPairs.Num();
  const float RelativePos = Time / (float)Seq.SequenceLength;

  for (int32 PairIndex=0; PairIndex<PairCount; ++PairIndex)
  {
    const BoneTrackPair& Pair = DesiredPairs[PairIndex];
    const int32 TrackIndex = Pair.TrackIndex;
    const int32 AtomIndex = Pair.AtomIndex;
    FTransform& BoneAtom = Atoms[AtomIndex];

    const int32* RESTRICT TrackData = Seq.CompressedTrackOffsets.GetData() + (TrackIndex*4);
    const int32 RotKeysOffset  = *(TrackData+2);
    const int32 NumRotKeys  = *(TrackData+3);
    const uint8* RESTRICT RotStream    = Seq.CompressedByteStream.GetData()+RotKeysOffset;

    // call the decoder directly (not through the vtable)
    AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(BoneAtom, Seq, RotStream, NumRotKeys, Time, RelativePos);
  }
}

So, we have our first FORCEINLINE here. Note the call to GetBoneAtomRotation() – let’s take a look at that:-

template<int32 FORMAT>
FORCEINLINE_DEBUGGABLE void AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(  
  FTransform& OutAtom,
  const UAnimSequence& Seq,
  const uint8* RESTRICT RotStream,
  int32 NumRotKeys,
  float Time,
  float RelativePos)
{
  if (NumRotKeys == 1)
  {
    // For a rotation track of n=1 keys, the single key is packed as an FQuatFloat96NoW.
    FQuat R0;
    DecompressRotation<ACF_Float96NoW>( R0 , RotStream, RotStream );
    OutAtom.SetRotation(R0);
  }
  else
  {
... lots of complex code here...

  }
}

This is a fairly long function and another with FORCEINLINE. All of this will be merged with the code in GetPoseRotations() and then inlined into AnimationFormat_GetAnimationPose().

Firing up vTune, we were seeing a large chunk of time, ~60%, spent on a single line of code… oddly, it was possibly the simplest line in the whole thing:-

if (NumRotKeys == 1)

It would be nice to show you the disassembly for GetPoseRotations() here … but it runs to 1,514 bytes of code across 444 lines… that’s not a good sign. Let me just show you an important part:-

0x1410d3675  Block 5:  
0x1410d3675  xorps xmm0, xmm0  
0x1410d3678  sqrtss xmm0, xmm5  
0x1410d367c  movss dword ptr [rsp+0xc], xmm0  // 0.00142252
0x1410d3682  movaps xmm0, xmmword ptr [rsp]  
0x1410d3686  jmp 0x1410d3ab7 <Block 62>  // 0.0255999
0x1410d368b  Block 6:  
0x1410d368b  mov dword ptr [rsp+0xc], 0x0  
0x1410d3693  movaps xmm0, xmmword ptr [rsp]

The two values you can see to the right give a way of measuring the time spent on each line. Note the time on the JMP… this is our 60% (the total measurement for the entire function was 0.04267214).

Analyzing the data coming through this function, something struck me: 99% of the time, we were seeing NumRotKeys come through as “1”. This may be something particular to our application – but I believe it’s likely to be a common case with video games, at least. Regardless, with 444 lines of disassembly, something needed to be done.

Our Optimization of the Animation System

First up, let’s tackle the expensive line of code that you see above. To do this, we need to understand, just a little, how modern processors work… they’re very clever, see, and will often prepare things that may be quite far from the current instruction point – using branch prediction, speculative operation, etc. The processor might do some of the math on either side of the NumRotKeys==1 test before knowing whether the test passes or fails. And that’s where our bottleneck comes in: the processor is running ahead doing some pretty complicated math (for the case where NumRotKeys is larger than 1) only for that math not to be needed. There’s no real way that we can hint to the processor that it doesn’t need to do this… so let’s, instead, fix it another way… let’s separate that code so that we have each case defined in different functions.

Since the calling function, GetPoseRotations(), is so simple and contains a neat loop, let’s just move our most common case into there – that way, we’re effectively inlining the common case and leaving the rarer case in a function. At the same time, let’s stop force-inlining GetPoseRotations() – it’s much better to let the compiler decide whether or not this should be done:-

template<int32 FORMAT>
void AEFVariableKeyLerp<FORMAT>::GetPoseRotations(
  FTransformArray& Atoms, 
  const BoneTrackArray& DesiredPairs,
  const UAnimSequence& Seq,
  float Time)
{
  const int32 PairCount = DesiredPairs.Num();
  const float RelativePos = Time / (float)Seq.SequenceLength;

  for (int32 PairIndex=0; PairIndex<PairCount; ++PairIndex)
  {
    const BoneTrackPair& Pair = DesiredPairs[PairIndex];
    const int32 TrackIndex = Pair.TrackIndex;
    const int32 AtomIndex = Pair.AtomIndex;
    FTransform& BoneAtom = Atoms[AtomIndex];

    const int32* RESTRICT TrackData = Seq.CompressedTrackOffsets.GetData() + (TrackIndex*4);
    const int32 RotKeysOffset  = *(TrackData+2);
    const int32 NumRotKeys  = *(TrackData+3);
    const uint8* RESTRICT RotStream    = Seq.CompressedByteStream.GetData()+RotKeysOffset;

    if (NumRotKeys == 1)
    {
      FQuat R0;
      DecompressRotation<ACF_Float96NoW>(R0, RotStream, RotStream);
      BoneAtom.SetRotation(R0);
    }
    else
    {
      // call the decoder directly (not through the vtable)
      AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(BoneAtom, Seq, RotStream, NumRotKeys, Time, RelativePos);
    }
  }
}

And then we can go ahead and comment out that part within GetBoneAtomRotation(). In this case, to stop the performance problem rearing it’s head again, we make sure that the function won’t be inlined this time by using FORCENOINLINE:-

template<int32 FORMAT>
FORCENOINLINE void AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(  
  FTransform& OutAtom,
  const UAnimSequence& Seq,
  const uint8* RESTRICT RotStream,
  int32 NumRotKeys,
  float Time,
  float RelativePos)
{
//  if (NumRotKeys == 1)
//  {
//    // For a rotation track of n=1 keys, the single key is packed as an FQuatFloat96NoW.
//    FQuat R0;
//    DecompressRotation<ACF_Float96NoW>( R0 , RotStream, RotStream );
//    OutAtom.SetRotation(R0);
//  }
//  else
  {

... lots of code

  }
}

Additionally, we took FORCEINLINE off GetPoseTranslations() and GetPoseScales().

And we go ahead and make these same changes to AnimEncoding_ConstantKeyLerp.h.

That’s all there is to it… if we now look at the disassembly of GetPoseRotations(), it comes to a respectable 100 lines of assembly language (382 bytes):-

0x1410d1130  Block 1:
0x1410d1130  mov rax, rsp
0x1410d1133  mov qword ptr [rax+0x8], rbx
0x1410d1137  mov qword ptr [rax+0x10], rbp
0x1410d113b  mov qword ptr [rax+0x18], rsi
0x1410d113f  push rdi
0x1410d1140  push r14
0x1410d1142  push r15
0x1410d1144  sub rsp, 0x90
0x1410d114b  movsxd rsi, dword ptr [r8+0x8]
0x1410d114f  movaps xmmword ptr [rax-0x28], xmm6
0x1410d1153  movaps xmmword ptr [rax-0x38], xmm7
0x1410d1157  movss xmm7, dword ptr [rsp+0xd0]
0x1410d1160  xor ebx, ebx
0x1410d1162  mov rdi, r9
0x1410d1165  mov r14, r8
0x1410d1168  mov r15, rdx
0x1410d116b  mov rbp, rcx
0x1410d116e  movaps xmm6, xmm7
0x1410d1171  divss xmm6, dword ptr [r9+0x68]
0x1410d1177  test rsi, rsi
0x1410d117a  jle 0x1410d1285 <Block 10>
0x1410d1180  Block 2:
0x1410d1180  movaps xmmword ptr [rax-0x48], xmm8
0x1410d1185  movss xmm8, dword ptr [rip+0xfb1d4e]
0x1410d118e  movaps xmmword ptr [rax-0x58], xmm9
0x1410d1193  xorps xmm9, xmm9
0x1410d1197  nop word ptr [rax+rax*1], ax
0x1410d11a0  Block 3:
0x1410d11a0  mov rcx, qword ptr [r14]
0x1410d11a3  movsxd rax, dword ptr [rcx+rbx*8]
0x1410d11a7  lea rdx, ptr [rax+rax*2]
0x1410d11ab  mov eax, dword ptr [rcx+rbx*8+0x4]
0x1410d11af  shl eax, 0x2
0x1410d11b2  shl rdx, 0x4
0x1410d11b6  add rdx, qword ptr [r15]
0x1410d11b9  movsxd rcx, eax
0x1410d11bc  mov rax, qword ptr [rdi+0xe0]
0x1410d11c3  movsxd r9, dword ptr [rax+rcx*4+0x8]
0x1410d11c8  mov r8d, dword ptr [rax+rcx*4+0xc]
0x1410d11cd  add r9, qword ptr [rdi+0x108]
0x1410d11d4  cmp r8d, 0x1
0x1410d11d8  jnz 0x1410d1251 <Block 7>
0x1410d11da  Block 4:
0x1410d11da  movss xmm2, dword ptr [r9]
0x1410d11df  movss xmm3, dword ptr [r9+0x4]
0x1410d11e5  movss xmm4, dword ptr [r9+0x8]
0x1410d11eb  movaps xmm5, xmm8
0x1410d11ef  movaps xmm0, xmm2
0x1410d11f2  movaps xmm1, xmm3
0x1410d11f5  mulss xmm0, xmm2
0x1410d11f9  mulss xmm1, xmm3
0x1410d11fd  subss xmm5, xmm0
0x1410d1201  movss dword ptr [rsp+0x40], xmm2
0x1410d1207  movss dword ptr [rsp+0x44], xmm3
0x1410d120d  movaps xmm0, xmm4
0x1410d1210  mulss xmm0, xmm4
0x1410d1214  subss xmm5, xmm1
0x1410d1218  movss dword ptr [rsp+0x48], xmm4
0x1410d121e  subss xmm5, xmm0
0x1410d1222  comiss xmm5, xmm9
0x1410d1226  jbe 0x1410d123f <Block 6>
0x1410d1228  Block 5:
0x1410d1228  xorps xmm0, xmm0
0x1410d122b  sqrtss xmm0, xmm5
0x1410d122f  movss dword ptr [rsp+0x4c], xmm0
0x1410d1235  movaps xmm0, xmmword ptr [rsp+0x40]
0x1410d123a  movaps xmmword ptr [rdx], xmm0
0x1410d123d  jmp 0x1410d126d <Block 8>
0x1410d123f  Block 6:
0x1410d123f  mov dword ptr [rsp+0x4c], 0x0
0x1410d1247  movaps xmm0, xmmword ptr [rsp+0x40]
0x1410d124c  movaps xmmword ptr [rdx], xmm0
0x1410d124f  jmp 0x1410d126d <Block 8>
0x1410d1251  Block 7:
0x1410d1251  movss dword ptr [rsp+0x30], xmm6
0x1410d1257  movss dword ptr [rsp+0x28], xmm7
0x1410d125d  mov dword ptr [rsp+0x20], r8d
0x1410d1262  mov r8, rdi
0x1410d1265  mov rcx, rbp
0x1410d1268  call 0x1410c7fd0 <?GetBoneAtomRotation@?$AEFVariableKeyLerp@$00@@UEAAXAEAUFTransform@@AEBVUAnimSequence@@PEIBEHMM@Z>
0x1410d126d  Block 8:
0x1410d126d  inc rbx
0x1410d1270  cmp rbx, rsi
0x1410d1273  jl 0x1410d11a0 <Block 3>
0x1410d1279  Block 9:
0x1410d1279  movaps xmm9, xmmword ptr [rsp+0x50]
0x1410d127f  movaps xmm8, xmmword ptr [rsp+0x60]
0x1410d1285  Block 10:
0x1410d1285  movaps xmm7, xmmword ptr [rsp+0x70]
0x1410d128a  lea r11, ptr [rsp+0x90]
0x1410d1292  movaps xmm6, xmmword ptr [r11-0x10]
0x1410d1297  mov rbx, qword ptr [r11+0x20]
0x1410d129b  mov rbp, qword ptr [r11+0x28]
0x1410d129f  mov rsi, qword ptr [r11+0x30]
0x1410d12a3  mov rsp, r11
0x1410d12a6  pop r15
0x1410d12a8  pop r14
0x1410d12aa  pop rdi
0x1410d12ab  ret

And, of course, performance is much improved, as you might expect. vTune no longer highlights the code above as being a bottleneck for us… though the memory copies that I mentioned earlier have now been brought to the fore – hopefully those will be addressed in a future revision of UE4 – if not soon, though, I’m sure we’ll get to that ourselves.

Credit(s): Robert Troughton (Coconut Lizard)
Status: Currently unimplemented in 4.12