For Developers Working With Unreal Engine

The Battle of the Lean and the Inlined Bone Functions

by

Inline Functionality

Using inline functions can make your program faster because they eliminate the overhead associated with function calls. Functions expanded inline are subject to code optimizations not available to normal functions. (MSDN)

Ah, thank the heavens for inline functions…

In the future, compilers may be able to do a better job of making inline decisions than programmers.
(Randy Meyers, Dr Dobbs, July 1st 2002)

Flash-forward to 2016 and let’s revisit Randy’s thinking. He was absolutely correct, compilers became way more intelligent since 2002. It makes sense – things got a whole lot more complicated. Developers, particularly game developers, need to think about multiple target platforms with potentially HUGE differences in how their CPUs work and perform. The compiler is definitely, in most cases, the best placed to make decisions about whether or not a function should be inlined in modern times. __inline is really just a lightweight hint to the compiler, though. It’s at the compiler’s discretion whether or not a function is really inlined – if it determines that it would better not be, for example if the function being considered is long and likely to reduce code cache performance, then it will just ignore the programmer’s recommendation.

With that, __forceinline comes into play of course. This is a much stronger hint to the compiler… a “please, I really want to inline this”. A nice little addition to the C++ language for the cases where you have a tiny piece of code that definitely should be inlined. Just in case the compiler doesn’t agree otherwise.

Here’s what MSDN has to say about this:-

The __forceinline keyword overrides the cost/benefit analysis and relies on the judgment of the programmer instead. Exercise caution when using __forceinline. Indiscriminate use of __forceinline can result in larger code with only marginal performance gains or, in some cases, even performance losses (due to increased paging of a larger executable, for example).

So basically, we can use it – but we shouldn’t use it too much.


Inlining In UE4

Why is this relevant to UE4? Well, looking at the source code as of today, here’re some stats for you relating to the code available through GitHub (nb. I’m excluding the thirdparty folder here as there’s not much we’d want to do with that):-

  • There are 7856 references to FORCEINLINE (defined as __forceinline on all platforms);
  • There’s 1 __forceinline (a mistake – this should of course be FORCEINLINE) in D3D12Resources.h;
  • There’re 136 FORCENOINLINE references.

Hmm… “Exercise caution” and “7856 references” don’t exactly roll together.

This is a common issue with large code bases. Inlining of functions can definitely help performance – but it shouldn’t just be applied everywhere. Here’s an example of something that I -have- seen, recently, and which leads to some very poorly performing piece of code:-

A() is approx 30 lines of fairly complex code… within it are 3 separate calls to B().

B() is set to be FORCEINLINE’d and runs to around 40 lines. B() has 3 calls to C().

C() is also FORCEINLINE’d with 35 lines of code, 3 calls to D() and 5 calls to E().

D() is FORCEINLINE’d and has 8 lines of code.

E() is also FORCEINLINE’d with 5 lines of code.

Totalling up, we’ll have 30 + 3 * ( 40 + 3 * ( 35 + 3 * 8 + 5 * 5 ) ) = 906 lines of code.

If none of that was inlined at all, we’d have just 30 + 40 + 35 + 8 + 5 = 118 lines.

To see whether inlining helps or not, you really need to profile the code. Sometimes, as in the example that follows, it’s immediately apparent that it’s not (quite) working… it doesn’t mean that we should just disable inlining completely – but that we really need to investigate carefully to decide.


Investigation of Some Animation Code

During a recent profiling run of our server standalone, we were seeing large amounts of time spent in PopulateFromAnimation(). Let’s investigate how that function works…

// Populates this pose from the supplied animation and track data
void PopulateFromAnimation(
  const UAnimSequence& Seq,
  const BoneTrackArray& RotationTracks,
  const BoneTrackArray& TranslationTracks,
  const BoneTrackArray& ScaleTracks,
  float Time)
{
  // @todo fixme 
  FTransformArray LocalBones;
  LocalBones = this->Bones;

  AnimationFormat_GetAnimationPose(
    LocalBones, //@TODO:@ANIMATION: Nasty hack
    RotationTracks,
    TranslationTracks,
    ScaleTracks,
    Seq,
    Time);
  this->Bones = LocalBones;
}

“@todo fixme” and “@TODO:@ANIMATION: Nasty hack” gave some warning signs – but changing those would be a bigger task and I believe Epic’s programmers may already be looking into those… the issue they’re referring to, I believe, being that this->Bones is being copied to LocalBones, only to be reinstated after the call to AnimationFormat_GetAnimationPose() … it definitely seems like there should be a better way to do that which would prevent the copies… but, anyway, that’s not our concern today… let’s look at the called function within the above:-

void AnimationFormat_GetAnimationPose(	
  FTransformArray& Atoms, 
  const BoneTrackArray& RotationPairs,
  const BoneTrackArray& TranslationPairs,
  const BoneTrackArray& ScalePairs,
  const UAnimSequence& Seq,
  float Time)
{
  // decompress the translation component using the proper method
  checkSlow(Seq.TranslationCodec != NULL);
  if (TranslationPairs.Num() > 0)
  {
    ((AnimEncoding*)Seq.TranslationCodec)->GetPoseTranslations(Atoms, TranslationPairs, Seq, Time);
  }

  // decompress the rotation component using the proper method
  checkSlow(Seq.RotationCodec != NULL);
  ((AnimEncoding*)Seq.RotationCodec)->GetPoseRotations(Atoms, RotationPairs, Seq, Time);

  checkSlow(Seq.ScaleCodec != NULL);
  // we allow scale key to be empty
  if (Seq.CompressedScaleOffsets.IsValid())
  {
    ((AnimEncoding*)Seq.ScaleCodec)->GetPoseScales(Atoms, ScalePairs, Seq, Time);
  }
}

Pretty simple function, again… including 3 calls… GetPoseTranslations(), GetPoseRotations() and GetPoseScales(). In our profiling, we were actually seeing the most time spent in GetPoseRotations(). There are 3 versions of this – the first in AnimEncoding.cpp, the next in AnimEncoding_ConstantKeyLerp.h and the last in AnimEncoding_VariableKeyLerp.h. It’s actually only the latter 2 that we’re concerned with here. Let’s take a look at one of these:-

template<int32 FORMAT>
FORCEINLINE_DEBUGGABLE void AEFVariableKeyLerp<FORMAT>::GetPoseRotations(	
  FTransformArray& Atoms, 
  const BoneTrackArray& DesiredPairs,
  const UAnimSequence& Seq,
  float Time)
{
  const int32 PairCount = DesiredPairs.Num();
  const float RelativePos = Time / (float)Seq.SequenceLength;

  for (int32 PairIndex=0; PairIndex<PairCount; ++PairIndex)
  {
    const BoneTrackPair& Pair = DesiredPairs[PairIndex];
    const int32 TrackIndex = Pair.TrackIndex;
    const int32 AtomIndex = Pair.AtomIndex;
    FTransform& BoneAtom = Atoms[AtomIndex];

    const int32* RESTRICT TrackData = Seq.CompressedTrackOffsets.GetData() + (TrackIndex*4);
    const int32 RotKeysOffset	= *(TrackData+2);
    const int32 NumRotKeys	= *(TrackData+3);
    const uint8* RESTRICT RotStream		= Seq.CompressedByteStream.GetData()+RotKeysOffset;

    // call the decoder directly (not through the vtable)
    AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(BoneAtom, Seq, RotStream, NumRotKeys, Time, RelativePos);
  }
}

So, we have our first FORCEINLINE here. Note the call to GetBoneAtomRotation() – let’s take a look at that:-

template<int32 FORMAT>
FORCEINLINE_DEBUGGABLE void AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(	
  FTransform& OutAtom,
  const UAnimSequence& Seq,
  const uint8* RESTRICT RotStream,
  int32 NumRotKeys,
  float Time,
  float RelativePos)
{
  if (NumRotKeys == 1)
  {
    // For a rotation track of n=1 keys, the single key is packed as an FQuatFloat96NoW.
    FQuat R0;
    DecompressRotation<ACF_Float96NoW>( R0 , RotStream, RotStream );
    OutAtom.SetRotation(R0);
  }
  else
  {
... lots of complex code here...

  }
}

This is a fairly long function and another with FORCEINLINE. All of this will be merged with the code in GetPoseRotations() and then inlined into AnimationFormat_GetAnimationPose().

Firing up vTune, we were seeing a large chunk of time, ~60%, spent on a single line of code… oddly, it was possibly the simplest line in the whole thing:-

if (NumRotKeys == 1)

It would be nice to show you the disassembly for GetPoseRotations() here … but it runs to 1,514 bytes of code across 444 lines… that’s not a good sign. Let me just show you an important part:-

0x1410d3675	Block 5:	
0x1410d3675	xorps xmm0, xmm0	
0x1410d3678	sqrtss xmm0, xmm5	
0x1410d367c	movss dword ptr [rsp+0xc], xmm0	// 0.00142252
0x1410d3682	movaps xmm0, xmmword ptr [rsp]	
0x1410d3686	jmp 0x1410d3ab7 <Block 62>	// 0.0255999
0x1410d368b	Block 6:	
0x1410d368b	mov dword ptr [rsp+0xc], 0x0	
0x1410d3693	movaps xmm0, xmmword ptr [rsp]

The two values you can see to the right give a way of measuring the time spent on each line. Note the time on the JMP… this is our 60% (the total measurement for the entire function was 0.04267214).

Analyzing the data coming through this function, something struck me: 99% of the time, we were seeing NumRotKeys come through as “1”. This may be something particular to our application – but I believe it’s likely to be a common case with video games, at least. Regardless, with 444 lines of disassembly, something needed to be done.


Our Optimization of the Animation System

First up, let’s tackle the expensive line of code that you see above. To do this, we need to understand, just a little, how modern processors work… they’re very clever, see, and will often prepare things that may be quite far from the current instruction point – using branch prediction, speculative operation, etc. The processor might do some of the math on either side of the NumRotKeys==1 test before knowing whether the test passes or fails. And that’s where our bottleneck comes in: the processor is running ahead doing some pretty complicated math (for the case where NumRotKeys is larger than 1) only for that math not to be needed. There’s no real way that we can hint to the processor that it doesn’t need to do this… so let’s, instead, fix it another way… let’s separate that code so that we have each case defined in different functions.

Since the calling function, GetPoseRotations(), is so simple and contains a neat loop, let’s just move our most common case into there – that way, we’re effectively inlining the common case and leaving the rarer case in a function. At the same time, let’s stop force-inlining GetPoseRotations() – it’s much better to let the compiler decide whether or not this should be done:-

template<int32 FORMAT>
void AEFVariableKeyLerp<FORMAT>::GetPoseRotations(
  FTransformArray& Atoms, 
  const BoneTrackArray& DesiredPairs,
  const UAnimSequence& Seq,
  float Time)
{
  const int32 PairCount = DesiredPairs.Num();
  const float RelativePos = Time / (float)Seq.SequenceLength;

  for (int32 PairIndex=0; PairIndex<PairCount; ++PairIndex)
  {
    const BoneTrackPair& Pair = DesiredPairs[PairIndex];
    const int32 TrackIndex = Pair.TrackIndex;
    const int32 AtomIndex = Pair.AtomIndex;
    FTransform& BoneAtom = Atoms[AtomIndex];

    const int32* RESTRICT TrackData = Seq.CompressedTrackOffsets.GetData() + (TrackIndex*4);
    const int32 RotKeysOffset	= *(TrackData+2);
    const int32 NumRotKeys	= *(TrackData+3);
    const uint8* RESTRICT RotStream		= Seq.CompressedByteStream.GetData()+RotKeysOffset;

    if (NumRotKeys == 1)
    {
      FQuat R0;
      DecompressRotation<ACF_Float96NoW>(R0, RotStream, RotStream);
      BoneAtom.SetRotation(R0);
    }
    else
    {
      // call the decoder directly (not through the vtable)
      AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(BoneAtom, Seq, RotStream, NumRotKeys, Time, RelativePos);
    }
  }
}

And then we can go ahead and comment out that part within GetBoneAtomRotation(). In this case, to stop the performance problem rearing it’s head again, we make sure that the function won’t be inlined this time by using FORCENOINLINE:-

template<int32 FORMAT>
FORCENOINLINE void AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(	
  FTransform& OutAtom,
  const UAnimSequence& Seq,
  const uint8* RESTRICT RotStream,
  int32 NumRotKeys,
  float Time,
  float RelativePos)
{
//	if (NumRotKeys == 1)
//	{
//		// For a rotation track of n=1 keys, the single key is packed as an FQuatFloat96NoW.
//		FQuat R0;
//		DecompressRotation<ACF_Float96NoW>( R0 , RotStream, RotStream );
//		OutAtom.SetRotation(R0);
//	}
//	else
  {

... lots of code

  }
}

Additionally, we took FORCEINLINE off GetPoseTranslations() and GetPoseScales().

And we go ahead and make these same changes to AnimEncoding_ConstantKeyLerp.h.

That’s all there is to it… if we now look at the disassembly of GetPoseRotations(), it comes to a respectable 100 lines of assembly language (382 bytes):-

0x1410d1130	Block 1:
0x1410d1130	mov rax, rsp
0x1410d1133	mov qword ptr [rax+0x8], rbx
0x1410d1137	mov qword ptr [rax+0x10], rbp
0x1410d113b	mov qword ptr [rax+0x18], rsi
0x1410d113f	push rdi
0x1410d1140	push r14
0x1410d1142	push r15
0x1410d1144	sub rsp, 0x90
0x1410d114b	movsxd rsi, dword ptr [r8+0x8]
0x1410d114f	movaps xmmword ptr [rax-0x28], xmm6
0x1410d1153	movaps xmmword ptr [rax-0x38], xmm7
0x1410d1157	movss xmm7, dword ptr [rsp+0xd0]
0x1410d1160	xor ebx, ebx
0x1410d1162	mov rdi, r9
0x1410d1165	mov r14, r8
0x1410d1168	mov r15, rdx
0x1410d116b	mov rbp, rcx
0x1410d116e	movaps xmm6, xmm7
0x1410d1171	divss xmm6, dword ptr [r9+0x68]
0x1410d1177	test rsi, rsi
0x1410d117a	jle 0x1410d1285 <Block 10>
0x1410d1180	Block 2:
0x1410d1180	movaps xmmword ptr [rax-0x48], xmm8
0x1410d1185	movss xmm8, dword ptr [rip+0xfb1d4e]
0x1410d118e	movaps xmmword ptr [rax-0x58], xmm9
0x1410d1193	xorps xmm9, xmm9
0x1410d1197	nop word ptr [rax+rax*1], ax
0x1410d11a0	Block 3:
0x1410d11a0	mov rcx, qword ptr [r14]
0x1410d11a3	movsxd rax, dword ptr [rcx+rbx*8]
0x1410d11a7	lea rdx, ptr [rax+rax*2]
0x1410d11ab	mov eax, dword ptr [rcx+rbx*8+0x4]
0x1410d11af	shl eax, 0x2
0x1410d11b2	shl rdx, 0x4
0x1410d11b6	add rdx, qword ptr [r15]
0x1410d11b9	movsxd rcx, eax
0x1410d11bc	mov rax, qword ptr [rdi+0xe0]
0x1410d11c3	movsxd r9, dword ptr [rax+rcx*4+0x8]
0x1410d11c8	mov r8d, dword ptr [rax+rcx*4+0xc]
0x1410d11cd	add r9, qword ptr [rdi+0x108]
0x1410d11d4	cmp r8d, 0x1
0x1410d11d8	jnz 0x1410d1251 <Block 7>
0x1410d11da	Block 4:
0x1410d11da	movss xmm2, dword ptr [r9]
0x1410d11df	movss xmm3, dword ptr [r9+0x4]
0x1410d11e5	movss xmm4, dword ptr [r9+0x8]
0x1410d11eb	movaps xmm5, xmm8
0x1410d11ef	movaps xmm0, xmm2
0x1410d11f2	movaps xmm1, xmm3
0x1410d11f5	mulss xmm0, xmm2
0x1410d11f9	mulss xmm1, xmm3
0x1410d11fd	subss xmm5, xmm0
0x1410d1201	movss dword ptr [rsp+0x40], xmm2
0x1410d1207	movss dword ptr [rsp+0x44], xmm3
0x1410d120d	movaps xmm0, xmm4
0x1410d1210	mulss xmm0, xmm4
0x1410d1214	subss xmm5, xmm1
0x1410d1218	movss dword ptr [rsp+0x48], xmm4
0x1410d121e	subss xmm5, xmm0
0x1410d1222	comiss xmm5, xmm9
0x1410d1226	jbe 0x1410d123f <Block 6>
0x1410d1228	Block 5:
0x1410d1228	xorps xmm0, xmm0
0x1410d122b	sqrtss xmm0, xmm5
0x1410d122f	movss dword ptr [rsp+0x4c], xmm0
0x1410d1235	movaps xmm0, xmmword ptr [rsp+0x40]
0x1410d123a	movaps xmmword ptr [rdx], xmm0
0x1410d123d	jmp 0x1410d126d <Block 8>
0x1410d123f	Block 6:
0x1410d123f	mov dword ptr [rsp+0x4c], 0x0
0x1410d1247	movaps xmm0, xmmword ptr [rsp+0x40]
0x1410d124c	movaps xmmword ptr [rdx], xmm0
0x1410d124f	jmp 0x1410d126d <Block 8>
0x1410d1251	Block 7:
0x1410d1251	movss dword ptr [rsp+0x30], xmm6
0x1410d1257	movss dword ptr [rsp+0x28], xmm7
0x1410d125d	mov dword ptr [rsp+0x20], r8d
0x1410d1262	mov r8, rdi
0x1410d1265	mov rcx, rbp
0x1410d1268	call 0x1410c7fd0 <[email protected][email protected][email protected]@[email protected]@[email protected]@[email protected]>
0x1410d126d	Block 8:
0x1410d126d	inc rbx
0x1410d1270	cmp rbx, rsi
0x1410d1273	jl 0x1410d11a0 <Block 3>
0x1410d1279	Block 9:
0x1410d1279	movaps xmm9, xmmword ptr [rsp+0x50]
0x1410d127f	movaps xmm8, xmmword ptr [rsp+0x60]
0x1410d1285	Block 10:
0x1410d1285	movaps xmm7, xmmword ptr [rsp+0x70]
0x1410d128a	lea r11, ptr [rsp+0x90]
0x1410d1292	movaps xmm6, xmmword ptr [r11-0x10]
0x1410d1297	mov rbx, qword ptr [r11+0x20]
0x1410d129b	mov rbp, qword ptr [r11+0x28]
0x1410d129f	mov rsi, qword ptr [r11+0x30]
0x1410d12a3	mov rsp, r11
0x1410d12a6	pop r15
0x1410d12a8	pop r14
0x1410d12aa	pop rdi
0x1410d12ab	ret

And, of course, performance is much improved, as you might expect. vTune no longer highlights the code above as being a bottleneck for us… though the memory copies that I mentioned earlier have now been brought to the fore – hopefully those will be addressed in a future revision of UE4 – if not soon, though, I’m sure we’ll get to that ourselves.


Credit(s): Robert Troughton (Coconut Lizard)
Status: Currently unimplemented in 4.12


1 Comment

Leave a Reply

Your email address will not be published.

*

Latest from ALL

Placating The Natives

In this article we delve into Blueprint Nativization, a relatively new feature
Go to Top
%d bloggers like this: