The Battle of the Lean and the Inlined Bone Functions
July 12, 2016
Inline Functionality
Using inline functions can make your program faster because they eliminate the overhead associated with function calls. Functions expanded inline are subject to code optimizations not available to normal functions. (MSDN)
Ah, thank the heavens for inline functions…
In the future, compilers may be able to do a better job of making inline decisions than programmers.
(Randy Meyers, Dr Dobbs, July 1st 2002)
Flash-forward to 2016 and let’s revisit Randy’s thinking. He was absolutely correct, compilers became way more intelligent since 2002. It makes sense – things got a whole lot more complicated. Developers, particularly game developers, need to think about multiple target platforms with potentially HUGE differences in how their CPUs work and perform. The compiler is definitely, in most cases, the best placed to make decisions about whether or not a function should be inlined in modern times. __inline is really just a lightweight hint to the compiler, though. It’s at the compiler’s discretion whether or not a function is really inlined – if it determines that it would better not be, for example if the function being considered is long and likely to reduce code cache performance, then it will just ignore the programmer’s recommendation.
With that, __forceinline comes into play of course. This is a much stronger hint to the compiler… a “please, I really want to inline this”. A nice little addition to the C++ language for the cases where you have a tiny piece of code that definitely should be inlined. Just in case the compiler doesn’t agree otherwise.
Here’s what MSDN has to say about this:-
The __forceinline keyword overrides the cost/benefit analysis and relies on the judgment of the programmer instead. Exercise caution when using __forceinline. Indiscriminate use of __forceinline can result in larger code with only marginal performance gains or, in some cases, even performance losses (due to increased paging of a larger executable, for example).
So basically, we can use it – but we shouldn’t use it too much.
Inlining In UE4
Why is this relevant to UE4? Well, looking at the source code as of today, here’re some stats for you relating to the code available through GitHub (nb. I’m excluding the thirdparty folder here as there’s not much we’d want to do with that):-
- There are 7856 references to FORCEINLINE (defined as __forceinline on all platforms);
- There’s 1 __forceinline (a mistake – this should of course be FORCEINLINE) in D3D12Resources.h;
- There’re 136 FORCENOINLINE references.
Hmm… “Exercise caution” and “7856 references” don’t exactly roll together.
This is a common issue with large code bases. Inlining of functions can definitely help performance – but it shouldn’t just be applied everywhere. Here’s an example of something that I -have- seen, recently, and which leads to some very poorly performing piece of code:-
- A() is approx 30 lines of fairly complex code… within it are 3 separate calls to B().
- B() is set to be FORCEINLINE’d and runs to around 40 lines. B() has 3 calls to C().
- C() is also FORCEINLINE’d with 35 lines of code, 3 calls to D() and 5 calls to E().
- D() is FORCEINLINE’d and has 8 lines of code.
- E() is also FORCEINLINE’d with 5 lines of code.
- Totalling up, we’ll have 30 + 3 * ( 40 + 3 * ( 35 + 3 * 8 + 5 * 5 ) ) = 906 lines of code.
- If none of that was inlined at all, we’d have just 30 + 40 + 35 + 8 + 5 = 118 lines.
To see whether inlining helps or not, you really need to profile the code. Sometimes, as in the example that follows, it’s immediately apparent that it’s not (quite) working… it doesn’t mean that we should just disable inlining completely – but that we really need to investigate carefully to decide.
Investigation of Some Animation Code
During a recent profiling run of our server standalone, we were seeing large amounts of time spent in PopulateFromAnimation(). Let’s investigate how that function works…
// Populates this pose from the supplied animation and track data void PopulateFromAnimation( const UAnimSequence& Seq, const BoneTrackArray& RotationTracks, const BoneTrackArray& TranslationTracks, const BoneTrackArray& ScaleTracks, float Time) { // @todo fixme FTransformArray LocalBones; LocalBones = this->Bones; AnimationFormat_GetAnimationPose( LocalBones, //@TODO:@ANIMATION: Nasty hack RotationTracks, TranslationTracks, ScaleTracks, Seq, Time); this->Bones = LocalBones; }
“@todo fixme” and “@TODO:@ANIMATION: Nasty hack” gave some warning signs – but changing those would be a bigger task and I believe Epic’s programmers may already be looking into those… the issue they’re referring to, I believe, being that this->Bones is being copied to LocalBones, only to be reinstated after the call to AnimationFormat_GetAnimationPose() … it definitely seems like there should be a better way to do that which would prevent the copies… but, anyway, that’s not our concern today… let’s look at the called function within the above:-
void AnimationFormat_GetAnimationPose( FTransformArray& Atoms, const BoneTrackArray& RotationPairs, const BoneTrackArray& TranslationPairs, const BoneTrackArray& ScalePairs, const UAnimSequence& Seq, float Time) { // decompress the translation component using the proper method checkSlow(Seq.TranslationCodec != NULL); if (TranslationPairs.Num() > 0) { ((AnimEncoding*)Seq.TranslationCodec)->GetPoseTranslations(Atoms, TranslationPairs, Seq, Time); } // decompress the rotation component using the proper method checkSlow(Seq.RotationCodec != NULL); ((AnimEncoding*)Seq.RotationCodec)->GetPoseRotations(Atoms, RotationPairs, Seq, Time); checkSlow(Seq.ScaleCodec != NULL); // we allow scale key to be empty if (Seq.CompressedScaleOffsets.IsValid()) { ((AnimEncoding*)Seq.ScaleCodec)->GetPoseScales(Atoms, ScalePairs, Seq, Time); } }
Pretty simple function, again… including 3 calls… GetPoseTranslations(), GetPoseRotations() and GetPoseScales(). In our profiling, we were actually seeing the most time spent in GetPoseRotations(). There are 3 versions of this – the first in AnimEncoding.cpp, the next in AnimEncoding_ConstantKeyLerp.h and the last in AnimEncoding_VariableKeyLerp.h. It’s actually only the latter 2 that we’re concerned with here. Let’s take a look at one of these:-
template<int32 FORMAT> FORCEINLINE_DEBUGGABLE void AEFVariableKeyLerp<FORMAT>::GetPoseRotations( FTransformArray& Atoms, const BoneTrackArray& DesiredPairs, const UAnimSequence& Seq, float Time) { const int32 PairCount = DesiredPairs.Num(); const float RelativePos = Time / (float)Seq.SequenceLength; for (int32 PairIndex=0; PairIndex<PairCount; ++PairIndex) { const BoneTrackPair& Pair = DesiredPairs[PairIndex]; const int32 TrackIndex = Pair.TrackIndex; const int32 AtomIndex = Pair.AtomIndex; FTransform& BoneAtom = Atoms[AtomIndex]; const int32* RESTRICT TrackData = Seq.CompressedTrackOffsets.GetData() + (TrackIndex*4); const int32 RotKeysOffset = *(TrackData+2); const int32 NumRotKeys = *(TrackData+3); const uint8* RESTRICT RotStream = Seq.CompressedByteStream.GetData()+RotKeysOffset; // call the decoder directly (not through the vtable) AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(BoneAtom, Seq, RotStream, NumRotKeys, Time, RelativePos); } }
So, we have our first FORCEINLINE here. Note the call to GetBoneAtomRotation() – let’s take a look at that:-
template<int32 FORMAT> FORCEINLINE_DEBUGGABLE void AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation( FTransform& OutAtom, const UAnimSequence& Seq, const uint8* RESTRICT RotStream, int32 NumRotKeys, float Time, float RelativePos) { if (NumRotKeys == 1) { // For a rotation track of n=1 keys, the single key is packed as an FQuatFloat96NoW. FQuat R0; DecompressRotation<ACF_Float96NoW>( R0 , RotStream, RotStream ); OutAtom.SetRotation(R0); } else { ... lots of complex code here... } }
This is a fairly long function and another with FORCEINLINE. All of this will be merged with the code in GetPoseRotations() and then inlined into AnimationFormat_GetAnimationPose().
Firing up vTune, we were seeing a large chunk of time, ~60%, spent on a single line of code… oddly, it was possibly the simplest line in the whole thing:-
if (NumRotKeys == 1)
It would be nice to show you the disassembly for GetPoseRotations() here … but it runs to 1,514 bytes of code across 444 lines… that’s not a good sign. Let me just show you an important part:-
0x1410d3675 Block 5: 0x1410d3675 xorps xmm0, xmm0 0x1410d3678 sqrtss xmm0, xmm5 0x1410d367c movss dword ptr [rsp+0xc], xmm0 // 0.00142252 0x1410d3682 movaps xmm0, xmmword ptr [rsp] 0x1410d3686 jmp 0x1410d3ab7 <Block 62> // 0.0255999 0x1410d368b Block 6: 0x1410d368b mov dword ptr [rsp+0xc], 0x0 0x1410d3693 movaps xmm0, xmmword ptr [rsp]
The two values you can see to the right give a way of measuring the time spent on each line. Note the time on the JMP… this is our 60% (the total measurement for the entire function was 0.04267214).
Analyzing the data coming through this function, something struck me: 99% of the time, we were seeing NumRotKeys come through as “1”. This may be something particular to our application – but I believe it’s likely to be a common case with video games, at least. Regardless, with 444 lines of disassembly, something needed to be done.
Our Optimization of the Animation System
First up, let’s tackle the expensive line of code that you see above. To do this, we need to understand, just a little, how modern processors work… they’re very clever, see, and will often prepare things that may be quite far from the current instruction point – using branch prediction, speculative operation, etc. The processor might do some of the math on either side of the NumRotKeys==1 test before knowing whether the test passes or fails. And that’s where our bottleneck comes in: the processor is running ahead doing some pretty complicated math (for the case where NumRotKeys is larger than 1) only for that math not to be needed. There’s no real way that we can hint to the processor that it doesn’t need to do this… so let’s, instead, fix it another way… let’s separate that code so that we have each case defined in different functions.
Since the calling function, GetPoseRotations(), is so simple and contains a neat loop, let’s just move our most common case into there – that way, we’re effectively inlining the common case and leaving the rarer case in a function. At the same time, let’s stop force-inlining GetPoseRotations() – it’s much better to let the compiler decide whether or not this should be done:-
template<int32 FORMAT> void AEFVariableKeyLerp<FORMAT>::GetPoseRotations( FTransformArray& Atoms, const BoneTrackArray& DesiredPairs, const UAnimSequence& Seq, float Time) { const int32 PairCount = DesiredPairs.Num(); const float RelativePos = Time / (float)Seq.SequenceLength; for (int32 PairIndex=0; PairIndex<PairCount; ++PairIndex) { const BoneTrackPair& Pair = DesiredPairs[PairIndex]; const int32 TrackIndex = Pair.TrackIndex; const int32 AtomIndex = Pair.AtomIndex; FTransform& BoneAtom = Atoms[AtomIndex]; const int32* RESTRICT TrackData = Seq.CompressedTrackOffsets.GetData() + (TrackIndex*4); const int32 RotKeysOffset = *(TrackData+2); const int32 NumRotKeys = *(TrackData+3); const uint8* RESTRICT RotStream = Seq.CompressedByteStream.GetData()+RotKeysOffset; if (NumRotKeys == 1) { FQuat R0; DecompressRotation<ACF_Float96NoW>(R0, RotStream, RotStream); BoneAtom.SetRotation(R0); } else { // call the decoder directly (not through the vtable) AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation(BoneAtom, Seq, RotStream, NumRotKeys, Time, RelativePos); } } }
And then we can go ahead and comment out that part within GetBoneAtomRotation(). In this case, to stop the performance problem rearing it’s head again, we make sure that the function won’t be inlined this time by using FORCENOINLINE:-
template<int32 FORMAT> FORCENOINLINE void AEFVariableKeyLerp<FORMAT>::GetBoneAtomRotation( FTransform& OutAtom, const UAnimSequence& Seq, const uint8* RESTRICT RotStream, int32 NumRotKeys, float Time, float RelativePos) { // if (NumRotKeys == 1) // { // // For a rotation track of n=1 keys, the single key is packed as an FQuatFloat96NoW. // FQuat R0; // DecompressRotation<ACF_Float96NoW>( R0 , RotStream, RotStream ); // OutAtom.SetRotation(R0); // } // else { ... lots of code } }
Additionally, we took FORCEINLINE off GetPoseTranslations() and GetPoseScales().
And we go ahead and make these same changes to AnimEncoding_ConstantKeyLerp.h.
That’s all there is to it… if we now look at the disassembly of GetPoseRotations(), it comes to a respectable 100 lines of assembly language (382 bytes):-
0x1410d1130 Block 1: 0x1410d1130 mov rax, rsp 0x1410d1133 mov qword ptr [rax+0x8], rbx 0x1410d1137 mov qword ptr [rax+0x10], rbp 0x1410d113b mov qword ptr [rax+0x18], rsi 0x1410d113f push rdi 0x1410d1140 push r14 0x1410d1142 push r15 0x1410d1144 sub rsp, 0x90 0x1410d114b movsxd rsi, dword ptr [r8+0x8] 0x1410d114f movaps xmmword ptr [rax-0x28], xmm6 0x1410d1153 movaps xmmword ptr [rax-0x38], xmm7 0x1410d1157 movss xmm7, dword ptr [rsp+0xd0] 0x1410d1160 xor ebx, ebx 0x1410d1162 mov rdi, r9 0x1410d1165 mov r14, r8 0x1410d1168 mov r15, rdx 0x1410d116b mov rbp, rcx 0x1410d116e movaps xmm6, xmm7 0x1410d1171 divss xmm6, dword ptr [r9+0x68] 0x1410d1177 test rsi, rsi 0x1410d117a jle 0x1410d1285 <Block 10> 0x1410d1180 Block 2: 0x1410d1180 movaps xmmword ptr [rax-0x48], xmm8 0x1410d1185 movss xmm8, dword ptr [rip+0xfb1d4e] 0x1410d118e movaps xmmword ptr [rax-0x58], xmm9 0x1410d1193 xorps xmm9, xmm9 0x1410d1197 nop word ptr [rax+rax*1], ax 0x1410d11a0 Block 3: 0x1410d11a0 mov rcx, qword ptr [r14] 0x1410d11a3 movsxd rax, dword ptr [rcx+rbx*8] 0x1410d11a7 lea rdx, ptr [rax+rax*2] 0x1410d11ab mov eax, dword ptr [rcx+rbx*8+0x4] 0x1410d11af shl eax, 0x2 0x1410d11b2 shl rdx, 0x4 0x1410d11b6 add rdx, qword ptr [r15] 0x1410d11b9 movsxd rcx, eax 0x1410d11bc mov rax, qword ptr [rdi+0xe0] 0x1410d11c3 movsxd r9, dword ptr [rax+rcx*4+0x8] 0x1410d11c8 mov r8d, dword ptr [rax+rcx*4+0xc] 0x1410d11cd add r9, qword ptr [rdi+0x108] 0x1410d11d4 cmp r8d, 0x1 0x1410d11d8 jnz 0x1410d1251 <Block 7> 0x1410d11da Block 4: 0x1410d11da movss xmm2, dword ptr [r9] 0x1410d11df movss xmm3, dword ptr [r9+0x4] 0x1410d11e5 movss xmm4, dword ptr [r9+0x8] 0x1410d11eb movaps xmm5, xmm8 0x1410d11ef movaps xmm0, xmm2 0x1410d11f2 movaps xmm1, xmm3 0x1410d11f5 mulss xmm0, xmm2 0x1410d11f9 mulss xmm1, xmm3 0x1410d11fd subss xmm5, xmm0 0x1410d1201 movss dword ptr [rsp+0x40], xmm2 0x1410d1207 movss dword ptr [rsp+0x44], xmm3 0x1410d120d movaps xmm0, xmm4 0x1410d1210 mulss xmm0, xmm4 0x1410d1214 subss xmm5, xmm1 0x1410d1218 movss dword ptr [rsp+0x48], xmm4 0x1410d121e subss xmm5, xmm0 0x1410d1222 comiss xmm5, xmm9 0x1410d1226 jbe 0x1410d123f <Block 6> 0x1410d1228 Block 5: 0x1410d1228 xorps xmm0, xmm0 0x1410d122b sqrtss xmm0, xmm5 0x1410d122f movss dword ptr [rsp+0x4c], xmm0 0x1410d1235 movaps xmm0, xmmword ptr [rsp+0x40] 0x1410d123a movaps xmmword ptr [rdx], xmm0 0x1410d123d jmp 0x1410d126d <Block 8> 0x1410d123f Block 6: 0x1410d123f mov dword ptr [rsp+0x4c], 0x0 0x1410d1247 movaps xmm0, xmmword ptr [rsp+0x40] 0x1410d124c movaps xmmword ptr [rdx], xmm0 0x1410d124f jmp 0x1410d126d <Block 8> 0x1410d1251 Block 7: 0x1410d1251 movss dword ptr [rsp+0x30], xmm6 0x1410d1257 movss dword ptr [rsp+0x28], xmm7 0x1410d125d mov dword ptr [rsp+0x20], r8d 0x1410d1262 mov r8, rdi 0x1410d1265 mov rcx, rbp 0x1410d1268 call 0x1410c7fd0 <?GetBoneAtomRotation@?$AEFVariableKeyLerp@$00@@UEAAXAEAUFTransform@@AEBVUAnimSequence@@PEIBEHMM@Z> 0x1410d126d Block 8: 0x1410d126d inc rbx 0x1410d1270 cmp rbx, rsi 0x1410d1273 jl 0x1410d11a0 <Block 3> 0x1410d1279 Block 9: 0x1410d1279 movaps xmm9, xmmword ptr [rsp+0x50] 0x1410d127f movaps xmm8, xmmword ptr [rsp+0x60] 0x1410d1285 Block 10: 0x1410d1285 movaps xmm7, xmmword ptr [rsp+0x70] 0x1410d128a lea r11, ptr [rsp+0x90] 0x1410d1292 movaps xmm6, xmmword ptr [r11-0x10] 0x1410d1297 mov rbx, qword ptr [r11+0x20] 0x1410d129b mov rbp, qword ptr [r11+0x28] 0x1410d129f mov rsi, qword ptr [r11+0x30] 0x1410d12a3 mov rsp, r11 0x1410d12a6 pop r15 0x1410d12a8 pop r14 0x1410d12aa pop rdi 0x1410d12ab ret
And, of course, performance is much improved, as you might expect. vTune no longer highlights the code above as being a bottleneck for us… though the memory copies that I mentioned earlier have now been brought to the fore – hopefully those will be addressed in a future revision of UE4 – if not soon, though, I’m sure we’ll get to that ourselves.
Credit(s): Robert Troughton (Coconut Lizard)
Status: Currently unimplemented in 4.12