Speeding Up The Time Limit Controls
July 13, 2016
Analyzing IsTimeLimitExceeded()
When you have a process that takes a long time to complete, and that process is running on the main thread, it’s common to use some sort of time-limiting code to make sure that other threads aren’t completely stalled. UE4 has a few of these, usually using a version of IsTimeLimitExceeded(). The FLinkerLoad class is one – and that’s the one we’re going to look at here.
During a vTune run of our server build, we noticed that IsTimeLimitExceeded() itself was taking up quite a bit of CPU time… so we delved in to see whether it could be optimized. Here’s what we saw in vTune:-
COREUOBJECT_API bool IsTimeLimitExceeded( const TCHAR* CurrentTask, int32 Granularity = 1 );
The final thing I want to show you is an example instance from where the function is called:-
while (Status == LINKER_TimedOut && (!bUseTimeLimit || (bUseFullTimeLimit && !IsTimeLimitExceeded(TEXT("Checking Full Timer")))) );
Now, let’s have a look at the resultant assembly code for this…
First, the caller:-
00007FF61932E789 cmp ebx,2 00007FF61932E78C jne FLinkerLoad::Tick+223h (07FF61932E7C3h) 00007FF61932E78E cmp byte ptr [rdi+6C6h],bpl 00007FF61932E795 je FLinkerLoad::Tick+83h (07FF61932E623h) 00007FF61932E79B cmp byte ptr [rdi+6C7h],bpl 00007FF61932E7A2 je FLinkerLoad::Tick+21Fh (07FF61932E7BFh) 00007FF61932E7A4 lea r8d,[rbx-1] 00007FF61932E7A8 lea rdx,[p+3FBCh (07FF61AED0D48h)] 00007FF61932E7AF mov rcx,rdi 00007FF61932E7B2 call FLinkerLoad::IsTimeLimitExceeded (07FF61931B6D0h) 00007FF61932E7B7 test al,al 00007FF61932E7B9 je FLinkerLoad::Tick+83h (07FF61932E623h)
Now FLinkerLoad::IsTimeLimitExceeded() itself:-
00007FF61931B6D0 mov qword ptr [rsp+10h],rbx 00007FF61931B6D5 push rdi 00007FF61931B6D6 sub rsp,40h 00007FF61931B6DA inc dword ptr [rcx+6C8h] 00007FF61931B6E0 cmp byte ptr [rcx+6C5h],0 00007FF61931B6E7 mov rdi,rdx 00007FF61931B6EA mov rbx,rcx 00007FF61931B6ED jne FLinkerLoad::IsTimeLimitExceeded+0FAh (07FF61931B7CAh) 00007FF61931B6F3 cmp byte ptr [rcx+6C6h],0 00007FF61931B6FA je FLinkerLoad::IsTimeLimitExceeded+0FAh (07FF61931B7CAh) 00007FF61931B700 mov eax,dword ptr [rcx+6C8h] 00007FF61931B706 cdq 00007FF61931B707 idiv eax,r8d 00007FF61931B70A test edx,edx 00007FF61931B70C jne FLinkerLoad::IsTimeLimitExceeded+0FAh (07FF61931B7CAh) 00007FF61931B712 lea rcx,[rsp+50h] 00007FF61931B717 call qword ptr [__imp_QueryPerformanceCounter (07FF61ACA96E8h)] 00007FF61931B71D movss xmm1,dword ptr [rbx+6CCh] 00007FF61931B725 xorps xmm2,xmm2 00007FF61931B728 cvtsi2sd xmm2,qword ptr [rsp+50h] 00007FF61931B72F cvtps2pd xmm0,xmm1 00007FF61931B732 mulsd xmm2,mmword ptr [FGenericPlatformTime::SecondsPerCycle (07FF61B83BC10h)] 00007FF61931B73A addsd xmm2,mmword ptr [__real@4170000000000000 (07FF61ACD2EF8h)] 00007FF61931B742 subsd xmm2,mmword ptr [rbx+6D0h] 00007FF61931B74A comisd xmm2,xmm0 00007FF61931B74E cvtps2pd xmm0,xmm1 00007FF61931B751 mulsd xmm0,mmword ptr [__real@4004000000000000 (07FF61AEE1150h)] 00007FF61931B759 seta al 00007FF61931B75C mov byte ptr [rbx+6C5h],al 00007FF61931B762 comisd xmm2,xmm0 00007FF61931B766 jbe FLinkerLoad::IsTimeLimitExceeded+0E8h (07FF61931B7B8h) 00007FF61931B768 cmp byte ptr [LogStreaming (07FF61B963190h)],5 00007FF61931B76F jb FLinkerLoad::IsTimeLimitExceeded+0E8h (07FF61931B7B8h) 00007FF61931B771 mov rax,qword ptr [LogStreaming+8h (07FF61B963198h)] 00007FF61931B778 mulsd xmm2,mmword ptr [__real@408f400000000000 (07FF61ACD2EE8h)] 00007FF61931B780 lea r8,[rsp+50h] 00007FF61931B785 mov qword ptr [rsp+50h],rax 00007FF61931B78A lea rax,[p+7DA4h (07FF61AED4B30h)] 00007FF61931B791 lea rcx,[p+7F94h (07FF61AED4D20h)] 00007FF61931B798 movsd mmword ptr [rsp+30h],xmm2 00007FF61931B79E mov r9d,5 00007FF61931B7A4 mov edx,328h 00007FF61931B7A9 mov qword ptr [rsp+28h],rdi 00007FF61931B7AE mov qword ptr [rsp+20h],rax 00007FF61931B7B3 call FMsg::Logf_Internal__VA (07FF61921EA60h) 00007FF61931B7B8 movzx eax,byte ptr [rbx+6C5h] 00007FF61931B7BF mov rbx,qword ptr [rsp+58h] 00007FF61931B7C4 add rsp,40h 00007FF61931B7C8 pop rdi 00007FF61931B7C9 ret 00007FF61931B7CA movzx eax,byte ptr [rcx+6C5h] 00007FF61931B7D1 mov rbx,qword ptr [rsp+58h] 00007FF61931B7D6 add rsp,40h 00007FF61931B7DA pop rdi 00007FF61931B7DB ret
That’s not an insignificant amount of code, really, for such a simple function… all of this is just to decide, on an interval (set by Granularity), whether we’ve spent enough time on the current operation. If Granularity is 1, which it is by default and is in most cases, including the case I’ve cited above, then we’ll check our time limit every time through.
So … looking at IsTimeLimitExceeded()’s disassembly, here’s a breakdown:-
- Lines 1-3: initial setup
- Line 4: IsTimeLimitExceededCallCount++
- Lines 5-15: the first if() block
- Lines 16-25: double CurrentTime = FPlatformTime::Seconds(); and bTimeLimitExceeded = CurrentTime – TickStartTime > TimeLimit;
- Lines 26-31: if( (CurrentTime – TickStartTime) > (2.5 * TimeLimit) )
- Lines 32-45: UE_LOG(LogStreaming, Log, TEXT(“FLinkerLoad: %s took (less than) %5.2f ms”), CurrentTask, (CurrentTime – TickStartTime) * 1000);
- Lines 46-55: return bTimeLimitExceeded (nb. I really don’t know why this code is duplicated like this..).
Note in particular line 13 – this line is doing a divide just for the purpose of working out “(IsTimeLimitExceededCallCount % Granularity) == 0 )”. As mentioned before, though, Granularity is -usually- 1 … if only there was a way of getting this information to the compiler?
Optimizing IsTimeLimitExceeded()
There are a couple of things I identified as being optimization possibilities here:-
- We can get rid of the divide by eliminating the Granularity calculation – if Granularity is <= 1 then we don’t need to bother with the call-count calculation at all;
- The logging code, and checks around it, should be eliminated completely on test/release builds. Note that, as I mentioned earlier, we’re only seeing this at all because we set USE_LOGGING_IN_SHIPPING to be “1” rather than UE4’s default of “0”.
Anyway, here’s how I did it…
Firstly, I added a new define to the top of LinkerLoad.cpp:-
#define LOG_SINGLE_SLOW_LOADS_IN_ISTIMELIMITEDEXCEEDED !(UE_BUILD_TEST||UE_BUILD_SHIPPING)
Then I changed IsTimeLimitExceeded() to be:-
bool FLinkerLoad::IsTimeLimitExceeded( const TCHAR* CurrentTask, int32 Granularity ) { if ((!bTimeLimitExceeded) && (bUseTimeLimit) && ((Granularity <= 1) || ((++IsTimeLimitExceededCallCount % Granularity)== 0)) ) { double CurrentTime = FPlatformTime::Seconds(); bTimeLimitExceeded = CurrentTime - TickStartTime > TimeLimit; #if LOG_SINGLE_SLOW_LOADS_IN_ISTIMELIMITEDEXCEEDED if (!FPlatformProperties::HasEditorOnlyData()) { // Log single operations that take longer than timelimit. if( (CurrentTime - TickStartTime) > (2.5 * TimeLimit) ) { UE_LOG(LogStreaming, Log, TEXT("FLinkerLoad: %s took (less than) %5.2f ms"), CurrentTask, (CurrentTime - TickStartTime) * 1000); } } #endif // LOG_SINGLE_SLOW_LOADS_IN_ISTIMELIMITEDEXCEEDED } return bTimeLimitExceeded; }
Upon compiling, I was very happy to see that the compiler decided to inline all of this, making the final disassembly much cleaner. Here it is inlined into the caller:-
00007FF7797FCE7A cmp ebx,2 00007FF7797FCE7D jne FLinkerLoad::Tick+21Ah (07FF7797FCF0Ah) 00007FF7797FCE83 cmp byte ptr [rdi+6C6h],sil 00007FF7797FCE8A je FLinkerLoad::Tick+83h (07FF7797FCD73h) 00007FF7797FCE90 cmp byte ptr [rdi+6C7h],sil 00007FF7797FCE97 je FLinkerLoad::Tick+1FBh (07FF7797FCEEBh) 00007FF7797FCE99 cmp byte ptr [rdi+6C5h],sil 00007FF7797FCEA0 jne FLinkerLoad::Tick+1FBh (07FF7797FCEEBh) 00007FF7797FCEA2 lea rcx,[rsp+28h] 00007FF7797FCEA7 call qword ptr [__imp_QueryPerformanceCounter (07FF77B1796E8h)] 00007FF7797FCEAD movss xmm0,dword ptr [rdi+6CCh] 00007FF7797FCEB5 xorps xmm1,xmm1 00007FF7797FCEB8 cvtsi2sd xmm1,qword ptr [rsp+28h] 00007FF7797FCEBF cvtps2pd xmm0,xmm0 00007FF7797FCEC2 mulsd xmm1,mmword ptr [FGenericPlatformTime::SecondsPerCycle (07FF77BD0BC10h)] 00007FF7797FCECA addsd xmm1,xmm7 00007FF7797FCECE subsd xmm1,mmword ptr [rdi+6D0h] 00007FF7797FCED6 comisd xmm1,xmm0 00007FF7797FCEDA seta al 00007FF7797FCEDD mov byte ptr [rdi+6C5h],al 00007FF7797FCEE3 test al,al 00007FF7797FCEE5 je FLinkerLoad::Tick+83h (07FF7797FCD73h)
This is, of course, significantly faster. The same happened in the other parts of code calling FLinkerLoad::IsTimeLimitExceeded(), resulting in significantly improved load performance all round.
Credit(s): Robert Troughton (Coconut Lizard)
Status: Currently unimplemented in 4.12