A Huge PhysX Memory Churn Reduction

April 18, 2016

UPDATE: the fixes below were added to the code base in UE4.13

TLDR: A single PhysX function call was churning through 2.5gb of temporary memory per minute – and the fix was easy!

It’s funny, really, that most of the big-hitting fixes/optimizations to UE4 seem to be so incredibly simple. It’s the smaller fixes which take time.

With memory churn, the allocation of temporary data structures that will be immediately or almost immediately discarded, there are often several ways to improve things:-

remove the churn completely if it’s not needed – removing the need of the memory at all or by putting a static buffer in place;
reduce the churn by reducing the frequency that it’s needed or by having a resizable static buffer;
other methods…?

For the code I’m going to show you here, we were able to go for option (1) with a static buffer.

Profiling showed us that a lot of time was being spent doing memory allocations deep inside PhysX. The UE4 code triggering this were the calls to “simulate” such as here:-

#if WITH_APEX
  PAScene->simulate(DeltaTime, bLastSubstep, SubstepTask);
#else
  PAScene->lockWrite();
  PAScene->simulate(DeltaTime, SubstepTask);
  PAScene->unlockWrite();
#endif

At first glance, we weren’t sure that we could do much… the “fix” would surely be deep within PhysX. It actually took us longer than it probably should’ve, along with a few chats with nVidia employees (call out to Gordon Yeoman, Pierre Terdiman, Phil Scott and Mike Skolones for their help), to find the right solution. Note the declaration for simulate, which made it glaringly obvious:-

virtual void simulate(PxReal elapsedTime, physx::PxBaseTask* completionTask = NULL, void* scratchMemBlock = 0, PxU32 scratchMemBlockSize = 0, bool controlSimulation = true) = 0;

So let’s not drag this out… if a good size can be determined for it, we should of course be using that scratchMemBlock functionality. We create a static buffer, preallocated, and pass that through to simulate(). Passing the size of the buffer through ensures that everything won’t just break if it’s not big enough – PhysX will then create it’s own temporary buffer.

Our solution, if you want to just go with this…

Above the body for FPhysScene::FPhyScene(), we add our static buffer:-

uint32 GSimulateScratchMemorySize = 524288; // 512k is more than enough (for our test cases)
uint8* GSimulateScratchMemory;

We set it up within FPhysScene::FPhysScene():-

if (!GSimulateScratchMemory)
{
  GSimulateScratchMemorySize = PhysSetting->SimulateScratchMemorySize;
  GSimulateScratchMemory = (uint8*)FMemory::Malloc(GSimulateScratchMemorySize, 16); // PhysX needs 16-byte aligned memory
}

And we change the calls to simulate() to benefit.. there are 4 of these:-

// OLD PScene->simulate(AveragedFrameTime[SceneType], Task);
PScene->simulate(AveragedFrameTime[SceneType], Task, GSimulateScratchMemory, GSimulateScratchMemorySize);

// OLD ApexScene->simulate(AveragedFrameTime[SceneType], true, Task);
ApexScene->simulate(AveragedFrameTime[SceneType], true, Task, GSimulateScratchMemory, GSimulateScratchMemorySize);

// OLD PAScene->simulate(DeltaTime, bLastSubstep, SubstepTask);
PAScene->simulate(DeltaTime, bLastSubstep, SubstepTask, GSimulateScratchMemory, GSimulateScratchMemorySize);

// OLD PAScene->simulate(DeltaTime, SubstepTask);
PAScene->simulate(DeltaTime, SubstepTask, GSimulateScratchMemory, GSimulateScratchMemorySize);

Along with externs in PhysSubstepTask.cpp:-

extern uint32 GSimulateScratchMemorySize;
extern uint8* GSimulateScratchMemory;

Finally, to make the 512k “guess” acceptable, we allow it to be overridden through the INIs by adding this to PhysicsSettings.h within the UPhysicsSettings class:-

  /** Amount of memory to reserve for PhysX simulate() */
  UPROPERTY(config, EditAnywhere, Category = Constants)
  int32 SimulateScratchMemorySize;

That’s it. For us, it took 2.5gb/minute off the memory allocation system. As I say, it’s not what I’d call a “clean” solution, if Epic implement this I’d expect it to be done in a slightly different way – but.. it works.. and it shouldn’t cause our integration engineers any serious headaches.

Credit(s): Robert Troughton (Coconut Lizard),
Gordon Yeoman, Pierre Terdiman, Phil Scott and Mike Skolones (all nvidia)
Status: Implemented in 4.13