A Huge PhysX Memory Churn Reduction

Blue glowing mandala shaped explosion with particles, computer generated abstract background

UPDATE: the fixes below were added to the code base in UE4.13

TLDR: A single PhysX function call was churning through 2.5gb of temporary memory per minute – and the fix was easy!

It’s funny, really, that most of the big-hitting fixes/optimizations to UE4 seem to be so incredibly simple. It’s the smaller fixes which take time.

With memory churn, the allocation of temporary data structures that will be immediately or almost immediately discarded, there are often several ways to improve things:-

  1. remove the churn completely if it’s not needed – removing the need of the memory at all or by putting a static buffer in place;
  2. reduce the churn by reducing the frequency that it’s needed or by having a resizable static buffer;
  3. other methods…?

For the code I’m going to show you here, we were able to go for option (1) with a static buffer.

Profiling showed us that a lot of time was being spent doing memory allocations deep inside PhysX. The UE4 code triggering this were the calls to “simulate” such as here:-

  PAScene->simulate(DeltaTime, bLastSubstep, SubstepTask);
  PAScene->simulate(DeltaTime, SubstepTask);

The same was showing up in performance tests within vTune – 12.4% of total time was being spent within simulate():-


At first glance, we weren’t sure that we could do much… the “fix” would surely be deep within PhysX. It actually took us longer than it probably should’ve, along with a few chats with nVidia employees (call out to Gordon Yeoman, Pierre Terdiman, Phil Scott and Mike Skolones for their help), to find the right solution. Note the declaration for simulate, which made it glaringly obvious:-

virtual void simulate(PxReal elapsedTime, physx::PxBaseTask* completionTask = NULL, void* scratchMemBlock = 0, PxU32 scratchMemBlockSize = 0, bool controlSimulation = true) = 0;

So let’s not drag this out… if a good size can be determined for it, we should of course be using that scratchMemBlock functionality. We create a static buffer, preallocated, and pass that through to simulate(). Passing the size of the buffer through ensures that everything won’t just break if it’s not big enough – PhysX will then create it’s own temporary buffer.

Our solution, if you want to just go with this…

Above the body for FPhysScene::FPhyScene(), we add our static buffer:-

uint32 GSimulateScratchMemorySize = 524288; // 512k is more than enough (for our test cases)
uint8* GSimulateScratchMemory;

We set it up within FPhysScene::FPhysScene():-

if (!GSimulateScratchMemory)
  GSimulateScratchMemorySize = PhysSetting->SimulateScratchMemorySize;
  GSimulateScratchMemory = (uint8*)FMemory::Malloc(GSimulateScratchMemorySize, 16); // PhysX needs 16-byte aligned memory

And we change the calls to simulate() to benefit.. there are 4 of these:-

// OLD PScene->simulate(AveragedFrameTime[SceneType], Task);
PScene->simulate(AveragedFrameTime[SceneType], Task, GSimulateScratchMemory, GSimulateScratchMemorySize);
// OLD ApexScene->simulate(AveragedFrameTime[SceneType], true, Task);
ApexScene->simulate(AveragedFrameTime[SceneType], true, Task, GSimulateScratchMemory, GSimulateScratchMemorySize);
// OLD PAScene->simulate(DeltaTime, bLastSubstep, SubstepTask);
PAScene->simulate(DeltaTime, bLastSubstep, SubstepTask, GSimulateScratchMemory, GSimulateScratchMemorySize);
// OLD PAScene->simulate(DeltaTime, SubstepTask);
PAScene->simulate(DeltaTime, SubstepTask, GSimulateScratchMemory, GSimulateScratchMemorySize);

Along with externs in PhysSubstepTask.cpp:-

extern uint32 GSimulateScratchMemorySize;
extern uint8* GSimulateScratchMemory;

Finally, to make the 512k “guess” acceptable, we allow it to be overridden through the INIs by adding this to PhysicsSettings.h within the UPhysicsSettings class:-

  /** Amount of memory to reserve for PhysX simulate() */
  UPROPERTY(config, EditAnywhere, Category = Constants)
  int32 SimulateScratchMemorySize;

That’s it. For us, it took 2.5gb/minute off the memory allocation system. As I say, it’s not what I’d call a “clean” solution, if Epic implement this I’d expect it to be done in a slightly different way – but.. it works.. and it shouldn’t cause our integration engineers any serious headaches.

Credit(s): Robert Troughton (Coconut Lizard),
Gordon Yeoman, Pierre Terdiman, Phil Scott and Mike Skolones (all nvidia)
Status: Implemented in 4.13


  1. I’ve created a pull request for this now. I don’t use Git/Github a lot so apologies if I’ve complicated it by not forking from master – it looks like my change has conflicts to resolve. Future PR’s will be perfect……..

    • Hi Mig, the memory profile obviously looked much better (memory allocations were reduced to zero within the simulate() functions). CPU Performance of the app also improved – though I can’t remember exactly how much. It should be very easy to test before/after, either with vTune or by building scoped timers into the relevant systems. I may do that later and add the stats to the article.

  2. Hi Robert,
    Thanks for such a valuable post. Just a quick question, will this fix also benefit Android or iOS game’s too? If so, I might have to pick a way smaller buffer size then.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: