For Developers Working With Unreal Engine

Sweeping Up And Disassembling A Physics Macro

by

Macros… they still have their uses in modern code, arguably, but I almost always find that it’s better to avoid them.

One macro that I came across just today, while profiling some physics “sweep” code, was the following in PhysXCollision.h:-

/** 
 *	Macro for obtaining a specific geometry from the shape passed in
 *	NB. this macro is bad in terms of hiding the variable names, so made variable name to use Macro, so it does not overlap with the ones used outside 
 **/
#define GET_GEOMETRY_FROM_SHAPE( OutGeom, Shape ) \
  PxGeometry*				OutGeom = NULL; \
  PxSphereGeometry		PMacroSphereGeom;	\
  PxBoxGeometry			PMacroBoxGeom;		\
  PxCapsuleGeometry		PMacroCapsuleGeom;	\
  PxConvexMeshGeometry	PMacroConvexGeom;	\
  \
  if(Shape->getGeometryType() == PxGeometryType::eSPHERE)\
  {\
    Shape->getSphereGeometry(PMacroSphereGeom);\
    OutGeom = &PMacroSphereGeom;\
  }\
  else if(Shape->getGeometryType() == PxGeometryType::eBOX)\
  {\
    Shape->getBoxGeometry(PMacroBoxGeom);\
    OutGeom = &PMacroBoxGeom;\
  }\
  else if(Shape->getGeometryType() == PxGeometryType::eCAPSULE)\
  {\
    Shape->getCapsuleGeometry(PMacroCapsuleGeom);\
    OutGeom = &PMacroCapsuleGeom;\
  }\
  else if(Shape->getGeometryType() == PxGeometryType::eCONVEXMESH)\
  {\
    Shape->getConvexMeshGeometry(PMacroConvexGeom);\
    OutGeom = &PMacroConvexGeom;\
  }\
  else \
  {\
    OutGeom = NULL; \
  }

Some of the things that I really didn’t like about this one:-

  • it’s long. There’s rarely a good reason for having a function declared in this way…
  • the Px…Geometry items at the top of the macro
    • they’re all going onto the stack
    • they’re all going to use their default initializers
    • we’re only actually using one of them ultimately anyway
  • the if/elseif/../else block – why isn’t this switch() instead?

Immediately following this macro, we have:-

//TODO: update places that use macro with this function
struct GeometryFromShapeStorage
{
  PxSphereGeometry SphereGeom;
  PxBoxGeometry BoxGeom;
  PxCapsuleGeometry CapsuleGeom;
  PxConvexMeshGeometry ConvexGeom;
  PxTriangleMeshGeometry TriangleGeom;
  PxHeightFieldGeometry HeightFieldGeom;
};

PxGeometry * GetGeometryFromShape(GeometryFromShapeStorage & LocalStorage, const PxShape * PShape, bool bTriangleMeshAllowed = false);

The matching CPP code for this is found in PhysXCollision.cpp:-

PxGeometry * GetGeometryFromShape(GeometryFromShapeStorage & LocalStorage, const PxShape * PShape, bool bTriangleMeshAllowed /*= false*/)
{
  switch (PShape->getGeometryType())
  {
  case PxGeometryType::eSPHERE:
    PShape->getSphereGeometry(LocalStorage.SphereGeom);
    return &LocalStorage.SphereGeom;
  case PxGeometryType::eBOX:
    PShape->getBoxGeometry(LocalStorage.BoxGeom);
    return &LocalStorage.BoxGeom;
  case PxGeometryType::eCAPSULE:
    PShape->getCapsuleGeometry(LocalStorage.CapsuleGeom);
    return &LocalStorage.CapsuleGeom;
  case PxGeometryType::eCONVEXMESH:
    PShape->getConvexMeshGeometry(LocalStorage.ConvexGeom);
    return &LocalStorage.ConvexGeom;
  case PxGeometryType::eTRIANGLEMESH:
    if (bTriangleMeshAllowed)
    {
      PShape->getTriangleMeshGeometry(LocalStorage.TriangleGeom);
      return &LocalStorage.TriangleGeom;
    }
  case PxGeometryType::eHEIGHTFIELD:
    PShape->getHeightFieldGeometry(LocalStorage.HeightFieldGeom);
    return &LocalStorage.HeightFieldGeom;
  default:
    return NULL;
  }
}

So it looks like moves were made to improve things … but, then, the new code was never fully hooked up for one reason or another. Probably because the new code was slower than the macro (I’m guessing here). GetGeometryFromShape() is a simple conversion of the aforementioned macro into a proper function – and one which does indeed use switch(). But… we can do better. Much, much better.

Let’s take a look at the macro first of all… here’s how that ends up compiled on PC:-

0x0B4 (180) bytes of code

00007FF702F623F9  mov         qword ptr [PMacroSphereGeom],r12  
00007FF702F623FE  mov         qword ptr [PMacroBoxGeom],3  
00007FF702F62407  movups      xmmword ptr [rbp-1Ch],xmm0  
00007FF702F6240B  mov         qword ptr [rbp-80h],r12  
00007FF702F6240F  mov         qword ptr [PMacroCapsuleGeom],2  
00007FF702F62418  mov         dword ptr [rsp+70h],r12d  
00007FF702F6241D  mov         dword ptr [rbp-20h],4  
00007FF702F62424  mov         qword ptr [rbp-0Ch],r12  
00007FF702F62428  mov         dword ptr [rbp-4],3F800000h  
00007FF702F6242F  mov         qword ptr [rbp],r12  
00007FF702F62433  mov         rax,qword ptr [rbx]  
00007FF702F62436  mov         rcx,rbx  
00007FF702F62439  call        qword ptr [rax+28h]  
00007FF702F6243C  mov         rcx,rbx  
00007FF702F6243F  test        eax,eax  
00007FF702F62441  mov         rax,qword ptr [rbx]  
00007FF702F62444  jne         <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+255h (07FF702F62455h)  
00007FF702F62446  lea         rdx,[PMacroSphereGeom]  
00007FF702F6244B  call        qword ptr [rax+48h]  
00007FF702F6244E  lea         rdi,[PMacroSphereGeom]  
00007FF702F62453  jmp         <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2ADh (07FF702F624ADh)  
00007FF702F62455  call        qword ptr [rax+28h]  
00007FF702F62458  mov         rcx,rbx  
00007FF702F6245B  cmp         eax,3  
00007FF702F6245E  mov         rax,qword ptr [rbx]  
00007FF702F62461  jne         <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+272h (07FF702F62472h)  
00007FF702F62463  lea         rdx,[PMacroBoxGeom]  
00007FF702F62468  call        qword ptr [rax+40h]  
00007FF702F6246B  lea         rdi,[PMacroBoxGeom]  
00007FF702F62470  jmp         <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2ADh (07FF702F624ADh)  
00007FF702F62472  call        qword ptr [rax+28h]  
00007FF702F62475  mov         rcx,rbx  
00007FF702F62478  cmp         eax,2  
00007FF702F6247B  mov         rax,qword ptr [rbx]  
00007FF702F6247E  jne         <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+28Fh (07FF702F6248Fh)  
00007FF702F62480  lea         rdx,[PMacroCapsuleGeom]  
00007FF702F62485  call        qword ptr [rax+50h]  
00007FF702F62488  lea         rdi,[PMacroCapsuleGeom]  
00007FF702F6248D  jmp         <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2ADh (07FF702F624ADh)  
00007FF702F6248F  call        qword ptr [rax+28h]  
00007FF702F62492  cmp         eax,4  
00007FF702F62495  jne         <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2AAh (07FF702F624AAh)  
00007FF702F62497  mov         rax,qword ptr [rbx]  
00007FF702F6249A  lea         rdx,[rbp-20h]  
00007FF702F6249E  mov         rcx,rbx  
00007FF702F624A1  call        qword ptr [rax+60h]  
00007FF702F624A4  lea         rdi,[rbp-20h]  
00007FF702F624A8  jmp         <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2ADh (07FF702F624ADh)  
00007FF702F624AA  mov         rdi,r12
00007FF702F624AD

So, wow, yeah… that’s a pretty hefty chunk of code.

Let me break this up for you:-

  • lines 1-10: these are initializing all the Px..Geoms from the top of the macro with default values
  • lines 11-17, 22-26, 31-35 and 40-42 .. these are the if/else if/else if/else parts
  • 18-21 is the Sphere Geom part
  • etc…

Note as well with lines 18-21… line 20 is reading the same pointer into rdi that line 18 previously read into rdx. The compiler has done this because it really has no idea what’s happening inside getSphereGeometry(), called on line 19.

Ok, so the macro is BAD… how about the function, though..? Let’s see how that compiles:-

0x128 (296) bytes of code

00007FF702F6FC50  mov         qword ptr [rsp+8],rbx  
00007FF702F6FC55  mov         qword ptr [rsp+10h],rsi  
00007FF702F6FC5A  push        rdi  
00007FF702F6FC5B  sub         rsp,20h  
00007FF702F6FC5F  mov         rax,qword ptr [rdx]  
00007FF702F6FC62  mov         rbx,rcx  
00007FF702F6FC65  mov         rcx,rdx  
00007FF702F6FC68  movzx       esi,r8b  
00007FF702F6FC6C  mov         rdi,rdx  
00007FF702F6FC6F  call        qword ptr [rax+28h]  
00007FF702F6FC72  cmp         eax,6  
00007FF702F6FC75  ja          $LN8+0D4h (07FF702F6FD66h)  
00007FF702F6FC7B  lea         rcx,[__ImageBase (07FF701BE0000h)]  
00007FF702F6FC82  cdqe  
00007FF702F6FC84  mov         r9d,dword ptr [rcx+rax*4+138FD78h]  
00007FF702F6FC8C  add         r9,rcx  
00007FF702F6FC8F  jmp         r9  
00007FF702F6FC92  mov         rax,qword ptr [rdi]  
00007FF702F6FC95  mov         rdx,rbx  
00007FF702F6FC98  mov         rcx,rdi  
00007FF702F6FC9B  call        qword ptr [rax+48h]  
00007FF702F6FC9E  mov         rax,rbx  
00007FF702F6FCA1  mov         rbx,qword ptr [rsp+30h]  
00007FF702F6FCA6  mov         rsi,qword ptr [rsp+38h]  
00007FF702F6FCAB  add         rsp,20h  
00007FF702F6FCAF  pop         rdi  
00007FF702F6FCB0  ret  
00007FF702F6FCB1  mov         r8,qword ptr [rdi]  
00007FF702F6FCB4  lea         rdx,[rbx+8]  
00007FF702F6FCB8  mov         rcx,rdi  
00007FF702F6FCBB  call        qword ptr [r8+40h]  
00007FF702F6FCBF  lea         rax,[rbx+8]  
00007FF702F6FCC3  mov         rbx,qword ptr [rsp+30h]  
00007FF702F6FCC8  mov         rsi,qword ptr [rsp+38h]  
00007FF702F6FCCD  add         rsp,20h  
00007FF702F6FCD1  pop         rdi  
00007FF702F6FCD2  ret  
00007FF702F6FCD3  mov         r8,qword ptr [rdi]  
00007FF702F6FCD6  lea         rdx,[rbx+18h]  
00007FF702F6FCDA  mov         rcx,rdi  
00007FF702F6FCDD  call        qword ptr [r8+50h]  
00007FF702F6FCE1  lea         rax,[rbx+18h]  
00007FF702F6FCE5  mov         rbx,qword ptr [rsp+30h]  
00007FF702F6FCEA  mov         rsi,qword ptr [rsp+38h]  
00007FF702F6FCEF  add         rsp,20h  
00007FF702F6FCF3  pop         rdi  
00007FF702F6FCF4  ret  
00007FF702F6FCF5  mov         r8,qword ptr [rdi]  
00007FF702F6FCF8  lea         rdx,[rbx+28h]  
00007FF702F6FCFC  mov         rcx,rdi  
00007FF702F6FCFF  call        qword ptr [r8+60h]  
00007FF702F6FD03  lea         rax,[rbx+28h]  
00007FF702F6FD07  mov         rbx,qword ptr [rsp+30h]  
00007FF702F6FD0C  mov         rsi,qword ptr [rsp+38h]  
00007FF702F6FD11  add         rsp,20h  
00007FF702F6FD15  pop         rdi  
00007FF702F6FD16  ret  
00007FF702F6FD17  test        sil,sil  
00007FF702F6FD1A  je          $LN8+0ACh (07FF702F6FD3Eh)  
00007FF702F6FD1C  mov         r8,qword ptr [rdi]  
00007FF702F6FD1F  lea         rdx,[rbx+50h]  
00007FF702F6FD23  mov         rcx,rdi  
00007FF702F6FD26  call        qword ptr [r8+68h]  
00007FF702F6FD2A  lea         rax,[rbx+50h]  
00007FF702F6FD2E  mov         rbx,qword ptr [rsp+30h]  
00007FF702F6FD33  mov         rsi,qword ptr [rsp+38h]  
00007FF702F6FD38  add         rsp,20h  
00007FF702F6FD3C  pop         rdi  
00007FF702F6FD3D  ret  
00007FF702F6FD3E  mov         r8,qword ptr [rdi]  
00007FF702F6FD41  lea         rdx,[rbx+80h]  
00007FF702F6FD48  mov         rcx,rdi  
00007FF702F6FD4B  call        qword ptr [r8+70h]  
00007FF702F6FD4F  lea         rax,[rbx+80h]  
00007FF702F6FD56  mov         rbx,qword ptr [rsp+30h]  
00007FF702F6FD5B  mov         rsi,qword ptr [rsp+38h]  
00007FF702F6FD60  add         rsp,20h  
00007FF702F6FD64  pop         rdi  
00007FF702F6FD65  ret  
00007FF702F6FD66  mov         rbx,qword ptr [rsp+30h]  
00007FF702F6FD6B  mov         rsi,qword ptr [rsp+38h]  
00007FF702F6FD70  xor         eax,eax  
00007FF702F6FD72  add         rsp,20h  
00007FF702F6FD76  pop         rdi  
00007FF702F6FD77  ret  
00007FF702F6FD78

This is actually longer than the assembly generated from the macro – but some of this can be excused from that the function is able to do more. It adds heightfields and triangle meshes as shape options that the macro didn’t support.

Still… we can do better. My first consideration was that we fairly easily change the Px..Geom group to be a union… this would reduce memory usage. It would need some trickery – but it was doable. That was actually my first solution… but then, wondering whether or not nVidia had actually considered this scenario, I started to hunt through their header files. That’s when I found PxGeometryHolder (http://docs.nvidia.com/gameworks/content/gameworkslibrary/physx/apireference/files/classPxGeometryHolder.html). Amazingly, this seemed to be EXACTLY what we wanted.

Additionally, our original macro didn’t actually support height fields.  It didn’t support triangle meshes either – but those were covered in the new function with bTriangleMeshAllowed. So we’d need to add a similar check using bTriangleMeshAllowed so that we wouldn’t be potentially breaking anything.

Bringing PxGeometryHolder into play, here’s the final code that I ended up with, along with the changes needed to call into this:-

// PhysXCollision.cpp:-

PxGeometry* GetGeometryFromShape(PxGeometryHolder& GeomHolder, const PxShape * PShape, bool bTriangleMeshAllowed /*= false*/, bool bHeightfieldAllowed /*= false*/)
{
  ;
  switch (PShape->getGeometryType())
  {
  case PxGeometryType::eSPHERE:
    PShape->getSphereGeometry(GeomHolder.sphere());
    break;
  case PxGeometryType::eBOX:
    PShape->getBoxGeometry(GeomHolder.box());
    break;
  case PxGeometryType::eCAPSULE:
    PShape->getCapsuleGeometry(GeomHolder.capsule());
    break;
  case PxGeometryType::eCONVEXMESH:
    PShape->getConvexMeshGeometry(GeomHolder.convexMesh());
    break;
  case PxGeometryType::eTRIANGLEMESH:
    if (bTriangleMeshAllowed)
    {
      PShape->getTriangleMeshGeometry(GeomHolder.triangleMesh());
    }
    else
    {
      return NULL;
    }
    break;
  case PxGeometryType::eHEIGHTFIELD:
    if (bHeightfieldAllowed)
    {
      PShape->getHeightFieldGeometry(GeomHolder.heightField());
    }
    else
    {
      return NULL;
    }
    break;
  }
  return &GeomHolder.any();
}

PhysXCollision.h (nb. this replaces the macro and the previous declaration of GetGeometryFromShape():-

PxGeometry* GetGeometryFromShape(PxGeometryHolder& GeomHolder, const PxShape* PShape, bool bTriangleMeshAllowed = false, bool bHeightfieldAllowed = false);

BodyInstance.cpp has 3 changes needed…

Firstly,

GeometryFromShapeStorage GeomStorage;
PxGeometry* PGeom = GetGeometryFromShape(GeomStorage, PShape, true);
PxTransform PGlobalPose = PCompTM.transform(PShape->getLocalPose());
if (PGeom)
{

changes to:-

PxGeometryHolder GeomHolder;
PxGeometry* PGeom = GetGeometryFromShape(GeomHolder, PShape, true, true);
if (PGeom)
{
  PxTransform PGlobalPose = PCompTM.transform(PShape->getLocalPose());

(note here that I’ve also moved PGlobalPose to be within the if() block – no point calculating it if we’re not going to be using it…

Secondly,

PxTransform PShapeGlobalPose = PTestGlobalPose.transform(PTargetShape->getLocalPose());
GET_GEOMETRY_FROM_SHAPE(PGeom, PTargetShape);
if (PGeom)
{

changes to:-

PxGeometryHolder GeomHolder;
PxGeometry* PGeom = GetGeometryFromShape(GeomHolder, PTargetShape);
if (PGeom)
{
  PxTransform PShapeGlobalPose = PTestGlobalPose.transform(PTargetShape->getLocalPose());

And,

const PxTransform PLocalPose = PShape->getLocalPose();
const PxTransform PShapeGlobalPose = PBodyInstanceSpaceToTestSpace.transform(PLocalPose);
GET_GEOMETRY_FROM_SHAPE(PGeom, PShape);
if (PGeom != NULL)
{

changes to:-

PxGeometryHolder GeomHolder;
PxGeometry* PGeom = GetGeometryFromShape(GeomHolder, PShape);
if (PGeom != NULL)
{
  const PxTransform PLocalPose = PShape->getLocalPose();
  const PxTransform PShapeGlobalPose = PBodyInstanceSpaceToTestSpace.transform(PLocalPose);

Finally, WorldCollision.cpp needs the following:-

GET_GEOMETRY_FROM_SHAPE(PGeom, PShape);

changing to:-

PxGeometryHolder GeomHolder;
PxGeometry* PGeom = GetGeometryFromShape(GeomHolder, PShape);

Actually, one further optimization that you can do is, for each of the above, move the location of the GeomHolder variable to be outside of each of the loops that they’re in.

That concludes our work with this for this article… but before we wrap up, let’s take a look at the disassembly for our new GetGeometryFromShape():-

0x0BF (191) bytes of code

00007FF7F3F41C30  mov         qword ptr [rsp+8],rbx  
00007FF7F3F41C35  mov         qword ptr [rsp+10h],rbp  
00007FF7F3F41C3A  mov         qword ptr [rsp+18h],rsi  
00007FF7F3F41C3F  push        rdi  
00007FF7F3F41C40  sub         rsp,20h  
00007FF7F3F41C44  mov         rax,qword ptr [rdx]  
00007FF7F3F41C47  mov         rdi,rcx  
00007FF7F3F41C4A  mov         rcx,rdx  
00007FF7F3F41C4D  movzx       esi,r9b  
00007FF7F3F41C51  movzx       ebp,r8b  
00007FF7F3F41C55  mov         rbx,rdx  
00007FF7F3F41C58  call        qword ptr [rax+28h]  
00007FF7F3F41C5B  cmp         eax,6  
00007FF7F3F41C5E  ja          $LN10+60h (07FF7F3F41CD7h)  
00007FF7F3F41C60  lea         rcx,[__ImageBase (07FF7F2BB0000h)]  
00007FF7F3F41C67  cdqe  
00007FF7F3F41C69  mov         r10d,dword ptr [rcx+rax*4+1391CF0h]  
00007FF7F3F41C71  add         r10,rcx  
00007FF7F3F41C74  jmp         r10  
00007FF7F3F41C77  mov         rax,qword ptr [rbx]  
00007FF7F3F41C7A  mov         rdx,rdi  
00007FF7F3F41C7D  mov         rcx,rbx  
00007FF7F3F41C80  call        qword ptr [rax+48h]  
00007FF7F3F41C83  jmp         $LN10+60h (07FF7F3F41CD7h)  
00007FF7F3F41C85  mov         rax,qword ptr [rbx]  
00007FF7F3F41C88  mov         rdx,rdi  
00007FF7F3F41C8B  mov         rcx,rbx  
00007FF7F3F41C8E  call        qword ptr [rax+40h]  
00007FF7F3F41C91  jmp         $LN10+60h (07FF7F3F41CD7h)  
00007FF7F3F41C93  mov         rax,qword ptr [rbx]  
00007FF7F3F41C96  mov         rdx,rdi  
00007FF7F3F41C99  mov         rcx,rbx  
00007FF7F3F41C9C  call        qword ptr [rax+50h]  
00007FF7F3F41C9F  jmp         $LN10+60h (07FF7F3F41CD7h)  
00007FF7F3F41CA1  mov         rax,qword ptr [rbx]  
00007FF7F3F41CA4  mov         rdx,rdi  
00007FF7F3F41CA7  mov         rcx,rbx  
00007FF7F3F41CAA  call        qword ptr [rax+60h]  
00007FF7F3F41CAD  jmp         $LN10+60h (07FF7F3F41CD7h)  
00007FF7F3F41CAF  test        bpl,bpl  
00007FF7F3F41CB2  je          $LN10+4Bh (07FF7F3F41CC2h)  
00007FF7F3F41CB4  mov         rax,qword ptr [rbx]  
00007FF7F3F41CB7  mov         rdx,rdi  
00007FF7F3F41CBA  mov         rcx,rbx  
00007FF7F3F41CBD  call        qword ptr [rax+68h]  
00007FF7F3F41CC0  jmp         $LN10+60h (07FF7F3F41CD7h)  
00007FF7F3F41CC2  xor         eax,eax  
00007FF7F3F41CC4  jmp         $LN10+63h (07FF7F3F41CDAh)  
00007FF7F3F41CC6  test        sil,sil  
00007FF7F3F41CC9  je          $LN10+4Bh (07FF7F3F41CC2h)  
00007FF7F3F41CCB  mov         rax,qword ptr [rbx]  
00007FF7F3F41CCE  mov         rdx,rdi  
00007FF7F3F41CD1  mov         rcx,rbx  
00007FF7F3F41CD4  call        qword ptr [rax+70h]  
00007FF7F3F41CD7  mov         rax,rdi  
00007FF7F3F41CDA  mov         rbx,qword ptr [rsp+30h]  
00007FF7F3F41CDF  mov         rbp,qword ptr [rsp+38h]  
00007FF7F3F41CE4  mov         rsi,qword ptr [rsp+40h]  
00007FF7F3F41CE9  add         rsp,20h  
00007FF7F3F41CED  pop         rdi  
00007FF7F3F41CEE  ret  
00007FF7F3F41CEF

That’s 33% smaller than the original function… and slightly longer than the original macro – but significantly faster. Note in particular lines 17-21. This shows how the compiler can be clever when it’s given the chance – it’s figured out a calculation to jump straight to the code needed, rather than using nasty comparisons! This is way, way better than the first example we saw.

See you next time!


Credit(s): Robert Troughton (Coconut Lizard)
Status: Currently unimplemented in 4.12


3 Comments

  1. Hi Robert,

    Nice find, by looking into UE4 source code I found that GET_GEOMETRY_FROM_SHAPE is used from within Sweet and Overlap checks, i.e.: performance-heavy code already. I was wondering if you would know how much gain could you get by your changes, because they look promising.

    Thanks!

    • Thanks. I’ll try to get some timings later this week to add to the article. It’s totally subject to the content, of course, as it will come down to how many shapes, what sort of shapes, etc… But I’ll give the results from our own game project.

Leave a Reply

Your email address will not be published.

*

Latest from ALL

Placating The Natives

In this article we delve into Blueprint Nativization, a relatively new feature
Go to Top
%d bloggers like this: