Sweeping Up And Disassembling A Physics Macro
June 20, 2016
Macros… they still have their uses in modern code, arguably, but I almost always find that it’s better to avoid them. One macro that I came across just today, while profiling some physics “sweep” code, was the following in PhysXCollision.h:-
/** * Macro for obtaining a specific geometry from the shape passed in * NB. this macro is bad in terms of hiding the variable names, so made variable name to use Macro, so it does not overlap with the ones used outside **/ #define GET_GEOMETRY_FROM_SHAPE( OutGeom, Shape ) \ PxGeometry* OutGeom = NULL; \ PxSphereGeometry PMacroSphereGeom; \ PxBoxGeometry PMacroBoxGeom; \ PxCapsuleGeometry PMacroCapsuleGeom; \ PxConvexMeshGeometry PMacroConvexGeom; \ \ if(Shape->getGeometryType() == PxGeometryType::eSPHERE)\ {\ Shape->getSphereGeometry(PMacroSphereGeom);\ OutGeom = &PMacroSphereGeom;\ }\ else if(Shape->getGeometryType() == PxGeometryType::eBOX)\ {\ Shape->getBoxGeometry(PMacroBoxGeom);\ OutGeom = &PMacroBoxGeom;\ }\ else if(Shape->getGeometryType() == PxGeometryType::eCAPSULE)\ {\ Shape->getCapsuleGeometry(PMacroCapsuleGeom);\ OutGeom = &PMacroCapsuleGeom;\ }\ else if(Shape->getGeometryType() == PxGeometryType::eCONVEXMESH)\ {\ Shape->getConvexMeshGeometry(PMacroConvexGeom);\ OutGeom = &PMacroConvexGeom;\ }\ else \ {\ OutGeom = NULL; \ }
Some of the things that I really didn’t like about this one:-
- it’s long. There’s rarely a good reason for having a function declared in this way…
- the Px…Geometry items at the top of the macro
- they’re all going onto the stack
- they’re all going to use their default initializers
- we’re only actually using one of them ultimately anyway
- the if/elseif/../else block – why isn’t this switch() instead?
Immediately following this macro, we have:-
//TODO: update places that use macro with this function struct GeometryFromShapeStorage { PxSphereGeometry SphereGeom; PxBoxGeometry BoxGeom; PxCapsuleGeometry CapsuleGeom; PxConvexMeshGeometry ConvexGeom; PxTriangleMeshGeometry TriangleGeom; PxHeightFieldGeometry HeightFieldGeom; }; PxGeometry * GetGeometryFromShape(GeometryFromShapeStorage & LocalStorage, const PxShape * PShape, bool bTriangleMeshAllowed = false);
The matching CPP code for this is found in PhysXCollision.cpp:-
PxGeometry * GetGeometryFromShape(GeometryFromShapeStorage & LocalStorage, const PxShape * PShape, bool bTriangleMeshAllowed /*= false*/) { switch (PShape->getGeometryType()) { case PxGeometryType::eSPHERE: PShape->getSphereGeometry(LocalStorage.SphereGeom); return &LocalStorage.SphereGeom; case PxGeometryType::eBOX: PShape->getBoxGeometry(LocalStorage.BoxGeom); return &LocalStorage.BoxGeom; case PxGeometryType::eCAPSULE: PShape->getCapsuleGeometry(LocalStorage.CapsuleGeom); return &LocalStorage.CapsuleGeom; case PxGeometryType::eCONVEXMESH: PShape->getConvexMeshGeometry(LocalStorage.ConvexGeom); return &LocalStorage.ConvexGeom; case PxGeometryType::eTRIANGLEMESH: if (bTriangleMeshAllowed) { PShape->getTriangleMeshGeometry(LocalStorage.TriangleGeom); return &LocalStorage.TriangleGeom; } case PxGeometryType::eHEIGHTFIELD: PShape->getHeightFieldGeometry(LocalStorage.HeightFieldGeom); return &LocalStorage.HeightFieldGeom; default: return NULL; } }
So it looks like moves were made to improve things … but, then, the new code was never fully hooked up for one reason or another. Probably because the new code was slower than the macro (I’m guessing here). GetGeometryFromShape() is a simple conversion of the aforementioned macro into a proper function – and one which does indeed use switch(). But… we can do better. Much, much better. Let’s take a look at the macro first of all… here’s how that ends up compiled on PC:-
0x0B4 (180) bytes of code 00007FF702F623F9 mov qword ptr [PMacroSphereGeom],r12 00007FF702F623FE mov qword ptr [PMacroBoxGeom],3 00007FF702F62407 movups xmmword ptr [rbp-1Ch],xmm0 00007FF702F6240B mov qword ptr [rbp-80h],r12 00007FF702F6240F mov qword ptr [PMacroCapsuleGeom],2 00007FF702F62418 mov dword ptr [rsp+70h],r12d 00007FF702F6241D mov dword ptr [rbp-20h],4 00007FF702F62424 mov qword ptr [rbp-0Ch],r12 00007FF702F62428 mov dword ptr [rbp-4],3F800000h 00007FF702F6242F mov qword ptr [rbp],r12 00007FF702F62433 mov rax,qword ptr [rbx] 00007FF702F62436 mov rcx,rbx 00007FF702F62439 call qword ptr [rax+28h] 00007FF702F6243C mov rcx,rbx 00007FF702F6243F test eax,eax 00007FF702F62441 mov rax,qword ptr [rbx] 00007FF702F62444 jne <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+255h (07FF702F62455h) 00007FF702F62446 lea rdx,[PMacroSphereGeom] 00007FF702F6244B call qword ptr [rax+48h] 00007FF702F6244E lea rdi,[PMacroSphereGeom] 00007FF702F62453 jmp <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2ADh (07FF702F624ADh) 00007FF702F62455 call qword ptr [rax+28h] 00007FF702F62458 mov rcx,rbx 00007FF702F6245B cmp eax,3 00007FF702F6245E mov rax,qword ptr [rbx] 00007FF702F62461 jne <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+272h (07FF702F62472h) 00007FF702F62463 lea rdx,[PMacroBoxGeom] 00007FF702F62468 call qword ptr [rax+40h] 00007FF702F6246B lea rdi,[PMacroBoxGeom] 00007FF702F62470 jmp <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2ADh (07FF702F624ADh) 00007FF702F62472 call qword ptr [rax+28h] 00007FF702F62475 mov rcx,rbx 00007FF702F62478 cmp eax,2 00007FF702F6247B mov rax,qword ptr [rbx] 00007FF702F6247E jne <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+28Fh (07FF702F6248Fh) 00007FF702F62480 lea rdx,[PMacroCapsuleGeom] 00007FF702F62485 call qword ptr [rax+50h] 00007FF702F62488 lea rdi,[PMacroCapsuleGeom] 00007FF702F6248D jmp <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2ADh (07FF702F624ADh) 00007FF702F6248F call qword ptr [rax+28h] 00007FF702F62492 cmp eax,4 00007FF702F62495 jne <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2AAh (07FF702F624AAh) 00007FF702F62497 mov rax,qword ptr [rbx] 00007FF702F6249A lea rdx,[rbp-20h] 00007FF702F6249E mov rcx,rbx 00007FF702F624A1 call qword ptr [rax+60h] 00007FF702F624A4 lea rdi,[rbp-20h] 00007FF702F624A8 jmp <lambda_ba77acd23315091b7f3fb6973978162a>::operator()+2ADh (07FF702F624ADh) 00007FF702F624AA mov rdi,r12 00007FF702F624AD
So, wow, yeah… that’s a pretty hefty chunk of code.
Let me break this up for you:-
- lines 1-10: these are initializing all the Px..Geoms from the top of the macro with default values
- lines 11-17, 22-26, 31-35 and 40-42 .. these are the if/else if/else if/else parts
- 18-21 is the Sphere Geom part
- etc…
Note as well with lines 18-21… line 20 is reading the same pointer into rdi that line 18 previously read into rdx. The compiler has done this because it really has no idea what’s happening inside getSphereGeometry(), called on line 19.
Ok, so the macro is BAD… how about the function, though..? Let’s see how that compiles:-
0x128 (296) bytes of code 00007FF702F6FC50 mov qword ptr [rsp+8],rbx 00007FF702F6FC55 mov qword ptr [rsp+10h],rsi 00007FF702F6FC5A push rdi 00007FF702F6FC5B sub rsp,20h 00007FF702F6FC5F mov rax,qword ptr [rdx] 00007FF702F6FC62 mov rbx,rcx 00007FF702F6FC65 mov rcx,rdx 00007FF702F6FC68 movzx esi,r8b 00007FF702F6FC6C mov rdi,rdx 00007FF702F6FC6F call qword ptr [rax+28h] 00007FF702F6FC72 cmp eax,6 00007FF702F6FC75 ja $LN8+0D4h (07FF702F6FD66h) 00007FF702F6FC7B lea rcx,[__ImageBase (07FF701BE0000h)] 00007FF702F6FC82 cdqe 00007FF702F6FC84 mov r9d,dword ptr [rcx+rax*4+138FD78h] 00007FF702F6FC8C add r9,rcx 00007FF702F6FC8F jmp r9 00007FF702F6FC92 mov rax,qword ptr [rdi] 00007FF702F6FC95 mov rdx,rbx 00007FF702F6FC98 mov rcx,rdi 00007FF702F6FC9B call qword ptr [rax+48h] 00007FF702F6FC9E mov rax,rbx 00007FF702F6FCA1 mov rbx,qword ptr [rsp+30h] 00007FF702F6FCA6 mov rsi,qword ptr [rsp+38h] 00007FF702F6FCAB add rsp,20h 00007FF702F6FCAF pop rdi 00007FF702F6FCB0 ret 00007FF702F6FCB1 mov r8,qword ptr [rdi] 00007FF702F6FCB4 lea rdx,[rbx+8] 00007FF702F6FCB8 mov rcx,rdi 00007FF702F6FCBB call qword ptr [r8+40h] 00007FF702F6FCBF lea rax,[rbx+8] 00007FF702F6FCC3 mov rbx,qword ptr [rsp+30h] 00007FF702F6FCC8 mov rsi,qword ptr [rsp+38h] 00007FF702F6FCCD add rsp,20h 00007FF702F6FCD1 pop rdi 00007FF702F6FCD2 ret 00007FF702F6FCD3 mov r8,qword ptr [rdi] 00007FF702F6FCD6 lea rdx,[rbx+18h] 00007FF702F6FCDA mov rcx,rdi 00007FF702F6FCDD call qword ptr [r8+50h] 00007FF702F6FCE1 lea rax,[rbx+18h] 00007FF702F6FCE5 mov rbx,qword ptr [rsp+30h] 00007FF702F6FCEA mov rsi,qword ptr [rsp+38h] 00007FF702F6FCEF add rsp,20h 00007FF702F6FCF3 pop rdi 00007FF702F6FCF4 ret 00007FF702F6FCF5 mov r8,qword ptr [rdi] 00007FF702F6FCF8 lea rdx,[rbx+28h] 00007FF702F6FCFC mov rcx,rdi 00007FF702F6FCFF call qword ptr [r8+60h] 00007FF702F6FD03 lea rax,[rbx+28h] 00007FF702F6FD07 mov rbx,qword ptr [rsp+30h] 00007FF702F6FD0C mov rsi,qword ptr [rsp+38h] 00007FF702F6FD11 add rsp,20h 00007FF702F6FD15 pop rdi 00007FF702F6FD16 ret 00007FF702F6FD17 test sil,sil 00007FF702F6FD1A je $LN8+0ACh (07FF702F6FD3Eh) 00007FF702F6FD1C mov r8,qword ptr [rdi] 00007FF702F6FD1F lea rdx,[rbx+50h] 00007FF702F6FD23 mov rcx,rdi 00007FF702F6FD26 call qword ptr [r8+68h] 00007FF702F6FD2A lea rax,[rbx+50h] 00007FF702F6FD2E mov rbx,qword ptr [rsp+30h] 00007FF702F6FD33 mov rsi,qword ptr [rsp+38h] 00007FF702F6FD38 add rsp,20h 00007FF702F6FD3C pop rdi 00007FF702F6FD3D ret 00007FF702F6FD3E mov r8,qword ptr [rdi] 00007FF702F6FD41 lea rdx,[rbx+80h] 00007FF702F6FD48 mov rcx,rdi 00007FF702F6FD4B call qword ptr [r8+70h] 00007FF702F6FD4F lea rax,[rbx+80h] 00007FF702F6FD56 mov rbx,qword ptr [rsp+30h] 00007FF702F6FD5B mov rsi,qword ptr [rsp+38h] 00007FF702F6FD60 add rsp,20h 00007FF702F6FD64 pop rdi 00007FF702F6FD65 ret 00007FF702F6FD66 mov rbx,qword ptr [rsp+30h] 00007FF702F6FD6B mov rsi,qword ptr [rsp+38h] 00007FF702F6FD70 xor eax,eax 00007FF702F6FD72 add rsp,20h 00007FF702F6FD76 pop rdi 00007FF702F6FD77 ret 00007FF702F6FD78
This is actually longer than the assembly generated from the macro – but some of this can be excused from that the function is able to do more. It adds heightfields and triangle meshes as shape options that the macro didn’t support.
Still… we can do better. My first consideration was that we fairly easily change the Px..Geom group to be a union… this would reduce memory usage. It would need some trickery – but it was doable. That was actually my first solution… but then, wondering whether or not nVidia had actually considered this scenario, I started to hunt through their header files. That’s when I found PxGeometryHolder (http://docs.nvidia.com/gameworks/content/gameworkslibrary/physx/apireference/files/classPxGeometryHolder.html). Amazingly, this seemed to be EXACTLY what we wanted.
Additionally, our original macro didn’t actually support height fields. It didn’t support triangle meshes either – but those were covered in the new function with bTriangleMeshAllowed. So we’d need to add a similar check using bTriangleMeshAllowed so that we wouldn’t be potentially breaking anything.
Bringing PxGeometryHolder into play, here’s the final code that I ended up with, along with the changes needed to call into this:-
// PhysXCollision.cpp:-
PxGeometry* GetGeometryFromShape(PxGeometryHolder& GeomHolder, const PxShape * PShape, bool bTriangleMeshAllowed /*= false*/, bool bHeightfieldAllowed /*= false*/) { ; switch (PShape->getGeometryType()) { case PxGeometryType::eSPHERE: PShape->getSphereGeometry(GeomHolder.sphere()); break; case PxGeometryType::eBOX: PShape->getBoxGeometry(GeomHolder.box()); break; case PxGeometryType::eCAPSULE: PShape->getCapsuleGeometry(GeomHolder.capsule()); break; case PxGeometryType::eCONVEXMESH: PShape->getConvexMeshGeometry(GeomHolder.convexMesh()); break; case PxGeometryType::eTRIANGLEMESH: if (bTriangleMeshAllowed) { PShape->getTriangleMeshGeometry(GeomHolder.triangleMesh()); } else { return NULL; } break; case PxGeometryType::eHEIGHTFIELD: if (bHeightfieldAllowed) { PShape->getHeightFieldGeometry(GeomHolder.heightField()); } else { return NULL; } break; } return &GeomHolder.any(); }
PhysXCollision.h (nb. this replaces the macro and the previous declaration of GetGeometryFromShape():-
PxGeometry* GetGeometryFromShape(PxGeometryHolder& GeomHolder, const PxShape* PShape, bool bTriangleMeshAllowed = false, bool bHeightfieldAllowed = false);
BodyInstance.cpp has 3 changes needed…
Firstly,
GeometryFromShapeStorage GeomStorage; PxGeometry* PGeom = GetGeometryFromShape(GeomStorage, PShape, true); PxTransform PGlobalPose = PCompTM.transform(PShape->getLocalPose()); if (PGeom) {
changes to:-
PxGeometryHolder GeomHolder; PxGeometry* PGeom = GetGeometryFromShape(GeomHolder, PShape, true, true); if (PGeom) { PxTransform PGlobalPose = PCompTM.transform(PShape->getLocalPose());
(note here that I’ve also moved PGlobalPose to be within the if() block – no point calculating it if we’re not going to be using it…
Secondly,
PxTransform PShapeGlobalPose = PTestGlobalPose.transform(PTargetShape->getLocalPose()); GET_GEOMETRY_FROM_SHAPE(PGeom, PTargetShape); if (PGeom) {
changes to:-
PxGeometryHolder GeomHolder; PxGeometry* PGeom = GetGeometryFromShape(GeomHolder, PTargetShape); if (PGeom) { PxTransform PShapeGlobalPose = PTestGlobalPose.transform(PTargetShape->getLocalPose());
And,
const PxTransform PLocalPose = PShape->getLocalPose(); const PxTransform PShapeGlobalPose = PBodyInstanceSpaceToTestSpace.transform(PLocalPose); GET_GEOMETRY_FROM_SHAPE(PGeom, PShape); if (PGeom != NULL) {
changes to:-
PxGeometryHolder GeomHolder; PxGeometry* PGeom = GetGeometryFromShape(GeomHolder, PShape); if (PGeom != NULL) { const PxTransform PLocalPose = PShape->getLocalPose(); const PxTransform PShapeGlobalPose = PBodyInstanceSpaceToTestSpace.transform(PLocalPose);
Finally, WorldCollision.cpp needs the following:-
GET_GEOMETRY_FROM_SHAPE(PGeom, PShape);
changing to:-
PxGeometryHolder GeomHolder; PxGeometry* PGeom = GetGeometryFromShape(GeomHolder, PShape);
Actually, one further optimization that you can do is, for each of the above, move the location of the GeomHolder variable to be outside of each of the loops that they’re in.
That concludes our work with this for this article… but before we wrap up, let’s take a look at the disassembly for our new GetGeometryFromShape():-
0x0BF (191) bytes of code 00007FF7F3F41C30 mov qword ptr [rsp+8],rbx 00007FF7F3F41C35 mov qword ptr [rsp+10h],rbp 00007FF7F3F41C3A mov qword ptr [rsp+18h],rsi 00007FF7F3F41C3F push rdi 00007FF7F3F41C40 sub rsp,20h 00007FF7F3F41C44 mov rax,qword ptr [rdx] 00007FF7F3F41C47 mov rdi,rcx 00007FF7F3F41C4A mov rcx,rdx 00007FF7F3F41C4D movzx esi,r9b 00007FF7F3F41C51 movzx ebp,r8b 00007FF7F3F41C55 mov rbx,rdx 00007FF7F3F41C58 call qword ptr [rax+28h] 00007FF7F3F41C5B cmp eax,6 00007FF7F3F41C5E ja $LN10+60h (07FF7F3F41CD7h) 00007FF7F3F41C60 lea rcx,[__ImageBase (07FF7F2BB0000h)] 00007FF7F3F41C67 cdqe 00007FF7F3F41C69 mov r10d,dword ptr [rcx+rax*4+1391CF0h] 00007FF7F3F41C71 add r10,rcx 00007FF7F3F41C74 jmp r10 00007FF7F3F41C77 mov rax,qword ptr [rbx] 00007FF7F3F41C7A mov rdx,rdi 00007FF7F3F41C7D mov rcx,rbx 00007FF7F3F41C80 call qword ptr [rax+48h] 00007FF7F3F41C83 jmp $LN10+60h (07FF7F3F41CD7h) 00007FF7F3F41C85 mov rax,qword ptr [rbx] 00007FF7F3F41C88 mov rdx,rdi 00007FF7F3F41C8B mov rcx,rbx 00007FF7F3F41C8E call qword ptr [rax+40h] 00007FF7F3F41C91 jmp $LN10+60h (07FF7F3F41CD7h) 00007FF7F3F41C93 mov rax,qword ptr [rbx] 00007FF7F3F41C96 mov rdx,rdi 00007FF7F3F41C99 mov rcx,rbx 00007FF7F3F41C9C call qword ptr [rax+50h] 00007FF7F3F41C9F jmp $LN10+60h (07FF7F3F41CD7h) 00007FF7F3F41CA1 mov rax,qword ptr [rbx] 00007FF7F3F41CA4 mov rdx,rdi 00007FF7F3F41CA7 mov rcx,rbx 00007FF7F3F41CAA call qword ptr [rax+60h] 00007FF7F3F41CAD jmp $LN10+60h (07FF7F3F41CD7h) 00007FF7F3F41CAF test bpl,bpl 00007FF7F3F41CB2 je $LN10+4Bh (07FF7F3F41CC2h) 00007FF7F3F41CB4 mov rax,qword ptr [rbx] 00007FF7F3F41CB7 mov rdx,rdi 00007FF7F3F41CBA mov rcx,rbx 00007FF7F3F41CBD call qword ptr [rax+68h] 00007FF7F3F41CC0 jmp $LN10+60h (07FF7F3F41CD7h) 00007FF7F3F41CC2 xor eax,eax 00007FF7F3F41CC4 jmp $LN10+63h (07FF7F3F41CDAh) 00007FF7F3F41CC6 test sil,sil 00007FF7F3F41CC9 je $LN10+4Bh (07FF7F3F41CC2h) 00007FF7F3F41CCB mov rax,qword ptr [rbx] 00007FF7F3F41CCE mov rdx,rdi 00007FF7F3F41CD1 mov rcx,rbx 00007FF7F3F41CD4 call qword ptr [rax+70h] 00007FF7F3F41CD7 mov rax,rdi 00007FF7F3F41CDA mov rbx,qword ptr [rsp+30h] 00007FF7F3F41CDF mov rbp,qword ptr [rsp+38h] 00007FF7F3F41CE4 mov rsi,qword ptr [rsp+40h] 00007FF7F3F41CE9 add rsp,20h 00007FF7F3F41CED pop rdi 00007FF7F3F41CEE ret 00007FF7F3F41CEF
That’s 33% smaller than the original function… and slightly longer than the original macro – but significantly faster. Note in particular lines 17-21. This shows how the compiler can be clever when it’s given the chance – it’s figured out a calculation to jump straight to the code needed, rather than using nasty comparisons! This is way, way better than the first example we saw.
See you next time!
Credit(s): Robert Troughton (Coconut Lizard)
Status: Currently unimplemented in 4.12