enums - Fastest implementation of simple, virtual, observer-sort of, pattern in c++?

Question

Welcome To Ask or Share your Answers For Others

enums - Fastest implementation of simple, virtual, observer-sort of, pattern in c++?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

enums - Fastest implementation of simple, virtual, observer-sort of, pattern in c++?

I'm working my arse off trying to implement an alternative for vtables using enums and a ton of macro magic that's really starting to mess with my brain. I'm starting to think i'm not walking the right path since the code is getting uglier and uglier, and will not be fit for production by any means.

How can the pattern of the following code be implemented with the least amount of redirection/operations?

It has to be done in standard c++, up to 17.

class A{
    virtual void Update() = 0; // A is so pure *?*
};

class B: public A
{
    override void Update() final
    {
        // DO B STUFF
    }
}

class C: public A
{
    override void Update() final
    {
        // DO C STUFF
    }
}

// class...

int main()
{
    std::vector<A*> vecA{};

    // Insert instances of B, C, ..., into vecA

    for(auto a: vecA) // This for will be inside a main loop
        a->Update(); // Ridiculous amount of calls per unit of time

    // Free memory
}

PS: If enum, switch and macros are indeed the best option, I guess I'll simply try to freshen up my caches and come up with a better design.

PSS: I know this is micro-optimization... Heck, I need to nano or even pico optimize this (figuratively speaking), so I will simply ignore any utilitarian responses that might come up.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:18:22+0000

As the first comment said, you have an XY problem here. Sorting / reordering is ok, and you have many objects, not a huge number of different classes, and no need to support types that your code doesn't know about at compile time. Polymorphism + virtual inheritance is the wrong choice.

Instead, use N different containers, one for each type of object, with no indirection. Letting the compiler inline B::Update() into a loop over all the B objects is much better. (For the trivial example below of incrementing one member int, my static performance analysis from looking at the asm puts it at about 24x faster on Skylake with the data hot in L1D cache. AVX2 auto-vectorization vs. call in a loop is really that huge.)

If there was some required order between the objects, including between different types of objects, then some kind of polymorphism or manual dispatching would be appropriate. (e.g. if it mattered what order you processed vecA in, keeping all the B objects separate from all the C objects wouldn't be equivalent.)

If you care about performance, you have to realize that making the source larger might simplify things for the compiler / in the asm output. Checking / dispatching based on the type of each object inside the inner loop is expensive. Using any kind of function pointer or enum to dispatch on a per-object basis can easily suffer from branch mispredicts when you have a mix of different objects.

Looping separately over multiple containers effectively hoists that type check out of the inner loop and lets the compiler devirtualize. (Or better, shrinks each object to not need a vtable pointer, enum, or function pointer in the first place, because its type is statically known.)

Writing out a separate loop for each container with a different type is sort of like fully unrolling a loop over different types after hoisting the type dispatching out of the inner loop. This is necessary for the compiler to inline the calls, which you want if there are a lot of objects of each type. Inlining lets it keep constants in registers across objects, enables SIMD auto-vectorization across multiple objects, and simply avoids the overhead of an actual function call. (Both the call itself and spill/reload of registers.)

You were right that if you did need per-object dispatch, C++ virtual functions are an expensive way to get it when you're using final overrides. You're paying the same runtime cost that would let your code support new derived classes of arbitrary size that it didn't know about at compile time, but not gaining any benefit from that.

Virtual dispatch only works with a level of indirection (e.g. a vector of pointers like you're using), which means you need to manage the pointed-to objects somehow, e.g. by allocating them from vector<B> poolB and vector<C> poolC. Although I'm not sure most implementations of vector<> use realloc() when they need to grow; the new/delete API doesn't have a realloc, so vector may actually copy every time it grows, instead of trying to extend the existing allocation in place. Check what your C++ implementation does, since it might suck compared to what you can do with malloc/realloc.

And BTW, it should be possible to do the new/delete with RAII with no extra overhead for allocation/deallocation, as long as all your classes are trivially destructible. (But note that unique_ptr may defeat other optimizations for using the vector of pointers). std::unique_ptr warns that it's UB to destroy it via a pointer to the base class, so you might have to roll your own. Still, on gcc on x86-64, sizeof(unique_ptr<class C>) is only 8, so it only has a single pointer member. But whatever, separately allocating zillions of tiny objects sucks so don't do it that way in the first place.

If you did need some kind of dispatch like the title asks for

If the objects are all similar sizes, then you really want to loop over objects, not pointers-to-objects. That would avoid the extra cache footprint of a vector of pointers, and it avoids the extra pointer-chasing latency that out-of-order execution has to hide to keep the execution units busy. But C++ virtual inheritance doesn't provide any standards-compliant way to get polymorphism for union upoly { B b; C c; } poly_array[1024]; You can hack this up yourself with reinterpret_cast<> in a way that probably works on x86-64 gcc, but you probably shouldn't. See @BeeOnRope's followup: Contiguous storage of polymorphic types. (Also an older Q&A: C++ polymorphism of a object in an array).

If you need that, the highest-performance way would probably be to build it yourself with an enum to index a table of function pointers (or use a switch() if your functions can inline). If your functions don't inline, switch() to a bunch of function-call cases doesn't usually optimize down to a table of function pointers even if they all have the same args (or no args). You usually get a jump table to a block of call instructions, rather than actually doing an indirect call. So there's an extra jump in every dispatch.

C++17 std::visit with std::variant<B, C> (using non-virtual inheritance for B and C) seems to give you dispatch based on an internal enum. std::visit uses its own jump table to dispatch, even with only 2 possible types, instead of inlining them both and using a conditional branch. It also has to check for the "uninitialized" state all the time. You can get good code if you manually work around that with B *tmp = std::get_if<B>(&my_variant), and a __builtin_unreachable() to tell gcc that nullptr isn't a possibility. But at that point you might as well just roll your own struct polymorph { enum type; union { B b; C c; }; }; (with non-virtual functions) if you don't need an "uninitialized" state. Related: C++ polymorphism of a object in an array.

In this case you only have one function, so you can put a function pointer inside each object as a member. Like void (*m_update)(A* this_object). In the caller, pass a pointer to the object as a void* or A*, since it's a non-member function. The implementation of the function will reinterpret_cast<C*>(this_object). (Not dynamic_cast: we're doing our own dispatchiing, not using C++'s).

If you want to use B and C in other contexts where the function-pointer member would be taking up space for no benefit, you can keep the function pointers in a separate container instead of in the base class. So it would be for(i=0..n) funcptrs[i]( &objects[i] );. As long as your containers don't get out of sync, you're always passing a pointer to a function that knows what to do with it. Use that with union {B b; C c} objects[] (or a vector<union>).

You can use void* if you want, especially if you make a separate array of function pointers. Then the union members don't need to inherit from a common base.

You could use std::function<> to store pointers to instance member functions, but on x86-64 gcc that's a 32-byte object. It's better for your cache footprint to only use 8-byte regular function pointers and write code that knows to pass an explicit pointer equivalent to the this pointer.

Putting a function pointer in each object may take more space than an enum or uint8_t, depending on current size/alignment. A small integer index into a table of function pointers might reduce the size of each instance of your objects vs. a pointer member, especially for 64-bit targets. Smaller objects could easily be worth the couple extra instructions to index an array of function pointers, and the possibly higher mispredict penalty from an extra pointer dereference. Memory / cache misses are often a bottleneck.

I'm assuming you do have some per-instance state, even though you don't show any. If not, then a vector of ordinary function pointers to non-member functions will be much cheaper!

Overhead comparison:

I had a look at the compiler-generated asm (gcc and clang targeting x86-64) for a few ways of doing this.

<a href="https://gcc.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAKxAEZSBnVAV2OUxAHIBSAJgGY8AO2QAbZlgDU3fgGEAbpmRFiM7NwAMAQU1axAQwYNJOgOwAhXZOuSADswBGovMhBWb8vMQLN9oyfKoeOiSAKq26PoEmBAAlNL8ACKSGjLmkgD0GSaSeMZMdqyYkgBU3LK83FqyZdrcpolpuroGRpLpIIVOLjn1lto2uUIEkgC2APoOCcmp/P1ag/bdru7WgcFhEVEx8aiKxMTBxQBmwn6r0hZjk3yWvJZzl411DU3aDATEzMqSspKdS2cyF6FguwhGE2BMhmbwWNkBLjcAw8QRC4Ui0Tikj2mAOR0kpyE52R1j61kht0pjyezVec104MkAGU/tDJAw8AAvTCoY4QWSxNLWLKSWgANkkDgAntFjA4lPpmAxirzJAQEMV5AR9E5irYgsNcc1tH9gKJUA4/ONkELMtkiEMPvphnhtpJ9JIsITCHhUEICSRfiAQOjtnFYcatCKhKgRsINYcCMJgO6Rn5/MA8IpjJgAI7MLN%2BTDDd0MUZq1BjfQAa2TmwxxRjQgAtJ5vL5RLoPl8fk22wRxuksgDHECTLT5oNGRMpmzZpOURtQ5j4lkcXipITiXDSVdrg5KfdbfVnmZGvTtIymek2RzubyIH2vAPzILqSK/aIpZIACySmWYMYti4tiDhUEoIwAO6EAgkgxgE2q6nYBrRKo2jrCEyq2PoxDbNafrasIuIMBAHzoMGijKCQMiyE%2B3iDmo0i8GKlHmKQ7IEORICUSoNGyIxfAsUoAoThcxwkBAioOoJDidKx8SDDYDgAHTLjssKDOJxCScw0nMa4ATCQpimSMgqlbCutrSC8p7NAIeDHKKugiuqxSWhywL8Ue9ySJB%2BhSuxxB5gWhxCCmzqSJgAAenwejhuHfqq%2BrgsRkqAUcxiuaB4HKJlGqjMqojZspXacRREHUXIWhlPw2CGcgZiWHS8wYfV4z9h24zJYaxAkfEfRiRJUmVvoclKFoq7ZAAKggeQBsQvl4KI/jyo6%2BIeqM%2BjCJI5qoLYFyDPozZqGp4aPCKABKwQuMw5pKu6owsCWqrIOmQEgcwQiENijlJqMmC0rZ2gZLUeiiIYQGoF%2Bj3ELYSBdGObGsqCzXORkZHlVRqhyPq0MkHDjGUV1UNSgNqPaLjUow3DyFfuM8X%2BdwACs5i0BovA/sz54taibWU9TCBYgNJJ2lpOkOqNfMk5NkgzXNWmLctaVrVIG1bf6u37SLwrZO65kNmdQ7ZFd%2BDILdLDGPoj2fSML1vXYH1fbbv14P9Fz8/jsElLY0xMWKlP0wcjMs6kTPc%2B7JMC6UxboD7bKCQHDOkyzbMc1zGk2ArEAJOkPtgGAbIx3Hjz7d5r6XAuJkLCU7nFGy%2B0Xjuim18dtWnW%2Blc0jZuglBkkYipEYW4hb7pCCEtjEBaOpfnBsbupIwCoBguRYPoJXaJ9vr%2BswlMVxcO8k0LKOnop6QHo3il/DajcnrTUqBwlzOs%2BznNhxG6G85vfrE3jsOC/1oIRYHxnt7X2CcSYP2DuYUO4cgG7xKEXMBzFE5B2Ts/NOb8L6Z0DNnIUecC5JEimPYuudDzl2Fk3QY1da6%2ByCilWGQUByvQ%2BDRaqagIC2A7gdJShhMCt2wO3DOu5bKvDqHUeyIhxBSBoscT6ygt7nFqmjOwjCCDfmQLhWwtgpSdGQBqZA1ZjAK3lHWfURg8BOG/J9ZUIwHTKmKA5XIIwGAIBYKIEI6pJ6QTiOvLQGNuIVWxrIfxsiRBJj9DRDCZ11C1XqgAMTJrArQrUiahPkX6PwR9O5i2GgSMayA4nGWsHyLh3cxE6HKYyBgrRjC3i5DyPk3ZvgjBEo3KplEtDjEwKITA5Y6n3j5CUDpT8YGlK0FUtJ4T/T9IaaRMqIBJlb0

Categories

enums - Fastest implementation of simple, virtual, observer-sort of, pattern in c++?

enums - Fastest implementation of simple, virtual, observer-sort of, pattern in c++?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

If you did need some kind of dispatch like the title asks for

Overhead comparison:

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags