|Author:||“No Bugs” Hare Follow:|
|Job Title:||Sarcastic Architect|
|Hobbies:|| Thinking Aloud , Arguing with Managers , Annoying HRs ,
Calling a Spade a Spade , Keeping Tongue in Cheek
[[ This is Chapter XIII(b) from “beta” Volume 2 of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.
To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents . ]]
The opposite of a fact is falsehood,
but the opposite of one profound truth may very well be another profound truth.
There are quite a few common wisdoms when it comes to C++ and games. As it always the case when facing a bunch of common wisdoms, some of them have their merits, some are obsolete-beyond-belief, and some are just taken from a very different context and are not really applicable. Let’s take a look at the most popular ones.
“No C++, just plain C” – not for app-level programming
There is a school which teaches that C++ is no good, for performance and reliability reasons.
Many proponents of this point of view quotewho has said that “C++ is a horrible language”. However, it should be noted, that while Linus MAY have a point against using C++ for operating systems,his rant should not be easily generalized to such a different area as games.
“ over(ab)using C++ features is a different story, we’ll discuss these features below on case-by-case basis In short – code in OS’s is changed some orders of magnitude less frequently than for games (or any other business app), so the problem of code maintenance becomes of much more importance in a game world. On the other hand, things such as “we need to handle even the most extreme corner cases, such as being unable to allocate 20 bytes of memory” are of much less importance for a usual game app (whether client or server). As a result, I am firmly going against the camp which says that “C++ is no good for game development” (over(ab)using C++ features is a different story, we’ll discuss these features below on case-by-case basis).
1 Note that this point is arguable, as quite a few OS’s are currently heavily relying on C++, including such a serious thing as OS/400. Well, fortunately we don’t need to answer the question “whether we want to write an OS in C or in C++”
On “C++ as C-with-Classes”
A bit less extreme point of view is that C++ is good, but only as long as we’re using it only as “C with classes”. Well, often I myself am guilty of using C++ along these lines (with a few notable exceptions, pun intended) ;-).
Interpretation of the term “C-with-classes” varies between different developers,so when discussing it, I strongly prefer to separate those-C++-features-going-beyond-plain-C, and discuss them one-by-one, to see whether these features are worth using in the context of game development.
In any case, there is more or less a consensus that “classes” in “C with Classes” is a Good Thing™ (in particular, as it encourages encapsulation and decoupling). Now let’s discuss more controversial features of C++.
2 I’ve seen people who were saying that anything with C-style array in it is not real C++, so it is “C-with-classes”
C++ Exceptions – (Almost-)Zero-Cost until Thrown
There is a “common wisdom” that C++ exceptions are Bad, because they’re damn expensive. Another school of thought says that modern table-driven C++ exceptions are zero cost until they’re thrown. Well, while I am not 100% convinced that exception handling is indeed exactly “zero-cost” until the exception is thrown, but it is certainly very very close. The most common exception-related thing which may affect performance of your app, is an additional cost of calling non-inlined function (and in ancient times there was an additional cost to it due to exceptions, in the range of single-digit CPU clocks). However:
- “ It seems that this per-function-call penalty is gone now (though I’m not 100% convinced whether it is 100% gone, and that it is gone on 100% of the widely-used compilers) It seems that this per-function-call penalty is gone now (though I’m not 100% convinced whether it is 100% gone, and that it is gone on 100% of the widely-used compilers). On how it is achieved – see, for example,.
- Moreover, there are claims that C++ exceptions are less expensive (until exception is thrown) compared to C-style error-code checking after the function returns. While I didn’t run my own tests to validate this claim, I can tell that the performance difference (if any) between error checking and C++ exception in non-error case is certainly negligible
- to be honest, if you have a function which is that performance-critical, you should force-inline it anyway (more on inlines and force-inlines below), and then even for older compilers the problem will go away
On the other hand, the cost of exception when it is thrown, is pretty large; by the very definition of exception handling, it has to perform RTTI or RTTI-like things (and usually it is direct RTTI, see, for example,). In one experiment, the cost of exception thrown-and-caught has been measured to be of the order of 1000-2000 CPU clocks.
Practical conclusions with regards to exceptions:
- DO use exceptions; error codes are usually much more error-prone in the context of games.
- “ Cost of exception handling until the exception is thrown, is negligible Cost of exception handling until the exception is thrown, is negligible
- Cost of exception when it is thrown, is significant. Therefore:
- DON’T use exceptions to return from the inner loop
- However, using exceptions to perform not-so-normal return from a top-level event handler, is generally ok (as top-level event handlers usually take at least hundreds of thousands of clocks anyway).
3 except, maybe, a very few extreme corner cases
4 and this “negligible difference” is that small, that I have no idea whether it is positive or negative
5 for nuclear reactors and operating systems the balance of pros and cons can be different, but for games and business apps the choice in favor of exceptions is very clear
Smart Pointers and STL: Allocations are Evil
“ In other words, if you replace an std::unique_ptr<> with a simple manually-handled new – well, you will get nothing except for an additional headache. Smart pointers are often considered a Big No-No for games. However, I contend that it is not smart pointers per se , but rather related allocations which will hurt your performance (see more on allocations and related lack of spatial locality in “Allocations are Evil. Even More So Because of Locality” section above). In other words, if you replace an std::unique_ptr<> with a simple manually-handled new (or, Stroustrup forbid, with (YourClass*)malloc(sizeof(YourClass)) ) – well, you will get nothing except for an additional headache. Note though that std::shared_ptr<> is different performance-wise (and usually you CAN beat it as long as you’re sharing only within the same thread).
When it comes to using STL in the context of games, I should admit that it is quite controversial. Overall, I love STL, but it does have its shortcomings when it comes to high-performance stuff, games included. My main concern about STL is that it uses way too many allocations for my taste (and more importantly, for poor CPU cache too), and as it was discussed in “Allocations are Evil. Even More So Because of Locality” section above, allocations are one of the most nasty things performance-wise out there .
As a result, I tend to advise along the following lines (YMMV, batteries not included):
- DO use STL sparingly.
- DON’T use STL just for the sake of using it.
- DO use arrays when size is known or very limited (that is, properly wrapped with bound checks) where applicable. Even better, DO use eastl::fixed_*<> containers (see more on EASTL below).
- The same applies to char arrays in lieu of std::string (though for std::string it is usually mitigated these days by small string optimisation (SSO), see [[TODO]] section above). Once again, proper wrapping (such as in eastl::fixed_string ) is necessary.
- “ As a rule of thumb, DO favour std::vector<> over std::list<> (as it has MUCH better locality). Use cases for std::list<> do exist, but they’re rather few and far between. As a rule of thumb, DO favour std::vector<> over std::list<> (as it has MUCH better locality). Use cases for std::list<> do exist, but they’re rather few and far between.
- BTW, DO remember that std::list<>::size() is O(N). I’ve run into this problem myself once, was Really Surprised when it hit . If this is a concern – make your own wrapper around std::list<> with size() being O(1)
- As a rule of thumb (and unless you need ordering), DO favour std::unordered_map<> over std::map<> (the same applies to all tree-based containers, namely to multimap<> , set<> , and multiset<> ). std::unordered_*<> containers exhibit MUCH better locality, and tend to be MUCH more efficient.
- Caveat: hash table collisions can be a Quite a Bad Thing for std::unordered_*<> containers, including potentially opening a door for DoS attacks (if your keys come from Clients) . Be careful to make sure your hash function is good for your keys, have run-time reporting on number of collisions for critical collections, and see Vol.3 for discussion on DoS.
- If you do need ordering, DO consider sorted arrays/vectors instead of std::map<> (or even better, eastl::vector_map<> etc., more on EASTL below). While not always possible (due to MUCH higher addition/deletion costs), quite often it can make wonders.
- DO use optimized STL versions such as
- DO use eastl::fixed_*<> containers (they have “overflow feature” too).
- DO use eastl::intrusive_*<> containers where applicable
- DO use vector::map<> etc.
- “ Most importantly, DO understand underlying implementation of STL containers Most importantly, DO understand underlying implementation of STL containers. In particular:
- std::vector<> is a single allocated block with some reserve (reallocated when you go over reserved portion). Which can be translated to “very good data locality”
- std::list<> is a double-linked list. Which means “locality is poor”
- std::map<> , std::multimap<> , std::set<> , and std::multiset<> are all binary trees. Means “lots of allocations, very poor data locality”
- std::unordered_*<> containers are hash maps. Can be translated into “risk of collisions, but few allocations and pretty good data locality”
Using boost:: – Case by Case and Only If You Do Need It
You may ask: “ok, you’ve said about STL, but what about boost::?” My position on boost:: usually goes as follows:
- all the MUST-HAVE stuff from boost:: is already in std::, phew (and THANKS A LOT to boost:: guys to make it happen)
- boost:: as a whole is a Damn Large set of libraries
- as a result, while there CAN be things in boost:: which you MIGHT need, I do NOT recommend to say “we’re using all the boost::”. Instead, I suggest an approach of:
- waiting until a need for a specific boost:: library arises
- discussing it (and implications, including performance ones) by the team
- and deciding that “yes, this specific boost:: library is a good fit for our purposes”
Polymorphism – Two-Fold Impact
“ in practice, there are usually two different costs associated with polymorphism Yet another performance-related “common wisdom” says that polymorphism is a Bad Thing performance-wise and should be avoided. Well, I need to say that there in practice, there are usually two different costs associated with polymorphism. The first (explicit) cost is the cost of taking a VTABLE pointer from the body of your polymorphic class, and going into VTABLE (which MAY cause a cache miss, though it is unlikely to cause any real issues as VTABLE is going to be cached if it is accessed anywhere close to often). In addition, there is usually an additional cost similar in nature to that of branch misprediction (see below).
However, there is a second (implicit) cost of polymorphism as-it-is-traditionally-used. Usually, to exploit polymorphism, we’re creating our polymorphed objects via new . And, as I noted above quite a few times, new is a Bad Thing performance-wise (in particular, because in practice it causes that dreaded cache miss much more frequently than VTABLE access). On the positive side, this second part of polymorphism-related costs CAN be avoided (see discussion on it in “Mitigation: Reducing number of Allocations” section above).
Overall, unless you’re using polymorphism in some Really Crazy manner, I didn’t see it to cause anywhere significant performance degradation (that is, beyond the costs of allocation+poor locality).
RTTI – Can be Easily Avoided Most of the Time
RTTI (a.k.a. Run-Time Type Information) is quite expensive when it’s used. Honestly, I haven’t seen any real-world uses for RTTI beyond dynamic_cast<> (while such uses MIGHT exist, I certainly missed out on them).
And when it comes to dynamic_cast<> , my position goes as follows:
- “ most of the time, it is trivial to replace dynamic_cast<> (together with related code) with your own virtual function most of the time, it is trivial to replace dynamic_cast<> (together with related code) with your own virtual function. If your dynamic_cast<> happens within performance-critical code – just DO it.
- If you CANNOT replace dynamic_cast<> with your own virtual function (and the only case I know for it, is when you’re not allowed to modify base class) – then it MAY be a legitimate use for dynamic_cast<> .
And that’s about it on RTTI.
On Branch (mis)-Predictions and Pipeline Stalls
One further topic which is routinely raised in the context of performance, is branch predictions and pipeline stalls. I won’t go into a lengthy discussion about it (see, for example,[Wikipedia.BranchPredictor]for Branch Predictions 101), but what really matters from our perspective, is an observation that:
If there is an “if” statement in the program,and execution goes along the “wrong” path– it is going to cost us
While the cost of misprediction is not that high (10-20 CPU clocks), in some of the Very Inner Loops it might start playing its role.
Dealing with mispredictions can be done in one of the two following ways.
The first way is to remove conditional processing completely ( NB: do NOT do this until you are sure that your bottleneck is within this piece of code ). For example,
a += (b ? 0 : c);//all operands including b are uint32_t // b is guaranteed to be either 0 or 1
can be rewritten into:
a += (b-1)&c;//no conditional processing here assert( ((b-1)&c) == (b ? 0 : c) );
Here (b-1) can take values of 0 or (uint32_t)(-1) (the latter equal to 0xFFFFFFFF). Then, bitwise-anding it with c, we’ll get either 0 or c , which is exactly what we need. Note that operand types are Really Important here (and DON’T try to use such tricks until (a) you’re sure that your bottleneck is here, (b) you know EXACTLY what you’re doing).
In theory, compiler should be able to optimize these things for us. In practice, however, more often than not it doesn’t (things MAY improve in the future though). So, for Really Performance-Critical pieces of code we MAY try doing it ourselves ( NB: after making such a change, we need to measure performance, and if the trick didn’t help – rollback the change to keep the code more readable ).
“ The second way of dealing with branch misprediction, is to tell the compiler which of the branches is the “right” (more likely) way of the code execution going The second way of dealing with branch misprediction, is to tell the compiler which of the branches is the “right” (more likely) way of the code execution going. If your code gets into each branch with 50% probability, you can’t really define “right” and “wrong” (and you cannot do much about branch misprediction besides replacing it with arithmetics), but if probability is not that even, you DO have a chance to deal with it.
To tell GCC/Clang which branch is “right”, you can use __builtin_expect() intrinsic pseudo-function (better still, use likely()/unlikely() macros as described here:). For MSVC, there are no such intrinsics, but there is an observation (which stands in most, but not in all cases [TODO]) that the “if” branch of the “if” statement is usually treated by MSVC compiler as “likely” (and “else” branch as “unlikely”).
To re-iterate the most important thing about handling branch misprediction:
DON’T do it until you’re sure that It Is Your Bottleneck!
Otherwise, you’re risking to end up with lots of convoluted and barely-readable code (or with code which makes too many assumptions about branch direction – and developers are notoriously bad with it), and from my experience, a convoluted unreadable code tends to inhibit further (often MUCH more significant) optimizations.
6 or equivalent; in particular, ?: operator is often guilty of being compiled into equivalent of if
7 we don’t define what is a “right” path and what’s a “wrong” path – yet.
On inline asm and SSE
There are cases where you may want to write inline asm. However, they should be REALLY few and far between. Contrary to one popular belief, inline asm still CAN improve your program. And contrary to another popular belief, it is Very Rarely worth the trouble. If you decide to go against my advice and go inline asm for app-level code, keep in mind that:
- “ Correct operation of inline asm MAY easily depend on the context where you use it Correct operation of inline asm MAY easily depend on the context where you use it.
- In particular, if you have your inline asm as a part of inline function, and make a mistake with a clobbered register list for GCC – all kinds of things can happen (the code may work for one particular invocation of this inline, and when you call it from a different place – a dragon can eat your code, or even worse – an “undefined behavior” can happen).
- Inline asm is Very Different for GCC and MSVC (and on MSVC it doesn’t even exist for x64 and ARM)
At some point in time, x86 SSE was an excuse to use inline asm; these days, fortunately enough, SSE intrinsics were added, so you can program SSE code without the hassle of inline asm (and in mostly compatible-between-GCC-and-MSVC-manner, though obviously still restricted to x86/x64). In short – for SSE, DO use intrinsics rather than inline asm (well, as long as it is possible).
On inlines and force-inlines
One thing which is often underestimated in performance department, is the cost of function calls.gives the cost of function call at 25-250 clocks (depending on number of arguments), and my personal experience for your-usual-function-with-one-or-three-arguments is about 20-30 clocks, so we’re pretty much on the same page here.
If it happens in the innermost loop of your algorithm, such a cost can be very considerable. And worse than that,
From my experience, as of 2016 compilers are still notoriously bad with deciding whether to ignore you inline specification or not .
“ Compiler will use all its Next-to-Divine Wisdom to show you that it is smarter than you are, and to ignore most of those inline specifications you carefully wrote. Yes, you read it correctly: if you designated your function as an inline, it doesn’t mean that it will be inlined. Compiler will use all its Next-to-Divine Wisdom to show you that it is smarter than you are, and to ignore most of those inline specifications you carefully wrote. In such cases, forcing inline usually helps; it can be done with __forceinline in MSVC, and __attribute__((always_inline)) in GCC (and BTW, make sure to make a cross-platfrom #define FORCEINLINE that works everywhere).
However, as with other performance tricks, I need to re-empasize:
DON’T overdo it. Use force-inline ONLY after you have identified the bottleneck. Indiscriminate use of force-inline can easily hurt performance.
8 I’ve seen code which has performance improved over 1.5x after placing two or three of force-inlines – BOTH on MSVC and GCC
One thing which was a long-running scare of early-days C++, is “template bloat”. It refers to the compiler creating numerous instances of pretty-much-the-same code when it instantiates templates. In practice, I didn’t see it as a real problem (that is, unless you’re doing Really Crazy Things), and these days C++ templates are even used on MCUs which have very limited amount of program memory. As we’re not on MCUs (phew), we should be generally be fine with templates (that is, unless we’re into deeply recursive templates – these DO have a potential to get ugly ).
If code bloat is still a concern, a technique similar to the one incan be used, though I’ve never seen it to be Desperately Needed in practice.
On compiler flags
“ One of the first things novice developers are trying to do when optimizing for performance, is trying to play with compiler flags. One of the first things novice developers are trying to do when optimizing for performance, is trying to play with compiler flags. Well, usually it helps; OTOH, it helps only a little bit compared to what a reasonable code optimization can achieve (and even less compared to what removing of the outright-crazy-stuff can achieve). The main problem with compiler flags is that usually they’re applied indiscriminately to the whole project, and most of the optimizations are not that universally useful as we’d like. Better results can often be achieved by applying flags on per-subproject (or even per-file) basis.
For GCC, the most popular performance-affecting options are:
- -O2 or -O3 (note that O3 often produces worse performing code than O2(!))
- -march (well, at least for Server-Side you know your architecture exactly, and for Client-Side finding something less than -march=corei7, will be quite difficult these days).
- -fomit-frame-pointer (not much gain, but why not?)
- -funroll-loops (results vary, but certainly worth trying for critical code)
- and don’t forget about -fprofile-generate; it is going to take a bit of effort, but is often worth it (that is, if you’re concerned about performance)
Note that I’m not saying that these options will necessarily help; what I’m saying is that you MAY want to try them, run your compiled app, and see whether there is any difference.
On Manual Loop Unrolling
Using compiler option such as -funroll-loops, can make your app run faster – or slower, if you’re not that lucky. However, it doesn’t mean that loop unrolling is useless – it just means that indiscriminate loop unrolling can have their problems.
As a result, I’ve seen manual unrolling of the innermost most-critical loop to help quite a bit (like “up to 2x improvement”). It is a Big Headache though, so once again –
NEVER EVER DO IT until you’re sure that it is your bottleneck.
Putting Things into Perspective: Rough Costs in Clocks
We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%
The following table tries to summarize some of the CPU clock costs which typically occur in the program:
|Operation||Cost in CPU clocks|
|“Simple” register-register op (ADD,OR,etc.)||<1|
|“right” branch of “if”||~1|
|“wrong” branch of “if” (branch misprediction)||10-20|
|Additional cost of the function call being polymorphic (NOT accounting for cache misses due to poor locality if allocation is used)|
|Main RAM read||100-150|
|NUMA: different-socket L3 read||100-200|
|NUMA: different-socket main RAM read||200-300|
|User-to-kernel switch and back (i.e. minimal cost of kernel function)||300-500|
|Context switch (direct costs)||2000|
|Context switch (total costs, including cache invalidation)||10K-1M|
And a practical suggestion based on the numbers above:
Three things Really Matter for performance. The first one is Algorithm, the second one is your code being Non-Blocking, and the third one is Data Locality. The rest of the tricks, while potentially useful, should be used only on case-by-case basis, when a bottleneck is identified. Otherwise it will be a Premature Optimisation.
9 for modern x64 CPU, YMMV, batteries not included
11 less in certain special cases
“For-free” Optimizations a.k.a. Avoiding Premature Pessimization
One exception to the “DON’T optimize too early” rule above is “optimizations” which do NOT cause any trouble when you’re developing (in particular, do NOT reduce readability and compatibility). In some sources (see, for example,), not doing such “for-free” optimizations is even referred to as “premature pessimization”.
“ One example of such a ‘for-free’ optimization is making sure that all your loops have standalone increments written as ++it instead of it++ . One example of such a “for-free” optimization is making sure that all your loops have standalone increments written as “++it” instead of “it++”. The point here is that any kind of postfix operation on a non-integer type (such as iterator), strictly speaking, requires a temporary copy. While compiler MAY be able to eliminate this temporary as a part of optimization, well – why not save it the trouble (and not to avoid any chance that it will be suboptimal on some-other-compiler) –
Especially as it costs you pretty much nothing?
Other tricks of the similar “cost-nothing” nature include:
- passing structs and classes by const reference (rather than by value). If you are not doing it yet – chances are that you’re doing something wrong.
- Note that returning values via non-const reference parameters (essentially output parameters), while still popular, gets out of favour and with move constructors is rarely really necessary
- using initialization over assignment
- a special case: using “: str(s)” instead of “str=s;” in constructors to initialize string and other allocated data members
- specifying “move constructors” and “move assignment operators” where applicable
- specifying them as noexcept wherever applicable
[[To Be Continued…
This concludes beta Chapter XIII(b) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter XIII(c), dedicated to some of C++ “best practices” (well, at least as I understand them ;-)).]]