Category Archives: programming

Creeping inefficiency

July 23, 2014 | Filed under: programming

Here is why I reckon that triple-A game runs slow on your PC. The real reason :D

Step 1: Geniuses at Intel / AMD / ARM design an unbelievable;e processor capable of a bazillion operations per second. Efficiency 100%

Step 2: Someone writes a compiler that converts C++ into assembly language / processor specific stuff that makes a lot of assumptions and loses a big chunk of efficiency Efficiency 80%

Step 3: A coder like me waltzes in and writes some code that is as optimized as he can possibly manage, but has deadlines etc and knowledge gaps meaning it’s slightly less efficient than optimal Efficiency 75%

Step 4: He then writes it to run on a single core, because the headache of smoothly spreading tasks over all the cores is unbelievable, plus game code doesn’t multithread easily so… Efficiency 25%

Step 5: Because writing a new engine for each game is un-trendy these days, the coder decides to use an off the shelf engine that makes even more assumptions and compromises… Efficiency 15%

Step 6: Coder #2, not knowing the assumptions Coder #1 made when we wrote those handy functions, calls them every frame instead of once… Efficiency 5%

Step 7: The game gets run on a typical desktop PC, with 30 different apps fighting for CPU and RAM, IM clients, P2P stuff, web browsers, email, all that crapware that shipped with the PC, anti-virus scanners, cool desktop widgets that tell you the weather, music streaming as you play… Final Efficiency 3%.

eff

My numbers are wild guesses, but I reckon there is some truth to it all. For inexperienced coders using off the shelf engines probably boosts efficiency. Maybe some engines under some circumstances on some hardware multithreading is more possible. I can’t help[ fantasizing about a PC that absolutely locked everything down in a big way when you launched a fullscreen game. Turned off everything that could possibly use some CPU or RAM and let the game run like an xbox. Maybe that is what steambox will become?

That’s more likely than many programmers learning how to optimize, that’s for sure :(

 

Here is a big battle on GSB2 running at 1920 1200 res, on a GTX 670, quad core windows 7 PC. This was taken using the visual C++ concurrency visualizer. 3732 is the main game thread. Green is busy, red is idle, light blue is sleeping (end of frame, waiting for flip). Click to enlarge.

multi

1284 seems to be the thread where directx or the nvidia driver does it’s stuff (not sure which).

7596 2692 and 2788 are my additional threads of GSB2 doing processing. Each of those colored bubbles represents one or more tasks that a thread has grabbed and is working through. The big red stretches are obviously gaps I could potentially fill as I find ways to break apart dependencies of tasks and push more of the main thread into the other cores. It’s obviously already been worthwhile, as I reckon I’m currently doubling the framerate (just about) thanks to multithreading. Almost all the grey blobs are transformation of particles within particle emitters, packed into arrays. These are too numerous and cause too much thread-scheduling right now so I might make those arrays bigger, or even dynamic sized.

 

There are no comments yet

My big slowdown function

May 04, 2014 | Filed under: gsb2 | programming

Well I feared as much when I wrote this: http://positech.co.uk/cliffsblog/2014/04/14/reading-back-from-gpu-memory-in-directx9 but it turns out that, yup, that function is the slowest in the entire engine, at least until battle is joined. I really need to fix it.

Essentially what I’m doing there is maintaining a depth buffer for objects in the game. I then do some fancy processing on that buffer (all taking place in video card memory. I then really badly need to know the values of the depth buffer for about 100 different points, and based on the outcome of that, I either don’t draw, draw some stuff quite small, or draw it really big. In short, I’m scaling an object based on specific values of the depth buffer.

Right now, my engine does not use vertex shaders at all. I just use pixel shaders, and have vertex shaders as NULL. I’m pretty sure the solution to my problem is easy if I go to draw all of these objects, then write a vertex shader that can scale the object accordingly. the thing is, I’m using directx9 and therefore I really do have separate vertex and pixel; shaders. This is going to involve me reading up on the most undocumented stuff ever, which how vertex shaders and pixel shaders can be used in a 2D game under directx9.

Bah.

So today I added the first few new fighters to my GSB 2 engine. As a result, for the very first time I noticed the release mode build looking like it wasn’t running at a full 60 FPS at 1920×1200. My target is a healthy 60 FPS at 2560×1440, so this will mean some proper optimizing. I thought I’d keep a diary here of my investigations. First step is to do a ‘releasesymbol’ build (release build with debug symbols) and run aqtime pro, my optimizer of choice, to look for CPU slowdowns. This is an instrumenting profiler so it will be slooow…..

My initial invesdtigation shows that 50.68% of the time is drawing the battle,  31.62% is processing the frame. I decide to concentrate on the frame processing first, as this is easier to potentially multithread…

f1

So pretty clear I need to work on the debris processing. This is already being multithreaded though… digging deeper it seems that a sorting function takes up 99.95% of that time, and 99.73% of *that* time is spent in GUI_DebrisCloud::Clear() ouch. It’s immediately obvious that per-frame sorting is mega overkill anyway, but why is clear so slow? Aha, because it involves removing the object from another, less optimised list. Ouch… Some digging shows I only add it back to that list later in the frame anyway, so this is entirely redundant, so thats a very easy win! Next lets look at the GUI_3DManager::PreDraw().

f2

Yikes. Looks like STL is not being my friend at all here. A bit of investigation is called for… Again, this is an STL sort algorithm. I suspect I may have an unusually long list of items to sort there, and some digging suggests that this list has about 400 items in it. Not a lot, maybe the actual sort comparison is slow? No it’s a simple float comparison. what can be going on? is the STL list sort() really that inefficient for so small a list? apparently so. My options are to replace it with a vector (which makes removes/inserts a bit slower) or sort less often, or find a way to reduce the list size. The sorting is already being done why other threads are busy. I’ll experiment with a switch to vectors. While I’m at it, lets take a look at the slowest stuff inside the battle drawing…

f3Looks pretty clear that my post processing is slow, but as I recall, thats actually where most of the real drawing takes place… Some digging shows me that the main culprits are my lighting compositing and my lens flare streaks. This is the dreaded code where I read back from the rendertarget, which I knew would be hellishly slow. I *could* do it every other frame, and ‘lag’ very slightly. I *could* reduce the amount of the data I need to read-back, but I suspect this isn’t the problem. A tricky problem, whereas looking into the lightmap stuff I find a whole bunch of STL list iteration going on. I have mused before about using some fixed size (but big) arrays instead in this area… I also think I’m simply doing *too many* single sprite draw calls here, multiple ones for each fighter (don’t ask!). I’m pretty sure I can make some safe assumptions that compress those fighter layers into one, meaning an instant 50% less ship draw calls, so I’ll try that too… In fact some of them had THREE layers. ouch. Right thats three changes so lets go through the old sloooow profiling again. (results not 100% same as I’m not doing a scripted playback…)

Right then, first thing is that DrawFighting() is now taking 58ms vs 74ms, so already a nice phat boost. (is this microseconds? I think so, it doesn’t matter :D). ProcessFrame takes 10.61 vs 46.19. Oh yeah! The new frame processing looks like this:

f4Weirdly my reduced layers have made no difference. previously the ‘drawunlit’ function took 4.48,s, the new version takes 4.53.  I’m now making 360 draw calls per frame, previously it was 442, but thats had zero impact. Maybe the draw call count is harmless at this point? Interesting. (I make many other draw calls per frame, this is just a certain function). The 3DManager sort time went from 10.5ms to 1.628, so a massive win. So far I am 2 victories, 1 damp squib. I can see some asteroid related slowdowns, but they aren’t in all maps and I’m looking for broad wins here… I just spotted 484 calls per frame to ship::IsOnScreen(). This is possibly an inefficient function, as it cycles through layers and checks for each layer being onscreen, without any ‘quick win’ bounding box clauses… I’ll add a sanity check fast ‘if ship center offscreen by twice our hull size, then quit’. That *must* be faster… I can check this fast without a whole long profiling battle so…

Cool, that function is now 10x faster. yay! 3 out fo 4.

In my browsing I now spot this beast:

f5

What the hell? I was only flying along, there were no distortion waves, let alone roughly 1,000 new ones per frame. This sounds like a complete balls-up! Actually this looks like the profiler getting confused, possibly by some release build optimisation. It does draw my attention to a list I fill each frame… There are 3 slowdown in this,. the creating a new object for the list (tiny struct), the function call to add it, and the actual list push_back. This is all slow. it looks like I am already caching the objects, but clearly it’s not enough. Luckily AQTime lets me profile line-by line…

c1Hmm, so as suspected lists just suck for this purpose. I need to sort out my caching (I actually suspect the max cached objects is just too low…) but I’m going to have to switch to vectors or ideally just an array for this stuff. Less convenient, but it clearly will boost performance.

Anyway… I’ll keep plugging away. I love this stuff. I fully expect the game frame rate to double by tomorrow. This is early days easy win stuff.

For a while Gratuitous Space Battles 2 had a separate ‘lighting’ layer for ships that was used to pull off a few effects. This meant a ship hull might come with a color layer, a normal map, a specular layer, an illumination layer and a hulk layer. And those ship graphics are 4x the size of before, so they are 32bit color 1024square dds files. Those get pretty big, and with at least 40-50 new ships, suddenly the game is getting awkward to throw builds about with on my crappy rural broadband…

Luckily, after a bit of chin-stroking and going to-and-fro with the ship artist, we now don’t need that specular layer at all. It was fairly redundant, it was effectively being used as a separate ‘foreground lighting’ layer. It turns out I don’t really need that, I can just re-use the color layer. It’s a bit of a complex engine because it does a normal-mapped 3d thing, but also has lights that can be controlled separately to the foreground (ambient) lighting and also the normal-mapped directional lighting. This allows me to have all kinds of cool effects such as ‘dark’ battles where you can only see lasers and lights from ships, and also to have bright nebulas with everything glowing.

Of course, the minute I changed the code, everything else stopped working. First shadows stopped rendering entirely, then they worked, but I lost control of illumination brightness for ship lights. A whole lot of head scratching and debugging later and I am back where I started, but now the specular layer is irrelevant. That simplifies and speeds up the toolchain too, which is a very welcome bonus.

I’m only a few fighter-hulls away from being able to put together some battles with all-new in-game graphics and taking some screenshots for real, although the planets are still placeholder, and the explosions and particle effects all need re-doing. Still, progress is progress.

Unrelated, I’m off to see the nice people at Valve tomorrow for a UK meeting with developers.. Should be interesting