Game Design, Programming and running a one-man games business…

CPU/GPU concurrency in video games

I’m no graphics programming expert, and not really any kind of programming expert, unless you want a strategy game coded with its own engine, in C++ for the windows platform in which case *cracks knuckles* I’m pretty experienced. (Actually I dont know hot to crack my knuckles).

What I do know,. is what to look out for, when you are worried about performance. One of the things I learned early on, was learned REALLY early on, when I made a game called Kombat Kars (probably in directx5) and was working on particle systems. To make it clear just how many aeons ago this was, lets take a look at an epic image of the rear boxart (yes! retail!)

Kombat Kars (2001) Windows box cover art - MobyGames

Yup, its not the frostbite engine.

Anyway, I was working on optimizing the drawing of vertex buffers full of particles, or asteroids or whatever, and I was depressed to discover after doing some cunning batching of my draw calls, that the performance went DOWN. Yup. Making the game more efficient in how few draw calls it made, made the game run SLOWER.

How can that be?

Actually super-easy, barely an inconvenience, but to understand why, you need to conceptually understand whats going on in the box when you run a PC game under windows. You basically have two CPUS. One of which is on the motherboard and is general purpose, the other is on the video card and specialized for processing vertexes and shaders and so on. It used to be 95% CPU work, and 5% GPU work. These days the GPU is often the most expensive, and powerful component in the box. On a lot of setups, the capabilities are fairly equal.

Its that equality of power that can actually cause problems. The peak performance of the machine is when the CPU is 100% busy (all threads!) and AT THE SAME TIME the GPU is 100% busy (multiple streams at once etc…). This is almost impossible to achieve, but its possible to actually make things worse than they should be, when you get too obsessed with batching.

If you don’t care about performance you code like this:

PrepareAMesh();
RenderAMesh();
PRepareAMesh();
RenderAMesh();
PrepareA..

Then one day you read some articles about the reason your frankly low-poly indie game runs at 20fps is that you have WAY too many draw calls. You read about batching, and your new code looks like this

for(int n = 0; n < lots; n++)
    PrepareAMesh();
RenderAllThoseMeshes();
for(int n = 0; n < lots; n++)
    PrepareAMesh();
RenderAllThoseMeshes();

And all is good in the world, because suddenly you are not flushing the queues on the video card every nanosecond, and its doing what it likes to do, what it was BORN to do, which is to stream through a whole ton of data like a sieve and throw polygons at the screen fast! But hold on…things can go wrong…

for(int n=0; n < eleventybillion; n++)
  PrepareAMesh();
RenderTheWholeDarnedGame();

This can actually be a REALLY BAD IDEA. Why? surely batches are good right…? well…to an extent. It really depends how you structure the code. It *might* be that during all those bazillion PrepareAMesh() calls, the GPU has run out of things to do. Maybe it hasn’t done ANYTHING yet this frame. It finished the last frame, and now its basically watching netflix waiting to hear from you some day…

…and once the CPU calls the GPU to render all bazillion polygons, depending how you structure the code, the CPU may be doing nothing. Maybe this is the frame end, and the CPU has to sit on its ass waiting for a Flip() or present() call from the GPU to get back to it some time maybe next week after the rendering is finished, when it can start thinking about the next frame?

This is the CPU/GPU concurrency issue. You can be TOO BATCHY. You can inadvertently set things up so that the GPU is always waiting for the CPU and the CPU is always waiting for the GPU. This is BAD for performance.

Luckily, free apps like VTune let you analyze this. FWIW Democracy 4 has no such problems with this at all, but to show you how it looks, here is the output of a very brief snippet of the vtune CPU/GPU concurrency analyzer:

You can see near the bottom how busy the GPU and CPU are. Luckily for me, they both keep pretty busy, even if I zoom in a lot to see the span of individual frames, but if your zoomed in CPU/GPU concurrency stuff shows big empty blocks within a frame, you have some optimizing to do.

The reason this catches out so many experienced coders is that it *sounds wrong*. Surely batching is good right? It is… but you have to remember that if the GPU would otherwise be sat on its ass eating crisps, even doing a bunch of small inefficient batches of 50-100 vert each, is MORE efficient that just letting it sit idle.

Think of the CPU/GPU as a team trying to do the dishes. The CPU is washing em, the GPU is drying them. Don’t let either of them stand idle.

How I address a tricky, user-interface layout / physics code challenge

Democracy 4 has one aspect of its GUI that still screams IMPERFECT at me, and needs fixing, but its at least 10x harder than it sounds. Its the sizing, and positioning of the icons on the main UI:

This looks like just a bunch of different sized circles on a screen. How is that tricky? Let me count the ways:

  • The icons have to be in specific zones, which are not rectangles, but could be any polygon. These change size and shape over time
  • There has to be consistency of size. A radius 10px icon in zone A has to represent the same value as a radius 10 icon in zone D
  • We could have ANY screen resolution or aspect ratio.
  • There could be ANY number of icons.
  • There is finite time, on perhaps a CPU-limited laptop to do all the calculations.
  • The icons cannot touch the center icon, or each other, or the zone boundaries.

Now it turns out…coding that is a real pain in the neck. The system that you *think* will solve it all, is to have each zone boundary and icon project a sort of repulsive force on to all the others, then step through an iterative system of applying forces and moving stuff around until an equilibrium is reached. Well kinda… but no. Look again:

Those icons with black arrows are the pain. They are kind of stuck next to corners and other icons. its the shortest distance between two icons ANYWHERE on the screen that limits the size of all of the others.

My initial solution to fix this, DID make it better, but its not good enough. What was it? Basically I do the physics-repulsing-forces thing as usual, then I ‘jiggle’ each icon in turn, randomly kicking it a few pixels in different directions, then checking to see if the overall separation of the icons got better or worse with each jiggle. I keep the jiggles that improved the situation.

Actually I improved that algorithm by first finding the icon that was the closest to the others and giving that 32 initial jiggles. This is the result, (less force is better, it means the separation is better)

Pre Jiggle. Total force:[0.39] Strongest: [0.14]
Post Jiggle. Total force:[0.03] Strongest: [0.03]
Pre Jiggle. Total force:[1.31] Strongest: [0.45]
Post Jiggle. Total force:[0.23] Strongest: [0.23]
Pre Jiggle. Total force:[1.67] Strongest: [0.32]
Post Jiggle. Total force:[0.19] Strongest: [0.19]
Pre Jiggle. Total force:[1.23] Strongest: [1.19]
Post Jiggle. Total force:[0.06] Strongest: [0.06]
Pre Jiggle. Total force:[1.14] Strongest: [0.13]
Post Jiggle. Total force:[0.17] Strongest: [0.17]
Pre Jiggle. Total force:[2.17] Strongest: [0.68]
Post Jiggle. Total force:[0.31] Strongest: [0.31]
Pre Jiggle. Total force:[0.53] Strongest: [0.40]
Post Jiggle. Total force:[0.26] Strongest: [0.26]

Which is what brings me to the current state of affairs. However, this is just one of those tasks’ that humans excel at and machines suck at. I bet you can see locations where the icons should be shuffled really easily.

Something I intend to experiment with is a more focused approach to the jiggling! I can tell now which icons (in this case two of them) seem ‘trapped’ and are causing the biggest problems, so rather than just going through all of the point containers and jiggling everything, I should now focus my attention just on the containers with the problem

Maybe I can even focus my attention just on the half-dozen icons per container that are closest to the problematic ones, but that may not actually be the solution. Check out this section in more detail:

This row of icons has basically got trapped. Moving any of them upwards and to the right is going to be tricky, unless the icons *above* them can get out of the way. The trouble is, we need some super-clever algorithm that could move an icon above them higher (thus making *that* icon worse off… so that later we can jiggle these others… and everyone will be better off.

It might be that the initial force algortihmn is too linear. A linear force would not concentrate midns enough of the dire plight of the icon in the bottom left. If the squeeze here was not seen as just *slightly* worse than the squeeze on others, but exponentailly worse… that might fix it.

…and also this icon is receiving some strong forces from its location in the corner., Unlike other icons, boundaries cannot move, so maybe we should prioritize their plight more? Surely no coincidence that both worst-case icons are in corners?

This all strikes me as something that would be easy if I’d learned more physics and maths, but hey… working it out alone from basic principles is kinda fun. I have plenty of ideas to tweak my algorithm.

The war on needless draw calls with GPU profilers

Democracy 4, (my latest game) is quite evidently a 2D game. There is an assumption by players and developers alike that 2D games always run fine, that even the oldest, crappest PC will run them, and that you only need to worry about optimisation when you do some hardcore 3D game with a lot of shaders and so forth.

Nothing could be further from the truth! There are a TON on 2D games that run like ass on any laptop not designed for gaming. I think there are a bunch of factors contributing to this phenomena:

  • The growth in sandbox building/strategy games with huge sprawl and map sizes that can result in a lot of objects onscreen
  • An emphasis on mobile-first computing has resulted in a lot of low-power low-spec but affordable CPUs/GPUs in laptops
  • An emphasis on ‘retina’ display aspiration ahs meant high screen resolutions being paired with low-spec GPUs
  • A lot of game developers starting with unity, and never being exposed to decent profilers, or ground-up programming, without understanding how video cards work

So my contention is that there is a bunch of people out there with mom or dads hand-me-down laptop that was bought for surfing the internet and using word, and is now being pressed into service reluctantly as a games machine. These laptops aren’t about to run Battlefield V any time soon, but they can run an optimized 2D game very well indeed… IF the developer pays attention.

Its also worth noting that markets like Africa/India/China er definitely good targets for future market growth, and a lot of these new gamers are not necessarily going to be rocking up to your game with a 2070 RTX video card…

I recently did a lot of optimizations on Democracy 4 and I’d like to talk through the basic principles and the difference it can make. Specifically how to double the frame rate (from 19 fps to 109fps) on a single screen. The laptop spec in question: Intel core i7 6500U @ 2.50GHZ CPU Intel HD Graphics 520 GPU, screen res 1920×1080. Here is the screen in question:

This is the ‘new policy’ screen in Democracy 4. TBH this screen looks a lot more complex to render than it actually is, because of some trickery I’d already done to get that lofty 19 fps. That background image that is all washed out and greyscaled is not actually rendered as a bunch of tiny icons and text. its drawn once, just as we transition to this screen, and saved in an offscreen buffer, so it can be blapped in one go as the non-interactive background to this screen. I use a shader to greyscale and lighten the original image so this is all done in a single call.

So here we are at 19 fps, but why! To answer that I needed to use the excellent (and free) intel GPA performance tools for intel GPUs. Its easy to use and install, and capture a trace of a specific frame. With the settings i chose, I end up with a screen like this:

The top strip shows time in this frame, and how long each draw call took (X axis) and how many pixels were affected (y axis). A total of 158 draw calls happened in this frame, and thats why this puny GPU could only manage things at 19 fps.

Its worth setting up the frame analyzer so that you have GPU elapsed time on the X axis, otherwise all draw calls look as bad as each other, whereas in fact you can see that the first bunch of them took AGES. These are the ones where I draw that big full-screen background image, and where I fill the current dialog with the big white and blue background regions. It seems like even fill-rate can be limiting on this card.

By stepping through each draw call I can see the real problem. I am making a lot of really tiny changes in tiny draw calls rendering hardly anything. Here is draw call 77:

This draw call is basically just the text in that strip in the middle of the screen (highlighted in purple). Now this is not as inefficient as a lot of early-years gamdev methods, because the text has actually been pre-rendered and cached, so that entire string is rendered in a single strip, not character-by-character, but even so its a single draw call, as is each icon, each little circular indicator, and each horizontal line under each strip.

So one of these strips is actually requiring 4 draw calls. And we have 18 of them on the screen. So 72 draw calls. Can we fix this?

Actually yes, very easily. I actually do have code in the game that should handle this automatically, by batching various calls, but it can become problematic as limits in buffers can be reached, and textures can be swapped, requiring an early ‘flush’ of such batches, which if they happen in the wrong order, can mean missing text or icons. As such, the simplest and easiest approach is to change the code that roughly goes like this:

for each strip
  Draw horzizontal line
  Draw Icon
  Draw little circle icon
  Draw Text

…to something more like this…

for each strip
  Batch horizontal line
Draw all lines
for each strip
 Batch Icon
 Batch little circle icon
Draw all Icons
for each strip
 Batch text
Draw All Text

Which on the face of it looks like more hassle. isn’t it bad to iterate through the list of strips 3 times instead of 1? Well kinda…but I’m GPU limited here. the CPU has time to do that stuff, and actually there is likely NO TIME LOST at all, because of CPU/GPU concurrency. Something people forget is you are coding 2 chips at once, the CPU and the GPU. Its very common for one to just be waiting for the other. While the GPU is doing ‘Draw All Lines’ my CPU can keep busy by building up the next batched draw call. Concurrency is great!

I guess we nee to take a moment here to ask WHY does this work? Basically video cards are designed to do things in parallel, and at scale. A GPU has a LOT of settings for each time it does anything, different render states different blend modes, all the various fancy shader stuff, culling, z-buffer and render target settings etc. Think of it as a MASSIVE checklist of settings that the GPU has to consider before it does ANYTHING.

When the GPU has finished its checklist, it can then pump a bunch of data through its system. Ideally everything has been set up with huge great buckets of vertices and texels, and they can all blast through the GPU like a firehose of rendered pixels. GPUs are happiest when they are rendering a huge ton of things, all with the same settings. The last thing it wants is to have to turn off the firehose, and go through the darned checklist again…

But with a new draw call, thats pretty much what you do. Unless you are doing fancy multi-texturing, every time you switch from rendering from texture A to texture B, or change the render states in some other way, then you are ‘stalling’ the video card and asking it to reset. Now a GPU is super-quick, but if you do this 150 times per frame… you need to have a GPU that can handle this 900 times a second to get the coveted 60fps. Thats without the actual rendering, flipping or anything else…

So the less resetting and stalling of the card the better, which means BATCH BABY BATCH!

Now with the re-jig and those strips being done with the new algorithm (plus a few other tweaks), the GPA analysis looks like this:

Let me here you shout ‘109 FPS with only 52 draw calls!‘. Its also worth noting that this means less processing in general, and thus less power-draw, so that laptop battery life will last longer for games that have fewer draw calls per frame. Its a missive win.

There is actually a lot more I can do here. Frankly until I get to that tooltip right at the end, ALL of the text on this screen can be done in a single drawn call near the end. The same is true of a bunch of other icons. Essentially all I have done here is optimize the left hand side of the screen, and there is scope to go further (and when I have time I will).

I thoroughly recommend you keep (or get) an old laptop with a low spec video card, and slap the intel GPA tools on there. Other such suites of profilers exist for GPUs, the AMD one is pretty good, and nvidia have nsight as well. It all depends on your GPU make obviously.

I do wish more people were aware of the basics of stuff like this. Its hugely rewarding in terms of making your game more playable. BTW these tools work on the final .exe, you don’t need a debug build, and the engine is irrelevant, so yes, you can use them on your unity games no problem.

Hope this was interesting :D

Pak files

I just added pak file support to Democracy 4. Its something I already had coded for an earlier game, but I had to do a bit of fussing to get it to work properly with democracy 4.

Pak files are basically big phat files that contain other files inside them. If you are a new developer, you probably have no idea they exist, because you probably use unity and AFAIK they handle it for you. Pak files are pretty old school, as I recall Doom and Quake used them (maybe called wad files), anyway the principle is pretty simple:

A pak file contains two sections, an index that tells you where all the other files are inside the main section, and a big phat list of data that is the contents of those files. All of this gets stuck in a single flat blob of binary data. A class exists that lets you grab the memory address of the file you want if you pass in the name of the file, so hopefully to anybody who didnt write the pak file code, using it is easy. You can read the contents of a file just like its on disk, except you have to use functions that read from memory, not ones that explicitly read disk files.

In my case, that meant stepping into the engine code, and the opengl stuff that reads in graphics files (we are only using a pak right now for dds and pngs), and just changing the contents of one function. Instead of using this code to load the data ready for creating a dds:

fread(filedata, filesize, 1, fp);

I now use this

memcpy(filedata, GetPakFiles()->GetData(pentry->StartOffset), filesize);

Big deal :D. Similar changes happen for pngs. To keep things super-simple, the old code for loading a file will run if the pak file reports it cant find that file, so you can stick a ‘loose’ png or dds file into the bitmaps folder structure and it will still get found, and the rest of the code doesnt even know the difference, which is perfect for mod support.

Most of the hassle in getting this to work was just writing some code to enumerate (that means list really…) the contents of a folder from within the pak file. I had to support this for stuff like minister profile pics, because the code previously would ask ‘what files are in this folder, I need to know so I can select a random one’, and now that code gets handled by the index at the start of the pak file instead.

So why bother with this?

Basically speed. Do you know wwhat the read speed of your hard drive is? Checking a random new one on amazon.com shows 6Gb/s. I assume thats bits not bytes, so thats 750MB per second. My hard drive is a little bit older, so lets say 500MB/second. I’ll just copy a big chunk of the source and obj files from D3 to another disk, brb…

Ok…copy speed is between a low of 1MB/second up to 100 MB/second. Wow. Thats so much slower. Why?

A HUGE amount of bullshit happens on a PC when you access files. To simplify it, it goes something like this: *deep breath* You ask to read in a file. The OS then looks up the file table to check that file exists. it then asks the security system if the current user has permission to access that file. When it gets a yes, it then opens that file, and sets attrributes so other processes will know that file is in use. The antivirus software then kicks in, and hooks into the file read so that it can check to see if that file is excluded from scans or not, and gets ready to analyze its contents. The O/S then has to use the file-table to work out where all the various scattered chunks of that file are, and start reading in each block. This means talking to the driver, and ultimately to the hardware, which may also have to check its cache to see what blocks have been cached and whether or not it has to start the glacial process of spinning an old physical drive or not (faster with SSDs obviously).

THEN! when the file read is complete, we can close that file again, notify the system that its not in use by our process any more. We can then start the process of opening the next file in our list..

You do that bullshit for EVERY DARNED FILE. But the good news is… if you have a big phat pak file…you do it once. Just once. The rest is free.

So Democracy 4 goes the extra mile, because our pak file is small (only a few hundred MB). We dont just open the file on startup, we stream all 200MB into RAM. That should take way under a second. We then have the entire file system of dds and png files in memory already, and able to be loaded almost-instantly into our engine. (RAM->VRAM is mega fast)

99% of players will not notice the speed difference. But if you have especially shit anti-virus running on a laptop, in low-battery mode, with an extremely fragmented hard drive, running democracy 4 on a train, you will be glad I bothered. Its really easy to code. My PakFile code has 263 lines in it. Many of them are whitespace or comments.

Democracy 4: The fixed income rewrite

About a week ago I had this mad idea that it would be cool to plot every single Democracy 4 voter’s wealth on a graph, so that you could see where they were clustering in a nice easy-to-understand way. Within an hour it worked, within a day, a new rewritten version that looked much nicer with blue dots on it was done, and I was tweeting, and people were saying ‘yay’, and then everything went fucking mad.

By hovering the mouse over one of those blue dots, you could see a breakdown of how that persons income was affected by every government policy or situation, indirectly, through their membership of voter groups. This data already existed in Democracy 3, it was easy, it was just GUI code, and done really quick… and that showed me what an absolute mess the simulation was…

Democracy 1,2,3 and 4 are all coded as a homebrew neural network. Every neuron has a value either capped from 0->1 or -1->+1. Everything in the game is a neuron, a voter, a voter group, a policy, a minister, an event…everything. Voters also have an ‘income’ neuron which tracks how much money those voters have. So…in supersimplistic terms if you want to know why Bobs income is 0.78, you look at the 21 weighted inputs from all the voter group income neurons, and you see all the +0.2, -0.1,+0.32 etc, that adds up to 0.78. If you want to go one stage further up the hierarchy you can track the origins of those effects to policies etc.

Thats worked for 3 games perfectly. But its a crap system.

The trouble is, peoples incomes are on a 0->1 scale. And all effects are percentages. So for example if free bus passess give retired people a 0.05 income boost, that increases the income of all retired people by 5%. Fine?

NO

Because working class ex-street sweeper mavis just got a bus pass worth $500, but retired hedge fund manager Boris just got a buss pass worth $15,000. WTF? why does democracy hate poor people? The problem is that we have only ever been able to use effects to apply percentages to incomes. That means EVERY benefit, or tax, or effect is proprotional to your income. That means lambourghini drivers pay more in car tax than skoda owners (maybe intentional), but means the state pension depends on how wealthy you already are.

Its fixed. it was hard.

Basically I have had to code an entirely new shadow system of fixed-income neurons that can cope with values beyond 1, and then (this is the hard bit) written code that stitches it all back together internally so that we can still use the same mechansims to move people between middle-income and poor etc, and still display everything in the UI as though nothing has changed.

This was hellish, because it also means restitching together lists of totally different UI items on the fly with different calculation methods to come up with a result that looks the same as it used to. Its taken a week of fixing edge cases, checking, altering UI, and writing lots and lots and lots of code which mostly will go underappreciated :D.

But thankfully it now works, and it means we can have effects in the game which are +10% income of retired people, and also effects that are +$10,000 income of retired people, as the designer or modder sees fit. This means helicopter money can actually be fixed for everyone, free school meals no longer serve foie gras to rich students, and so-on.

You wont notice it immediately but its a massive improvement in the underlying simulation code in the game. It stressed me and tired me out so much to do it that im forcing myself not to code today so I can recover.