Am I the only person doing this? probably. I often am. Most people have moved on from DX9 (I know it so well there is big opportunity cost to updating) or use OpenGL, and very few people are doing 2D games where performance is an issue. I am taking early steps with Gratuitous Space Battles 2, and my aim is to have it run at 60 FPS on average hardware with 2 1920×1080 monitors. I also intend to get it running ok for bigger setups too. That’s a lot of pixels, and due to all sorts of fancyness I’m adding to GSB 2.0, it means a lot of processing. a REAL lot.

So…multithreading! it’s about time i ventured forth. To date, my only multi-threading efforts have been the asynch server communication in GSB 1.0 for challenge uploads etc, and the loading screen for GTB and Democracy 3. Actual mid-game multihthreading has scared me until now.

I hate middleware so I’m not using any libraries, just raw calls to CreateThread, TerminateThread and so on… This might make it more complex, but means I have complete control over stuff. My first experiments were not exactly encouraging. I attempted to speed up the position calculations of asteroids. Now to cut a long story short, I use D3DTLVERTEX style stuff (not hardware Transform and lighting) and for good reason i won’t bore you with. The upshot is, I have a lot of non-directx transform stuff to do for anything drawn on the screen.

An ideal case for multithreading!

threads--friend--grey-cat--thread_3218279

So I wrote code to split up the asteroids into 8 chunks (test case of an 8 core chip), and gave each processor a list of asteroids to process. Result? SLOWER. Actually quite a bit slower. Some fiddling with AQTime (My profiler) let me analyze cache misses for each thread, and I also profiled it as 1 thread. The cache miss rate went through the roof. Basically, my transform code was relying on some global camera data, and I suspect that either:

a) Referencing he camera data was a bottleneck with each thread blocking each other from getting it or…

b) The memory locations of the asteroid transform data was laid out in such a way that all the different threads kept fighting for the same cache lines and generally getting in each others way.

I spent a lot of time reading and fiddling and decided that it wasn’t working (although did manage a decent few speedups in other ways). I then decided that if lots of threads sharing the same job wasn’t going to help, maybe lots of threads doing different (unrelated) jobs would…?

And this is more of a success. I have a function called ProcessFrame() which does a lot of non-directx stuff, such as the aforementioned asteroid transforming, updating engine glows, updating explosion plumes, particle effects and distortion waves blah blah… Until recently, it just did them one after the other. I then realized that although a lot of them accessed the same data (camera position stuff mostly), none of them altered it, and the tasks were quite discrete.  So I packaged them up and sent them to different threads, and then spun in the main thread waiting for them to finish. result? 21% faster. yay? not bad, but not 800% faster, which would have been theoretically do-able(not really but…)

Of course the missing link was that I am then left waiting for the slowest thread. Plus if I have more than 8 tasks, I run out of CPUs. So I re-coded it to have a queue of tasks, and when a thread finished a task, it checked the queue, and only reported it was done when the queue was empty. This was way more efficient, and easier to scale to available cores. Result? 41% faster!

Now obviously a 41% processing speedup is good (although this is pre-render, not render, so probably only a 20% FPS boost) but I can’t help thinking that if not 800%, a 200% speedup of that bit of code must be possible. Debugging cache-misses is hard, as even aqtime will bluescreen occasionally on windows 7 when profiling it. I’m pretty sure it’s some cache, false-sharing issue going on.

In the meantime, GSB is now faster as a result, even if i spend no more time attempting to multithread it (and i will… I’ve only just got going). Anyone else attempting this sort of thing?

 

8 Responses to “Thoughts on multi-threaded 2D game development in directx9”

  1. You might be seeing false sharing. Fabian Giessen has a good series of articles about optimizing a software renderer (with code on github) with the part on shared memory issues here: http://fgiesen.wordpress.com/2013/01/31/cores-dont-like-to-share/

    I can highly recommend using a good multithreaded profiler to help, with Intel’s VTune Amplifier XE being fantastic but pricey. I frequently use Intel’s free GPA platform analyzer with code markup, which helps with thread scheduling issues though not with things like false sharing.

    For your transform problems, you might actually find that SIMD parallelism on a single core gives you a better performance boost that mulithreading, or a significant boost on top. For that you’d need to write the transform code in a more structure of arrays style (SOA vs AOS), and fortunatly the code Fabian discusses has a lot of that for very similar operations, so it’s worth reading the series and glancing at the code.

  2. Klaim - Joel Lamotte says:

    Hi Cliff, you are not alone in trying this kind of thing, I’m doing something similar too (for a 3D game but it have characteristics similar to yours). The main difference is that I started from an empty project and build up over, while you have a project already finished but you want to optimized.
    Before starting my project I read “C++ Concurrency In Action” and played a lot several libraries and test projects to try different architectures and understand the way to organize concurrency as flows of tasks. I really wanted to understand the basics, what expert think are the best way to work, and try to make sure that my code don’t become impossible to follow.

    I think it’s both a sane reaction to not trust libraries, but also a quite expensive one and I really feel like I can’t avoid libraries on this subject, at least for low-level constructs which are insanely hard to understand well. However I do use my own high-level constructs, like what I describe here.

    As I started with setting up the architecture first, and my game isn’t finished, I can’t compare performance with anything else. The major goal in my case was more assuring responsiveness than brutal performance, though making performance scale with available cores does both improve performance and responsiveness.
    So far what I mesure is how much cpu power the overall system uses on my 4 core computer. I managed to make it use very few resources so far, well distributed on the different cores.

    It would be a bit long to explain how I organized everything in my specific case, but so far I’m actually surprised that it’s not slow even in Debug mode. I ended up making different systems communicating almost exclusively through workqueues (implemented using TBB’s concurrent queue), which assure passing tasks and other messages between systems very simple.

    Also, one thing I’m not sure is usual is that I have a special Task class which is designed to be an augmentation of a callable. Basically it’s like a std::function but with knowledge about if it should be called each cycle of the executor system or if it should just be executed once. It also keep information about when to execute it so I don’t really have a timer system (though it’s not perfect).
    What happen in my case is that each system (input, graphics, game representation, networking) runs a “loop”, locked-stepped or not depending on the kind of system, which is executed either by a unique thread (graphics, because OGL don’t allow access from separate threads except in some specific cases), or a thread pool (TBB implementation, used by input, game representation and network , though there is a separate specific thread for networking acquisition). In the case of the thread pool it means that the loop cycle of the system is represented as a Task which is configured to reschedule, and my TaskScheduler implementation (that manage the thread pool) do the scheduling work. The TaskScheduler is then shared by some systems, while other systems use their own thread, depending on the needs.

    Now, to communicate, these systems mainly “push” Tasks into other systems, either to be executed in the host system loop until a condition is reached, or to be executed once (for example that’s how I change the graphic state of a game object). They can also be executed only after some time, or in loop with a time lapse.

    I totally avoid sharing any data between the systems (with a few required exception) so I almost have no mutexes except in construction and destruction of systems because of inter-dependency which induce a construction/destruction order.

    By the way, it’s a client/server game even in solo, so what I’m describing here is true both for the client and sever parts, even if they don’t have at all the same systems. They only share the networking system I think.

    As I am not in an optimisation phase yet, I think my current code is not at all as efficient as it could be. Also, I didn’t try to overload the system with tons of work yet, that will come soon. Anyway, if you want to take a look at the code I did, maybe to get ideas or by curiosity, I’d gladly send it to you. I don’t think it’s interesting yet, but maybe in a few months when I can finally show a more game-y version, it would be interesting for you to look at the code.

    • Klaim - Joel Lamotte says:

      Oh also: I didn’t check yet if I have a lot of false sharing, but I think it might have been limited a lot by the way I organise most data: most data gravitate around a system so there is no sharing except in a few unavoidable cases.

      About tools, the VS debugger is insanely useful. Their concurrency performance analyzer helps a lot too. These are the reaons why I continue using VS instead of switching to more advanced compilers.
      However I focus on high-level organization of the concurrency constructs, so I don’t have any data on, for example, parallelisation/vectorization things. Also I use Ogre as a graphic engine and not the new one which is clearly faster, so the graphic code is not expected to be insanely fast until I upgrade the version, but it’s ok in my case.

      • cliffski says:

        My Visual studio copy is old enough, that after reading your post I’ve given in and orderd visual studio 2013, which supports the concurrency stuff, which my ancient (2005) copy does not… can’t wait to get my teeth into it :D

        • Klaim - Joel Lamotte says:

          Well for performance analysis, concurrency performance analysis, whole program optimization and profile-guided optimization, I guess that’s the best choice.
          That being said, I agree with this recent post summarizing the problems with Visual Studio 2010 to 2013: http://yosoygames.com.ar/wp/2013/12/microsoft-we-need-to-talk-about-visual-studio/

          As you bought VS2013 already, it’s to late to warn you, though VS2013 is more bearable than VS2010 for example. I’ll just quote here the reasons why I and others like the author continue to use VS even with this massive list of flaws:

          If it’s so bad, why don’t you move?
          Oh, I AM trying to move. And I cannot wait for Clang’s MSVC frontend to be finished. That’s the whole point of this article: If the VS team doesn’t improve these serious pitfalls, on the long run more and more developers will walk away.

          But there are three things in mind:

          Visual Studio IS the default and standard compiler for the Windows platform.
          MSVC 2008 is probably one of the best IDEs ever made (including the compiler). I still use it on a daily basis. However as time goes on, less libraries ship precompiled for 2008, and MS is dropping support for its latest platforms (i.e. Win 8 and Co.)
          Truth to be told, MSVC has one the best, if not the best, debugger out there. It’s also truth that MSVC’s debugging performance has also gotten slower (evaluating an expression and single stepping keeps taking longer and longer on every iteration); so watch out for that too. But it is its strongest selling point. If Qt Creator or another IDE had the powerful debugging UI that MSVC has without all the performance pitfalls, VS days would be numbered.
          Overall VS 2013 is a big step forward over the horrible VS2012 and VS2010; so there’s still hope; but they still have a long way to run to recover the competitiveness they once had. Mainly in the areas of compiler performance, which is still horrible and the game industry demands fast iteration and very low compile times; providing 64-bit version, and better refactoring facilities (how smart intellisense adapt to code changes, even if that includes analyzing older data).

  3. Arowx says:

    Have you considered the new Mantle API with multi-core rendering and nine times the draw calls?

  4. Xietanu says:

    I know this isn’t the main focus of the post, but I am curious (and pleased) about the dual monitor support. How are you thinking of using the two monitors? Would it just be drawing the game across both, and do you see any potential issues with the ‘gap’? Also, will this work with setups like mine, where I have two different monitors that are different sizes/resolutions?

    • cliffski says:

      Currently what I am doing is just creating an extra full screen window for each additional monitor (when enabled, and in fullscreen mode) which extends the view of the battle over all screens. They are separate windows, but being powered by the same driver and directx device, so in theory should work with any driver, and any combination of monitors. I’ve tested it with my own 2 monitors which are different resolutions and it’s fine :D