very techy post…

My latest game uses the same multi-threading code I’ve used before. This is C++ code under windows, on windows 10, using Visual Studio 2013 update 5. My multithreading code basically does this:

Create 8 threads (if an 8 core system). Create 8 events which are tracked by the 8 threads. At the start of the app all of the threads are running, all of the events are unset. As I get to a point where I wwant to doi some multithreaded code, i add tasks to my thread manager, and when I add them, I see which of my threads does not have a task scheduled, and I set that event:

PCurrentTasks[t] = pitem;
SetEvent(StartEvent[t]);

Thats done inside a CriticalSection, which I initialized with a spincount of 0 (although a 2,000 spiun count makes no difference).

Each of the threads has a threadprocess thats essentially this:

WaitForSingleObject(SIM_GetThreadManager()->StartEvent[thread], INFINITE);

When I get past that, I enter the critical section, grab the task assigned to me, do the actual thread work, then I immediately check to see if there is another task (again, this happens in a critical section). if so…I process that task, if not, I’m back in my loop of WaitForSingleObject().

Everything works just fine… BUT!

If I have a big queue of things to do, my concurrency charts from vtune show stuff like this:

Seriously, WTF is happening there? Those gaps are huge, according to the scale they are maybe 0.2ms long. What the hell is taking so long? Is there some fundamental problem with how I am multithreading using events? Does SetEvent() take a billion years to actually process? If feels like my code should be way more bunched up than it is.

13 Responses to “Getting threads to rapidly schedule tasks in C++, Windows”

  1. James says:

    Long shot.. but mutexes are pretty slow (If its an OS mutex then that requires a context switch and all sorts)

    Would a lockless queue be a better way to distribute jobs to worker threads? To achieve the effect of a lockless queue, you can implement a ring buffer using nothing but atomic operations.. which are pretty fast. You can go turbo geeky in get into this: https://mikeash.com/pyblog/friday-qa-2012-02-03-ring-buffers-and-mirrored-memory-part-i.html

    • cliffski says:

      Interesting thanks for the link. I think I’ve just made some big progress actually, mostly by staying within the thread, and ignoring the event system as long as more items are queued up.

  2. Sam says:

    I’m a bit of a lurker here, I find your posts super interesting, so thank you for sharing.

    I’m curious why you are still on VS 2013, and haven’t moved up to 2015 or 2017 yet?

  3. Sam Swain says:

    Sounds very similar to mine, except I don’t bother working out which is free, I just trigger them all and let them fight it out to pick up tasks. The logic being that jobs typically come along in groups. Cost of them all checking for work is negligible. Could try that. :)

  4. gradbot says:

    It sounds like the critical section is the problem. Threads calling EnterCriticalSection or WaitForSingleObject can be blocked waiting on the windows scheduler even though the critical section is not taken or the event has been signaled.

    TryEnterCriticalSection could solve your problem. If you know there is more work but can’t take any yet because of the lock then you could spin your thread instead of giving up control to the OS.

    • CdrJameson says:

      Sounds familiar to some problems I ran into a few years back.
      SetEvent/WaitForSingleObject aren’t like calling a callback, where the waiting function is called immediately, it’s more like SetEvent says “Blockage removed” so the now unblocked thread becomes a candidate for consideration to get some processor time when the scheduler next gets round to having a look.

      Possibly if you can do something after the SetEvent() (eg. a Sleep(0)) to wake up the scheduler you might get an improvement, but that might not work on modern Windows versions.

      Someone on Stack Overflow seemed to be having this problem over 7 years ago, so it’s not new: https://stackoverflow.com/questions/1306118/overhead-due-to-use-of-events

      Also, some interesting stuff on Windows thread scheduling in this sample chapter from Windows Internals: http://download.microsoft.com/download/1/4/0/14045A9E-C978-47D1-954B-92B9FD877995/97807356648739_SampleChapters.pdf

      • cliffski says:

        Yup thats exactly the issue. I have found that when I complete a task, if I just go in there and check 9inmside a critical section) for another task *within the thread body* rather than relying on the event check instead, I can get much better speeds, although thats only for when I build up a queue of events. Still…its much better.

  5. Ben Collier says:

    Probably nonsense but…

    I almost wonder if you could be the victim of an emergent behaviour. If all of your calc_route procedures are of almost exactly the same length, and you have some holdoff time if you discover that all the threads are busy, then you could be holding off for those long delays on all your threads.

    If there is a holdoff time, try randomising it instead of saying “wait for x microseconds before rechecking if there’s a thread free”.

  6. Shad Sterling says:

    The thing that strikes me is that in each visible place where most of the threads have a gap, there is exactly one thread that does not. Is that the case generally (not just in this screenshot)? Could one of the paths be doing too much in a critical section?