400% speedup in a pesky transform thing

August 22, 2011 cliffski

Posted on August 22, 2011

This code was slow*:

    D3DTLVERTEX* pvert = LocalMem;
    for(int c = 0; c < CopiedIn; c++)
    {
        pvert[c].dvSX -= TransformX;
        pvert[c].dvSY -= TransformY;
        pvert[c].dvSX *= TransformZoom;
        pvert[c].dvSY *= TransformZoom;
    }

This code runs in one quarter of the time:

    D3DTLVERTEX* pvert = LocalMem;
    for(int c = 0; c < CopiedIn; c++)
    {
        pvert->dvSX -= TransformX;
        pvert->dvSY -= TransformY;
        pvert->dvSX *= TransformZoom;
        pvert->dvSY *= TransformZoom;
        pvert++;
    }

Pointers FTW!

I’m doing this sort of stuff now, which isn’t as fast as actually using hardware T&L, but is better than my older, hacky software transforms which happened on individual sprites, rather than at the VB level. Yeah I know… everyone uses world matrices and hardware T&L, I won’t bore you with the reasons I’m not, but there ya go. It works! (GSB uses a per-object world -> screen software transform for each object).

EDIT: These measurements may be a glitch. I’ve run and re-run, and re-run the profiler on both versions and now cannot get as big a discrepancy (although there is still a speed difference). Getting accurate measurements on a multi-core PC that has a live internet connection and various background services running is hell. Now I know why people like console dev :D

*relatively speaking.

9 thoughts on 400% speedup in a pesky transform thing

LtJax says:

August 23, 2011 at 10:26 am

So it seems that the code is slower because it re-evaluates the pvert[c] four times. That alone shouldn’t even double the runtime tho – since I can’t imagine that FLOPS are the delimiting factor here. My guess is, that the explicit pvert[c] somehow prevents the optimizer from doing a good job.

I wonder how

pvert[c].dvSX = (pvert[c].dvSX – TransformX)*TransformZoom;
pvert[c].dvSY = (pvert[c].dvSY – TransformY)*TransformZoom;

would perform…
Andrew says:

August 23, 2011 at 12:36 pm

This is why I prefer writing games in Java – simple, slow and steady wins the race. :)

I’m very rusty with C/C++ – are you not just accessing the array using a different method? Why is one so much faster than the other?
Robert says:

August 23, 2011 at 12:58 pm

Is it possible you wouldn’t have had to spend the time on this if you were using a framework like Unity?

But then again, I think you probably enjoy this part of coding too much to let it go :)
cliffski says:

August 23, 2011 at 6:20 pm

it’s precisely because of things like this why I do NOT use unity or java. Neither unity or java have any idea how I am designing mdoe code and data. They just make an educated guess. In this case, one guess is 4 times faster than another, so I’d rather be in charge and not leave it to guesswork :D
That’s just me though…
I’m happy now the game runs at 60 FPS in 1920 1200 res all the time. Oh yes :D
Kyle says:

August 24, 2011 at 3:51 am

It makes since to get a 4x speedup since you are doing only one offset calculation (the pointer increment) instead of 4 per loop. For this code to even show up on the radar, must mean alot of vertices are being pushed through this code. Looking forward to seeing those vertices in action ;-)
Mike says:

August 24, 2011 at 4:49 am

These are doing the same thing. These things will get optimized by the compiler. I suspect you are benchmarking Debug builds, which is just wrong (I guess unless you release Debug builds in your releases…)

I really didn’t believe this, so I coded up a quick benchmark.

http://pastebin.com/jdMyr5D1

My build in *debug mode* is showing only 2x speed difference.

When I build in release, I get results are identical.

Using Code::Blocks with GNU GCC in Windows 7 x64 bit.

Relese build:

Array refs took 207736181 (an average of 24.9591 cycles per vertex)
Pointers took 204688496 (an average of 24.5929 cycles per vertex)

Process returned 0 (0x0) execution time : 0.235 s
Press any key to continue.

Will follow with the asm of both versions when I figure out how, I’m used to using gdb layout split for that, or objdump.
Mike says:

August 24, 2011 at 5:39 am

Yeah, it looks like you get extra dereference with the un-optimized asm.

optimized with -O3 produces (effectively identical)
http://pastebin.com/ELAajuuL
cliffski says:

August 24, 2011 at 7:54 am

definitely not profiling a debug build here :D profiling it with aqtime.
Keith LaMothe says:

August 24, 2011 at 4:45 pm

Assuming that the code Cliff posted is somewhat abstracted from the actual code, it’s possible that in the actual GTB code there’s something that’s causing the optimizer to believe that the value of either pvert or c can change during the execution of the loop body (not the loop increment of c), and thus it’s backing off from trying to optimize it into the second loop.

Comments are currently closed.

Cliffs Solar Panels:
	CO2 emission reduced 445.05 kg
	Equivalent trees planted 26.93 trees
	Equivalent lightbulbs 6973.96 lightbulbs