Micro optimizations & emulation
Today I finally had some time to get back to reicast / dreamcast emulation / open source stuff.
I was playing around inolen’s redream project. Performance was bad so I dived into profiling. It turns out redream was miscalculating the fastmem compile flag.
Of course, I couldn’t resist and started doing differential profiling. Looking at redream vs reicast, reicast’s TA processing code took 18% of the main emulation thread while redream’s TA processing took far less.
The original code had a few “warning” signs (the use of F64 and %) but I assumed a “modern, smart compiler” will be able to optimize these. The generated assembly was quite bad — no guarantee that dst/src pointers don’t overlap forced the compiler to spill to stack and use x87 instructions. Also pcw.obj_ctrl % 32 compiled to a signed modulo operation even though the value can never be negative.
Changing to memcpy, explicit & 31 instead of % 32, then SSE/AVX intrinsics:
- memcpy + &31: 2x faster, +12% overall
- AVX intrinsics: another +13% overall
- SSE + branch reorder: another +5% overall
ta_vtx_data32 went from 18% at 170 fps to 0.72% at 225 fps. A 32% performance increase just from editing a few lines.
The original code was micro-optimized for cortex-a8, without considering x86 performance. That micro-optimization also had a fatal mistake, forcing a float -> integer move.
So, do micro optimizations matter? Based on this example, only if you don’t make things worse while doing them. Looking at the generated assembly and benchmarking on all relevant platforms helps.