What follows is a chart with the new performance numbers:The benchmark is the same as last time but run on a slightly slower computer using OS X v10.6 instead of 10.5. As the chart shows, the new qcms code is more than twice as fast as the previous code. In addition to the performance improvement, the new code includes a version that only uses SSE1 instructions. This will be especially helpful for those with older computers where the time spent doing color correction isn't as negligible as it is on faster computers.
When running this benchmark again, I noticed that the performance of lcms had drastically improved since the last time I had run the benchmark. Why was lcms so much faster on 10.6? What had changed? The default architecture target: in OS X 10.6, the compiler builds 64 bit binaries by default. Still, it was surprising that compiling for 64 bit could nearly double the performance.
The large difference, it turns out, can largely be attributed to the
MAT3evalW1 function. This function multiplies a 1×3 matrix with a 3×3 one using 9 32×32→64 multiplications. GCC can usually optimize these multiplications by using the 32×32→64 multiply instructions, however that wasn't happening in 32 bit mode. Instead of the expected 9 multiplies, we get 18 multiplies and a bunch of housekeeping work, likely caused by the 64 bit additions and additional register pressure. In 64 bit mode, however, we get the code that you'd expect. This only takes 38 instructions versus the 169 instructions the 32 bit build uses. With a difference like that in the inner loop, it's easy to see why the 64 bit build is so much faster.
MAT3evalWhas a handwritten assembly version that should be faster than the one that GCC generates, unfortunately it is MSVC only.