Jeff Muizelaar: qcms

Friday, October 2, 2009

qcms — now faster

Thanks to some optimization work by Steve Snyder, qcms is even faster than before.

What follows is a chart with the new performance numbers:

The benchmark is the same as last time but run on a slightly slower computer using OS X v10.6 instead of 10.5. As the chart shows, the new qcms code is more than twice as fast as the previous code. In addition to the performance improvement, the new code includes a version that only uses SSE1 instructions. This will be especially helpful for those with older computers where the time spent doing color correction isn't as negligible as it is on faster computers.

When running this benchmark again, I noticed that the performance of lcms had drastically improved since the last time I had run the benchmark. Why was lcms so much faster on 10.6? What had changed? The default architecture target: in OS X 10.6, the compiler builds 64 bit binaries by default. Still, it was surprising that compiling for 64 bit could nearly double the performance.

The large difference, it turns out, can largely be attributed to the MAT3evalW¹ function. This function multiplies a 1×3 matrix with a 3×3 one using 9 32×32→64 multiplications. GCC can usually optimize these multiplications by using the 32×32→64 multiply instructions, however that wasn't happening in 32 bit mode. Instead of the expected 9 multiplies, we get 18 multiplies and a bunch of housekeeping work, likely caused by the 64 bit additions and additional register pressure. In 64 bit mode, however, we get the code that you'd expect. This only takes 38 instructions versus the 169 instructions the 32 bit build uses. With a difference like that in the inner loop, it's easy to see why the 64 bit build is so much faster.

1. MAT3evalW has a handwritten assembly version that should be faster than the one that GCC generates, unfortunately it is MSVC only.

6 comments:

Zack Weinberg said...: What does MAT3evalW correspond to in QCMS?; October 7, 2009 at 3:49 PM
Jeff Muizelaar said...: There's no direct correspondence in qcms, but code for doing the matrix multiplication exists in all of the qcms_transform_data_*() specializations.; October 7, 2009 at 4:11 PM
Caspy7 said...: Curious, what version of Firefox will this most likely find itself in? 3.7?; October 8, 2009 at 7:14 PM
Jeff Muizelaar said...: @Mark: Certainly in 3.7 and depending on how thing go, we may be able to get it in for 3.6.; October 8, 2009 at 8:44 PM
Dave said...: Very nice optimizations here. I'm interested in comparing it against what we're using (KodakCMS), however it doesn't look like these optimizations are included in the Git repository found here:

http://cgit.freedesktop.org/~jrmuizel/qcms/

Any chance of seeing these optimizations there? If not, where is qcms being maintained that I can find the latest code?

Thanks for sharing some great technology.; February 17, 2010 at 1:48 AM
Jeff Muizelaar said...: @Dave: The newer optimizations are in the git repository now, though there might be some build issues.; February 27, 2010 at 1:41 PM