Drawing arcs is a common task in computer graphics. Arcs are typically drawn using an approximation by set of cubic Bezier splines. These splines are passed through the regular rendering pipeline where they are usually subdivided into a set of line segments and then rasterized. A circle, for example, can be represented by four cubic splines: one for each 90 degree arc.
An arc of up to 90 degrees can be approximated by a single cubic Bezier spline reasonably well. To find an appropriate approximating spline we set the start and end points of the spline to match the arc and then can find the middle two points using the following formula, where A is the start angle and B is the end angle:
h = 4/3 * tan ((B - A)/4)
pt1 = {center.x + r*cos(A), center.y + r*sin(A)}
pt2 = {center.x + r*cos(A) - h*r*sin(A), center.y + r*sin(A) + h*r*cos(A)}
pt3 = {center.x + r*cos(A) + h*r*sin(B), center.y + r*sin(A) - h*r*cos(B)}
pt4 = {center.x + r*cos(B), center.y + r*sin(B)}
This approach works quite well and is currently used by cairo. However, the problem with this approach is that it requires converting to polar coordinates and then back to Cartesian ones. This conversion is slow because it requires the use of trigonometric functions.
Fortunately, this paper gives a solution that avoids the conversion to polar coordinates:
ax = pt1.x - center.x
ay = pt1.y - center.y
bx = pt4.x - center.x
by = pt4.y - center.y
q1 = ax*ax + ay*ay
q2 = q1 + ax*bx + ay*by
k2 = 4/3 (sqrt(2*q1*q2) - q2) / (ax*by - ay*bx)
pt2.x = pt1.x - k2*ay
pt2.y = pt1.y + k2*ax
pt3.x = pt4.x + k2*bx
pt3.y = pt4.y - k2*by
Unfortunately, there is no explanation, derivation or proof for this formula. Even more problematic, the formula for k2 becomes less stable as pt1 approaches pt4, becoming NaN when pt1 equals pt4.
Therefore, the homework problem has two parts:
1. Give an explanation or derivation for the formula for k2 provided above.
2. Provide a similar formulation for pt2 and pt3 that doesn't degenerate as pt1 and pt4 become close or prove that one doesn't exist.
The best answer will be credited in the new cairo stroking code.
Wednesday, October 7, 2009
Friday, October 2, 2009
qcms — now faster
Thanks to some optimization work by Steve Snyder, qcms is even faster than before.
What follows is a chart with the new performance numbers:The benchmark is the same as last time but run on a slightly slower computer using OS X v10.6 instead of 10.5. As the chart shows, the new qcms code is more than twice as fast as the previous code. In addition to the performance improvement, the new code includes a version that only uses SSE1 instructions. This will be especially helpful for those with older computers where the time spent doing color correction isn't as negligible as it is on faster computers.
When running this benchmark again, I noticed that the performance of lcms had drastically improved since the last time I had run the benchmark. Why was lcms so much faster on 10.6? What had changed? The default architecture target: in OS X 10.6, the compiler builds 64 bit binaries by default. Still, it was surprising that compiling for 64 bit could nearly double the performance.
The large difference, it turns out, can largely be attributed to the
1.
What follows is a chart with the new performance numbers:The benchmark is the same as last time but run on a slightly slower computer using OS X v10.6 instead of 10.5. As the chart shows, the new qcms code is more than twice as fast as the previous code. In addition to the performance improvement, the new code includes a version that only uses SSE1 instructions. This will be especially helpful for those with older computers where the time spent doing color correction isn't as negligible as it is on faster computers.
When running this benchmark again, I noticed that the performance of lcms had drastically improved since the last time I had run the benchmark. Why was lcms so much faster on 10.6? What had changed? The default architecture target: in OS X 10.6, the compiler builds 64 bit binaries by default. Still, it was surprising that compiling for 64 bit could nearly double the performance.
The large difference, it turns out, can largely be attributed to the
MAT3evalW
1 function. This function multiplies a 1×3 matrix with a 3×3 one using 9 32×32→64 multiplications. GCC can usually optimize these multiplications by using the 32×32→64 multiply instructions, however that wasn't happening in 32 bit mode. Instead of the expected 9 multiplies, we get 18 multiplies and a bunch of housekeeping work, likely caused by the 64 bit additions and additional register pressure. In 64 bit mode, however, we get the code that you'd expect. This only takes 38 instructions versus the 169 instructions the 32 bit build uses. With a difference like that in the inner loop, it's easy to see why the 64 bit build is so much faster.1.
MAT3evalW
has a handwritten assembly version that should be faster than the one that GCC generates, unfortunately it is MSVC only.
Subscribe to:
Posts (Atom)