addps r0, r1 # (r0 = r0 + r1)
vs.
vaddps r0, r1, r2 # (r0 = r1 + r2)
This new encoding is not only used for the new 256 bit instructions, but also for the 128 bit AVX versions of all the old SSE instructions. This means that existing SSE code can improved without requiring a switch to 256 bit registers. Finally, AVX introduces some new data movement instructions, which should help improve code efficiency.
I decided to see what kind of performance difference using AVX could make in qcms with minimal effort. If you use SSE compiler intrinsics, like qcms does, switching to AVX is very easy; simply recompile with -mavx. In addition to using -mavx, I also took advantage of some of the new data movement instructions by replacing the following:
vec_r = _mm_load_ss(r);
vec_r = _mm_shuffle_ps(vec_r, vec_r, 0);
with the the new vbroadcastss instruction:
vec_r = _mm_broadcast(r);
Overall, this change reduces the inner loop by 3 instructions.
The performance results were positive, but not what I expected. Here's what the timings were:
SSE2: | 75798 usecs |
AVX (-mavx): | 69687 usecs |
AVX w/ vbroadcastss: | 72917 usecs |
vbroadcastss
instruction, in addition to the AVX encoding, not only doesn't help, but actually makes things worse. I tried analyzing the code with the Intel Architecture Code Analyzer, but the analyzer also thought that using vbroadcastss
should be faster. If anyone has any ideas why vbroadcastss
would be slower, I'd love to hear them.Despite this weird performance problem, AVX seems like a good step forward and should provide good opportunities for improving performance beyond what's possible with SSE. For more information, check out this presentation which gives a good overview of how to take advantage AVX.
5 comments:
Jeff,
Can you fix the link to the presentation? It needs an http:// in front.
You might also have mentioned that AVX can access unaligned data. This should make it much easier to use.
Jeff, these fancy new insns are only
going to improve performance if it
isn't limited by some other factor,
particularly by the performance of the
memory system. Do you have any feel,
for the SSE code with the workloads
you're using, to what extent performance
is limited by the rate at which the
processor can dispatch SSE insns vs by
cache misses?
jlebar: fixed.
jseward: The working set is processing about 50MB of linear data. This means the best data rate we're currently getting is about 690 MB/s which I expect is lower than the rate the machine can sustain. I'll rip out some of the computation to get a better idea of what the memory performance of the workload is when I get a chance.
jseward: ripping out the computation makes the loop run in 30889 usecs.
Post a Comment