tag:blogger.com,1999:blog-13869480373844354412024-03-14T03:42:33.830-07:00Jeff MuizelaarJeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.comBlogger30125tag:blogger.com,1999:blog-1386948037384435441.post-32667912144117178982016-07-28T11:57:00.000-07:002016-07-28T11:58:01.145-07:00Counting function calls per secondSay you want to know how often you're allocating tiles in Firefox or the rate of some other thing. There's an easy way to do this using dtrace. The following dtrace script counts calls to any functions matching the pattern '*SharedMemoryBasic*Create*' in XUL in the target process.<br />
<br />
<pre>#pragma D option quiet
dtrace:::BEGIN
{
rate = 0;
}
profile:::tick-1sec
{
printf("%d/sec\n", rate);
rate = 0;
}
pid$target:XUL:*SharedMemoryBasic*Create*:entry
{
rate++;
} </pre>
<br />
You can run this script with following command:
<br />
<pre>$ dtrace -s $SCRIPT_NAME -p $PID
</pre>
I'd be interested in knowing if anyone else has a similar technique for OSs that don't have dtrace.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com3tag:blogger.com,1999:blog-1386948037384435441.post-15201569091655239762015-12-29T09:35:00.001-08:002015-12-29T09:35:14.065-08:00WebGL2 enabled in Firefox NightlyA couple of weeks ago we enabled <a href="https://www.khronos.org/registry/webgl/specs/latest/2.0/">WebGL2</a> in Nightly. The implementation is still missing some functionality like
PBOs, MSRBs and sampler objects, but it seems to work well enough with the WebGL2 content that we've tried.<br />
<br />
WebGL2 is based on OpenGL ES 3 and adds occlusion queries, transform feedback, a large amount of texturing functionality and bunch of new capabilities to the shading language including integer operations.<br />
<br />
You can test out the implementation here <a href="http://toji.github.io/webgl2-particles/">http://toji.github.io/webgl2-particles/</a>. If it says WebGL2 it's working with WebGL2. We look forward to seeing the graphical enhancements enabled by WebGL2 and encourage developers to start trying it out.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com1tag:blogger.com,1999:blog-1386948037384435441.post-22242221147460689222015-11-11T11:48:00.002-08:002015-11-11T11:48:48.174-08:00Debugging reftests with RR<p>When debugging reftests it's common to want to trace back the contents of pixel to see where they came from. I wrote a tool called <a href="https://github.com/jrmuizel/rr-dataflow/">rr-dataflow</a> to help with this.</p>
<p>What follows is a log of a rr session where I use this tool to trace back the contents of a pixel to the code responsible for it being set.
In this case I'm using the softpipe mesa driver which is a simple software implementation of OpenGL. This means that I can trace through
the entire graphics pipeline as needed.</p>
<pre><code>Breakpoint 1, mozilla::WebGLContext::ReadPixels (this=0x7fc064bc7000, x=0, y=0, width=64, height=64, format=6408,
type=5121, pixels=..., rv=...) at /home/jrmuizel/src/gecko/dom/canvas/WebGLContextGL.cpp:1411
1411 {
(gdb) c
Continuing.
Breakpoint 11, mozilla::ReadPixelsAndConvert (gl=0x7fc05c4e7000, x=0, y=0, width=64, height=64, readFormat=6408,
readType=5121, pixelStorePackAlignment=4, destFormat=6408, destType=5121, destBytes=0x7fc05c9d5000)
at /home/jrmuizel/src/gecko/dom/canvas/WebGLContextGL.cpp:1310
1310 {
(gdb) list
1305
1306 static void
1307 ReadPixelsAndConvert(gl::GLContext* gl, GLint x, GLint y, GLsizei width, GLsizei height,
1308 GLenum readFormat, GLenum readType, size_t pixelStorePackAlignment,
1309 GLenum destFormat, GLenum destType, void* destBytes)
1310 {
1311 if (readFormat == destFormat && readType == destType) {
1312 gl->fReadPixels(x, y, width, height, destFormat, destType, destBytes);
1313 return;
1314 }
(gdb) n
1311 if (readFormat == destFormat && readType == destType) {
(gdb)
1312 gl->fReadPixels(x, y, width, height, destFormat, destType, destBytes);
(gdb)
1313 return;
</code></pre>
<p>Let's disable the two breakpoints and set a watch point on the first pixel of the destination</p>
<pre><code>(gdb) dis 1
(gdb) dis 11
(gdb) watch -location *(int*)destBytes
Hardware watchpoint 12: -location *(int*)destBytes
</code></pre>
<p>Then reverse-continue back to where the first pixel was set.</p>
<pre><code>(gdb) rc
Continuing.
Hardware watchpoint 12: -location *(int*)destBytes
Old value = -16711936
New value = 0
__memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:213
213 ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory.
</code></pre>
<p>We end up at a memcpy inside of readpixels that copies into the destination buffer.</p>
<pre><code>(gdb) bt 9
#0 0x00007fc0ad9ed955 in __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:213
#1 0x00007fc075c5080f in _mesa_readpixels (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>)
at /usr/include/x86_64-linux-gnu/bits/string3.h:53
#2 0x00007fc075c5080f in _mesa_readpixels (packing=0x7fc05c3e92e8, pixels=<optimized out>, type=5121, format=766008576, height=64, width=64, y=0, x=0, ctx=0x7fc05c3ce000) at main/readpix.c:245
#3 0x00007fc075c5080f in _mesa_readpixels (ctx=ctx@entry=0x7fc05c3ce000, x=x@entry=0, y=y@entry=0, width=width@entry=64, height=height@entry=64, format=format@entry=6408, type=5121, packing=0x7fc05c3e92e8, pixels=<optimized out>)
at main/readpix.c:873
#4 0x00007fc075ce0985 in st_readpixels (ctx=0x7fc05c3ce000, x=0, y=0, width=64, height=64, format=6408, type=5121, pack=0x7fc05c3e92e8, pixels=0x7fc05c9d5000) at state_tracker/st_cb_readpixels.c:255
#5 0x00007fc075c519d4 in _mesa_ReadnPixelsARB (x=0, y=0, width=64, height=64, format=6408, type=5121, bufSize=2147483647, pixels=0x7fc05c9d5000) at main/readpix.c:1120
#6 0x00007fc075c51c82 in _mesa_ReadPixels (x=<optimized out>, y=<optimized out>, width=<optimized out>, height=<optimized out>, format=<optimized out>, type=<optimized out>, pixels=0x7fc05c9d5000) at main/readpix.c:1128
#7 0x00007fc09c3a9b3b in mozilla::gl::GLContext::raw_fReadPixels(int, int, int, int, unsigned int, unsigned int, void*) (this=0x7fc05c4e7000, x=0, y=0, width=64, height=64, format=6408, type=5121, pixels=0x7fc05c9d5000)
at /home/jrmuizel/src/gecko/gfx/gl/GLContext.h:1511
#8 0x00007fc09c39abe1 in mozilla::gl::GLContext::fReadPixels(int, int, int, int, unsigned int, unsigned int, void*) (this=0x7fc05c4e7000, x=0, y=0, width=64, height=64, format=6408, type=5121, pixels=0x7fc05c9d5000)
at /home/jrmuizel/src/gecko/gfx/gl/GLContext.cpp:2873
#9 0x00007fc09d78696d in mozilla::ReadPixelsAndConvert(mozilla::gl::GLContext*, GLint, GLint, GLsizei, GLsizei, GLenum, GLenum, size_t, GLenum, GLenum, void*) (gl=0x7fc05c4e7000, x=0, y=0, width=64, height=64, readFormat=6408, readType=5121, pixelStorePackAlignment=4, destFormat=6408, destType=5121, destBytes=0x7fc05c9d5000)
at /home/jrmuizel/src/gecko/dom/canvas/WebGLContextGL.cpp:1312
</code></pre>
<p>From here we can see that the memcpy is storing the value of ymm4 into [r10]
We use the <code>origin</code> command to step back to the place where ymm4 is loaded.</p>
<pre><code>(gdb) origin
0x1000: vmovdqu ymmword ptr [r10], ymm4
1
reg used ymm4
212 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add rdx, rdi
211 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add edx, eax
210 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: jae 0xfd2
209 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: sub edx, eax
208 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add rdi, rax
207 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqa ymmword ptr [rdi + 0x60], ymm3
206 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqa ymmword ptr [rdi + 0x40], ymm2
205 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqa ymmword ptr [rdi + 0x20], ymm1
204 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqa ymmword ptr [rdi], ymm0
203 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add rsi, rax
202 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm3, ymmword ptr [rsi + 0x60]
201 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm2, ymmword ptr [rsi + 0x40]
200 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm1, ymmword ptr [rsi + 0x20]
199 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm0, ymmword ptr [rsi]
197 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: sub edx, eax
196 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add rsi, r11
195 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm4, ymmword ptr [rsi]
</code></pre>
<p>We end up the instruction that loads ymm4 from [rsi]. Origin
does this by single stepping backwards looking for writes to
the ymm4 register. From here we want to continue tracking
the origin. We use the <code>origin</code> command again. This time
it sets a hardware watchpoint on the address in rsi.</p>
<pre><code>(gdb) origin
0x1000: vmovdqu ymm4, ymmword ptr [rsi]
3
mem used *(int*)(0x7fc05c6c5000)
Hardware watchpoint 13: *(int*)(0x7fc05c6c5000)
Hardware watchpoint 13: *(int*)(0x7fc05c6c5000)
Old value = -16711936
New value = -452919552
__memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238
238 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
We end up in another memcpy. This memcpy is flushing the tile buffer which
is used for rendering to the backbuffer that ReadPixels is reading from.
(gdb) bt 9
#0 0x00007fc0ad9ed9a6 in __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238
#1 0x00007fc075e89753 in pipe_put_tile_raw (pt=pt@entry=0x7fc05c4f8740, dst=dst@entry=0x7fc05c6c5000, x=x@entry=0, y=y@entry=0, w=<optimized out>, w@entry=64, h=<optimized out>, h@entry=64, src=0x7fc05c9d9000, src_stride=<optimized out>)
at util/u_tile.c:80
#2 0x00007fc075e8a268 in pipe_put_tile_rgba_format (pt=0x7fc05c4f8740, dst=0x7fc05c6c5000, x=0, y=0, w=w@entry=64, h=h@entry=64, format=PIPE_FORMAT_R8G8B8A8_UNORM, p=0x7fc05c9c5000) at util/u_tile.c:524
#3 0x00007fc0760034aa in sp_flush_tile (tc=tc@entry=0x7fc06c1e9400, pos=pos@entry=0) at sp_tile_cache.c:427
#4 0x00007fc076003c05 in sp_flush_tile_cache (tc=0x7fc06c1e9400) at sp_tile_cache.c:457
#5 0x00007fc075fe6a0e in softpipe_flush (pipe=pipe@entry=0x7fc0650d5000, flags=flags@entry=0, fence=fence@entry=0x7fff2da85b40) at sp_flush.c:72
#6 0x00007fc075fe6b0d in softpipe_flush_resource (pipe=0x7fc0650d5000, texture=texture@entry=0x7fc07fad4380, level=level@entry=0, layer=<optimized out>, flush_flags=flush_flags@entry=0, read_only=<optimized out>, cpu_access=1 '\001', do_not_block=0 '\000') at sp_flush.c:148
#7 0x00007fc07600304f in softpipe_transfer_map (pipe=<optimized out>, resource=0x7fc07fad4380, level=0, usage=1, box=0x7fff2da85bf0, transfer=0x7fc0665e0a58) at sp_texture.c:387
#8 0x00007fc075cdd6bd in st_MapRenderbuffer (transfer=0x7fc0665e0a58, h=64, w=<optimized out>, y=0, x=0, access=<optimized out>, layer=<optimized out>, level=<optimized out>, resource=<optimized out>, context=<optimized out>)
at ../../src/gallium/auxiliary/util/u_inlines.h:457
#9 0x00007fc075cdd6bd in st_MapRenderbuffer (ctx=<optimized out>, rb=0x7fc0665e09d0, x=0, y=<optimized out>, w=<optimized out>, h=64, mode=1, mapOut=0x7fff2da85cf8, rowStrideOut=0x7fff2da85cf0) at state_tracker/st_cb_fbo.c:772
</code></pre>
<p>We use the <code>origin</code> command again. This time we have rep movsb operation that reads from memory as a source. Origin uses
a hardware watchpoint on that address again.</p>
<pre><code>(gdb) origin
0x1000: rep movsb byte ptr [rdi], byte ptr [rsi]
3
mem used *(int*)(0x7fc05c9d9003)
Hardware watchpoint 14: *(int*)(0x7fc05c9d9003)
Hardware watchpoint 14: *(int*)(0x7fc05c9d9003)
Old value = 16711935
New value = -437918209
util_format_r8g8b8a8_unorm_pack_rgba_float (dst_row=0x7fc05c9d9000 "", dst_stride=256, src_row=0x7fc05c9c5000,
src_stride=<optimized out>, width=64, height=64) at util/u_format_table.c:15204
15204 *(uint32_t *)dst = value;
</code></pre>
<p>This watchpoint takes us back to the function that converts from the floating point output of the graphics pipeline
to the byte value that goes in the destination tile.</p>
<pre><code>(gdb) bt 9
#0 0x00007fc075eabe44 in util_format_r8g8b8a8_unorm_pack_rgba_float (dst_row=0x7fc05c9d9000 "", dst_stride=256, src_row=0x7fc05c9c5000, src_stride=<optimized out>, width=64, height=64) at util/u_format_table.c:15204
#1 0x00007fc075e8a23b in pipe_put_tile_rgba_format (pt=0x7fc05c4f8740, dst=0x7fc05c6c5000, x=0, y=0, w=w@entry=64, h=h@entry=64, format=PIPE_FORMAT_R8G8B8A8_UNORM, p=0x7fc05c9c5000) at util/u_tile.c:518
#2 0x00007fc0760034aa in sp_flush_tile (tc=tc@entry=0x7fc06c1e9400, pos=pos@entry=0) at sp_tile_cache.c:427
#3 0x00007fc076003c05 in sp_flush_tile_cache (tc=0x7fc06c1e9400) at sp_tile_cache.c:457
#4 0x00007fc075fe6a0e in softpipe_flush (pipe=pipe@entry=0x7fc0650d5000, flags=flags@entry=0, fence=fence@entry=0x7fff2da85b40) at sp_flush.c:72
#5 0x00007fc075fe6b0d in softpipe_flush_resource (pipe=0x7fc0650d5000, texture=texture@entry=0x7fc07fad4380, level=level@entry=0, layer=<optimized out>, flush_flags=flush_flags@entry=0, read_only=<optimized out>, cpu_access=1 '\001', do_not_block=0 '\000') at sp_flush.c:148
#6 0x00007fc07600304f in softpipe_transfer_map (pipe=<optimized out>, resource=0x7fc07fad4380, level=0, usage=1, box=0x7fff2da85bf0, transfer=0x7fc0665e0a58) at sp_texture.c:387
#7 0x00007fc075cdd6bd in st_MapRenderbuffer (transfer=0x7fc0665e0a58, h=64, w=<optimized out>, y=0, x=0, access=<optimized out>, layer=<optimized out>, level=<optimized out>, resource=<optimized out>, context=<optimized out>)
at ../../src/gallium/auxiliary/util/u_inlines.h:457
#8 0x00007fc075cdd6bd in st_MapRenderbuffer (ctx=<optimized out>, rb=0x7fc0665e09d0, x=0, y=<optimized out>, w=<optimized out>, h=64, mode=1, mapOut=0x7fff2da85cf8, rowStrideOut=0x7fff2da85cf0) at state_tracker/st_cb_fbo.c:772
#9 0x00007fc075c507b2 in _mesa_readpixels (packing=0x7fc05c3e92e8, pixels=0x7fc05c9d5000, type=5121, format=766008576, height=64, width=64, y=0, x=0, ctx=0x7fc05c3ce000) at main/readpix.c:234
</code></pre>
<p>We see that ecx is being stored into [r10 - 4]. We use <code>origin</code> to track back to the source.</p>
<pre><code>(gdb) origin
0x1000: mov dword ptr [r10 - 4], ecx
1
reg used ecx
15206 src += 4;
0x1000: add rax, 0x10
15207 dst += 4;
0x1000: add r10, 4
15204 *(uint32_t *)dst = value;
0x1000: or ecx, esi
</code></pre>
<p>We end at an or instruction. Looking at the source below we see that each of the floating point channels is being converted to a byte.
We'll manually set a watchpoint on the channel that we're interested to avoid getting lost in the conversion code.</p>
<pre><code>(gdb) list
15199 uint32_t value = 0;
15200 value |= (float_to_ubyte(src[0])) & 0xff;
15201 value |= ((float_to_ubyte(src[1])) & 0xff) << 8;
15202 value |= ((float_to_ubyte(src[2])) & 0xff) << 16;
15203 value |= (float_to_ubyte(src[3])) << 24;
15204 *(uint32_t *)dst = value;
15205 #endif
15206 src += 4;
15207 dst += 4;
15208 }
(gdb) watch -location src[1]
Hardware watchpoint 15: -location src[1]
(gdb) rc
Continuing.
Hardware watchpoint 15: -location src[1]
Old value = 1
New value = -1.35707841e+23
0x00007fc076003783 in clear_tile_rgba (tile=0x7fc05c9c5000, format=PIPE_FORMAT_R8G8B8A8_UNORM, clear_value=0x7fc06c1e968c)
at sp_tile_cache.c:272
272 tile->data.color[i][j][1] = clear_value->f[1];
</code></pre>
<p>We end up at the clear_tile_rgba function which is settting the data in the buffer from the clear value.</p>
<pre><code>(gdb) bt 9
#0 0x00007fc076003783 in clear_tile_rgba (tile=0x7fc05c9c5000, format=
PIPE_FORMAT_R8G8B8A8_UNORM, clear_value=0x7fc06c1e968c) at sp_tile_cache.c:272
#1 0x00007fc0760040e5 in sp_find_cached_tile (tc=0x7fc06c1e9400, addr=...) at sp_tile_cache.c:579
#2 0x00007fc075febee9 in single_output_color (layer=<optimized out>, y=<optimized out>, x=<optimized out>, tc=<optimized out>) at sp_tile_cache.h:155
#3 0x00007fc075febee9 in single_output_color (qs=0x7fc07fb6c780, quads=0x7fc0665f3500, nr=1) at sp_quad_blend.c:1179
#4 0x00007fc075fefc9f in flush_spans (setup=setup@entry=0x7fc0665f1000) at sp_setup.c:251
#5 0x00007fc075ff0112 in subtriangle (setup=setup@entry=0x7fc0665f1000, eleft=eleft@entry=0x7fc0665f1058, eright=eright@entry=0x7fc0665f1028, lines=64) at sp_setup.c:759
#6 0x00007fc075ff0af2 in sp_setup_tri (setup=setup@entry=0x7fc0665f1000, v0=v0@entry=0x7fc089aea7c0, v1=v1@entry=0x7fc089aea7d0, v2=v2@entry=0x7fc089aea7e0) at sp_setup.c:853
#7 0x00007fc075fe71a2 in sp_vbuf_draw_arrays (vbr=<optimized out>, start=<optimized out>, nr=6) at sp_prim_vbuf.c:422
#8 0x00007fc075e2b704 in draw_pt_emit_linear (emit=<optimized out>, vert_info=<optimized out>, prim_info=0x7fff2da85f80)
at draw/draw_pt_emit.c:261
#9 0x00007fc075e2d025 in fetch_pipeline_generic (prim_info=0x7fff2da85f80, vert_info=0x7fff2da85e40, emit=<optimized out>) at draw/draw_pt_fetch_shade_pipeline.c:196
</code></pre>
<p>We use <code>origin</code> again twice to track through the store and the load.</p>
<pre><code>(gdb) origin
0x1000: movss dword ptr [rax - 0xc], xmm0
272 tile->data.color[i][j][1] = clear_value->f[1];
0x1000: movss xmm0, dword ptr [rbp + 4]
(gdb) origin
0x1000: movss xmm0, dword ptr [rbp + 4]
3
mem used *(int*)(0x7fc06c1e9690)
Hardware watchpoint 16: *(int*)(0x7fc06c1e9690)
Hardware watchpoint 16: *(int*)(0x7fc06c1e9690)
Old value = 1065353216
New value = 0
sp_tile_cache_clear (tc=0x7fc06c1e9400, color=color@entry=0x7fc05c3cfa4c, clearValue=clearValue@entry=0)
at sp_tile_cache.c:640
640 tc->clear_color = *color;
</code></pre>
<p>We end up in sp_tile_cache_clear which is setting up the clear color.</p>
<pre><code>(gdb) bt 9
#0 0x00007fc076004376 in sp_tile_cache_clear (tc=0x7fc06c1e9400, color=color@entry=0x7fc05c3cfa4c, clearValue=clearValue@entry=0) at sp_tile_cache.c:640
#1 0x00007fc075fe5c84 in softpipe_clear (pipe=0x7fc0650d5000, buffers=5, color=0x7fc05c3cfa4c, depth=1, stencil=0)
at sp_clear.c:71
#2 0x00007fc075cd8181 in st_Clear (ctx=0x7fc05c3ce000, mask=272) at state_tracker/st_cb_clear.c:539
#3 0x00007fc09c3a9427 in mozilla::gl::GLContext::raw_fClear(unsigned int) (this=0x7fc05c4e7000, mask=16640)
at /home/jrmuizel/src/gecko/gfx/gl/GLContext.h:952
#4 0x00007fc09c3a9456 in mozilla::gl::GLContext::fClear(unsigned int) (this=0x7fc05c4e7000, mask=16640)
at /home/jrmuizel/src/gecko/gfx/gl/GLContext.h:959
#5 0x00007fc09d7830bf in mozilla::WebGLContext::Clear(unsigned int) (this=0x7fc064bc7000, mask=16640)
at /home/jrmuizel/src/gecko/dom/canvas/WebGLContextFramebufferOperations.cpp:46
#6 0x00007fc09d202460 in mozilla::dom::WebGLRenderingContextBinding::clear(JSContext*, JS::Handle<JSObject*>, mozilla::WebGLContext*, JSJitMethodCallArgs const&) (cx=0x7fc078086400, obj=..., self=0x7fc064bc7000, args=...)
at /home/jrmuizel/src/gecko/obj-x86_64-unknown-linux-gnu/dom/bindings/WebGLRenderingContextBinding.cpp:11027
#7 0x00007fc09d6de3fa in mozilla::dom::GenericBindingMethod(JSContext*, unsigned int, JS::Value*) (cx=0x7fc078086400, argc=1, vp=0x7fc08f230210) at /home/jrmuizel/src/gecko/dom/bindings/BindingUtils.cpp:2644
#8 0x00007fc0a03af188 in js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (args=..., native=0x7fc09d6de0bd <mozilla::dom::GenericBindingMethod(JSContext*, unsigned int, JS::Value*)>, cx=0x7fc078086400)
at /home/jrmuizel/src/gecko/js/src/jscntxtinlines.h:235
#9 0x00007fc0a03af188 in js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (cx=0x7fc078086400, args=..., construct=js::NO_CONSTRUCT) at /home/jrmuizel/src/gecko/js/src/vm/Interpreter.cpp:489
(gdb)
</code></pre>
<p>Running a backtrace we see that this goes back to a call to WebGLContext::Clear. This is the actual clear call that triggers the code
that eventually sets the pixel to the value that we see when we call glReadPixels. At this point we've travelled through the entire pipeline, and we've done it
with minimal effort through the magic of <a href="https://github.com/mozilla/rr">rr</a>.</p>
Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com1tag:blogger.com,1999:blog-1386948037384435441.post-21490884958420352872015-06-08T12:51:00.001-07:002015-06-08T13:03:01.707-07:00Intel driver crash of the dayIn <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1170143">bug 1170143</a> we ran into an Intel driver crash when trying to share DXGI_FORMAT_A8_UNORM surfaces. Older Intel drivers crash while opening the texture using OpenShareHandle. The driver successfully opens BGRA surfaces, but not alpha surfaces which we want to use for video playback. Who knows why... Here's a <a href="https://github.com/jrmuizel/d3d-tests/blob/master/alpha-texture-sharing.cc">test case</a>.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com0tag:blogger.com,1999:blog-1386948037384435441.post-12504273665148651282015-06-01T12:33:00.000-07:002015-06-01T12:43:23.972-07:00Direct2D on top of WARPIn Firefox 38 we introduced the use of <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/gg615082%28v=vs.85%29.aspx">WARP</a> for software rasterization on Window 7. Early in the release we ran into an issue where using WARP on top of the builtin VGA driver was <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1165732">ridiculously slow</a>. We fixed this by disabling WARP when the the VGA driver was being used, but I was curious how Internet Explorer avoids this issue. One big difference between Firefox and Internet Explorer is that we're not currently using Direct2D on top of WARP where as they are. It turns out the WARP driver has a private API that is used by Direct2D to avoid having to use the regular D3D11 API.<br />
<br />
Profiling Internet Explorer shows the following private API's used by Direct2D <br />
d3d10warp.dll!UMDevice::DrawGlyphRun<br />
d3d10warp.dll!UMDevice::AlphaBlt2<br />
d3d10warp.dll!UMDevice::InternalGetDC<br />
d3d10warp.dll!UMDevice::CreateGeometry<br />
d3d10warp.dll!UMDevice::DrawGeometryInternal<br />
<br />
It looks like these turn into fairly traditional 2D graphics operations as shown with the following call stack snippets below:<br />
<br />
RasterizationStage::Rasterize_TEXT<br />
DrawGlyphRun6x1_B8G8R8A8_SSE<1></1><br />
DrawGlyphRun4x4_B8G8R8A8_SSE<br />
<br />
RasterizationStage::Rasterize_GEOMETRY<br />
PixelJITRasterizeGeometry<br />
PixelJITGeometryRasterizer::Rasterize<br />
WarpGeometry::Rasterize<br />
CAntialiasedFiller::RasterizeEdges<br />
CAntialiasedFiller::FillEdges<br />
CAntialiasedFiller::GenerateOutput<br />
PixelJITGeometryRasterizer::RasterizeComplexScan<br />
PixelJITGeometryRasterizer::BeginSpan<br />
InitializeEdges<br />
InitializeInactiveArray<br />
QuickSortEdges<br />
<br />
This suggests that using Direct2D on top of WARP is more efficient than expected and might actually make more sense than our current strategy of only using WARP for composition.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com0tag:blogger.com,1999:blog-1386948037384435441.post-37437419017547088712015-03-16T12:40:00.000-07:002015-03-16T13:00:59.118-07:00Performance and feature improvements in Firefox 37 WebGL with D3D11 ANGLEFirefox 37 adds support for WebGL rendering using D3D11 on Windows. Up till now we were using D3D9 which has very limited support for cross-device synchronization. Without proper synchronization we were forced to wait on the main thread for WebGL content to finish rendering before we could continue script execution. The result of this is that the total frame rendering time would be the sum of the script time and the remaining gpu time. D3D11 allows us to use a <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ff471338%28v=vs.85%29.aspx">GPU-side read barrier</a> between the main thread and compositing thread. This lets Firefox avoid waiting on the main thread giving improved responsiveness and more time for script execution.<br />
<br />
Here's a <a href="http://people.mozilla.org/~bgirard/webgl-tweak.html">test program</a> that lets you adjust the GPU and CPU execution times to see how the browser responds. D3D11 WebGL let's you adjust the CPU time up to nearly 15ms without dropping below 60fps.<br />
<br />
D3D11 support also lets us expose the <a href="https://www.khronos.org/registry/webgl/extensions/WEBGL_draw_buffers/">WEBGL_draw_buffers</a> extension which allows drawing to multiple output buffers at the same time, functionality that's very helpful for implementing deferred renderers.<br />
<br />
Give D3D11 WebGL support a try in <a href="https://www.mozilla.org/en-US/firefox/channel/">Firefox Beta</a> today and let us know how it works.<br />
<br />
<br />Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com0tag:blogger.com,1999:blog-1386948037384435441.post-83367849046274484042012-08-23T17:14:00.000-07:002012-08-23T17:14:08.211-07:00The system worksOver the last few months a number of us at Mozilla having been working on a <a href="https://developer.mozilla.org/en-US/docs/Performance/Profiling_with_the_Built-in_Profiler">profiler built into Firefox</a>. One of the goals of this profiler is to make it as easy as possible to profile anywhere. Yesterday we had a satisfying realization of this goal.<br />
<br />
It all started with Taras' <a href="https://blog.mozilla.org/tglek/2012/08/16/snappy-36/">Snappy #36</a> update. In a comment, a user going by kamulos reported <a href="https://blog.mozilla.org/tglek/2012/08/16/snappy-36/#comment-35935">a recent problem</a> about laggy tab switches. kamulos posted the <a href="http://people.mozilla.com/~bgirard/cleopatra/?report=542efa04d60977a067a3623a00765d837cb952ba">profile</a> and Benoit Girard filed <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=784756">bug 784756</a>. The profile showed us spending a bunch of time in TimeStamp::Now() during image decode. I wasn't particularly surprised by this because our TimeStamp::Now() implementation on Windows is not particularly fast. Ehsan and I went away and put some effort into improving the performance and have some good candidates in <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=784859">bug 784859</a>. In the mean time, Robert Lickenbrock discovered that problem was recently introduced by <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=685516">bug 685516</a>, which unintentionally caused a fixed time delay where we called TimeStamp::Now() in a loop. He has since posted a patch that fixes the problem.<br />
<br />
Here we have two community members helping uncover a problem within a week of it landing, a problem that could have otherwise gone undetected for a long time. This is a great example of an open source community working beautifully.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com0tag:blogger.com,1999:blog-1386948037384435441.post-86893397465834956272012-07-13T14:55:00.001-07:002012-07-13T19:01:01.378-07:00What happens when you switch to a Gmail tab on OS XWhat follows is a brief walk through of what happens when you switch to a Gmail tab. You can follow along in the <a href="http://people.mozilla.com/~bgirard/cleopatra/?report=b5fdedb10d1a09f9a536ab564f44a92398f28054">profile</a>.<br />
<br />
The process starts with [GeckoNSApplication sendEvent:] for the mouse event. This travels on down to nsXBLEventHander::HandleEvent(). From there, we call into the JS, specifically onxblmousedown() in tabbox.xml(). This eventually calls into set__selected() and set_selectedPanel(). set_selectedIndex() calls onselect() in browser.xul which ends up taking about 14ms. During onselect() we spend 4ms decoding an image, 3 ms in callProgressListeners() and 3ms in GetBoundingClientRect(). The whole process of handling the click event takes about 15ms.<br />
<br />
After that we spend 6ms handling some events. Among these are a RefreshDriver tick and a toolkit paint. Afterwards we wait for 12ms.<br />
<br />
33 ms after the original click we start the painting process. First we do a [NSView viewWillDraw] which calls in PresShell::WillPaint() and takes 3ms. Finally we start the actual 85 ms paint in PresShell::Paint().<br />
<br />
Here's the breakdown of what we doing during paint. Of the 85 ms, 81 ms is in LayerManagerOGL::Render(). 5ms of that is clearing the surface in BasicBufferOGL::BeginPaint(), 11 ms is texture upload which does a useless format conversion (<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=613046">bug 613046</a>). In between these two is 58ms of DrawThebesLayer() of which 39 ms is BasicLayerManager::EndTransactionInternal doing composition. A lot of this seems to be vm badness caused by cairo/CoreGraphics and it's weird copy-on-write semantics. The rest of the time is 6ms in nsDisplayText::Paint, 5 ms in nsDisplayBackground::Paint, 4 ms in nsDisplayBorderBackground::Paint, and 3ms in nsDisplayBorder::Paint(). Unfortunately, of the 85 ms only 18ms of the time is painting display items and of the 18ms less than half of that is actual painting operations inside of CoreGraphics.<br />
<br />
Shortly after PresShell::Paint() the new content is displayed on the screen and we run a couple more different events and a garbage collection. And that's what happens during the 130ms switch to a Gmail tab.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com7tag:blogger.com,1999:blog-1386948037384435441.post-78074738359580172012-06-22T13:21:00.002-07:002012-06-22T13:21:45.308-07:00Resizable windows in UbuntuBy default, Ubuntu ships with window resizers that are very small and difficult to hit exactly with the mouse. This is made worse by the fact the resize cursor jumps to a different location. You can fix this by switching to the High Contrast theme. This adds a visible resizer to some windows. It does make the rest of the UI look terrible, but that's a price I'm willing to pay to be able to resize my terminals.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com2tag:blogger.com,1999:blog-1386948037384435441.post-67532565430618748072012-04-24T13:57:00.000-07:002012-04-24T13:57:07.993-07:00Azure canvas on OS XFirefox 12 is the first release that we use the new CoreGraphics backend for canvas. This brings a host of performance improvements, that largely come from removing overhead and semantic mismatches between HTML canvas and CoreGraphics. <br />
<br />
Here are some examples:<br />
GUIMark2 vector test 6.29 fps to 6.63 fps <br />
GUIMark2 Bitmap From 17.62 fps to 22.9 fps<br />
Fish IE goes from a high quality but embarrassing 7fps with 10 fish to 48fps with 250 fish.<br />Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com1tag:blogger.com,1999:blog-1386948037384435441.post-66411950134897041192012-03-13T12:29:00.000-07:002012-03-13T12:29:30.827-07:00Checkerboarding CNN.com<style>
td:first-child {text-align:right}
</style>
<br />
The graphics team has been spending most of its time working on "off-main thread compositing" (OMTC) on the maple project branch. By separating Gecko into two threads, a content thread and a composition thread, we hope to make interacting with Firefox on Android more pleasant, because we won't be waiting on content to pan around pages.<br />
<br />
When you pan to an area that hasn't been drawn yet, we display a "checkerboard" indicating that the content will be shown soon. Obviously, we want to minimize the time we display checkerboard to the minimum possible.<br />
<br />
Here is the current breakdown of where we spend our time while panning around on cnn.com:<br />
<br />
<table><tbody>
<tr><td>34.2%</td><td>painting</td></tr>
<tr><td>14.7%</td><td>waiting for texture upload to finish</td></tr>
<tr><td>12.2%</td><td>sleeping</td></tr>
<tr><td>8.2%</td><td>building display lists</td></tr>
<tr><td>3.4%</td><td>servicing timers</td></tr>
<tr><td>27.3%</td><td>other</td></tr>
</tbody></table>Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com0tag:blogger.com,1999:blog-1386948037384435441.post-16076667917955677252011-10-19T13:17:00.000-07:002011-10-19T13:17:11.717-07:00Moving patches between git and hgMoving patches between git and hg is currently not very easy. I found script that converts in one direction and I added script that goes in the other direction. The scripts are available here: <a href="https://github.com/jrmuizel/patch-converter">https://github.com/jrmuizel/patch-converter</a>. Hopefully, this will make it a bit easier.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com4tag:blogger.com,1999:blog-1386948037384435441.post-12214860194820539952011-06-16T14:04:00.000-07:002011-06-16T14:54:29.238-07:00WebGL considered harmful?Today Microsoft posted an article titled "WebGL considered harmful". It seems like a lot of their arguments against WebGL also apply to Silverlight 5's XNA 3D graphics support. It, like WebGL, allows authors to write shaders using HLSL. I wonder, if you reframe their article by replacing WebGL with Silverlight 5, is anything untrue? If so, how does Microsoft solve these problems?<blockquote><h2>Silverlight XNA 3D considered harmful</h2>Microsoft's Silverlight 5 XNA 3D technology is a low-level 3D graphics API for the web.<br /><br />One of the functions of MSRC Engineering is to analyze various technologies in order to understand how they can potentially affect Microsoft products and customers. As part of this charter, we recently took a look at XNA 3D. Our analysis has led us to conclude that Microsoft products supporting XNA 3D would have difficulty passing Microsoft’s <a href="http://www.microsoft.com/security/sdl/default.aspx">Security Development Lifecycle</a> requirements. Some key concerns include:<br /><ul><li><span style="font-weight: bold;">Browser support for Silverlight 5 directly exposes hardware functionality to the web in a way that we consider to be overly permissive</span><br />The security of Silverlight 5 as a whole depends on lower levels of the system, including OEM drivers, upholding security guarantees they never really need to worry about before. Attacks that may have previously resulted only in local elevation of privilege may now result in remote compromise. While it may be possible to mitigate these risks to some extent, the large attack surface exposed by Silverlight 5 remains a concern. We expect to see bugs that exist only on certain platforms or with certain video cards, potentially facilitating targeted attacks.<br /></li><br /> <li> <span style="font-weight: bold;">Browser support for Silverlight 5 security servicing responsibility relies too heavily on third parties to secure the web experience</span><br />As Silverlight 5 vulnerabilities are uncovered, they will not always manifest in the Silverlight 5 API itself. The problems may exist in the various OEM and system components delivered by IHV’s. While it has been suggested that Silverlight 5 implementations may block the use of affected hardware configurations, this strategy does not seem to have been successfully put into use to address existing vulnerabilities. It is our belief that as configurations are blocked, increasing levels of customer disruption may occur. Without an efficient security servicing model for video card drivers (eg: Windows Update), users may either choose to override the protection in order to use Silverlight 5 on their hardware, or remain insecure if a vulnerable configuration is not properly disabled. Users are not accustomed to ensuring they are up-to-date on the latest graphics card drivers, as would be required for them to have a secure web experience. In some cases where OEM graphics products are included with PCs, retail drivers are blocked from installing. OEMs often only update their drivers once per year, a reality that is just not compatible with the needs of a security update process.</li><br /><li><span style="font-weight: bold;">Problematic system DoS scenarios<br /></span> Modern operating systems and graphics infrastructure were never designed to fully defend against attacker-supplied shaders and geometry. Although mitigations such as Direct3D 10 may help, they have not proven themselves capable of comprehensively addressing the DoS threat. While traditionally client-side DoS is not a high severity threat, if this problem is not addressed holistically it will be possible for any web site to freeze or reboot systems at will. This is an issue for some important usage scenarios such as in critical infrastructure.</li></ul><br />We believe that Silverlight 5 will likely become an ongoing source of hard-to-fix vulnerabilities. In its current form, XNA 3D in Silverlight 5 is not a technology Microsoft can endorse from a security perspective.<br /><br />We recognize the need to provide solutions in this space however it is our goal that all such solutions are secure by design, secure by default, and secure in deployment.<br /></blockquote><br />The problems Microsoft is worried about are real, and they don't have any easy solutions. At the same, I don't think we need to wait for perfect answers before trying. With Silverlight 5's 3D support, it looks like Microsoft feels the same way.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com10tag:blogger.com,1999:blog-1386948037384435441.post-7216243929107017592011-04-20T07:26:00.000-07:002011-05-24T18:20:55.019-07:00WebP<p>Overall the reception to <a href="http://code.google.com/speed/webp/faq.html#whatis">WebP</a> that I've seen so far has been pretty negative. Jason Garrett-Glaser wrote a <a href="http://x264dev.multimedia.cx/archives/541">popular review</a>, but there have been similar response from others like <a href="http://cbloomrants.blogspot.com/2010/10/10-02-10-webp.html">Charles Bloom</a>. Since these reviews, the WebP encoder has improved on the example used by Jason (<a href="http://x264.nl/developers/Dark_Shikari/imagecoding/vp8.png">old</a> vs. <a href="http://people.mozilla.org/%7Ejmuizelaar/webp/parkjoy-webp.png">new</a>) but it's still not a lot better than a decent <a href="http://people.mozilla.org/%7Ejmuizelaar/webp/parkjoy.jpg">JPEG encoding</a>. I also have a couple of thoughts on the format that I'd like to share.</p> <p> Google <a href="http://code.google.com/speed/webp/docs/c_study.html">claims it's better</a> than JPEG but this study has some problems and as a result, isn't very convincing (<span style="font-weight: bold;">Update: </span>Google has a <a href="http://code.google.com/speed/webp/docs/webp_study.html">new study</a> that's better). First, they recompress existing JPEG's. This is unconventional. Perhaps recompressing JPEG's is their target market, but I find that a little weird and it should at least be explained in the study. Second, they use PSNR as a comparison metric. This is even more confusing. PSNR has, for a while now, been accepted as a poor measure of visual quality and I can't understand why Google continues to use it. I think it would help the format's credibility if Google did a study that used uncompressed source images, SSIM as a metric and provided enough information about the methodology so that others could reproduce their results. </p> <p> WebP also comes across as half-baked. Currently, it only supports a subset of the features that JPEG has. It lacks support for any color representation other than 4:2:0 YCrCb. JPEG supports 4:4:4 as well as other color representations like CMYK. WebP also seems to lack support for EXIF data and ICC color profiles, both of which have be come quite important for photography. Further, it has yet to include any features missing from JPEG like alpha channel support. These features can still be added, but the longer they remain unspecified, the more difficult it will be to adopt. </p> <p> <a href="http://en.wikipedia.org/wiki/JPEG_XR">JPEG XR</a> provides a good example of what features you'd want from a replacement for JPEG. It has support for an alpha channel and <a href="http://en.wikipedia.org/wiki/High_dynamic_range_imaging">HDR</a> among <a href="http://en.wikipedia.org/wiki/JPEG_XR#Capabilities">others</a>. Microsoft has also put in the effort to have it formally standardized. However, it too is not without problems. The compression improvements it claims haven't matched evaluations other parties have done. I don't know enough about JPEG XR to say whether this is because the <a href="http://x264dev.multimedia.cx/archives/164">encoders are bad</a> or because the format is not really that great. </p> <p> Every image format that becomes “part of the Web platform” exacts a cost for all time: all clients have to support that format forever, and there's also a cost for authors having to choose which format is best for them. This cost is no less for WebP than any other format because progressive decoding requires using a separate library instead of reusing the existing WebM decoder. This gives additional security risk but also eliminates much of the benefit of having bitstream compatibility with WebM. It makes me wonder, why not just change the bitstream so that it's more suitable for a still image codec? Given every format has a cost, if we're going to have a new image format for the Web we really need to make it the best we can make it with today's (royalty-free) technology.</p> <p> Where does that leave us? WebP gives a subset of JPEG's functionality with more modern compression techniques and no additional IP risk to those already shipping WebM. I'm really not sure it's worth adding a new image format for that. Even if WebP was a clear winner in compression, large image hosts don't seem to care that much about image size. Flickr compresses their images at libjpeg quality of 96 and Facebook at 85: both quite a bit higher than the recommended 75 for <a href="http://google.com/codesearch/p?hl=en#M3EzZdztQo0/pub/graphics/packages/jpeg/jpegsrc.v6.tar.gz%7Cu6QbQHjGtGQ/jpeg-6/jcparam.c&l=70">“very good quality”</a>. Neither of them optimize the huffman tables, which gives a lossless 4–7% improvement in size. Further, switching to progressive JPEG gives an even larger improvement of 8–20%.</p> <p> History has shown that adoption of image formats on the internet is slow. JPEG 2000 has mostly failed on the internet. PNG took a very long time, despite having large advantages. I expect that adoption may even be slower now than it was in the past, because there is no driving force. I would also be surprised if Microsoft adopted WebP because of their stance on WebM and their involvment in JPEG XR. Can WebP succeed without being adopted by all of the major web browsers? It's hard to say, but it wouldn't be easy. Personally, I'd rather the effort being spent on WebP be spent on a <a href="http://cbloomrants.blogspot.com/2010/10/10-08-10-optimal-baseline-jpeg.html"> improved JPEG encoder</a> or even an improved JPEG XR encoder. </p> <p> Is JPEG still great? No. Is there a great replacement for it? It doesn't feel like we're there yet. </p>Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com31tag:blogger.com,1999:blog-1386948037384435441.post-15885017440364033642011-03-02T21:07:00.000-08:002011-03-02T21:51:57.008-08:00Drawing Sprites: Minimizing draw callsOne reason OpenGL is so fast is that it allows applications to provide large chunks of work to be done in parallel. When drawing sprites with WebGL, it's important to make an effort to take advantage of this by minimizing the number of <a href="http://www.khronos.org/registry/webgl/specs/latest/#5.13.11">draw calls</a>. This is true with OpenGL, but even more so with WebGL because each draw call requires extra validation.<br /><br />Unfortunately, minimizing draw calls isn't always easy. It's often impractical or impossible to draw all your geometry at once because the geometry must share the same texture(s). FishIE used a single <a href="http://blog.mozilla.com/webdev/2009/03/27/css-spriting-tips/">sprite</a> from the beginning, which made it easy to draw everything at once. If possible, move as many sprites into the same texture as possible and sort or group sprites using the same texture into a single draw call. It may also be possible to use multi-texturing, but depending on the GPU architecture, this can cause all textures to be read for each sprite which will have dramatic impact on performance because of limitations on texture bandwidth.<br /><br />The performance difference between drawing sprites individually versus all at once can be pretty big. I made another version of the <a href="http://people.mozilla.org/%7Ejmuizelaar/fishie/fishie-gl-individual.html">FishIE demo that draws each sprite individually</a>. This version draws 2000 fish at 10fps on my test system, while the original <a href="http://people.mozilla.org/%7Ejmuizelaar/fishie/fishie-gl.html">WebGL FishIE</a> can do 4000 fish at 60fps on the same system. Since the same texture is used for all sprites I did not have to rebind the texture for each sprite; doing so would likely decrease performance further.<br /><br />Designing an application around these limitation can be tricky, but often the application is in a better position to make compromises or take short cuts than a more general Canvas 2D implementation would be.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com1tag:blogger.com,1999:blog-1386948037384435441.post-35439479082052383122011-02-28T13:29:00.000-08:002011-03-01T18:03:24.598-08:00Drawing Sprites: Canvas 2D vs. WebGLLately I've seen a lot of graphics benchmarks that basically just test image blitting/sprite performance. These include <a href="http://ie.microsoft.com/testdrive/Performance/FlyingImages/">Flying Images</a>, <a href="http://ie.microsoft.com/testdrive/Performance/FishIETank/Default.html">FishIE</a>, <a href="http://ie.microsoft.com/testdrive/Performance/SpeedReading/Default.html">Speed Reading</a> and <a href="https://developers.facebook.com/blog/post/460">JSGameBench</a>(<b>Update:</b> I just saw the blog post for the <a href="http://developers.facebook.com/blog/post/468">WebGL JSGameBench</a>. This further confirms my claim that WebGL is a better way to do sprites). They all try to draw a bunch of images in a short amount of time. They mostly use two techniques: positioned images or canvas' drawImage. Neither of these methods is particularly well suited to this task. Positioned images have typically been used for document layout and the Canvas 2D API was designed as a JavaScript binding to CoreGraphics which owes most of its design to Postscript. Neither were designed for high performance interactive graphics. However, OpenGL, and its web counterpart WebGL, was designed for exactly this.<br /><br />To show off some of the potential performance difference available, I ported the FishIE benchmark to WebGL. Along the way I discovered some different problems and ways to solve them.<br /><br />The problem, once the overhead of Canvas 2D is removed, is that FishIE very quickly becomes texture read bound. I noticed that the FishIE sprites have a lot of horizontal padding. This padding was included in the drawImage calls which causes us to do a bunch of texture reads for transparent pixels. Trimming this down a little gave a noticeable framerate boost.<br /><br />An even bigger cause of texture bandwidth waste is that the demo uses a large sprite to draw a small fish. Fortunately, OpenGL has a great solution to this problem: <a href="http://en.wikipedia.org/wiki/Mipmap">mipmaps</a>. <a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoVPnuKK8rRxr5lMmpg9YzGnYvAUz8bojr8YgwnXNke4bNt_knwygvYo4ysURlxWH2_hPaEtNzIlup53HJPqP6PeZjDlQH2yWPLN8cF_yVkT_uPCMoPR3iLAmUlu9tDi65D6EE2zT4o1A/s1600/mipmap-out.png"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 80px; height: 80px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoVPnuKK8rRxr5lMmpg9YzGnYvAUz8bojr8YgwnXNke4bNt_knwygvYo4ysURlxWH2_hPaEtNzIlup53HJPqP6PeZjDlQH2yWPLN8cF_yVkT_uPCMoPR3iLAmUlu9tDi65D6EE2zT4o1A/s400/mipmap-out.png" title="without mipmaps" alt="without mipmaps" id="BLOGGER_PHOTO_ID_5578865905397666146" border="0" /></a><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0v6Ipp4KbCU5Z5ZiPfP3IgZax3xq7BvnPbbz9QznCar1G3xTUCIGd-PO0aLOMt85M80HhQBcJtuZR8XsGJW5XCCcMYM3vDim_vhr7zO4iPwNTrTh7SwbfwyxjFKz0gJpPhJqROvHURBE/s1600/alias-out.png"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 80px; height: 80px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0v6Ipp4KbCU5Z5ZiPfP3IgZax3xq7BvnPbbz9QznCar1G3xTUCIGd-PO0aLOMt85M80HhQBcJtuZR8XsGJW5XCCcMYM3vDim_vhr7zO4iPwNTrTh7SwbfwyxjFKz0gJpPhJqROvHURBE/s400/alias-out.png" alt="" id="BLOGGER_PHOTO_ID_5578865816967083602" border="0" /></a>Mipmaps let the GPU use smaller textures when drawing smaller fish, which can dramatically reduce the texture bandwidth required. They also improve the quality of small fish by eliminating the aliasing that occurs when downscaling by large amounts.<br /><br />Mipmapping is a good example of the flexibility that WebGL allows. Canvas 2D aims to be an easy to use API for drawing pictures, but this ease of use comes at some cost. First, the Canvas 2D implementation has to guess the intents of the author. For example drawImage on OS X does a high quality lanczos down scaling of the image. Direct2D just does a quick bilinear down scale. This makes it difficult for authors to know how fast drawImage will be. Further, because the design of Canvas 2D is inspired by an API for describing print jobs, it's not well suited to reusing data between paints.<br /><br />Try out the difference with these two modified versions of FishIE:<ol><li><a href="http://people.mozilla.org/%7Ejmuizelaar/fishie/fishie.html">The original FishIE modified only to allow more fish</a>.</li><li><a href="http://people.mozilla.org/%7Ejmuizelaar/fishie/fishie-gl.html">FishIE ported to WebGL</a>.</li></ol> The method I used to port FishIE to WebGL is pretty straight forward so I expect that any of the other benchmarks listed above could also be easily ported to WebGL.<br /><h2>Pushing the limits</h2>Once the number of fish becomes high enough we run into Javascript performance problems. FishIE has some Javascript problems that make things worse than they need to be. First, it loops over the fish with "for (var fishie in fish) {". This can end up using 10% of the total CPU time. The problem with this code is that converts all of the array indices to strings and then uses those strings to index into the array. It also has the problem that any additional properties added to the array will also show up as index values, which is likely not the intent of the author.<br /><br />Second, each fish object includes a swim() method. Unfortunately, in the FishIE source swim() is a closure inside the Fish() object. This means that the swim() method is different for each Fish which makes things worse for Javascript engines.<br /><br />Fixing both of these problems, and making the fish really small lets us get an idea of how many sprites we can actually push around. Here's a <a href="http://people.mozilla.org/%7Ejmuizelaar/fishie/fishie-fast.html">final version</a>. If I disable the method jit (<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=637878">bug 637878</a>) and run at an even window size (<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=637894">bug 637894</a>) I can do 60000 fish at 30fps, which I think is pretty impressive compared to the 1000 that the original Microsoft demo does.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com18tag:blogger.com,1999:blog-1386948037384435441.post-86228978346889618182011-02-18T12:55:00.000-08:002011-02-18T13:03:49.210-08:00Updated mozilla-cvs-history git repoI recently ran git gc --agressive on the cvs history git repository mentioned <a href="http://muizelaar.blogspot.com/2010/02/historical-mozilla-central-git.html">here</a>. It's now 543M, down from 986M. I've also uploaded a copy to <a href="https://github.com/jrmuizel/mozilla-cvs-history">github</a>.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com2tag:blogger.com,1999:blog-1386948037384435441.post-82293650476542314612011-02-10T10:43:00.000-08:002011-02-10T11:33:07.235-08:00Clone timingsChris Atlee was wondering how clone times differ between git and mercurial so I ran a quick test on a fast linux machine.<br /><br />$ time git clone git://github.com/doublec/mozilla-central.git<br />real 1m33.478s<br /><br />$ time git clone mozilla-central moz2<br />real 0m2.559s<br /><br /><br />$ time hg clone http://hg.mozilla.org/mozilla-central/<br />real 3m22.510s<br /><br />$ time hg clone mozilla-central moz2<br />real 0m20.660sJeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com6tag:blogger.com,1999:blog-1386948037384435441.post-3414772750389115802011-01-12T16:37:00.000-08:002011-01-12T13:38:40.531-08:00historical mozilla-central git repositoryA number of people use git to work with the mozilla hg tree. In the past I've wanted the entire history as a git repo so I converted the old CVS repository to git and put it up on people.mozilla.org.<br /><br />You can set it up as follows:<br /><code style="font-size: small"><br />git clone http://people.mozilla.org/~jmuizelaar/mozilla-cvs-history.git<br />git clone git://bluishcoder.co.nz/git/mozilla-central.git<br /><br />cd mozilla-central/.git/objects/pack<br /># set up symbol links to cvs-history pack files<br />ln -s ../../../../mozilla-cvs-history/.git/objects/pack/pack-5b5d604ab48cf7bc2a6b4495292fa8700a987c5f.pack .<br />ln -s ../../../../mozilla-cvs-history/.git/objects/pack/pack-5b5d604ab48cf7bc2a6b4495292fa8700a987c5f.idx .<br />cd ../../<br /><br /># add a graft from the last revision in the mozilla-central repo<br /># to the first revision in the cvs-history<br />echo 2514a423aca5d1273a842918589e44038d046a51 3229d5d8b7f8376cfb7936e7be810635a14a486b > info/grafts<br /></code><br />Now you have a git repository containing all of the history. You can update the mozilla-central repository as you normally would. The conversion isn't perfect, but it's been good enough to have working blame back into cvs time.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com6tag:blogger.com,1999:blog-1386948037384435441.post-42543437431149257042011-01-11T14:05:00.000-08:002011-01-11T15:24:54.234-08:00Firefox acceleration prefs changingI just landed a changeset that changes the names of the layer acceleration prefs in Firefox.<br /><br />The old prefs were:<br /> layers.accelerate-all<br /> layers.accelerate-none<br /><br />The new prefs are:<br /> layers.acceleration.disabled<br /> layers.acceleration.force-enabled<br /><br />layers.accelerate-all previously defaulted to 'true' on Windows and OS X. Which meant that there was no easy way to force layer acceleration on if your card had been blacklisted for some reason. The new prefs allow the blacklist to be overwritten. The old prefs are not being migrated over to the new names. If you have a problem with the defaults, please file bugs.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com1tag:blogger.com,1999:blog-1386948037384435441.post-89003922121684967082011-01-08T18:42:00.001-08:002011-01-09T18:44:17.659-08:00Trying out AVXIntel's new <a href="http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=1">Sandy Bridge</a> CPUs came out this week and they support a new set of instructions called <a href="http://software.intel.com/en-us/avx/">AVX</a>. The AVX instructions are a much bigger change than the usual SSE revisions in the past few micro-architectures. First of all, they double the 128 bit SSE registers to 256 bits. Second, they introduce an entirely new instruction <a href="http://en.wikipedia.org/wiki/VEX_prefix">encoding</a>. The new encoding switches from 2 operand instructions to 3 operand instructions allowing the destination register to be different than the source registers. For example:<br /><code> addps r0, r1 # (r0 = r0 + r1)</code><br /> vs.<br /><code> vaddps r0, r1, r2 # (r0 = r1 + r2)</code><br />This new encoding is not only used for the new 256 bit instructions, but also for the 128 bit AVX versions of all the old SSE instructions. This means that existing SSE code can improved without requiring a switch to 256 bit registers. Finally, AVX introduces some new data movement instructions, which should help improve code efficiency.<br /><br />I decided to see what kind of performance difference using AVX could make in <a href="http://mxr.mozilla.org/mozilla-central/source/gfx/qcms/transform-sse2.c#13">qcms</a> with minimal effort. If you use SSE compiler intrinsics, like qcms does, switching to AVX is very easy; simply recompile with -mavx. In addition to using -mavx, I also took advantage of some of the new data movement instructions by replacing the following:<br /><code> vec_r = _mm_load_ss(r);<br /> vec_r = _mm_shuffle_ps(vec_r, vec_r, 0);</code><br />with the the new vbroadcastss instruction:<br /><code> vec_r = _mm_broadcast(r);</code><br />Overall, this change reduces the inner loop by 3 instructions.<br /><br />The performance results were positive, but not what I expected. Here's what the timings were:<table style="width: auto; margin-left: 1em; -moz-font-feature-settings: "tnum=1";"><tbody><tr><td style="line-height: auto">SSE2:</td><td style="line-height: auto; color: #444444">75798 usecs</td></tr><tr><td style="line-height: auto">AVX (-mavx):</td><td style="line-height: auto">69687 usecs</td></tr><tr><td>AVX w/ vbroadcastss:</td><td style="line-height: auto;">72917 usecs</td></tr></tbody></table>Switching to the AVX encoding improves performance by more than I expected: nearly 10%. But adding the new <code>vbroadcastss</code> instruction, in addition to the AVX encoding, not only doesn't help, but actually makes things worse. I tried analyzing the code with the <a href="http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/">Intel Architecture Code Analyzer</a>, but the analyzer also thought that using <code>vbroadcastss</code> should be faster. If anyone has any ideas why <code>vbroadcastss</code> would be slower, I'd love to hear them.<br /><br />Despite this weird performance problem, AVX seems like a good step forward and should provide good opportunities for improving performance beyond what's possible with SSE. For more information, check out this <a href="http://software.intel.com/file/24742">presentation</a> which gives a good overview of how to take advantage AVX.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com5tag:blogger.com,1999:blog-1386948037384435441.post-31801004755427701452010-12-18T21:53:00.000-08:002010-12-19T18:57:06.523-08:00Improved Hardware Acceleration in FennecOn Thursday night, after the all-hands party, Matt Woodrow landed a beautiful <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=604101">refactoring</a> of our texture upload code. This should give a noticeable improvement in scrolling performance when accelerated layers are enabled and hopefully fixes some of the problems people were seeing there. It also improves texture upload performance on OS X.<br /><br />Unfortunately, there are still two bugs that are keeping us from enabling accelerated layers by default:<br /><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=619615">Bug 619615</a> - Hangs on Nexus One<br /><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=619539">Bug 619539</a> - Startup crashes on Droid<br /><br />Any help debugging these problems would be greatly appreciated.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com5tag:blogger.com,1999:blog-1386948037384435441.post-24182270555635387432010-12-15T12:24:00.000-08:002010-12-15T13:23:34.965-08:00Hardware Acceleration on FennecIt's now possible with current nightlies to use OpenGL for compositing in Fennec on Android. To turn it on, go to about:config and set "layers.accelerate-all" to "true" and restart. If it's working you can go to about:support and the Graphics section will say "1/1 OpenGL".<br /><br />It would be great if people can test it and let me know how it goes.Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com12tag:blogger.com,1999:blog-1386948037384435441.post-88776139208109256062010-11-08T11:54:00.000-08:002010-11-10T11:16:03.710-08:00Dealing with mach_kernel in SharkSometimes when profiling a bunch of time ends up in mach_kernel. Figuring out why isn't always easy but here are two tips that should help a bit:<br /><br /><ul><li>You can get better symbols for mach_kernel by downloading a <a href="http://developer.apple.com/hardwaredrivers/download/kerneldebugkits.html">KernelDebugKit</a><br />This can help a bit when trying to figure out what's happening in the kernel. For example, _dtrace_get_cpu_int_stack_t becomes _mach_call_munger.<br /></li><br /><li>Shark has a System Trace profiling mode. This can show you what code is causing the kernel to do work. It can break down time by system call or by vm fault which should account for most things.<br /><br />While trying this out I noticed we were spending a fair amount of time in ChildViewMouseTracker::WindowForEvent(NSEvent*). This gave me the idea that the reason that Firefox causes the WindowServer process to start using a huge amount of CPU is because we tell the WindowServer to give us all of the mouse events instead of the ones only targeted at our window. Presumably this causes the WindowServer to build up a very large queue of events when the Firefox process is stopped and thus use lots of CPU. This turns out to be the case. nsToolkit::RegisterForAllProcessMouseEvents causes us to listen to all mouse events and disabling the code there fixes the problem. Bug <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=611068">611068</a> tracks the problem.<br /></li><br /></ul>Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com3tag:blogger.com,1999:blog-1386948037384435441.post-57444252886487372842010-05-28T14:22:00.000-07:002010-05-28T14:36:03.693-07:00Reviewing in vimBugzilla's review interface is poor. I find a mild improvement is possible by copying the review text into an editor and reviewing it there. One of things that makes this experience better is syntax highlighting. <a href="http://people.mozilla.org/%7Ejmuizelaar/vim/review.vim">Here</a>'s a modification of vim's diff highlighting script that works with quoted patches. Adding the following to one's .vimrc will get it used for .review files:<code><br />au BufNewFile,BufRead *.review setf review<br /></code>Jeff Muizelaarhttp://www.blogger.com/profile/17483047845050494642noreply@blogger.com4