Thursday, July 28, 2016

Counting function calls per second

Say you want to know how often you're allocating tiles in Firefox or the rate of some other thing. There's an easy way to do this using dtrace. The following dtrace script counts calls to any functions matching the pattern '*SharedMemoryBasic*Create*' in XUL in the target process.

#pragma D option quiet
dtrace:::BEGIN
{
    rate = 0;
}

profile:::tick-1sec
{
    printf("%d/sec\n", rate);
    rate = 0;
}

pid$target:XUL:*SharedMemoryBasic*Create*:entry
{
    rate++;
} 

You can run this script with following command:
$ dtrace -s $SCRIPT_NAME -p $PID
I'd be interested in knowing if anyone else has a similar technique for OSs that don't have dtrace.

Tuesday, December 29, 2015

WebGL2 enabled in Firefox Nightly

A couple of weeks ago we enabled WebGL2 in Nightly. The implementation is still missing some functionality like PBOs, MSRBs and sampler objects, but it seems to work well enough with the WebGL2 content that we've tried.

WebGL2 is based on OpenGL ES 3 and adds occlusion queries, transform feedback, a large amount of texturing functionality and bunch of new capabilities to the shading language including integer operations.

You can test out the implementation here http://toji.github.io/webgl2-particles/. If it says WebGL2 it's working with WebGL2. We look forward to seeing the graphical enhancements enabled by WebGL2 and encourage developers to start trying it out.

Wednesday, November 11, 2015

Debugging reftests with RR

When debugging reftests it's common to want to trace back the contents of pixel to see where they came from. I wrote a tool called rr-dataflow to help with this.

What follows is a log of a rr session where I use this tool to trace back the contents of a pixel to the code responsible for it being set. In this case I'm using the softpipe mesa driver which is a simple software implementation of OpenGL. This means that I can trace through the entire graphics pipeline as needed.

Breakpoint 1, mozilla::WebGLContext::ReadPixels (this=0x7fc064bc7000, x=0, y=0, width=64, height=64, format=6408, 
    type=5121, pixels=..., rv=...) at /home/jrmuizel/src/gecko/dom/canvas/WebGLContextGL.cpp:1411
1411    {
(gdb) c
Continuing.

Breakpoint 11, mozilla::ReadPixelsAndConvert (gl=0x7fc05c4e7000, x=0, y=0, width=64, height=64, readFormat=6408, 
    readType=5121, pixelStorePackAlignment=4, destFormat=6408, destType=5121, destBytes=0x7fc05c9d5000)
    at /home/jrmuizel/src/gecko/dom/canvas/WebGLContextGL.cpp:1310
1310    {
(gdb) list
1305    
1306    static void
1307    ReadPixelsAndConvert(gl::GLContext* gl, GLint x, GLint y, GLsizei width, GLsizei height,
1308                         GLenum readFormat, GLenum readType, size_t pixelStorePackAlignment,
1309                         GLenum destFormat, GLenum destType, void* destBytes)
1310    {
1311        if (readFormat == destFormat && readType == destType) {
1312            gl->fReadPixels(x, y, width, height, destFormat, destType, destBytes);
1313            return;
1314        }
(gdb) n
1311        if (readFormat == destFormat && readType == destType) {
(gdb) 
1312            gl->fReadPixels(x, y, width, height, destFormat, destType, destBytes);
(gdb) 
1313            return;

Let's disable the two breakpoints and set a watch point on the first pixel of the destination

(gdb) dis 1
(gdb) dis 11
(gdb) watch -location *(int*)destBytes
Hardware watchpoint 12: -location *(int*)destBytes

Then reverse-continue back to where the first pixel was set.

(gdb) rc
Continuing.
Hardware watchpoint 12: -location *(int*)destBytes

Old value = -16711936
New value = 0
__memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:213
213 ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory.

We end up at a memcpy inside of readpixels that copies into the destination buffer.

(gdb) bt 9
#0  0x00007fc0ad9ed955 in __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:213
#1  0x00007fc075c5080f in _mesa_readpixels (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>)
    at /usr/include/x86_64-linux-gnu/bits/string3.h:53
#2  0x00007fc075c5080f in _mesa_readpixels (packing=0x7fc05c3e92e8, pixels=<optimized out>, type=5121, format=766008576, height=64, width=64, y=0, x=0, ctx=0x7fc05c3ce000) at main/readpix.c:245
#3  0x00007fc075c5080f in _mesa_readpixels (ctx=ctx@entry=0x7fc05c3ce000, x=x@entry=0, y=y@entry=0, width=width@entry=64, height=height@entry=64, format=format@entry=6408, type=5121, packing=0x7fc05c3e92e8, pixels=<optimized out>)
    at main/readpix.c:873
#4  0x00007fc075ce0985 in st_readpixels (ctx=0x7fc05c3ce000, x=0, y=0, width=64, height=64, format=6408, type=5121, pack=0x7fc05c3e92e8, pixels=0x7fc05c9d5000) at state_tracker/st_cb_readpixels.c:255
#5  0x00007fc075c519d4 in _mesa_ReadnPixelsARB (x=0, y=0, width=64, height=64, format=6408, type=5121, bufSize=2147483647, pixels=0x7fc05c9d5000) at main/readpix.c:1120
#6  0x00007fc075c51c82 in _mesa_ReadPixels (x=<optimized out>, y=<optimized out>, width=<optimized out>, height=<optimized out>, format=<optimized out>, type=<optimized out>, pixels=0x7fc05c9d5000) at main/readpix.c:1128
#7  0x00007fc09c3a9b3b in mozilla::gl::GLContext::raw_fReadPixels(int, int, int, int, unsigned int, unsigned int, void*) (this=0x7fc05c4e7000, x=0, y=0, width=64, height=64, format=6408, type=5121, pixels=0x7fc05c9d5000)
    at /home/jrmuizel/src/gecko/gfx/gl/GLContext.h:1511
#8  0x00007fc09c39abe1 in mozilla::gl::GLContext::fReadPixels(int, int, int, int, unsigned int, unsigned int, void*) (this=0x7fc05c4e7000, x=0, y=0, width=64, height=64, format=6408, type=5121, pixels=0x7fc05c9d5000)
    at /home/jrmuizel/src/gecko/gfx/gl/GLContext.cpp:2873
#9  0x00007fc09d78696d in mozilla::ReadPixelsAndConvert(mozilla::gl::GLContext*, GLint, GLint, GLsizei, GLsizei, GLenum, GLenum, size_t, GLenum, GLenum, void*) (gl=0x7fc05c4e7000, x=0, y=0, width=64, height=64, readFormat=6408, readType=5121, pixelStorePackAlignment=4, destFormat=6408, destType=5121, destBytes=0x7fc05c9d5000)
    at /home/jrmuizel/src/gecko/dom/canvas/WebGLContextGL.cpp:1312

From here we can see that the memcpy is storing the value of ymm4 into [r10] We use the origin command to step back to the place where ymm4 is loaded.

(gdb) origin
0x1000: vmovdqu ymmword ptr [r10], ymm4
1
reg used ymm4 

212 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add rdx, rdi
211 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add edx, eax
210 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: jae 0xfd2
209 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: sub edx, eax
208 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add rdi, rax
207 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqa ymmword ptr [rdi + 0x60], ymm3
206 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqa ymmword ptr [rdi + 0x40], ymm2
205 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqa ymmword ptr [rdi + 0x20], ymm1
204 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqa ymmword ptr [rdi], ymm0
203 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add rsi, rax
202 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm3, ymmword ptr [rsi + 0x60]
201 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm2, ymmword ptr [rsi + 0x40]
200 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm1, ymmword ptr [rsi + 0x20]
199 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm0, ymmword ptr [rsi]
197 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: sub edx, eax
196 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: add rsi, r11
195 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S
0x1000: vmovdqu ymm4, ymmword ptr [rsi]

We end up the instruction that loads ymm4 from [rsi]. Origin does this by single stepping backwards looking for writes to the ymm4 register. From here we want to continue tracking the origin. We use the origin command again. This time it sets a hardware watchpoint on the address in rsi.

(gdb) origin
0x1000: vmovdqu ymm4, ymmword ptr [rsi]
3
mem used *(int*)(0x7fc05c6c5000)
Hardware watchpoint 13: *(int*)(0x7fc05c6c5000)
Hardware watchpoint 13: *(int*)(0x7fc05c6c5000)

Old value = -16711936
New value = -452919552
__memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238
238 in ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S

We end up in another memcpy. This memcpy is flushing the tile buffer which
is used for rendering to the backbuffer that ReadPixels is reading from.

(gdb) bt 9
#0  0x00007fc0ad9ed9a6 in __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238
#1  0x00007fc075e89753 in pipe_put_tile_raw (pt=pt@entry=0x7fc05c4f8740, dst=dst@entry=0x7fc05c6c5000, x=x@entry=0, y=y@entry=0, w=<optimized out>, w@entry=64, h=<optimized out>, h@entry=64, src=0x7fc05c9d9000, src_stride=<optimized out>)
    at util/u_tile.c:80
#2  0x00007fc075e8a268 in pipe_put_tile_rgba_format (pt=0x7fc05c4f8740, dst=0x7fc05c6c5000, x=0, y=0, w=w@entry=64, h=h@entry=64, format=PIPE_FORMAT_R8G8B8A8_UNORM, p=0x7fc05c9c5000) at util/u_tile.c:524
#3  0x00007fc0760034aa in sp_flush_tile (tc=tc@entry=0x7fc06c1e9400, pos=pos@entry=0) at sp_tile_cache.c:427
#4  0x00007fc076003c05 in sp_flush_tile_cache (tc=0x7fc06c1e9400) at sp_tile_cache.c:457
#5  0x00007fc075fe6a0e in softpipe_flush (pipe=pipe@entry=0x7fc0650d5000, flags=flags@entry=0, fence=fence@entry=0x7fff2da85b40) at sp_flush.c:72
#6  0x00007fc075fe6b0d in softpipe_flush_resource (pipe=0x7fc0650d5000, texture=texture@entry=0x7fc07fad4380, level=level@entry=0, layer=<optimized out>, flush_flags=flush_flags@entry=0, read_only=<optimized out>, cpu_access=1 '\001', do_not_block=0 '\000') at sp_flush.c:148
#7  0x00007fc07600304f in softpipe_transfer_map (pipe=<optimized out>, resource=0x7fc07fad4380, level=0, usage=1, box=0x7fff2da85bf0, transfer=0x7fc0665e0a58) at sp_texture.c:387
#8  0x00007fc075cdd6bd in st_MapRenderbuffer (transfer=0x7fc0665e0a58, h=64, w=<optimized out>, y=0, x=0, access=<optimized out>, layer=<optimized out>, level=<optimized out>, resource=<optimized out>, context=<optimized out>)
    at ../../src/gallium/auxiliary/util/u_inlines.h:457
#9  0x00007fc075cdd6bd in st_MapRenderbuffer (ctx=<optimized out>, rb=0x7fc0665e09d0, x=0, y=<optimized out>, w=<optimized out>, h=64, mode=1, mapOut=0x7fff2da85cf8, rowStrideOut=0x7fff2da85cf0) at state_tracker/st_cb_fbo.c:772

We use the origin command again. This time we have rep movsb operation that reads from memory as a source. Origin uses a hardware watchpoint on that address again.

(gdb) origin
0x1000: rep movsb   byte ptr [rdi], byte ptr [rsi]
3
mem used *(int*)(0x7fc05c9d9003)
Hardware watchpoint 14: *(int*)(0x7fc05c9d9003)
Hardware watchpoint 14: *(int*)(0x7fc05c9d9003)

Old value = 16711935
New value = -437918209
util_format_r8g8b8a8_unorm_pack_rgba_float (dst_row=0x7fc05c9d9000 "", dst_stride=256, src_row=0x7fc05c9c5000, 
    src_stride=<optimized out>, width=64, height=64) at util/u_format_table.c:15204
15204            *(uint32_t *)dst = value;

This watchpoint takes us back to the function that converts from the floating point output of the graphics pipeline to the byte value that goes in the destination tile.

(gdb) bt 9
#0  0x00007fc075eabe44 in util_format_r8g8b8a8_unorm_pack_rgba_float (dst_row=0x7fc05c9d9000 "", dst_stride=256, src_row=0x7fc05c9c5000, src_stride=<optimized out>, width=64, height=64) at util/u_format_table.c:15204
#1  0x00007fc075e8a23b in pipe_put_tile_rgba_format (pt=0x7fc05c4f8740, dst=0x7fc05c6c5000, x=0, y=0, w=w@entry=64, h=h@entry=64, format=PIPE_FORMAT_R8G8B8A8_UNORM, p=0x7fc05c9c5000) at util/u_tile.c:518
#2  0x00007fc0760034aa in sp_flush_tile (tc=tc@entry=0x7fc06c1e9400, pos=pos@entry=0) at sp_tile_cache.c:427
#3  0x00007fc076003c05 in sp_flush_tile_cache (tc=0x7fc06c1e9400) at sp_tile_cache.c:457
#4  0x00007fc075fe6a0e in softpipe_flush (pipe=pipe@entry=0x7fc0650d5000, flags=flags@entry=0, fence=fence@entry=0x7fff2da85b40) at sp_flush.c:72
#5  0x00007fc075fe6b0d in softpipe_flush_resource (pipe=0x7fc0650d5000, texture=texture@entry=0x7fc07fad4380, level=level@entry=0, layer=<optimized out>, flush_flags=flush_flags@entry=0, read_only=<optimized out>, cpu_access=1 '\001', do_not_block=0 '\000') at sp_flush.c:148
#6  0x00007fc07600304f in softpipe_transfer_map (pipe=<optimized out>, resource=0x7fc07fad4380, level=0, usage=1, box=0x7fff2da85bf0, transfer=0x7fc0665e0a58) at sp_texture.c:387
#7  0x00007fc075cdd6bd in st_MapRenderbuffer (transfer=0x7fc0665e0a58, h=64, w=<optimized out>, y=0, x=0, access=<optimized out>, layer=<optimized out>, level=<optimized out>, resource=<optimized out>, context=<optimized out>)
    at ../../src/gallium/auxiliary/util/u_inlines.h:457
#8  0x00007fc075cdd6bd in st_MapRenderbuffer (ctx=<optimized out>, rb=0x7fc0665e09d0, x=0, y=<optimized out>, w=<optimized out>, h=64, mode=1, mapOut=0x7fff2da85cf8, rowStrideOut=0x7fff2da85cf0) at state_tracker/st_cb_fbo.c:772
#9  0x00007fc075c507b2 in _mesa_readpixels (packing=0x7fc05c3e92e8, pixels=0x7fc05c9d5000, type=5121, format=766008576, height=64, width=64, y=0, x=0, ctx=0x7fc05c3ce000) at main/readpix.c:234

We see that ecx is being stored into [r10 - 4]. We use origin to track back to the source.

(gdb) origin
0x1000: mov dword ptr [r10 - 4], ecx
1
reg used ecx 

15206            src += 4;
0x1000: add rax, 0x10
15207            dst += 4;
0x1000: add r10, 4
15204            *(uint32_t *)dst = value;
0x1000: or  ecx, esi

We end at an or instruction. Looking at the source below we see that each of the floating point channels is being converted to a byte. We'll manually set a watchpoint on the channel that we're interested to avoid getting lost in the conversion code.

(gdb) list
15199            uint32_t value = 0;
15200            value |= (float_to_ubyte(src[0])) & 0xff;
15201            value |= ((float_to_ubyte(src[1])) & 0xff) << 8;
15202            value |= ((float_to_ubyte(src[2])) & 0xff) << 16;
15203            value |= (float_to_ubyte(src[3])) << 24;
15204            *(uint32_t *)dst = value;
15205   #endif
15206            src += 4;
15207            dst += 4;
15208         }
(gdb) watch -location src[1]
Hardware watchpoint 15: -location src[1]
(gdb) rc
Continuing.
Hardware watchpoint 15: -location src[1]

Old value = 1
New value = -1.35707841e+23
0x00007fc076003783 in clear_tile_rgba (tile=0x7fc05c9c5000, format=PIPE_FORMAT_R8G8B8A8_UNORM, clear_value=0x7fc06c1e968c)
    at sp_tile_cache.c:272
272                tile->data.color[i][j][1] = clear_value->f[1];

We end up at the clear_tile_rgba function which is settting the data in the buffer from the clear value.

(gdb) bt 9
#0  0x00007fc076003783 in clear_tile_rgba (tile=0x7fc05c9c5000, format=
    PIPE_FORMAT_R8G8B8A8_UNORM, clear_value=0x7fc06c1e968c) at sp_tile_cache.c:272
#1  0x00007fc0760040e5 in sp_find_cached_tile (tc=0x7fc06c1e9400, addr=...) at sp_tile_cache.c:579
#2  0x00007fc075febee9 in single_output_color (layer=<optimized out>, y=<optimized out>, x=<optimized out>, tc=<optimized out>) at sp_tile_cache.h:155
#3  0x00007fc075febee9 in single_output_color (qs=0x7fc07fb6c780, quads=0x7fc0665f3500, nr=1) at sp_quad_blend.c:1179
#4  0x00007fc075fefc9f in flush_spans (setup=setup@entry=0x7fc0665f1000) at sp_setup.c:251
#5  0x00007fc075ff0112 in subtriangle (setup=setup@entry=0x7fc0665f1000, eleft=eleft@entry=0x7fc0665f1058, eright=eright@entry=0x7fc0665f1028, lines=64) at sp_setup.c:759
#6  0x00007fc075ff0af2 in sp_setup_tri (setup=setup@entry=0x7fc0665f1000, v0=v0@entry=0x7fc089aea7c0, v1=v1@entry=0x7fc089aea7d0, v2=v2@entry=0x7fc089aea7e0) at sp_setup.c:853
#7  0x00007fc075fe71a2 in sp_vbuf_draw_arrays (vbr=<optimized out>, start=<optimized out>, nr=6) at sp_prim_vbuf.c:422
#8  0x00007fc075e2b704 in draw_pt_emit_linear (emit=<optimized out>, vert_info=<optimized out>, prim_info=0x7fff2da85f80)
    at draw/draw_pt_emit.c:261
#9  0x00007fc075e2d025 in fetch_pipeline_generic (prim_info=0x7fff2da85f80, vert_info=0x7fff2da85e40, emit=<optimized out>) at draw/draw_pt_fetch_shade_pipeline.c:196

We use origin again twice to track through the store and the load.

(gdb) origin
0x1000: movss   dword ptr [rax - 0xc], xmm0
272                tile->data.color[i][j][1] = clear_value->f[1];
0x1000: movss   xmm0, dword ptr [rbp + 4]
(gdb) origin
0x1000: movss   xmm0, dword ptr [rbp + 4]
3
mem used *(int*)(0x7fc06c1e9690)
Hardware watchpoint 16: *(int*)(0x7fc06c1e9690)
Hardware watchpoint 16: *(int*)(0x7fc06c1e9690)

Old value = 1065353216
New value = 0
sp_tile_cache_clear (tc=0x7fc06c1e9400, color=color@entry=0x7fc05c3cfa4c, clearValue=clearValue@entry=0)
    at sp_tile_cache.c:640
640    tc->clear_color = *color;

We end up in sp_tile_cache_clear which is setting up the clear color.

(gdb) bt 9
#0  0x00007fc076004376 in sp_tile_cache_clear (tc=0x7fc06c1e9400, color=color@entry=0x7fc05c3cfa4c, clearValue=clearValue@entry=0) at sp_tile_cache.c:640
#1  0x00007fc075fe5c84 in softpipe_clear (pipe=0x7fc0650d5000, buffers=5, color=0x7fc05c3cfa4c, depth=1, stencil=0)
    at sp_clear.c:71
#2  0x00007fc075cd8181 in st_Clear (ctx=0x7fc05c3ce000, mask=272) at state_tracker/st_cb_clear.c:539
#3  0x00007fc09c3a9427 in mozilla::gl::GLContext::raw_fClear(unsigned int) (this=0x7fc05c4e7000, mask=16640)
    at /home/jrmuizel/src/gecko/gfx/gl/GLContext.h:952
#4  0x00007fc09c3a9456 in mozilla::gl::GLContext::fClear(unsigned int) (this=0x7fc05c4e7000, mask=16640)
    at /home/jrmuizel/src/gecko/gfx/gl/GLContext.h:959
#5  0x00007fc09d7830bf in mozilla::WebGLContext::Clear(unsigned int) (this=0x7fc064bc7000, mask=16640)
    at /home/jrmuizel/src/gecko/dom/canvas/WebGLContextFramebufferOperations.cpp:46
#6  0x00007fc09d202460 in mozilla::dom::WebGLRenderingContextBinding::clear(JSContext*, JS::Handle<JSObject*>, mozilla::WebGLContext*, JSJitMethodCallArgs const&) (cx=0x7fc078086400, obj=..., self=0x7fc064bc7000, args=...)
    at /home/jrmuizel/src/gecko/obj-x86_64-unknown-linux-gnu/dom/bindings/WebGLRenderingContextBinding.cpp:11027
#7  0x00007fc09d6de3fa in mozilla::dom::GenericBindingMethod(JSContext*, unsigned int, JS::Value*) (cx=0x7fc078086400, argc=1, vp=0x7fc08f230210) at /home/jrmuizel/src/gecko/dom/bindings/BindingUtils.cpp:2644
#8  0x00007fc0a03af188 in js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (args=..., native=0x7fc09d6de0bd <mozilla::dom::GenericBindingMethod(JSContext*, unsigned int, JS::Value*)>, cx=0x7fc078086400)
    at /home/jrmuizel/src/gecko/js/src/jscntxtinlines.h:235
#9  0x00007fc0a03af188 in js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (cx=0x7fc078086400, args=..., construct=js::NO_CONSTRUCT) at /home/jrmuizel/src/gecko/js/src/vm/Interpreter.cpp:489
(gdb) 

Running a backtrace we see that this goes back to a call to WebGLContext::Clear. This is the actual clear call that triggers the code that eventually sets the pixel to the value that we see when we call glReadPixels. At this point we've travelled through the entire pipeline, and we've done it with minimal effort through the magic of rr.

Monday, June 8, 2015

Intel driver crash of the day

In bug 1170143 we ran into an Intel driver crash when trying to share DXGI_FORMAT_A8_UNORM surfaces. Older Intel drivers crash while opening the texture using OpenShareHandle. The driver successfully opens BGRA surfaces, but not alpha surfaces which we want to use for video playback. Who knows why... Here's a test case.

Monday, June 1, 2015

Direct2D on top of WARP

In Firefox 38 we introduced the use of WARP for software rasterization on Window 7. Early in the release we ran into an issue where using WARP on top of the builtin VGA driver was ridiculously slow. We fixed this by disabling WARP when the the VGA driver was being used, but I was curious how Internet Explorer avoids this issue. One big difference between Firefox and Internet Explorer is that we're not currently using Direct2D on top of WARP where as they are. It turns out the WARP driver has a private API that is used by Direct2D to avoid having to use the regular D3D11 API.

Profiling Internet Explorer shows the following private API's used by Direct2D
d3d10warp.dll!UMDevice::DrawGlyphRun
d3d10warp.dll!UMDevice::AlphaBlt2
d3d10warp.dll!UMDevice::InternalGetDC
d3d10warp.dll!UMDevice::CreateGeometry
d3d10warp.dll!UMDevice::DrawGeometryInternal

It looks like these turn into fairly traditional 2D graphics operations as shown with the following call stack snippets below:

RasterizationStage::Rasterize_TEXT
DrawGlyphRun6x1_B8G8R8A8_SSE<1>
DrawGlyphRun4x4_B8G8R8A8_SSE

RasterizationStage::Rasterize_GEOMETRY
PixelJITRasterizeGeometry
PixelJITGeometryRasterizer::Rasterize
WarpGeometry::Rasterize
CAntialiasedFiller::RasterizeEdges
CAntialiasedFiller::FillEdges
CAntialiasedFiller::GenerateOutput
PixelJITGeometryRasterizer::RasterizeComplexScan
PixelJITGeometryRasterizer::BeginSpan
InitializeEdges
InitializeInactiveArray
QuickSortEdges

This suggests that using Direct2D on top of WARP is more efficient than expected and might actually make more sense than our current strategy of only using WARP for composition.

Monday, March 16, 2015

Performance and feature improvements in Firefox 37 WebGL with D3D11 ANGLE

Firefox 37 adds support for WebGL rendering using D3D11 on Windows. Up till now we were using D3D9 which has very limited support for cross-device synchronization. Without proper synchronization we were forced to wait on the main thread for WebGL content to finish rendering before we could continue script execution. The result of this is that the total frame rendering time would be the sum of the script time and the remaining gpu time.  D3D11 allows us to use a GPU-side read barrier between the main thread and compositing thread. This lets Firefox avoid waiting on the main thread giving improved responsiveness and more time for script execution.

Here's a test program that lets you adjust the GPU and CPU execution times to see how the browser responds. D3D11 WebGL let's you adjust the CPU time up to nearly 15ms without dropping below 60fps.

D3D11 support also lets us expose the WEBGL_draw_buffers extension which allows drawing to multiple output buffers at the same time, functionality that's very helpful for implementing deferred renderers.

Give D3D11 WebGL support a try in Firefox Beta today and let us know how it works.


Thursday, August 23, 2012

The system works

Over the last few months a number of us at Mozilla having been working on a profiler built into Firefox. One of the goals of this profiler is to make it as easy as possible to profile anywhere. Yesterday we had a satisfying realization of this goal.

It all started with Taras' Snappy #36 update. In a comment, a user going by kamulos reported a recent problem about laggy tab switches. kamulos posted the profile and Benoit Girard filed bug 784756. The profile showed us spending a bunch of time in TimeStamp::Now() during image decode. I wasn't particularly surprised by this because our TimeStamp::Now() implementation on Windows is not particularly fast. Ehsan and I went away and put some effort into improving the performance and have some good candidates in bug 784859. In the mean time, Robert Lickenbrock discovered that problem was recently introduced by bug 685516, which unintentionally caused a fixed time delay where we called TimeStamp::Now() in a loop. He has since posted a patch that fixes the problem.

Here we have two community members helping uncover a problem within a week of it landing, a problem that could have otherwise gone undetected for a long time. This is a great example of an open source community working beautifully.