Improving Performance of a WebKit Port MIPS Platform (ELC 2014)
-
Upload
igalia -
Category
Technology
-
view
197 -
download
1
description
Transcript of Improving Performance of a WebKit Port MIPS Platform (ELC 2014)
IMPROVING $PORTPERFORMANCE ON $ARCH
PLATFORM-BASED PERFORMANCE TUNING OF WEBKIT(PORT=QT ARCH=MIPS74KF)
Embedded Linux ConferenceApril 29 — May 1, 2014
Adrián Pérez de Castro
THE CHALLENGEMAKE A QTWEBKIT-BASED BROWSER USEABLE
ON LIMITED HARDWAREMIPS 74Kf @500 MHz
RAM: 256 MBNo GPU
MIPS74KF“Classic” MIPS32
+FPU
+MMU
+DSP
DSP?No. Not really a DSP.
SIMD instructions suitable for signal processing.
CAN WE USE THIS TO IMPROVE PERFORMANCE?
CHALLENGE ACCEPTED
THE PLANPROFILE → OPTIMIZE → VALIDATE
WHAT TO OPTIMIZEVideo/audio decoding.
Image operations.
WHERE TO OPTIMIZE?Can we improve the platform overall,
not just WebKit?
Yes!
QtWebKit uses the Qt drawing functions.
A/V decoding uses GStreamer, which uses Orc.
Good candidates for SIMD code.
LIMITATIONSNo Valgrind.
No GDB.No perf.
No performance counters.
↓qemu + gdbserver.gperftools.
CLOCK_PROCESS_CPUTIME_ID
ROLL YOUR OWN TOOLS(WITH HELP FROM EXISTING ONES)
GNU HAMMER^WTIME!# Use full path to avoid using the shell's time builtin# One line per run with user/system time and page faults/usr/bin/time -a -o timings.txt \ -f '%U %S %F %x %C' $COMMAND
# For example, measuring the qtdemux GStreamer component/usr/bin/time -a -o timings.txt \ -f '%U %S %F %x %C' gst-launch -q \ filesrc=file.mp4 ! qtdemux ! video/x-h264 ! fakesink
TIMINGBeware of CLOCK_PROCESS_CPUTIME_ID's resolution!#define CLOCK_MAX_RESOLUTION_DELTA (10000.0 * 1e-9)bool usePosixClock() { static bool checked = false; static bool useposix; if (!checked) { if (posixClockAvailable()) { double res_theorical = posixClockTheoricalResolution(); double res_empirical = posixClockEmpiricalResolution(); useposix = fabs(res_theorical - res_empirical) <= CLOCK_MAX_RESOLUTION_DELTA; } else { useposix = false; } checked = true; } return useposix;}
clock.cc
WEBSNAP% g++ -DMAIN -o clock clock.cc% ./clockCLOCK_PROCESS_CPUTIME_ID is supportedResolution (advertised/empirical): 0.0000000010/0.0000002460sSampled resolution: 0.0000005470sPrinting the lines above took 0.0000483550s
% LD_PRELOAD=/usr/lib/libprofiler.so \ ./websnap http://igalia.com 1000 pprofLoading 100% Layout completedLoad successfullibprofile.so detected (0x7f77468e8f90, 0x7f77468e8fd0), output 'pprof'Profiling started, code: 0x1, timeout: 0PROFILE: interrupts/evictions/bytes = 634/537/22168http://igalia.com 1000 6.2709987870s
% mkdir out && ./runtests 1000 < urls.txt
github.com/aperezdc/websnap
...AND BEYONDAd-hoc Python/Bash scripts:
Fix library paths in profiler output.Data munging.
Measurements comparison.Generate CSV files.Report generation.
…
SOME RESULTS(DETAILED)
LATIN-1→UTF16: V0// "dst" array (uint16_t*)// "src" array (uin8_t*)
while (len--) *dst++ = (uchar) *src++;
LATIN-1→UTF16: V1; a0: "dst" array (uint16_t*); a1: "src" array (uint8_t *); a2: "len"
1: lbu t1, 0 (a1) addiu a2, a2, -1 ; len-- sh t1, 0 (a0) addiu a0, a0, 2 ; dst++ bnez a2, 1b addiu a1, a1, 1 ; src++
LATIN-1→UTF16: V21: lw t1, (a1) ; t1 = ABCD ; ; TODO: extract bytes from t1 to t2/t3, padding ; them with zeroes: t2 = 0A0B, t3 = 0C0D ; addiu a1, a1, 4 ; src++ addiu a2, a2, -4 ; len-- sw t2, 0 (a0) sw t3, 4 (a0) bnez a2, 1b addiu a0, a0, 8 ; dst += 2
LATIN-1→UTF16: V31: lw t1, (a1) ; t1 = ABCD srl t2, t1, 24 ; t2 = 000A sll t2, t2, 16 ; t2 = 0A00 sll t1, t1, 8 ; t1 = BCD0 srl t4, t1, 24 ; t4 = 000B or t2, t2, t4 ; t2 = 0A0B sll t1, t1, 8 ; t1 = CD00 srl t1, t1, 16 ; t1 = 00CD andi t4, t1, 0xFF00 ; t4 = 00C0 sll t4, t4, 8 ; t4 = 0C00 or t3, t1, t4 ; t3 = 0C0D addiu a1, a1, 4 ; src++ addiu a2, a2, -4 ; len-- sw t2, 0 (a0) sw t3, 4 (a0) bnez a2, 1b addiu a0, a0, 8 ; dst += 2
LATIN-1→UTF16: V4; DSP instructions can unpack bytes directly :-)
1: lw t1, (a1) ; t1 = ABCD
preceu.ph.qbl t2, t1 ; t2 = 0A0B preceu.ph.qbr t3, t1 ; t3 = 0C0D
addiu a1, a1, 4 ; src++ addiu a2, a2, -4 ; len-- sw t2, 0 (a0) sw t3, 4 (a0) bnez a2, 1b addiu a0, a0, 8 ; dst += 2
LATIN-1 → UTF-16
ALPHA BLENDING
UTF-16 STRICMP()
RESULTS
Speedup histogram
UP TO 30% FASTER RENDERINGThanks to:
Orc backend using MIPS DSP instructionsQImage composition operations
Color conversion (RGB16/888→ARGB32)Alpha premultiplication and blendingString conversions and comparisons
UPSTREAM STATUSOrc backend complete upstream
Initial work based on Qt 4.8Most of the code is already in Qt 5.2
Rest in the next releaseNo backport to Qt 4.8
THANK YOUFOR YOUR ATTENTION
perezdecastro.org+AdrianPerezDeCastro
@aperezdc