Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... •...

36
Android Platform Optimizations 1 © Synopsys 2011 ELC-Europe Prague, October 2011 Ruud Derwig

Transcript of Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... •...

Page 1: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Android Platform Optimizations

1© Synopsys 2011

ELC-EuropePrague, October 2011Ruud Derwig

Page 2: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Helping Design the Chips InsideData Center & NetworkingDigital Home Computing & PeripheralsMobile Multimedia

2© Synopsys 2011

Industrial Military / AerospaceAutomotive OtherMedical

Page 3: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Agenda

Market & value drivers

What to optimize?

3© Synopsys 2011

How to optimize?

Results & conclusion

Page 4: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

• Smartphones• Tablets• TV• STB / multimedia

Android Markets

4© Synopsys 2011

• STB / multimedia

• Others / new

Page 5: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

• Smartphones• Tablets• TV• STB / multimedia

Android Markets

5© Synopsys 2011

• STB / multimedia

• Others / new

Page 6: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

• Smartphones• Tablets• TV• STB / multimedia

Android Markets

6© Synopsys 2011

• STB / multimedia

• Others / new

Page 7: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

• Power consumption� optimize performance / mW

• Product cost� optimize performance / area� optimize development efficiency

Key Value Drivers & System Architecture Choices

7© Synopsys 2011

� optimize development efficiency

• Hardware – Software trade-offs– Maximum flexibility & developer efficiency : “virtual everything”

• PC model, multi-GHz SMP processor centric designs

– Minimal power & optimal performance: “dedicated hardware”• dedicated, fixed function device

– Sweetspot : “heterogeneous, HW accelerated multi-cor e”• Mix of CPU, DSP, and dedicated HW

Page 8: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Market & value drivers

What to optimize? AMBA 3 AXI & AMBA 2.0 AHB

USBcontroller

PCIecontroller

DDRPHY

DDRcontroller

USBPHY

PCIePHY

ARC Host processor

HDMIcontroller

HDMIPHY

AudioCodec

ARC Audioprocessor

ADCsDACs

processingSignal

processing

VideoFront End

ARC Videoprocessor

Agenda

8© Synopsys 2011

How to optimize?

Results & conclusion

UART

AMBA APB

GPIO

Ethernetcontroller

SATAcontroller

SD/MMCcontroller

XAUIPHY

SATAPHYI2C

MIPI DigRF, CSI, DSI

controller

MIPI D-PHYM-PHY

Embedded

Memories

Embedded

Memories(NVMs

&SRAMs)

Datapath

Logic Libraries

Page 9: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

• Important,• … but not Android specific

• Optimization options

Linux Kernel & Library Optimizations

9© Synopsys 2011

• Optimization options– Optimize hotspots

• compiler• handwritten assembler

– CPU hardware optimizations• MMU• special instructions

Page 10: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Dalvik Virtual Machine

• “Java” * virtual machine– Register-based architecture (Java VMs are stack machines). Dalvik

registers are typically stored in memory (on the stack, like local variables in C).

– Own bytecode

• Three virtual machines– Portable: completely C-based, in fact one large switch{} statement with

10© Synopsys 2011

– Portable: completely C-based, in fact one large switch{} statement with a case x: for every Dalvik opcode.

– Fast (a.k.a. MTERP): assembly-coded handlers for every Dalvikopcode, which are aligned on 64 bytes addresses, so that the address of the handler can be easily calculated from the opcode, saving a lookup.

– JIT: just-in-time compiler, initially starts as fast/mterp interpreter, but will identify ‘hot’ traces and pass these to the compiler thread.

*Dalvik is a clean-room implementation of Java for copyright reasons. The syntax is similar.

Page 11: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Android Media Player Architecture

Media Player App

Media Player App Framework

JAVA

Linux User Space

11© Synopsys 2011

Media Player Service

AudioFlinger

Alsa Kernel Driver

Other Audio Driver

…Linux User Space

Linux Kernel Space

Page 12: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Android Media Player Architecture

StageFrightPlayer

Media Player Service

Media Player App

Media Player App Framework

Media Player Service

JAVA

Linux User Space

� Google’s player of choice is the Stagefright• multi-format A/V player, newly developed for Android• Simple fixed graph – selects demuxer and decoder

based on file extension� Alternatives exist

• Gstreamer based• proprietary / legacy

GStreamerPlayer

12© Synopsys 2011

OpenCorePlayer

VorbisPlayer

Player

MIDIPlayer

AudioFlinger

Alsa Kernel Driver

Other Audio Driver

Linux Kernel Space

Demuxer(Parse container

format)

VideoDecoding

AudioDecoding

Video Rendering(� SurfaceFlinger)

Audio Rendering (� AudioFlinger)

Reader(Read from Source)

Player

Page 13: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Audio Optimization Option:off-load audio processing to DSP

Audio Rendering ( AudioFlinger)

Demuxer(Parse container

format)Audio DecodingReader

(Read from Source)

GP-CPU

Audio DSP

13© Synopsys 2011

DSPAudio Decoding

Control API

PostProcessing(SonicFocus)

Control API

ARC Sound Processor

Audio DSP

Page 14: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Android Graphics - Architecture

• 2D– Canvas/Skia– OpenVG

• 3D OpenGL

Application

Canvas

Skia

RenderScript

14© Synopsys 2011

• 3D– OpenGL-ES 1.x– OpenGL-ES 2

• Renderscript– Expose native GPU/SMP to (portable) applications– C99 ->LLVM intermediate bitcode -> machine code

OpenGLSkia

Surface

Page 15: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Android Graphics - Compositor

OpenGL

Application

Canvas

Skia

Surface

RenderScript

OpenGL

Application

Canvas

Skia

Surface

RenderScript

15© Synopsys 2011

2D blitterOpenGL

GPU

SurfaceFlinger

PixelFlinger

Page 16: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

• Graphics drawing/rendering– Software/assembler optimization

• Skia, PixelFlinger– Hardware acceleration

• GPU (OpenGL-ES 2)• 2D accelerator (OpenVG compatible or other)• Memory architecture, caching

Graphics Optimization Options

16© Synopsys 2011

• Memory architecture, caching– Renderscript

• Surface Composition– Scaling, colorspace conversion

• Custom instructions• GPU • Dedicated hardware acceleration (bitblit)

Page 17: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

What to optimize?

Market & value drivers

Agenda

17© Synopsys 2011

How to optimize?

Results & conclusion

Page 18: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Optimized Designware ARC Android

� Full port of the Android Froyo/Gingerbread release to the ARC processor architecture and build environment

� Including NDK and SDK to support Android

ARC SoundAudio DSP

(work in progress)

18© Synopsys 2011

to support Android application building/porting

� Google/OHA Compatibility Test Suite tested ARC Optimized

bionic C library

ARC Optimized Dalvik VM

ARC Optimized pixelflinger

ARC Linux kernel and Drivers

ARC Optimized V8 JavaScript engine

(work in progress)

Page 19: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Differences between VM Implementations

Portable MTerp JIT

switch (opcode) {case add: a = b + c;

break;case sub: a = b – c;

break;...

ld r0, [b]ld r1, [c]add r0, r0, r1st r0, [a]ld r0, [next_opcode]asl r0, r0, 6add r0, r13, r0

ld r0, [b]ld r1, [c]add r0, r0, r1st r0, [a]

OR

19© Synopsys 2011

add r0, r13, r0j [r0] add r20, r20, r21

ld r0, [next_opcode]<pipeline stall>ld.as r1, [jump_table, r0]<pipeline stall>j [r1]

Page 20: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

20© Synopsys 2011

Reused from Google I/O presentation

Page 21: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Register- and Stack -based VMs

Example: a = b + c

Java Dalvik Dalvik for ARC

iload biload ciaddistore a

add-int a, b, c add-int a, b, c

21© Synopsys 2011

istore a

ld r0, [b]push r0ld r0, [c]push r0pop r0pop r1add r0, r0, r1push r0pop r0st r0, [a]

ld r0, [b]ld r1, [c]add r0, r0, r1st r0, [a]

add r20, r21, r22

Registers are only saved/restored when

changing stack frames or when

moving to interpreter

Page 22: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Audio Processing on DSP

Audio Rendering ( AudioFlinger)

Demuxer(Parse container

format)Audio DecodingReader

(Read from Source)

Control API Control API

GP-CPU

Audio DSP

22© Synopsys 2011

� Audio decoding and Post-processing off-loaded to ARC Sound Processor� Special host Audio Decoder implementation that takes care of off-loading

� with standard host decoder interfaces, so seamless integration� Post-processing control through Renderer on host (special Renderer or Renderer

plug-in component)

DSPAudio Decoding

PostProcessing(SonicFocus)

ARC Sound Processor

Page 23: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Reader

Demuxer

Renderer

Media Player App

Media Player App Framework

� MSF = Media Streaming FrameworkARC DSP optimized, lightweight streaming framework

� MQX = Real-time Operating system

� RPC/IPC = Remote Procedure call /Inter Processor Communication

StageFright Player(Control API)

Post Processing

(Control API)

Decoder

Stagefright

23© Synopsys 2011

Android Host Processor ARC Audio Processor(s) - AS2xx

MQX OS

RPC/IPC

Android (Linux) OS

RPC/IPC

Audio Data, Control

MSF Framework

Control APIRemote-d Control API

Processing“ plug-in” De-

coderSonicFocus

Decoder Wrapper

ARC Sound Remote-d MSF

Page 24: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

OMX2

MSF

OMX2

MSF

OMX2

MSF

Android & Audio APIs

• Stagefright supports 2 types of interfaces– OpenMax-IL : for re-use of OMX components– Stagefright codec interface : for native Stagefright

codecs

• AudioFlinger uses dedicated interfaces– standard implementation using “ALSA” exist– developments ongoing (?) to support OpenSL-ES

Stagefright

IPC

24© Synopsys 2011

– developments ongoing (?) to support OpenSL-ES Khronos standard (like OMX)

• SNPS API choice not yet made– OMX-IL pro : open standard– OMX-IL con: efficiency, complexity: standard by

committee…– Stagefright pro : efficient integration with Stagefright– Stagefright con : not an open standard, no deep

tunneling

codec

IPC

codec codec

Page 25: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

• GStreamer Android Player– see e.g. ELC-E 2010

presentation

• “The goal of the project is to both allow hardware makers to standardize on GStreamer

Alternative: Gstreamer

25© Synopsys 2011

to standardize on GStreameraccross their software platforms, but also to make the advanced functionality of GStreamer available on the Android platform, like video editing, DLNA Support and Video conferencing.”

Page 26: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

• Gstreamer-MSF integration makes heterogeneous multi-core SW development transparent to user

• Instantiation of Gstreamerelement � instantiation of

GStreamer DSP Off-loading with“Deep Tunneling”

Host CPU

G G G

Host-ARC

G

26© Synopsys 2011

Dataflow graph instantiated on Audio Subsystem

element � instantiation of module on one of the ARC cores

• Creation of link �local connection or core-crossing connection between modules

ARC core #0 ARC core #1

M M Msink src

FIFO

FIFO

src

driver

ARC-ARCStreaming

Host-ARCStreaming

LocalStreaming

Page 27: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Gstreamer Deep Tunneling

27© Synopsys 2011

Page 28: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

ARC HW Extensions

extension instructions

ARC EIA (Extension Interface Automation):• supports user defined custom instructions• accelerates typical Dalvik (Java VM) and

pixelflinger (2D GUI) instruction sequences

Prefetcher:• Eliminates pipeline stalls in high latency

memory environments• L2$ not required in this case

ARC750D

I$ D$

28© Synopsys 2011

prefetchunit

AXI bus

• L2$ not required in this case• Configurable depending on application

VGA / HDMI / DSIDDR memory (including frame buffer) display driver

I$ D$

CPU area, excluding memories

CPU + FPU

prefetcher

extra instructions

Page 29: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

7 6 5 4 3 2 1 0

Leveraging the ARC EIA CapabilitiesExample: Colour Space Conversion

ABGR8888

7 6 5 4 3 2 1 0

mask

7 6 5 4 3 2 1 0

7 6 5 4 3 2

shift right

mask

7 6 5 4 3

shift left7 6 5 4 3 2 1 0

7 6 5

8 operations are required for a conversion from ABGR8888 to RGB565.

This can be combined into one single EIA instruction.

29© Synopsys 2011

7 6 5 4 3 2 1 0 7 6 5 4 3

7 6 5 4 3 7 6 5 4 3 2 7 6 5 4 3 RGB565

7 6 5 4 3 2mask

shift right

bitwise or

mask7 6 5 4 3

bitwise or

bitwise or

Instruction Operands

ld.ab r1, [r4, 0x4]

and r2, r1, 0xf8

asl r2, r2, 8

and r3, r1, 0xfc00

lsr r11, r3, 5

or r2, r2, r11

and r3, r1, 0xf80000

lsr r11, r3, 19

or r2, r2, r11

stw.ab r2, [r5, 0x2]

bitwise or

Instruction Operands

ld.ab r1, [r4, 0x4]

upk8 r2, r1, r6

stw.ab r2, [r5, 0x2]

Page 30: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

What to optimize?

Market & value drivers

Agenda

30© Synopsys 2011

How to optimize?

Results & conclusion

Page 31: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Optimizing Dalvik VM

31© Synopsys 2011

Page 32: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Optimizing Dalvik VMDalvik JIT Optimization

Relative Performance Compared to Interpreter Perfor mance

x 4.9 … and going

~20%

32© Synopsys 2011

CoreMark

CaffeineMark

WithoutL2 cache

1,9 4,9 /MHz

37 90 /mW

14 35 /MHz/mm 2

x 11.3 … and going

measurements are done on 50MHz FPGAresults are without performance gains from hardware extensions

Page 33: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

40000

50000

60000

70000E

xecu

tion

time

(AR

C c

ycle

s)

Original Code - No prefetchingCode with Android extensions - No prefetchingOriginal code - PrefetchingCode with Android extensions - Prefetching

Optimizing HardwareCustom Instructions & Prefetching

33© Synopsys 2011

0

10000

20000

30000

0 1 2 4 8 16 32 64 128

Exe

cutio

n tim

e (A

RC

cyc

les)

Memory latency (cycles)

Page 34: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

Linux kernel + ARC HW optimizations

34© Synopsys 2011

Page 35: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

• There are more markets for Android than high-end smartphone• There are more optimizations possible than relying on Moore’s law

for GHz multi-cores

• Optimize performance / mW & performance / area• Sweetspot : “heterogeneous, HW accelerated multi-core”

Conclusions

35© Synopsys 2011

• Sweetspot : “heterogeneous, HW accelerated multi-core”– Mix of CPU, DSP, and dedicated HW– Highly optimized platform infrastructure SW hides heterogeneous complexities

• ‘Simple’ ARC processor with SW optimized Dalvik VM performs equal or better as others, thanks to careful SW optimizations, and the use of simple HW acceleration

– Custom instructions tailored for specific tasks– Prefetcher iso. general purpose 2nd level cache– DSP more efficient in audio processing than CPU

Page 36: Android Platform Optimizations SNPS 20111027 - · PDF fileoptimize performance / mW ... • GPU (OpenGL-ES 2) • 2D accelerator ... • There are more markets for Android than high-end

36© Synopsys 2011

Fast Forward to Predictable Success