Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University...

27
Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems with Scratchpad Memory

Transcript of Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University...

Page 1: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Ph.D. Comprehensive Examination

José A. Baiocchi ParedesDepartment of Computer ScienceUniversity of Pittsburgh

Towards Virtualization of Embedded Systems with Scratchpad Memory

Page 2: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

OverviewSystem Virtualization

Paravirtualization(OS Assisted)

Full Systemvirtualization

Trap-And-Emulate(Classic)

Hardware AssistedVirtualization

Memory ResourceManagement

needs

approaches

Page 3: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

VirtualMachine

VirtualMachine

VirtualMachine

System Virtualization Allow multiple Operating

Systems share Hardware

Uses: Server consolidation Co-located hosting Distributed web services Application mobility Secure computing platforms Etc.

Virtual Machine Monitor

User Apps

GuestOS 3

User Apps

GuestOS 2

User Apps

GuestOS 1

Hardware

Type I: “Bare Metal”

Virtual Machine Monitor

User Apps

GuestOS 3

User Apps

GuestOS 2

User Apps

GuestOS 1

Hardware

Host OS

Type II: “Hosted”

Page 4: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

VMM

Innocuous

Sensitive

Nonprivileged

Privileged

Classical VMM Instruction behavior

Sensitive Instructions (S) control-sensitive: change resource

configuration behavior-sensitive: depend on

resource configuration Privileged Instructions (P)

trap in user mode don’t trap in supervisor mode

VMM can be built if S P Trap-And-Emulate

Deprivileging: Guest OS in user mode, VMM in supervisor mode

Impossible for x86!Efficiency

Resource Control

Equivalence

Hardware

ISA

User Applications

Guest OS

ISA

P S

Popek & GoldbergFormal Req. for Virt.

3rd Gen. ArchitecturesCACM’74

VMM

EmulationRoutineAllocator

Dispatcher

trap

Page 5: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

x86 Virtualization Challenges Protection (Segmentation)

4 Privilege Levels (Rings) Segment access by PL Deprivileging: 0/1/3, 0/3/3

Sensitive structures On-chip: control registers, table

registers, etc Off-chip: segment descriptor

tables, page tables, interrupt tables, etc

Shadow structures Tracing: write-protected primary

structures Sensitive unprivileged instructions

3210

OS

Apps

Privilege Rings

Segm

LinearAddressSpace

Segmentation

%cr3

PhysicalAddressSpace

PagDir

PagTab

TLBs

Paging

Page

SegmDescr

DPL

LogicalAddress

%ldtr

%gdtr

GDT

LDT

CPL

%cs

Page 6: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

User Apps

Paravirtualization Guest OS modifications (0/1/3)

Paravirtualized x86 interface OS can’t evolve independently!

CPU Xen-validated exception handlers ‘Fast’ handler for system calls Timer: real, virtual, wall-clock

Memory Xen in top 64MB of address space Validated updates to segment

descriptor tables and page tables I/O

Buffer-descriptor rings HW interrupts replaced by events

Domain0 runs control software

Hardware

User Apps

Xen Hypervisor

Control Plane SW

Paravirt.Guest OS

x86

x86-Dom0CtrlIntf

Virtx86CPU

VirtPhysMem

VirtNetIntf

VirtBlckDev

Paravirt.Guest OS

ABI

Efficiency

Resource Control

Equivalence

Domain0

XDD XDD

Barham et al.Xen and the Art of

VirtualizationSOSP’03

Page 7: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

VMM

EmulationRoutineAllocator

Dispatcher

Hardware Assisted VMM x86 extensions

1st gen: AMD-V™, Intel® VT-x enable trap-and-emulate

Guest OS runs in new guest mode, VMM in host mode 4 privilege rings in both modes Host to guest: vmrun

Virtual Machine Control Block (VMCB) Host state + guest state + control

fields Guest to host: exit conditions Diagnostic fields to aid VMM Efficiency

Resource Control

Equivalence

x86+

x86

Hardware

VMCB

Adams & AgesenHW & SW Techniquesfor x86 Virtualization

ASPLOS’06

User Applications

Guest OS

exit

Page 8: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Full System Virtualization Direct Execution of ring 3 code Binary Translation of ring 0 code

Dynamic Binary Translator (DBT) Input: any x86 code

no ABI assumptions Output: subset of x86 code

stored in Code Cache (CC) runs in ring 3

Privileged instruction replacement Simple: in-CC sequences Complex: callout-and-emulate

Adaptive BT Frequent traps replaced by callouts Reverted when trapping infrequent

Adams & AgesenHW & SW Techniquesfor x86 Virtualization

ASPLOS’06

Hardware

x86

x86

Efficiency

Resource Control

Equivalence

CC

VMMCCDBT

EmulationRoutine

User Applications

Guest OS

Page 9: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Memory Resource Mgmt. Virtual Physical Memory

physical addr. machine addr. VM config.: min, max, shares

Content-Based Page Sharing Reduce memory pressure Identical pages: copy-on-write

Share-Based Allocation Min-funding revocation Idle memory tax

Reclamation Ballooning forces guest OS to

make paging decisions Fallback to Demand Paging

User Apps

HW

VMware ESX

Guest OS

VM VM

Machine Memory

Phys.Mem

Linear Mem

Phys.Mem

User Apps

Guest OS

Linear Mem

WaldspurgerMemory Res. Mgmt.

in VMware ESX ServerOSDI’02

Phys.Mem

Balloon

))1(( fkfP

S

Page 10: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Overview

Dynamic BinaryOptimization

Code CacheManagement

Dynamic BinaryTranslation

System Virtualization

Paravirtualization(OS Assisted)

Full Systemvirtualization

Trap-And-Emulate(Classic)

Hardware AssistedVirtualization

Memory ResourceManagement

needsbased on

approaches

needs

enables

Page 11: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Dynamic Binary Translation Modify a running program binary

instructions before they execute on the host platform

Uses: Emulation Virtualization Dynamic Optimization Code security (shepherding) Dynamic Instrumentation Software I-Caching Etc.

DBT

HW

App

Host OS

App

App

App

Guest OS

Guest OS

DBT

Host OS

HW

HWHW

DBTDBT

Page 12: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Binary

A

C

B

D

E

Code Cache

Generic DBT operation

call

return

DBT

ContextSave

ContextRestore

NewFragment

End offragment?

N

Y

Cached?NewPC

Y

N

Translate

Next PC

Decode

Fetch

AA

to B

to C

fragmentexitstubs

G

I

H

J

conditionalbranch: stop

Scott et al.Retarget. & reconfig.

SDTCGO’03

Page 13: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Code Cache

Generic DBT operationDBT

ContextSave

ContextRestore

NewFragment

End offragment?

Cached?NewPC

N

Y

Y

N

Translate

Next PC

Decode

Fetch

A

to B

to C

C

D

G

to H

to IH

J

indE

to A

branch and link: emulate side effects and elide

unconditional branch: elideindirect exit stub

Scott et al.Retarget. & reconfig.

SDTCGO’03

Binary

A

C

B

D

E

call

return

G

I

H

J

Page 14: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Code Cache

Generic DBT operation

A

to B

to C

C

D

G

to H

to IH

J

indE

to A

Reducing context switches fragment linking for direct targets indirect branch target cache

(IBTC) for indirects

computedtarget

IBTC

translatedtarget

indIBTC

lookup

Kumar et al.Compile-time planningoverhead reduc. SDT

IJPP’05

Binary

A

C

B

D

E

call

return

G

I

H

J

DBT

ContextSave

ContextRestore

NewFragment

End offragment?

Cached?NewPC

N

Y

Y

N

Translate

Next PC

Decode

Fetch

Page 15: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

DBO

Link Fragments

Trace Selector

Dynamic OptimizationInterpreter

Interpretuntil

taken branch

Interpret+ codegen

untiltaken branch

Startof trace?

Hot?

End oftrace?

Bala et al.DynamoPLDI’00

ContextRestore

Cached?

BTA

Incrementcounter

ContextSave

Y

Code Cache

Trace selection

Binary

A

C

B

D

E

call

return

G

I

H

J

N

Optimize Trace

Y

Form Fragments

N

N

Y

Y

N

Hot Trace Buffer

Page 16: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

DBO

Link Fragments

Trace Selector

Hot Trace Buffer

Dynamic OptimizationInterpreter

Interpretuntil

taken branch

Interpret+ codegen

untiltaken branch

Startof trace?

Hot?

End oftrace?

Bala et al.DynamoPLDI’00

ContextRestore

Cached?

BTA

Incrementcounter

ContextSave

Y

Code Cache

Trace formation: Most Recently Executed Tail (MRET)

Binary

A

C

B

D

E

call

return

G

I

H

J

N

Optimize Trace

Y

Form Fragments

N

N

Y

Y

N

A

C

D

E

G

H

J

Page 17: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

DBO

Link Fragments

Trace Selector

Hot Trace Buffer

Dynamic OptimizationInterpreter

Interpretuntil

taken branch

Interpret+ codegen

untiltaken branch

Startof trace?

Hot?

End oftrace?

Bala et al.DynamoPLDI’00

ContextRestore

Cached?

BTA

Incrementcounter

ContextSave

Y

Code Cache

Trace Optimization: IR, 2 passes (forward+backward)

Binary

A

C

B

D

E

call

return

G

I

H

J

N

Optimize Trace

Y

Form Fragments

N

N

Y

Y

N

A

C

D

G

H

E

J

A

C

D

G

H

J

E

• Branch fixup• Redundance

elimination• Compensation

blocks• Copy

propagation• Loop unrolling• etc

Page 18: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

DBO

Link Fragments

Trace Selector

Hot Trace Buffer

Dynamic OptimizationInterpreter

Interpretuntil

taken branch

Interpret+ codegen

untiltaken branch

Startof trace?

Hot?

End oftrace?

Bala et al.DynamoPLDI’00

ContextRestore

Cached?

BTA

Incrementcounter

ContextSave

Y

Code Cache

Fragment formation and linking

Binary

A

C

B

D

E

call

return

G

I

H

J

N

Optimize Trace

Y

Form Fragment

N

N

Y

Y

N

to B

to I

A

C

D

G

H

J

E

A

C

D

G

H

J

E

to H

B

D

G

I

J

E

B

D

G

I

J

E

Page 19: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Nursery Cache

Persistent Cache

Probation Cache(Instrumented)

FIFO

HOT

COLD

Code Cache ManagementHazelwood & SmithManaging Bounded

Code CachesTACO’03

Code Cache Manager

EvictCode

Roomin CC?

NUpdate mapand insert

Y

DBT

ContextRestore

MapLookup

PCmiss

hit

RegionFormation

Handle CC overflows Overhead sources

miss rate eviction frequency unlinking cost

Strategies: FLUSH FIFO Mid-grained Generational

Code Cache

Cache Unit

Cache Unit

Cache Unit

Page 20: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Overview

Software-basedInstruction Cache Scratchpad

Memory

Compiler-generatedOverlays Embedded

Systems

Overview

Dynamic BinaryOptimization

Code CacheManagement

Dynamic BinaryTranslation

System Virtualization

Paravirtualization(OS Assisted)

Full Systemvirtualization

Trap-And-Emulate(Classic)

Hardware AssistedVirtualization

Memory ResourceManagement

needsbased on

approaches

needs

have

enables

approaches

Page 21: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Software-controlled SRAM Replaces or complements caches

Advantages: Fast Smaller than cache Energy-efficient Better timing-predictability

How to manage SPM? Static partitioning Software caching Overlays

Scratchpad Memory (SPM)

System-on-Chip (SoC)

ROMCPUMain

MemoryDRAMSPM

System-on-Chip (SoC)

ROMCPU

MainMemoryDRAMSPM

I-L1

System-on-Chip (SoC)

ROMCPU

MainMemoryDRAMSPM

I-L1D-L1

Page 22: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

SW-Based I-CacheMiller & AgarwalSoftware-based

Instruction CachingASPLOS’06

Binary

A

C

B

D

E

call

return

G

H

I

Binary

BinaryRewriter

B

D

E

G

I

C1

C2

……

Basic Block Formation:splitting & padding

A

A C1 BB D

G I H…

C1 C2C2 DD G E

H I

H

DestinationsTable

SPM

Runtime

Memory

B

D

E

G

I

C1

C2

A

A C1 BB D

G I H…

C1 C2C2 DD G E

H I

H

DestinationsTable

EP1 EP2

IndEP

RUN

Almost a DBT!!! (offline region formation)

A

C1

A

A C1

EP1

Page 23: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Compiler-generated SPM Overlays Compiler introduces code to copy objects from

memory to SPM and back at selected program points

Questions: Which objects to promote/demote? At what (profitable) program points?

Needs to know: Profile information SPM size

Page 24: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Concomitance + SMI Concomitance measures temporal

distance of block(s) execution Large self-concomitance SPM Large concomitance (2 blocks)

can’t overlay Program graph partitioning

Nodes: blocks with large self-concomitance

Partition into overlays Insert SMI in CFG edges

Special instruction to copy code from memory to scratchpad

Supported by SPM controller

Janapsatya et al.Expl. Statistical Info.

for Implem. Instr. SPMTVLSI’06

ControlLogic

Addressof DRAM

SizeAddressof SPM

MemoryController

Basic Block Table (BBT)

From/toCPU

To I-MEM and I-SPM

SMI opcode Operand: BBT addr

SPM controller

Page 25: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Udayakumaran et al.Dynamic Allocation

for SPMTECS’06

Data-Program Rel. Graph For globals, stack variables and

code (procedures) Program points based on control

flow DPRG represents program

regions and their time order Code inserted to promote/demote

objects Usage information from profile Liveness analysis to eliminate

unnecesary transfers Problems:

Pointers Join nodes Gotos

Page 26: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Optimal Scratchpad Overlay

For globals, non-scalar locals and code traces Based on Live Ranges (profile for variables, static analysis for

traces) Memory Assignment: NP-complete, reduces to register allocation Solutions:

Optimal: ILP formulation (16 sec.) Near Optimal: Heuristic

Verma & MarwedelOverlay Techniques

for SPMTVLSI’06

1. Memory Object Determination

2. Liveness Analysis

3. Memory Assignment

4. Onchip Address Assignment

5. Code Generation

Page 27: Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University of Pittsburgh Towards Virtualization of Embedded Systems.

Conclusions DBT-based virtualization transparently virtualizes general-

purpose architectures (x86) Paravirtualization sacrifices OS-independence HW assisted not yet as efficient, increases HW cost.

Software i-caching manages SPM for code at runtime DBT can provide it (CC in SPM) Compiler-generated overlays already use profile information, but

need to know SPM size DBO-ideas (trace selection) can be adapted to exploit SPM for

code

DBT for embedded systems: exploit SPM and enable virtualization