Post on 19-Aug-2018
Jakob Engblom, PhDJakob Engblom, PhDUppsala Uppsala UniveUniversityrsity & Virtutech Inc.& Virtutech Inc.
jakob.engblom@it.uu.sejakob.engblom@it.uu.sejakob@virtutech.comjakob@virtutech.com
EmbeddedEmbedded Systems Systems ComputerComputer
ArchitectureArchitecture
techvirtutechvirtutechvirtutechvirtu14 Nov 2003 Embedded Computer Architecture 2
Embedded Embedded SystemsSystems
14 Nov 2003 Embedded Computer Architecture 3
Embedded SystemsEmbedded Systems
It is a It is a snakesnake!!
No, a No, a wallwall!!
No, a No, a pillar!pillar!
No, it is a No, it is a treetrunktreetrunk!!
YouYou’’re re all all wrongwrong, it is a , it is a
fan!fan!
Now what Now what is this is this elephant thingelephant thing??
14 Nov 2003 Embedded Computer Architecture 4
Embedded SystemsEmbedded Systems
““A computer that doesnA computer that doesn’’t t look like a computerlook like a computer””Interacts with worldInteracts with worldPrimitive or no user interfacePrimitive or no user interfacePart of other productsPart of other products
14 Nov 2003 Embedded Computer Architecture 5
Embedded SystemsEmbedded Systems
Single purpose productsSingle purpose productsNot Not general purposegeneral purpose like desktop PCslike desktop PCsDo one thing very efficientlyDo one thing very efficiently
Software very important:Software very important:Gives character to productGives character to product
Used to differentiate inside a Used to differentiate inside a ““platformplatform””Can be changed lateCan be changed lateProcessor cheaper than special HWProcessor cheaper than special HWTToday, dominates dev costoday, dominates dev cost
14 Nov 2003 Embedded Computer Architecture 6
"Desktop"2%
"Embedded"98%
Processor MarketProcessor Market
Embedded Embedded = most= most processors!processors!200 million PC and server200 million PC and server8000 million embedded8000 million embedded
14 Nov 2003 Embedded Computer Architecture 7
Processor MarketProcessor Market
Processors: Processors: 50% of all 50% of all semiconductor revenuesemiconductor revenueExplains why everyone Explains why everyone wants to do processorswants to do processors
3232--bit dominantbit dominant30% of total 30% of total semiconductorssemiconductors
PC processors: PC processors: 50% of CPU revenue50% of CPU revenue15% of total 15% of total semiconductorssemiconductorsAMD and Intel share itAMD and Intel share it
32-bit16-bit
8-bit
4-bit
DSP
32-bit
16-bit8-bit4-bitDSP
0%10%20%30%40%50%60%70%80%90%
100%
Units Money
14 Nov 2003 Embedded Computer Architecture 8
RealReal--Time SystemTime System
Timing as important as resultTiming as important as resultHard realHard real--time:time:
Hard deadlinesHard deadlinesDead if missed deadlineDead if missed deadlineWorstWorst--casecase
Soft realSoft real--time:time:Fuzzier deadlinesFuzzier deadlinesCan miss some deadlinesCan miss some deadlinesAverageAverage--casecase
14 Nov 2003 Embedded Computer Architecture 9
RealReal--Time SystemsTime Systems
Embedded and RealEmbedded and Real--TimeTimeSynonymous?Synonymous?
Most embedded Most embedded systems are systems are realreal--timetimeMost realMost real--time time systems are systems are embeddedembedded
embeddedembedded
realreal--timetime
embedded embedded realreal--timetime
14 Nov 2003 Embedded Computer Architecture 10
Simple Embedded Simple Embedded SystemsSystems
8-bit Hitachi H8/30032 kB ROM, 32 kB RAM
Standard microcontroller chip
Byte-code machine, sensor drivers, …
8-bit Intel 8051, standard microcontroller
Behavior, talk, IR communications
14 Nov 2003 Embedded Computer Architecture 11
Fun App: Smart Beer GlassFun App: Smart Beer Glass
88--bbit, 8it, 8--pin pin PIC processorPIC processor
Capacitive Capacitive senssensor for or for fluid levelfluid level
InduInductive coil for ctive coil for RF ID activation RF ID activation
& power& power
CPU and reading coil in the table. Reports the level of fluid in the glass, alerts servers when close to empty
ContContactless actless transmission of transmission of
power and power and readingsreadings
14 Nov 2003 Embedded Computer Architecture 12
No Upgrades PossibleNo Upgrades Possible
Once a product shipsOnce a product ships…………it often cannot be servicedit often cannot be serviced
No download abilityNo download abilityNo writable persistent storageNo writable persistent storageNo disksNo disksNo loaderNo loader
Software is writeSoftware is write--onceonce(There are exceptions)(There are exceptions)
14 Nov 2003 Embedded Computer Architecture 13
Consumer ElectronicsConsumer Electronics
Heterogeneous Heterogeneous multiprocessormultiprocessor
88--bit Atmel AVR for UI, games, bit Atmel AVR for UI, games, ……1616--bit fixedbit fixed--point TI C54 DSP for point TI C54 DSP for GSM coding, radio interface, GSM coding, radio interface, ……3232--bit ARM7 in Bluetooth modulebit ARM7 in Bluetooth module+ maybe ARM7 in IRDA interface+ maybe ARM7 in IRDA interface
All in custom chipsAll in custom chipsSoftware is large:Software is large:
16 MB of code in control part16 MB of code in control partPlus signal processing codePlus signal processing code
14 Nov 2003 Embedded Computer Architecture 14
AutomAutomotiveotive
Multiple networksMultiple networksCAN for body CAN for body electronics: 30+ nodeselectronics: 30+ nodesCAN for engine control: CAN for engine control: few nodesfew nodesLIN for instrumentsLIN for instruments
Many processorsMany processorsUp to 100Up to 100
Large diversity in processor types:Large diversity in processor types:88--bit CPUs (PIC, HC08) for door locks, lights, etc. bit CPUs (PIC, HC08) for door locks, lights, etc. 1616--bit CPUs (C167, HC11, HC12) for most functionsbit CPUs (C167, HC11, HC12) for most functions3232--bit CPUs (PPC,V850) for engine control, airbagsbit CPUs (PPC,V850) for engine control, airbags
Total amount of code: 40Total amount of code: 40--50 MB50 MB
14 Nov 2003 Embedded Computer Architecture 15
AutomotiveAutomotive
Form follows functionForm follows functionProcessing where the action isProcessing where the action isArchitecture given by applicationArchitecture given by applicationSensors and actuators distributedSensors and actuators distributed
Heterogeneous systemsHeterogeneous systemsMany Many different makes of different makes of CPUsCPUsStandardizedStandardized at the at the networknetwork/bus/bus
14 Nov 2003 Embedded Computer Architecture 16
Timing AspectsTiming Aspects
Interrupt latencyInterrupt latencyImportant criterion for embeddedImportant criterion for embeddedA few clock cycles at mostA few clock cycles at mostMeasure of RTOS performanceMeasure of RTOS performance
RealReal--Time = predictabilityTime = predictabilityInIn--order pipelinesorder pipelinesSRAM instead of cachesSRAM instead of cachesLockable cachesLockable cachesSeveral small CPUs instead of one bigSeveral small CPUs instead of one big
14 Nov 2003 Embedded Computer Architecture 17
Military Military ShShipboardipboardStandard multiprocessor UltraSparc servers for radar, target tracking, combat control, …
Many CPUs in missiles, gun controls, engines, …
14 Nov 2003 Embedded Computer Architecture 18
Mobile Phone Base StationMobile Phone Base Station
Handle signalsHandle signalsData streams to and from Data streams to and from phonesphonesMassively parallel systemMassively parallel systemThousands of DSP tasksThousands of DSP tasksPerfect parallel scalabilityPerfect parallel scalability
Custom or standard Custom or standard DSPsDSPsUp to 8 Up to 8 DSPsDSPs on a single chipon a single chip
14 Nov 2003 Embedded Computer Architecture 19
TrendsTrends
Hardware to softwareHardware to softwareIncrease flexibility, lower costIncrease flexibility, lower costSoftware on fast processor can equal HWSoftware on fast processor can equal HW
Software to hardwareSoftware to hardwareBetter power consumption & performanceBetter power consumption & performanceDesign custom hardware for applicationDesign custom hardware for application
HardwareHardware--software software codesigncodesignDelay division HW/SW to late in projectDelay division HW/SW to late in projectObtain Obtain ““optimaloptimal”” HW/SW divisionHW/SW division
14 Nov 2003 Embedded Computer Architecture 20
On-chip bus
SystemSystem--onon--aa--chipchip
Integration Integration extremeextreme
Thanks to modern Thanks to modern semiconductorssemiconductors
Entire product Entire product on a chipon a chipOne or more One or more processors, processors, accelerators, accelerators, ……
DSP
LCD driver
CPU
Blu
etoo
th
GSM Radio
Code memory
Data mem
14 Nov 2003 Embedded Computer Architecture 21
Embedded Embedded ProcProcessingessing
14 Nov 2003 Embedded Computer Architecture 22
MicrocontrollersMicrocontrollers
Classic embedded hardwareClassic embedded hardwareStandard partsStandard parts
Quite broad application domainsQuite broad application domainsSold in large seriesSold in large seriesDefined by hardware vendorsDefined by hardware vendorsAs cheap as a single dollarAs cheap as a single dollar
Single processor + devicesSingle processor + devicesHuge number of variantsHuge number of variantsUsually intended for control planeUsually intended for control plane
Mic
roco
ntro
llers
14 Nov 2003 Embedded Computer Architecture 23
MicroconMicrocontrollertroller
A single chip:A single chip:CPU CoreCPU CoreIntegrated memoryIntegrated memoryIntegrated peripheralsIntegrated peripheralsIntegrated servicesIntegrated services
Goal:Goal:System on one chipSystem on one chipNo external HWNo external HWFit application Fit application ““perfectlyperfectly””
CPUCore
RAM(small)
ROM(big)
UA
RT
A/D
Tim
er
LCD
D
Outside World
14 Nov 2003 Embedded Computer Architecture 24
MicrocontrollerMicrocontroller
CPU CPU BitnessBitness: 4 to 64 bits: 4 to 64 bitsMost common: 8 bit (4G units)Most common: 8 bit (4G units)3232--bit growing fastestbit growing fastest32/6432/64--bit outnumbers desktopbit outnumbers desktop
Frequency: DC to Frequency: DC to 22 GhzGhzMemory onMemory on--chchipip: : 0.5 kB to 5 MB0.5 kB to 5 MBPower: Power: mWmW (and up)(and up)1/30 to 10 instructions per cycle1/30 to 10 instructions per cycle
14 Nov 2003 Embedded Computer Architecture 25
Example: PIC 12CE674Example: PIC 12CE674Memory arch:Memory arch: HarvardHarvardProgram memory:Program memory: 2048 x 14 (OTP/Flash)2048 x 14 (OTP/Flash)EEPROM:EEPROM: 16 bytes16 bytesRAM:RAM: 128 bytes128 bytesADC channels:ADC channels: 4 (8 bits)4 (8 bits)I/O ports:I/O ports: 66Timers:Timers: One 8One 8--bit, One WDTbit, One WDTClock:Clock: onchiponchip crystal, 10MHzcrystal, 10MHzPackage:Package: 8 pins (Pentium 4:8 pins (Pentium 4:700700 pins)pins)Cost:Cost: <<$1.00 (Pentium 4:>$200.00)$1.00 (Pentium 4:>$200.00)
14 Nov 2003 Embedded Computer Architecture 26
Example: AT91M42800AExample: AT91M42800A
ARM7TDMI 32ARM7TDMI 32--bit corebit coreStatic design: 0 to 33 Static design: 0 to 33 MhzMhz
MemoryMemory8 8 kBkB SRAM on chipSRAM on chipExternal memory interface, 8/16 bit interfaceExternal memory interface, 8/16 bit interface
DevicesDevices6 timers6 timers2 serial ports2 serial ports
JTAG debug interfaceJTAG debug interfaceAbout 0.5 W powerAbout 0.5 W powerAbout 18 USDAbout 18 USD
144 Pin package144 Pin packageOne of 13 AT91 One of 13 AT91 variantsvariants
14 Nov 2003 Embedded Computer Architecture 27
Devices on the ChipDevices on the Chip
Interface with the worldInterface with the worldDigital I/ODigital I/OAnalog/Digital conversionAnalog/Digital conversionDigital/Analog conversionDigital/Analog conversion
CommunicationsCommunicationsCAN networksCAN networksEthernet networksEthernet networksRadioRadioSerial ports (UART, USART)Serial ports (UART, USART)USB, FireWire, ... USB, FireWire, ...
14 Nov 2003 Embedded Computer Architecture 28
Devices on the ChipDevices on the Chip
TimersTimersTrigger interruptsTrigger interruptsWatchdogsWatchdogs
GraphicsGraphicsLCD driversLCD drivers2D/3D graphics acceleration2D/3D graphics acceleration
BusesBusesOnOn--chipchip:: between devices: AMBA, between devices: AMBA, ……OffOff--chip: PCI, chip: PCI, HyperTransportHyperTransport, , RapidIORapidIO ……
14 Nov 2003 Embedded Computer Architecture 29
ASIPsASIPs / / ASSPsASSPs
ApplicationApplication--specific specific integrated/standard processorintegrated/standard processor
Targeting a particular niche marketTargeting a particular niche marketMore targeted than microcontrollerMore targeted than microcontrollerDomainDomain--specific acceleratorsspecific accelerators
Usually more upscaleUsually more upscale3232--bit processorsbit processorsMultiprocessors Multiprocessors Expensive peripheralsExpensive peripheralsExternal memory assumedExternal memory assumedHigher performance, includes dataHigher performance, includes data--planeplane A
SIP
/ ASS
P
14 Nov 2003 Embedded Computer Architecture 30
Example: Example: PowerQUICCPowerQUICC IIIIII
MotorolaMotorolaTarget marketTarget market
CommunicationsCommunications
Processing Processing PowerPC e500PowerPC e500666666--1000 1000 MhzMhz256 256 kBkB L2 cacheL2 cache
NetworkingNetworkingCPM module, RISCCPM module, RISC--based microcodebased microcode
About 160 USDAbout 160 USD
Features
Capabilities
256Multichannel HDLC (from MCC2)
2Utopia II ATM (from FCC)
2Ethernet 10/100/1000
3Ethernet, 10/100 (from FCC)
4Ethernet, 10 (from SCC)
2Ethernet 10/100/1000 controller
1RapidIO controller
1PCI-X/PCI controller
11DDR Memory controller
1I2C controller
1Serial Peripheral Interface (SPI)
2Serial Management Controller (SMC)
2Multi-Channel Controller (MCC2)
3Fast Communications Controller (FCC)
4Serial Communications Controller (SCC)
14 Nov 2003 Embedded Computer Architecture 31
Example: C167CSExample: C167CS
InfineonInfineonTarget MarketTarget Market
Automotive controlAutomotive control
ProcessingProcessing1616--bit C16x corebit C16x core44--stage simple pipelinestage simple pipeline40 40 MhzMhz operationoperation16 MB memory space, 16 MB memory space, including ROM, RAM, including ROM, RAM, devicesdevices
144 pin package144 pin packageTolerates Tolerates --40 C to +125 C40 C to +125 C
About 25 USDAbout 25 USD
1Synchronous Serial Comms (SSC)
8 kBExtension Internal RAM (XRAM)
3 kBFast General Internal RAM (IRAM)
Devices
External Ports
32 kBROM
Memory
116-bit ports from devices
88-bit ports from devices
2CAN interfaces
2x16Capture/Compare Channels
1USART
24+8Analog-Digital Converter Channels
1Pulse-Width Modulator (PWM)
1Watch-Dog Timer (WDT)
5General-Purpose Timers (GPT)
2CAN 2.0b controllers
14 Nov 2003 Embedded Computer Architecture 32
Example: Cisco Toaster3Example: Cisco Toaster38 clusters of 2 8 clusters of 2
processors processors eacheach
Each TMC Each TMC is a is a VLIW machine VLIW machine
with 74 bit with 74 bit instructions, 2k instructions, 2k instructions in instructions in local memorylocal memory
Total caTotal capacity: pacity: about 5 GOps, at about 5 GOps, at around 160 Mhzaround 160 Mhz
Two 32Two 32--bit bit ALUs and three ALUs and three
control/data control/data movement units movement units
per TMCper TMC
Image from Microprocessor Report, Oct 2002
14 Nov 2003 Embedded Computer Architecture 33
Example: Cisco Toaster3Example: Cisco Toaster3
Massive Massive multiprocessingmultiprocessing
16 cores on a chip16 cores on a chip4 chips in serial4 chips in serialRouting:Routing:
10 10 GbpsGbps@ 20 @ 20 Mpackets/sMpackets/s1000 ops per packet 1000 ops per packet passing throughpassing through
14 Nov 2003 Embedded Computer Architecture 34
FPGAFPGA
Field Programmable Gate ArrayField Programmable Gate ArrayReconfigurable hardware: Reconfigurable hardware: ““soft logicsoft logic””
““ProgramProgram”” is circuit layoutis circuit layoutCan be changed after Can be changed after iniinitial loadtial load
Kilos to Megs of Kilos to Megs of ””gatesgates”” availableavailable
Competitor to Competitor to ASICsASICsMore expensive per unit, More expensive per unit, but no startbut no start--up cost for manufacturingup cost for manufacturingLess flexible, slightly slowerLess flexible, slightly slowerPerfect for lowPerfect for low--volume productsvolume products FP
GA
14 Nov 2003 Embedded Computer Architecture 35
FPGA ArchitectureFPGA Architecture
Computation cellsComputation cellsProgrammable Programmable functionfunction
Adder, Logic Adder, Logic funcsfuncs, ..., ...Memory, Registers, ... Memory, Registers, ...
Input/Output cellsInput/Output cellsInterconnectInterconnect
ReconfigurableReconfigurableProgrammableProgrammable
14 Nov 2003 Embedded Computer Architecture 36
FPGA ArchitectureFPGA Architecture
Computation cellsComputation cellsLookLook--Up TableUp Table
Arbitrary 4Arbitrary 4--input, input, 11--output functionoutput function
CoarseCoarse--grainedgrainedLots of functionalityLots of functionalitySeveral Several LUTsLUTsPlus flipPlus flip--flops etc.flops etc.
FineFine--grainedgrainedLittle functionalityLittle functionality
ConfigRAM
LUT
14 Nov 2003 Embedded Computer Architecture 37
FPGFPGA with CPU CoresA with CPU Cores
CPU onCPU on--board FPGAboard FPGAHW accelerate critical HW accelerate critical tasks in FPGA tasks in FPGA fabfabricricData pumps in FPGAData pumps in FPGAControl in CPUControl in CPU
Cool new possibilitiesCool new possibilitiesReconfigure FPGA onlineReconfigure FPGA onlineAdapt to workloadsAdapt to workloads
CPU
14 Nov 2003 Embedded Computer Architecture 38
Soft CPUs in FPGAsSoft CPUs in FPGAs
Processor in the FPGA fabricProcessor in the FPGA fabric””SoftSoft”” processorprocessorSpecial design considerationsSpecial design considerations
ExamplesExamplesAltera NiosAltera NiosXilinx MicroblazeXilinx MicroblazeResearch projectsResearch projects
VVäästersteråås ARM clone s ARM clone Leon processor also prototypedLeon processor also prototyped
14 Nov 2003 Embedded Computer Architecture 39
ExamplesExamples
Altera Apex 20kCAltera Apex 20kC““VolumeVolume””3030k to 1.5M gatesk to 1.5M gates
XilinxXilinx VirtexVirtex IIII: : ““HighHigh--endend””11--4 PPC405 cores 4 PPC405 cores (optional)(optional)10M gates10M gatesPrice at about $1000Price at about $1000
AlteraAltera StratixStratix““AdvancedAdvanced””10 10 MbitMbit RAMRAM28 DSP elements28 DSP elements100000 LE100000 LE1300 user I/O pins1300 user I/O pinsOptimized for Optimized for NiosNios
ATMEL FPSLIC: ATMEL FPSLIC: ““LowLow--endend””AVR 8AVR 8--bit CPUbit CPU5050kk gatesgates
14 Nov 2003 Embedded Computer Architecture 40
CCase Study: ase Study: ARMARM
1026EJ1026EJ--SS
14 Nov 2003 Embedded Computer Architecture 41
OverviewOverview
14 Nov 2003 Embedded Computer Architecture 42
The Basics:The Basics: ARM1026EJARM1026EJ--SS
Not a standNot a stand--alone processoralone processorFor integration in your own chipsFor integration in your own chipsProcessor package:Processor package:
CPU core CPU core CCachesaches, configurable in size, configurable in sizeTightlyTightly--coupled memories, configurable coupled memories, configurable in sizein sizeBus interfaceBus interfaceMMU (supports WinCE, Symbian, etc.)MMU (supports WinCE, Symbian, etc.)
14 Nov 2003 Embedded Computer Architecture 43
Business ModelBusiness Model
Sold as an Sold as an IP CoreIP CoreIP = IP = ““Intellectual PropertyIntellectual Property””Not a physical chip, just a designNot a physical chip, just a design””Source code componentSource code component””Similar in scope to classic processorSimilar in scope to classic processor
For integration in For integration in ASICASICssASIC = ApplicationASIC = Application--specific specific integrated circuitintegrated circuit
14 Nov 2003 Embedded Computer Architecture 44
ASICsASICs
Fully custom chipsFully custom chipsCustom for your applicationCustom for your applicationAs small or large as necessaryAs small or large as necessary
CharacteristicsCharacteristicsExpensive to developExpensive to develop
10s of engineers, often 100s10s of engineers, often 100sLarge series necessary to pay offLarge series necessary to pay off
At least 100 000 units necessary on averageAt least 100 000 units necessary on averageMostly for large companiesMostly for large companies
To streamline: build from To streamline: build from IP blocksIP blocks
14 Nov 2003 Embedded Computer Architecture 45
IP BlocksIP Blocks
IPIPHardware componentsHardware componentsIntegrated on chip by Integrated on chip by customercustomer
Examples:Examples:CPU CoresCPU CoresMemoryMemoryBusesBusesNetwork interfacesNetwork interfacesAccelerator circuitsAccelerator circuits
On-chip bus
DSP
LCD driver
CPU
Blu
etoo
th
GSM Radio
Code memory
Data mem
14 Nov 2003 Embedded Computer Architecture 46
CPU CoresCPU Cores
The biggest The biggest ““IPIP”” businessbusiness““FablessFabless”” chchipip companiescompaniesBiggest players:Biggest players:
ARM (bestARM (best--selling 32selling 32--bit bit architecturearchitecture))MIPS (and its licensees)MIPS (and its licensees)
Crowded fieldCrowded fieldNew companies appear monthlyNew companies appear monthlyNiched components can find a marketNiched components can find a market
14 Nov 2003 Embedded Computer Architecture 47
Component StylesComponent Styles
Hard IP:Hard IP:Tied to a particular fab processTied to a particular fab process
Like IBM 0.13u Cu, TSMC 0.18, etc.Like IBM 0.13u Cu, TSMC 0.18, etc.Black box to customerBlack box to customer
Synthesizable IP:Synthesizable IP:Source code for compilation by customerSource code for compilation by customerOffers configuration options like cache sizes, TCMsOffers configuration options like cache sizes, TCMsMIPS 24k, ARM 9S, 1026S, 1136SMIPS 24k, ARM 9S, 1026S, 1136S
Soft IP:Soft IP:Get full source code for the componentGet full source code for the componentPurpose is to customize heavilyPurpose is to customize heavilyARCARC ARCtangent 5, ARCtangent 5, TenTensilica Xtensa Vsilica Xtensa V
14 Nov 2003 Embedded Computer Architecture 48
Synthesizable Vs Hard IPSynthesizable Vs Hard IP
SynthesizableSynthesizable++ Use any processUse any process++ Use any fabUse any fab++ Customize detailsCustomize details++ Customize chipsCustomize chips++ Add instructionsAdd instructions-- Slower memoriesSlower memories-- Higher powerHigher power-- Lower Lower
performanceperformance
Hard IPHard IP++ Optimized layoutOptimized layout++ Small areaSmall area++ Low powerLow power++ Best performanceBest performance-- No flexibilityNo flexibility
ForFor best results, best results, cores need to be cores need to be redesigned to be redesigned to be
synthesizablesynthesizable
14 Nov 2003 Embedded Computer Architecture 49
1026EJ1026EJ--S CoreS Core
66--stage pipeline:stage pipeline:Max clock, best case: 475 MhzMax clock, best case: 475 Mhz
Depends on process, synthesis usedDepends on process, synthesis usedOptimized for synthesis of coreOptimized for synthesis of coreIntegerInteger--onlyonly
Power:Power:Depends on process & configurationDepends on process & configurationQuoted numbers: 0.5mW/Mhz Quoted numbers: 0.5mW/Mhz
With 16kB+16kB L1 cachesWith 16kB+16kB L1 caches130 nm process at TSMC130 nm process at TSMC(Pen(Pentium tium 4: >35 4: >35 mW/MhzmW/Mhz))
14 Nov 2003 Embedded Computer Architecture 50
ARM1026EJARM1026EJ--S PipelineS Pipeline
Fetch Issue Decode
Shift/ALU Sat
Write
MAC1 MAC2
LS1 LS2 LS write
Static branch Static branch preprediction (75% diction (75% accurate): uses accurate): uses less power than less power than
dynamicdynamic
RetuReturn stack rn stack (single entry). (single entry).
Simple but Simple but effectiveeffective
14 Nov 2003 Embedded Computer Architecture 51
ARM1026EJARM1026EJ--S PipelineS Pipeline
Fetch Issue Decode
Shift/ALU Sat
Write
MAC1 MAC2
LS1 LS2 LS write
ARARM/Thumb/Java M/Thumb/Java decodedecode
AAccess to ccess to coprocessorscoprocessors
14 Nov 2003 Embedded Computer Architecture 52
ARM1026EJARM1026EJ--S PipelineS Pipeline
Fetch Issue Decode
Shift/ALU Sat
Write
MAC1 MAC2
LS1 LS2 LS write
Register read, Register read, initialize memory initialize memory
accessesaccesses
Evaluate Evaluate immediatesimmediates
14 Nov 2003 Embedded Computer Architecture 53
ARM1026EJARM1026EJ--S PipelineS Pipeline
Fetch Issue Decode
Shift/ALU Sat
Write
MAC1 MAC2
LS1 LS2 LS write
ExExecution pipeline ecution pipeline for most integer for most integer
instructionsinstructions
Handle Handle saturated saturated arithmearithmetictic
14 Nov 2003 Embedded Computer Architecture 54
ARM1026EJARM1026EJ--S PipelineS Pipeline
Fetch Issue Decode
Shift/ALU Sat
Write
MAC1 MAC2
LS1 LS2 LS write
Execution pipeline Execution pipeline for for multiplymultiply--accumulate accumulate instructionsinstructions
14 Nov 2003 Embedded Computer Architecture 55
ARM1026EJARM1026EJ--S PipelineS Pipeline
Fetch Issue Decode
Shift/ALU Sat
Write
MAC1 MAC2
LS1 LS2 LS write
DDecoupled pipeline ecoupled pipeline for loads and storesfor loads and stores
2 stage memory 2 stage memory access to support access to support slow synthesized slow synthesized
memorymemory
14 Nov 2003 Embedded Computer Architecture 56
Rounding OutRounding Out
Configurable cachesConfigurable cachesTypically 16kB/16kBTypically 16kB/16kB
Optional Optional TCMsTCMsMemory interfaceMemory interface
2 x 64 bit AMBA AHB links2 x 64 bit AMBA AHB linksOptional vector FP coprocessorOptional vector FP coprocessorOptional vector interrupt Optional vector interrupt controllercontroller
14 Nov 2003 Embedded Computer Architecture 57
ARM1026EJARM1026EJ--S SystemS System
ARM1026EJ-SCore
I$ D$
I-TCM
VFP10 FP coprocessor
RAM
D-TCM
VIC10 interrupt
coprocessor
ETM10RV trace/debug
BIU
Debug port connection
64-bit AMBA/AHB data bus for D
64-bit AMBA/AHB
data bus for IFLASH
14 Nov 2003 Embedded Computer Architecture 58
TCMTCM
TightlyTightly--Coupled MemoriesCoupled MemoriesAlternative to cachesAlternative to caches
As fast as cachesAs fast as cachesProgrammerProgrammer--controlledcontrolledNo automatic managementNo automatic managementCheaper to implementCheaper to implementMore predictable in behaviorMore predictable in behavior
Programming:Programming:In memory mapIn memory mapTagged like cachesTagged like caches
TCM
14 Nov 2003 Embedded Computer Architecture 59
Instruction Sets for ARMInstruction Sets for ARM
Base: ARM v5Base: ARM v53232--bit integerbit integer--only instruction setonly instruction set
T: thumb instruction setT: thumb instruction set1616--bit, for smaller core sizebit, for smaller core size
J: J: JazelleJazelle extensionsextensionsJava support in hardwareJava support in hardwareImplements 140 out of 228 JVM byte codesImplements 140 out of 228 JVM byte codes
E: DSP extensionsE: DSP extensionsDone in regular registersDone in regular registersSaturation, some more Saturation, some more MACsMACs
14 Nov 2003 Embedded Computer Architecture 60
The ARM Instruction SetThe ARM Instruction Set
Continuous evolutionContinuous evolutionAdd features required by marketAdd features required by marketRISC? Not anymore, if everRISC? Not anymore, if ever
Now at v6, in the ARM11 familyNow at v6, in the ARM11 familyv5, v5E in ARM9 and ARM10 v5, v5E in ARM9 and ARM10 V4 in old ARM7V4 in old ARM7Backwards compatibility!Backwards compatibility!
14 Nov 2003 Embedded Computer Architecture 61
TT: : ThThumb umb
Compressed instruction setCompressed instruction set1616--bit encoding of (parts of) bit encoding of (parts of) 3232--bit instructionbit instruction setsetLimitations in ARMLimitations in ARM//Thumb:Thumb:
Only access to 8 registers (16 Only access to 8 registers (16 in ARM modein ARM mode))No system operationsNo system operations
Effect:Effect:More but smaller instructionsMore but smaller instructions
30% more, at half size30% more, at half sizeUsually some performance lossUsually some performance loss
(Perform better on narrow buses)(Perform better on narrow buses)
14 Nov 2003 Embedded Computer Architecture 62
TT: Thumb: Thumb
Thumb sThumb shrinks the code:hrinks the code:Thumb ARM 386 8088 68020 SPARC
eqntott 10608 16768 17640 19106 20542 22256
0.63 1.00 1.05 1.14 1.23 1.33
xlisp 26388 40768 28097 29401 46746 44648
0.65 1.00 0.69 0.72 1.15 1.10
espresso 72596 109923 125686 137194 131854 142752
0.66 1.00 1.14 1.25 1.20 1.30
Source: Microprocessor Report, March 1995
14 Nov 2003 Embedded Computer Architecture 63
T2: Doing a Better ThumbT2: Doing a Better Thumb
ARM Thumb: fixed 16ARM Thumb: fixed 16--bit sizebit sizeSaves 28% space compared to 32Saves 28% space compared to 32--bit ARMbit ARMRuns 20% slower than 32Runs 20% slower than 32--bit ARMbit ARM
ARM Thumb 2: mixed 16/32ARM Thumb 2: mixed 16/32Brand new, arrives with ARM1156Brand new, arrives with ARM1156Saves 26% space compared to 32Saves 26% space compared to 32--bit ARMbit ARMRuns 2% slower than 32Runs 2% slower than 32--bit ARMbit ARM(Introduces some new instructions)(Introduces some new instructions)
Conclusion: mixed length good!Conclusion: mixed length good!Source: Microprocessor Report, June 2003
14 Nov 2003 Embedded Computer Architecture 64
Why T?Why T?
Pushed by mobile phonesPushed by mobile phonesMore memory = more expensiveMore memory = more expensiveMore memory = bigger packageMore memory = bigger packageMore memory = higher powerMore memory = higher power
More features in same memory!More features in same memory!Performance is not criticalPerformance is not critical
14 Nov 2003 Embedded Computer Architecture 65
T: CompetitorsT: Competitors
Compressed instruction setsCompressed instruction setsMIPS16e, shrunk MIPS32 ISAMIPS16e, shrunk MIPS32 ISAARCARCTensilicaTensilica
AllAll--small instruction setssmall instruction setsSH familySH family
Compressed codeCompressed codeIBM PowerPC 405 GXIBM PowerPC 405 GXDecompress when loaded into cacheDecompress when loaded into cache
14 Nov 2003 Embedded Computer Architecture 66
J: JazelleJ: Jazelle
Hardware Java accelerationHardware Java accelerationPushed by mobile phonesPushed by mobile phones
Why?Why?To fix Java performance problemsTo fix Java performance problems
SW JVM problems:SW JVM problems:Minimal clock frequency = Minimal clock frequency = low interpreter performancelow interpreter performanceJIT requires more memoryJIT requires more memory
14 Nov 2003 Embedded Computer Architecture 67
E: DSP ExtensionsE: DSP Extensions
A few new instructionsA few new instructionsSaturated arithmeticSaturated arithmetic
Add, Sub, Add, Sub, Signed multiply, MACSigned multiply, MAC
2 162 16--bit values in one registerbit values in one register16x1616x1632x1632x16
Count leading zeroesCount leading zeroesLoad/store pairs of registersLoad/store pairs of registers
Fairly typical Fairly typical ””DSPDSP”” additionsadditions14 Nov 2003 Embedded Computer Architecture 68
Why E?Why E?
Enhance DSP performanceEnhance DSP performanceOf standOf stand--alone ARM corealone ARM coreAvoid multipro solution Avoid multipro solution
Hard disk controllers, for exampleHard disk controllers, for example
14 Nov 2003 Embedded Computer Architecture 69
E: CompetitionE: Competition
DSPDSP--inin--processorprocessor““MAC=DSPMAC=DSP””Almost all embedded processors have itAlmost all embedded processors have itNo revolution in performanceNo revolution in performance
DSP/processor hybridsDSP/processor hybridsInfineonInfineon TricoreTricoreMicrochip Microchip DSPicDSPicHard to get it right, not a big success so farHard to get it right, not a big success so far
SIMD extensions SIMD extensions More extensive additions than v5EMore extensive additions than v5ERequires new functional unitsRequires new functional unitsMajor performance gain possibleMajor performance gain possible
14 Nov 2003 Embedded Computer Architecture 70
SIMD ExtensionsSIMD Extensions
HeavyHeavy--weight additionweight additionNew functional units, registersNew functional units, registersSmall vector computersSmall vector computers
Examples:Examples:ARM SIMD extensions (in v6)ARM SIMD extensions (in v6)Motorola Motorola AltivecAltivecMIPSMIPSx86 MMXx86 MMX--SSESSE--SSE2SSE2--3Dnow!3Dnow!SPARC VISSPARC VIS
14 Nov 2003 Embedded Computer Architecture 71
SIMD ExtensionsSIMD ExtensionsTargetTarget
MotorolaMotorolaPPC 7455 (G4+)PPC 7455 (G4+)1 1 GhzGhz
EEMBC EEMBC TelemarkTelemark suitesuiteNetworking suiteNetworking suite
OOTB:OOTB:OutOut--ofof--thethe--boxbox
OPT:OPT:Manually tuned to use Manually tuned to use AltivecAltivec
Overall/Average:Overall/Average:33--4 times speed up 4 times speed up can be expectedcan be expected
35,1
0
1
23
4
5
6
7
89
10
Aut
ocor
r 1
Con
volu
tion
1
Bit
allo
c 1
FFT
1
Vite
rbi 1
OS
PF
1
Rou
te 1
Pac
ket 5
12
OOTB OPT
14 Nov 2003 Embedded Computer Architecture 72
ARM ARM vsvs DSPDSP
Despite Despite ““EE”” and and ““SIMDSIMD””... ... Standard solution:Standard solution:
DualDual--core setupcore setupARM core ARM core DSP coreDSP core
Control Control vsvs datadata
14 Nov 2003 Embedded Computer Architecture 73
Control Control vsvs DataData
Control plane:Control plane:Standard processor tasksStandard processor tasksDecisionDecision--makingmaking““Integer applicationsInteger applications””UI of a phone, packet routing, UI of a phone, packet routing, ……
Data plane:Data plane:Move or process dataMove or process dataPerformance is keyPerformance is keySignal processing, multimedia, Signal processing, multimedia, ……Floating/fixed pointFloating/fixed point
14 Nov 2003 Embedded Computer Architecture 74
ARMARM--DSP: TI OMAP 5910DSP: TI OMAP 5910
Texas InstrumentsTexas InstrumentsTarget marketTarget market
DataData--intense realintense real--timetimeAudio, biometrics, etc.Audio, biometrics, etc.
Processing Processing DualDual--core chipcore chipARM925T 150 ARM925T 150 MhzMhzTI C55 DSP 150 TI C55 DSP 150 MhzMhz
Power 230 Power 230 mWmWPrice 32 USDPrice 32 USD
ARM shared devices
ARM private devices
System devices
DSP shared devices
DSP private devices
C55xDSP Core
24k I$
64k data SRAM
96k instrSRAM
ARM925CPU Core
16k I$
8k D$
MMU
192k Shared SRAM
MemCtrl
75 Mhz
LCD Ctrl
USB 1.1LCD controllerMMC/SDcard intfcamera interface keyboard interfaceRTCI2C8 serial ports3 UARTs14 GPIO pins
USB 1.1USB 1.1LCD controllerLCD controllerMMC/MMC/SDcardSDcard intfintfcamera interface camera interface keyboard interfacekeyboard interfaceRTCRTCI2CI2C8 serial ports8 serial ports3 3 UARTsUARTs14 GPIO pins14 GPIO pins
14 Nov 2003 Embedded Computer Architecture 75
ARM Family: ARM CoresARM Family: ARM Cores
ARM7
Performance
Time
ARM9
ARM10
ARM11
3-stage pipeunified cachelow power
5-stage pipeI/D caches
ARM9E5-stage pipeI/D cachesJava, DSP
1998
2000
2000
8-stage pipeDynamic BPOOO-completion550 Mhz
2002
6-stage pipeStatic BP64-bit BIUFP
1994
14 Nov 2003 Embedded Computer Architecture 76
ARM Family: Intel ChipsARM Family: Intel Chips
ARM7
Performance
Time
ARM9
ARM10
ARM11
StrongARM
XScale
ARM9E
19955-stage pipeLegandary performer
2001
7-10-stage pipeDynamic BP800 Mhz
Intel makes chips Intel makes chips based on the Xscale; based on the Xscale; does not license the does not license the
core to 3core to 3rdrd partiespartiesIntel got this from Intel got this from
Digital in 1998. Digital in 1998. A single variant, A single variant,
big in PDAs.big in PDAs.
14 Nov 2003 Embedded Computer Architecture 77
ConfConfigurable igurable Instruction Instruction
SetsSets
14 Nov 2003 Embedded Computer Architecture 78
Instruction Sets: ConfigureInstruction Sets: Configure
Configurable instruction setsConfigurable instruction setsAdapt to needs of applicationAdapt to needs of applicationUser can specialize the processorUser can specialize the processorLess waste on generalityLess waste on generalityFast evolution of instruction setsFast evolution of instruction sets
Traditionally:Traditionally:Chip manufacturers determine Chip manufacturers determine instruction sets aimed at some nicheinstruction sets aimed at some nicheSlow evolution of instruction setsSlow evolution of instruction sets
14 Nov 2003 Embedded Computer Architecture 79
Instruction Sets: ConfigureInstruction Sets: Configure
SubsetSubsettingtingThere is a limited and predefined set of There is a limited and predefined set of instructions availableinstructions availableEasy to compile for: restrict code Easy to compile for: restrict code gengenRemove instructions to simplify coreRemove instructions to simplify core
AdditionAdditionFFreedomreedom to to invent instructionsinvent instructionsTool chain: assemblyTool chain: assembly, C compilers, C compilersGenuine development of Genuine development of ISAsISAs
14 Nov 2003 Embedded Computer Architecture 80
Configurable Instruction SetsConfigurable Instruction Sets
Tight integration:Tight integration:Add to regular pipelineAdd to regular pipelineAdditional functional unitsAdditional functional unitsAdding fineAdding fine--grained instructionsgrained instructions
Loose integration:Loose integration:Coprocessor interfaceCoprocessor interfaceSlower communicationSlower communicationOffloading of macroOffloading of macro--scale tasksscale tasksMethod to invoke accelerator circuitsMethod to invoke accelerator circuits
14 Nov 2003 Embedded Computer Architecture 81
Configurability TrendConfigurability Trend
PioneersPioneersTensilicaTensilica XtensaXtensaArc ArctangentArc ArctangentConfigurability as key selling pointConfigurability as key selling point
Added to general architecturesAdded to general architecturesMIPS: MIPS: ““CorExtendCorExtend””PowerPC: PowerPC: ““BookEBookE ASUASU””Usually less tight integrationUsually less tight integration
14 Nov 2003 Embedded Computer Architecture 82
Benefit of ConfigurabilityBenefit of ConfigurabilityTargetTarget
XtensaXtensa IIIIII200 200 MhzMhz
EEMBC EEMBC TelemarkTelemark suitesuiteNetworking suiteNetworking suite
OOTB:OOTB:OutOut--ofof--thethe--boxbox25k gate core25k gate core
OPT:OPT:Tuned codeTuned code25k base core gates25k base core gates18k extra 18k extra instrinstr gatesgates100k DSP 100k DSP coproccoproc37k 37k configconfig gatesgates
SpeedupsSpeedups
Benchmark OOTB OPT Telemark overall 1 37
Autocorr 1 9
Convolution 1 1249
Bit alloc 1 34
FFT 1 24 Viterbi GSM 1 14
14 Nov 2003 Embedded Computer Architecture 83
ConfConfiguration Toolsiguration Tools
instruction set choices
Gate and memory size
counters