Putting MPSOC to Work in Multimedia · © 2007 Tensilica Inc. Sample of HiFi2 Audio Software ......
Transcript of Putting MPSOC to Work in Multimedia · © 2007 Tensilica Inc. Sample of HiFi2 Audio Software ......
1 © 2007 Tensilica Inc.
Six billion people want live multimedia entertainment and information anywhere and anytime at the lowest cost
1. Multimedia subsystems appear everywhere – big market
2. Technical demands require new tools, architectures and business models
3. Diversity within multimedia mirrors diversity of entire MPSOC challenge
Putting MPSOC to Work in MultimediaPutting MPSOC to Work in MultimediaPutting MPSOC to Work in Multimedia
2 © 2007 Tensilica Inc.
MPSOC Becomes Mainstream StyleThe Transformation of the Modern Mobile Device
Key complementary processing functions of mobile devices covered by Tensilica configurable processors
Audio Codec
GPSBluetooth
Satellite Radio
HD Radio
User Authentication
Camera Image Processing
Mobile WiFi
Noise Cancellation
Video Codec
DTV Reception
Baseband DSP
3 © 2007 Tensilica Inc.
The designer never has to manually add these instructioComplete
MultimediaSolutions
Software RulesNetwork of Partners and Customers Defines Strength
Video Imaging
Audio
AFA TECHNOLOGY
NUFRONTPnpNetwork
Technologies
MobileMedia
ReceiverConcept Embedded
Systems
4 © 2007 Tensilica Inc.
Design Teams Need Different Entry PointsKey Question: Who Specifies Processor?
Hard-wired logic
RISC CPUμController
DSP
Xtensa LX
Dat
a P
roce
ssin
g N
eeds
Complex Computing Needs
• Xtensa configurable processors widely used in multimedia
• Diamond Standard cores introduced February 2006:
• Broad line of compatible controller, CPU and DSP cores
• High performance general-purpose DSP
• Complete low-power HiFi2 audio solution
• Diamond Video spans wide range of mobile media needs
Diamond388VDOVideo Codec
Diamond381 VDO
VideoDecode
Diamond383VDO
Video Codec
Diamond385VDOVideo
Decode
Xtensa
Diamond 570T
High-end CPU
Diamond 232L
Linux CPU
Diamond 212GP
Mid-rangeController/
DSP
Diamond 108 Mini
Low-powerController
Diamond330HiFi
Audio
Diamond 545CK
High-end DSPHiFi2Audio
5 © 2007 Tensilica Inc.
Automation Enables Broad Product LineC
apab
i lit y
time
Xtensa 7
XtensaXtensa IIXtensa IIIXtensa IVXtensa VXtensa 6
Xtensa –next
Mainstream
extensible processors
XtensaLX
XtensaLX2
XtensaLX-next
High-performance
extensible processorsXPRES
Compiler
Processor Synthesis
- next
Automatic
processor synthesis
XTMPSystem Simulation
API
XTSCSystemC
Simulation
TurboXimEnergy
Modeling
Processor and
system modelingNext ModelingMP SW tools
CoWare
Seamless
xcc C/C++ Code
Development
Xtensa XplorerDevelopmentEnvironment
Next Xtensa Xplorer
Software development
environment
gccGNU tools
Xtensa XplorerDevelopmentEnvironment
DiamondCtrlrs
DiamondCtrlr - next
Standard controllers
DiamondAudio
DiamondVideo
nextDSP
next Video
next Audio
Application-
specific DSPs
DiamondDSP
6 © 2007 Tensilica Inc.
Base CPU
AppsDatapaths
OCD
Timer
FPUExtended Registers
Cache
Multimedia Design is Energy-Based DesignExample: Xenergy Energy Explorer
Processor Description
Xenergy Energy
Estimator
Complete DynamicEnergy/Power Profile
Processor + Memory
Xtensa xccC/C++ compiler
int main( ) {int i;short c[100];for (i=0;i<N/2;i++){
int main( ) {int i;short c[100];for (i=0;i<N/2;i++){
Application Code
Tune processorinstruction set and
memories
Tune softwarealgorithms and coding
7 © 2007 Tensilica Inc.
Application-Specific ProcessorsExample: HiFi 2 Audio Engine
• Pre-configured VLIW audio DSP with dual 24-bit MACs for higher quality audio• Already in use in dozens of chip designs, licensed by many top semiconductor
suppliers• Available 2 ways:
• Xtensa HiFi 2 Audio Engine for further customization• Diamond Standard 330HiFi audio core
• Delivers both low power and scalability• <75K gates for full-feature processor• Example: MP3 decode at 5.7 MHz (core + memory <0.75mW in 65nm)• Emerging trend: multiple instances of HiFi2 on same chip, tuned for different mW vs. MHz
Fetch/Decode
I-RAM
Reg File
ALU
Branch Unit
External I/F
Load/Store D-RAM 2
Brid
ge (o
pt.)
VLCAccel.
I-Cache
Audio ALU
Extended
Audio MAC 24b
Misc. AudioFunctionsQ
ueue
I/F
SLOT 1 SLOT 0
Audio Reg File
D-RAM 1D-CACHE
8 © 2007 Tensilica Inc.
Sample of HiFi2 Audio Software
Currently available:MP3 decodeMP3 encodeWMA decodeWMA encodeAAC-LC decodeAAC-LC encodeaacPlus v2 decodeaacPlus v2 encodeaacPlus v1 decodeaacPlus v1 encodeOgg Vorbis DecoderDolby AC-3 5.1 ch decodeDolby AC-3 2 ch encodeDolby Digital Plus 5.1 ch
decodeDolby Digital Plus 7.1 ch
decode AM3D Diesel 3D
AM3D Zirene effectsQSound microQ MIDI, 3D, effectsSONiVOX AudioiNSIDEMIDISRS WOW XT audio expansionSRS Xspace 3DSRS TruSurround HDAMR NB codecAMR WB codecSchedule for release:BSAC decoderDolby TrueHD Dolby Digital Prologic II / IIxDolby Digital Consumer Encoder 2 channel encoder
Dolby Digital Compatible Output 5.1 channel encoder
aacPlus 5.1/7.1 channel decode
SBC Bluetooth stereo codec
AMR-WB+G.711 codecG.723.1 codecG.726 codec DTMF aacPlus 5.1/7.1 channel
decodeDolby TrueHD Dolby Digital
Prologic II / IIxG.167 (AEC)G.168 (LEC)WMA 10 Pro Decoder
9 © 2007 Tensilica Inc.
Growth in Video Resolution and Complexity
Computational Complexity of Digital Video
1
10
100
1000
H.264MPEG-4MPEG-2
Video Standard
Rel
ativ
e C
ompu
teC
ycle
s (M
PEG
-2 C
IF =
1)
HD 1080pHD 720pD1 (576i)CIF
Source: “A Performance Characterization of High Definition Digital Video Decoding using H.264/AVC”, scaled for CIF
10 © 2007 Tensilica Inc.
Diamond VDO Design Goals
• Video encode & decode up to720 x 480 @30 fps (NTSC)720 x 576 @25 fps (PAL)
• Support multiple coding standards @ main profileH.264 main profile @ L3 decoderMPEG-4 ASP [no GMC] @ L5 decoderMPEG-2 MP @ ML decoderWMV9/VC1 main profile decoderMPEG-4 ASP encoder
• Programmable solution • Less than 100MB/s memory decode bandwidth requirement• Operating frequency <200Mhz
Can be implemented in 130nm G technology
• Compatible subsets for decode only and Base Profile only
11 © 2007 Tensilica Inc.
Top Level Block Diagram
12 © 2007 Tensilica Inc.
Diamond VDO Product Family
Baseline Profiles – D1
Main / Advanced Profiles – D1
MPEG-4 Advanced Simple Profile
H.264 Main Profile MPEG-4 Advanced
Simple Profile VC-1/WMV9 Main ProfileMPEG-2 Main Profile
Diamond 388VDO
NoneH.264 Main Profile MPEG-4 Advanced
Simple Profile VC-1/WMV9 Main ProfileMPEG-2 Main Profile
Diamond 385VDO
MPEG-4 Simple Profile
H.264 Baseline ProfileMPEG-4 Simple ProfileVC-1/WMV9 Simple
Profile MPEG-2 Main Profile
Diamond 383VDO
NoneH.264 Baseline Profile MPEG-4 Simple ProfileVC-1/WMV9 Simple
ProfileMPEG-2 Main Profile
Diamond 381VDO
EncoderDecodersTensilica Engine
150 MHz30D1MPEG-2 Main Profile Decode
185 MHz30D1MPEG-4 Advanced Simple Profile Encode
189 MHz30D1VC-1/WMV9 Main Profile Decode
156 MHz30D1MPEG-4 Advanced Simple Profile Decode
172 MHz30D1H.264 Main Profile Decode
Max. Required
Clock Rate
Frames-per-
Second
ResolutionStandard
13 © 2007 Tensilica Inc.
Optimizing EnergyISA Extension Example: CABAC
• Context-adaptive binary arithmetic coding (CABAC) achieves higher compression in H.264
Required for Main Profile H.264 decodingOur solution: Xtensa ISA extensionsPeak decode performance is one bin per cycle
13
Millions of cycles per second in ISA extended core
710CABAC decode cycles
Millions of cycles per second in unaugmented core
5mJ164mJEnergy for 1 second
Area cost: 20K gates for CABAC ISA extensions
Unaugmented core assumes base core plus 16b MUL, LOOP ops, debug, PIF at max MHz
14 © 2007 Tensilica Inc.
Optimizing EnergyExample: Motion Estimation for MPEG4 Encode
14.216598pt SAD26.6204Control76.24268Motion Estimation
10.71143Half Pel
8.4735Diamond
16.352716x16 SAD
Millions of cycles required on ISA extended core
Millions of cycles required on unaugmented core
Area Cost: Less than 40Kgates for all Motion Estimation extensions
29mJ988mJEnergy for 1 second
Unaugmented core assumes base core plus 16b MUL, LOOP ops, debug, PIF at max MHz
15 © 2007 Tensilica Inc.
System Issues in Mobile VideoExample: Decode Memory Bandwidth Drives Power
• Off-chip power is a major component of solution power
• Early, accurate modeling gives insight and enables novel optimization
• Non-trivial interactions:Core power increases with bit stream data rateMemory power decreases with bit stream data rate
• Design of video architecture, DMA and software reduces bandwidth by about one-third (saves 30mW)
• Additional power savings in on-chip interconnect and memory controller 0
2040
60
80
100120
140
160
60 70 80 90 100 110 120 130 140Delivered MB/s
Mob
ile D
DR p
ower
(mW
)
Estimated power for Micron MT46H16M16LF MobileDDR at 133MHz: 50% 16B block read, 50% 16B writes
0
20
40
60
80
100
120
0 2 4 6 8 10H.264 MP bitstream (Mbps)
Mem
ory
Band
wid
th (M
B/s)
Memory Bandwidth Needs
Memory Bandwidth Power Cost
16 © 2007 Tensilica Inc.
Move Up to Fully Programmable HD Approach: five small video task engines + audio
Off-chip DRAM controller
Processor InterFace (PIF) connect
Off-chip DDR
Bridge to legacy system bus
AHB, AXI, OCP, CoreConnect etc.
BitstreamGP scalar with
CABAC/CAVLC accel
InstCache Data
RAM
MotionVector
GP FLIX
InstCache
DataRAM
1K motionvectors
PredictionPixel SIMD
InstCache
DataRAM
ResidualSmall SIMD
InstCache
DataCache
DeblockingPixel SIMD
InstCache
DataRAM
Hifi2 AudioEngine
InstCache
DataRAM
3-4 computed macroblocks:
<1.5KBcomputedresiduals<100B
Residual data on 6-8 macroblocks
2-4KB20-30 motion
vectors
~32B
bits
tream
data
Bits
tream
addr
Blo
ck d
ata
Boundary strength data on5-7 macroblocks
Pixel positions: 3-5 macroblocks
Intra
fram
ead
dr
Intra
fram
eda
ta
DataCache
3KB
DataCache
Optional Audio
Res
ult
addr
ess
& d
ata
17 © 2007 Tensilica Inc.
Baseband DSP Needs Extensible Processors
Optimized baseband architecture with DSP-optimized task-engines
• Explosion of standards and multiple radios dictates agile-but-efficient baseband design
• Each processor extensible to handle specific algorithm kernels with RTL-like power and throughput
• Easy composition of engines into efficient pipeline
• Unified hardware and software development model
Baseband Processing Engine
MA
C
proc
esso
r
Enc
rypt
enco
de
proc
esso
r
Mod
ulat
ion
Dem
odul
atio
npr
oces
sor
Ana
log
I/FA
nalo
g I/F
Ana
log
I/F
Xtensa DSP Foundation
Ker
nel A
ccel
Pack
age
Ker
nel A
ccel
Pack
age
Ker
nel A
ccel
Pack
age
Ker
nel A
ccel
Pack
age
StreamingI/O
Interfaces
Special FunctionMemories
~100xTI C64xLDPC
2xRISCUpper layer processing4xRISCSSL
300xRISCAES43xRISC3-DESSecurity
18xTI C62xTurbo decoding
3xTI C64xViterbiError Coding
3xTI C64xFFT4xTI C64xFIRModulation
(e.g. OFDM)
Efficiency (cycle reduction)
VersusFunctionDomain
18 © 2007 Tensilica Inc.
System Level Design MattersExample: Audio System Optimization
System-level design tools make a big difference• Tensilica supports wealth of system modeling:
ISS, TurboXim, XTMP C modeling, XTSC System C modeling, CoWare, Seamless
• Tensilica Xenergy Energy Estimator provides rapid, accurate configuration-specific application power analysis
• Example: Audio system organization analysis:
HiFi2 coreInst
cache
On-chip SRAM
Mobile DDR (32M x 16b)
Data cache
PIF
AudioDACs
DDR/Flash controller
Other sub-systems
PLLs & clock trees
0.250.3616K/32K DM + DDR
0.190.5616K/32K + 128K on-chip + DDR0.150.6516K/32K DM + 192K on-chip
0.160.654K/8K DM + 192K on-chip0.941.004K/8K DM + DDR
90G high-vt90GSystem Organization Power includes dynamic and static at minimum WMA decode MHzPower includes HiFi2 core, I cache, D cache, On-chip SRAM, DDR (no refresh)I/O through queues
Relative power
19 © 2007 Tensilica Inc.
MPSOC System Design is CollaborativeExample: Xtensa in CoWare Platform Architect
Configured subsystems with rich interfaces:
• AMBA bus• Local memories• Caches• Special lookup
memories• Input queues• Output queues• Input wires• Exported state
wires• Interrupts
• Wrapper automatically generated from processor configuration
20 © 2007 Tensilica Inc.
Putting It TogetherThe principles
• Multimedia SOC is not “rocket science”: hundreds of designs underway around the world
• Pay attention to best practices:1.Leverage existing design
standards, tools, interconnect and subsystem IP
2.Model early and often. Cost, performance, power largely set in architectural phase
3.Application standards explode. Build in programmability everywhere
4.Look at total bandwidth and total power, especially in off-chip communications
• Differentiation, time-to-market and high efficiency are compatible!
On-chip interconnect
Video
DSPI/O
inte
rface
s System CPU
Audio
Mem
ory
cont
rolle
r
RF
AudioDAC
MA
C
proc
esso
r
Enc
rypt
/enc
ode
proc
esso
r
Mod
ulat
ion
proc
esso
r
Fetch/Decode
I-RAM
Reg File
ALU
Branch Unit
External I/F
Load/Store D-RAM 2
Brid
ge
(opt
.)
VLC
Accel.
I-Cache
Audio ALU
Extended
Audio MAC 24bMisc.
AudioFunctio
ns
Que
ue
I/F
SLOT 1 SLOT 0
Audio Reg File
D-RAM 1D-CACHE
Fetch/Decode MMU
Reg File
ALU
Branch Unit
External I/F
Load/Store MMU
32
PIF
Brid
ge
(opt
.)
32
AMBA AHB lite
MAC 16 DSP
I-Cache
D-Cache
21 © 2007 Tensilica Inc.Thank you!