4.4 Implementation Structures in FPGAs and DSPs
Transcript of 4.4 Implementation Structures in FPGAs and DSPs
4.4 Implementation Structures in FPGAs and DSPs
Presented by Lee PuckerPresident, ForwardLink Consulting
Agenda
Case Study on Implementation StructuresSynchronization in a GSM Network
Option 1: DSP Implementation of a GSM Sync Burst Matched CorrelatorOption 2: FPGA Implementation of a GSM Sync Burst Matched CorrelatorDiscussion on Trade-offs in Device SelectionConclusions
Typical problem – Synchronization on a GSM network
The Mobile Station and Base Station Terminals in a GSM network contain multiple clocks that run asynchronously
For these radios to communicate, they need to synchronize their clocks
This Problem is Addressed Through the GSM Logical Channel and Frame Structure
GSM Signals Are Transmitted in “Bursts”
1 Burst per Time SlotEach time slot contains 156.25 bits at 270.833 bits/sec8 Time Slots per Frame
One time slot in each downlink frame is used by the base station to transfer synchronization and control information
Referred to as logical channels
Every 10 frames, the base station transmits a Sync Channel (SCH) to facilitate synchronization by the mobile station with the base station
FCHSCH
BCCHBCCHBCCHBCCHCCCHCCCHCCCHCCCHFCHSCH
CCCHCCCHCCCHCCCHCCCHCCCHCCCHCCCHFCHSCH
CCCHCCCHCCCHCCCHCCCHCCCHCCCHCCCHFCHSCH
CCCHCCCHCCCHCCCHCCCHCCCHCCCHCCCHFCHSCH
CCCHCCCHCCCHCCCHCCCHCCCHCCCHCCCHIDLESource: ETSI 3GPP TS 45.002
The GSM Synchronization Channel (SCH)
The GSM Synchronization Burst Contains a “training sequence”of 64 bits that is used to facilitate synchronization
Sync bits are known by both the base station and the mobileDetecting where the synchronization burst occurs in time allows the mobile station to synchronize with the base stations baud clock and frame clockTracking the drift from SCH burst to SCH burst can be used by the mobile station to fine tune receiver clock
More on this when you study Synchronization later in this course
Process for detecting the GSM SCH
Oversample the received complex baseband GSM signal by 4X
This allows us to synchronize to the base station with an accuracy of .25 bits
Sometimes referred to as a qbitBaud rate = 270.833 MSymbols per second, so the sample rate is 1083.3333 MSamples per second
GMSK modulate the training sequence of the SCH to create the matched filter and “upsample” by 4
Creates 64 X 4 samples = 256 samples
Slide the received GSM sample data against the matched filter to find the correlation peak
The Correlation Function
Cn = MF(m) x S*(n+m) m = 0 Σ255
Let MF be the Matched Filter Sequence (Length 255)Let S be the Signal SequenceC be the Correlation Sequence
A Typical DSP – the TMS320C6416T
Single C64x fixed point DSP coreIndependent L1 Program and L1 Data CacheBuilt in 8Mbit L2 CacheDual external memory interfacesViterbi and Turbo coprocessors3 independent timers
Source:TMS320C6414T, TMS320C6415T, TMS320C6416T Fixed Point Digital Signal Processors Data Sheet (SPRF226J)
The C64x VLIW Architecture
The C64x CPU hosts dual sets of functional units each with dedicated register files
.L and .S functional unit performs arithmetic, logical and branch operations .M functional units perform two 16 bit x16 bit multiplies per clock .D data addressing units are responsible for data transfers between register files and memory32 registers each with 32 bits
Source:TMS320C64x Technical Overview
Key issue – the need to support fixed point math
Most DSP’s utilize fixed point vs. floating point mathAllows for significant reduction in power and cost associated with the device
Fixed point math requires scaling to occur16 bit * 16 bit = 32 bit to maintain full precision16 bit + 16 bit = 17 bit to maintain full precisionIf A, B, C, D are 16 bit values, then A*B + C*D requires 33 bits
Can’t be supported on a processor that only does 32 bit arithmeticUsually handled by right shifting the products to make them 31 bit numbersThis is simplified if using unsigned or sign/magnitude values versus 2’s complement
GSM Synchronization Channel Matched Filter Implementation on a TMS320C6416t DSP
During each clock cycle (order is important, and optimizations are required)Add/subtract previous complex products to accommodate conjugation and add to previous sum with appropriate scaling using the .L and .S units
C_Real = W + X + Cr_ImagC_Imag = Y - Z + C_Imag
Computer new complex products using .m unitW = MF_Real * S_RealX = MF_Imag * S_ImagY = MF_Imag * S_RealZ = MF_Real * S_Imag
Decrement m (m = m-1).S Unit
If m = 0, Branch using S. UnitSave C to memory for Cn using .D unitIncrement n Set m = 256, Reset C to 0;
Load next S_Real (n + m), S_Image (n + m), MF_Real(m), MF_Imag(m) into A and B registers using .D unit
Cn = MF(m) x S*(n+m) m = 0 Σ255
Performance
Training sequence = 64/156.25 or .4096 of the bits in a sync burst For a 1 GHz Clock, the filter implementation shown in the previous slide operating with input data of 1MSPS will consume greater than .4 the cycles for 1 pass
Only 2 passes (n=0 and n=1) and the processor is exhausted
Solution is to Reduce precision to allow more operations per cycle, orGo to a faster processor
A typical FPGA for wireless signal processingThe Xilinx Virtex 5 SXT
Virtex 5 SXT FPGA has a number of “features” supporting wireless signal processing
DSP48E SliceEmbedded “Block Ram”Configurable Logic Blocks
Source: Virtex-5 Family Overview LX, LXT, and SXT Platforms
Virtex 4 SXT Block Ram
Up to 244 36kbit dual port clocksBuilt in address sequencing to support FIFO and shift register functions
Source: Virtex-5 Users Guide
Virtex 4 Configurable Logic Blocks
Primary logic resources provided by the FPGAEach CLB has 2 “slices”Each “slice”consists of mulitplelook up tables, registers, and combinatorial logic elements
GSM Synchronization Channel Matched Filter Implementation on a Virtex 5 SXT FPGA
S0 S1 S2 S3 S252 S253 S254 S255
X
Signal In
MF0 MF1 MF2 MF3 MF252 MF253 MF254 MF255
X X X X X X X
Sum
C Out
256 Tap Shift register created using BlockRAMFilter Taps Stored
in BlockRAM
Multiply and Accumulate Functions Supported via
DSP48 Slices
Multiply and Accumulate Functions Supported via
DSP48 Slices
Performance
Because the FPGA is an inherently parallel processor, the entire correlation can be done in 1 clock cycle
The Virtex 5 supports this at clock rates up to 550 MSPS
So why wouldn’t you always use an FPGA???
Device Selection ProcessIdentify Algorithmsfor Implementing
Each Defined Block.Establish Functional
Requirements forAlgorithms
Map Algorithms toCandidate Processing
Devices andSelect Devices
Block Diagrams
Modify Block Diagramsto Better Align
Algorithms FromDifferent Air Interfaces
Block Diagrams withFunctional Requirements
Modify Algorithms Basedon Device Constraints
Modify Block Diagrams Basedon Device Constraints
DevelopBlock DiagramArchitectures
Supporting Each Modeof Target
Air Interfaces
Air InterfaceSpecificationsFrom Concept
Definition Selection Criteriafor
Processing Devices
Block Diagram withFunctional Requirements
Mappedto Selected Devices
This step may also include a “make versus buy” decision
Selection Criteria for Signal Processing Devices
PerformanceCan it do the job
ProgrammabilityLife cycle/maintenance costs
Level of IntegrationCost per unit
Development CycleCost of development
PowerBattery life
Example 4Device Mapping for 8-PSK Demod
Figure 7a: Base Architecture
Decimate by 30Digital Down
Converter(< 65 ops/sample)
Decimate by2 +/- ε
Resampler(< 62 ops/sample)
QuadraturePhase
Detector(< 100 ops/
sample)
SymbolRate
Detector(< 200 ops/
sample)
Integrateand Dump
Symbol RateAdjust
IF SignalFrom RF
Subsystem(65 MSPS)
RecoveredBits ToChannelDecoder
SymbolMapping
CarrierRate
Detector(< 100 ops/
sample)
Carrier Adjust
Baud Clock(270.833 kHz)
Example 4 ContinuedMapping 1
Figure 7b: Device Mapping 1
ASIC (Channelizer)
DSP or GPP(Channel Processor)
Decimate by 30Digital Down
Converter(< 65 ops/sample)
Decimate by2 +/- ε
Resampler(< 62 ops/sample)
QuadraturePhase
Detector(< 100 ops/
sample)
SymbolRate
Detector(< 200 ops/
sample)
Integrateand Dump
Symbol RateAdjust
IF SignalFrom RF
Subsystem(65 MSPS)
RecoveredBits ToChannelDecoder
SymbolMapping
CarrierRate
Detector(< 100 ops/
sample)
Carrier Adjust
Baud Clock(270.833 kHz)
Example 4 ContinuedMapping 2
Figure 7c: Device Mapping 2
FPGA (Channelizer)
DSP or GPP(Channel Processor)
Decimate by 30Digital Down
Converter(< 65 ops/sample)
Decimate by2 +/- ε
Resampler(< 62 ops/sample)
QuadraturePhase
Detector(< 100 ops/
sample)
SymbolRate
Detector(< 200 ops/
sample)
Integrateand Dump
Symbol RateAdjust
IF SignalFrom RF
Subsystem(65 MSPS)
RecoveredBits ToChannelDecoder
SymbolMapping
CarrierRate
Detector(< 100 ops/
sample)
Carrier Adjust
Baud Clock(270.833 kHz)
Example 4 ContinuedMapping 3
Figure 7d: Device Mapping 3
FPGA (Channelizer)DSP or GPP
(Channel Processor)
Decimate by 30Digital Down
Converter(< 65 ops/sample)
Decimate by2 +/- ε
Resampler(< 62 ops/sample)
QuadraturePhase
Detector(< 100 ops/
sample)
SymbolRate
Detector(< 200 ops/
sample)
Integrateand Dump
Symbol RateAdjust
IF SignalFrom RF
Subsystem(65 MSPS)
RecoveredBits ToChannelDecoder
SymbolMapping
CarrierRate
Detector(< 100 ops/
sample)
Carrier Adjust
Baud Clock(270.833 kHz)
What Generally Goes WhereWhat Generally Goes Where
ASIC/FPGA DSP GPP
Digital Down Conversion/ Digital Up Conversion
Resampling Resampling
Equalization and Pre-emphasis Filtering
Carrier Synchronization Carrier Synchronization
Chip Rate/Code Synchronization
Symbol Synchronization Symbol Synchronization
Spread/Despread Modulation/Demodulation Modulation/Demodulation
Carrier Synchronization Interleaving/ De-Interleaving
Interleaving/ De-Interleaving
Symbol Synchronization Packet Framing Packet Framing
Diversity Combining Error Correction Coding/Decoding
Error Correction Coding/Decoding
Resampling Link Layer Processing
Note: System on Chip (SoC) technology may integrate all of the above
Real world example (Source: EE Times)
DSP for baseband signal processing
GPP for higher levels of the
protocol stack
Hardware coprocessor (ASIC) for computationally
expensive functions