CA-RAM: A High-Performance Memory Substrate for Search-Intensive Applications Sangyeun Cho, J. R....
-
Upload
cody-nelson -
Category
Documents
-
view
229 -
download
1
Transcript of CA-RAM: A High-Performance Memory Substrate for Search-Intensive Applications Sangyeun Cho, J. R....
CA-RAM:A High-Performance Memory
Substrate for Search-Intensive Applications
Sangyeun Cho, J. R. Martin, R. Xu,M. H. Hammoud and R. Melhem
Dept. of Computer ScienceUniversity of Pittsburgh
ISPASS 2007
Search ops in applications
Search (or lookup) operations represent an important common function
Network packet processing• For each arriving packet, determine the output port• Given packet information, find a matching classification rule• Each look up can incur many memory accesses
Speech recognition• Searching (e.g., dictionary lookup) takes up ~24% of CPU cycles
Forthcoming RMS (Recognition, Mining, and Synthesis) apps
ISPASS 2007
Search performance and power
Search performance must match increasing line speeds• For OC-768, up to 104M packets must be processed per second• Network traffic has doubled every year [McKeown03]• Routing tables (~200K prefixes in a core router) are growing [RIS]
• IPv6
Power and thermal issue already a critical limiting factor in network processing device design [McKeown03]
Search in battery-operated devices should be energy-efficient
Conventional search solutions• Software methods (tries, hash table, …)• Hardware methods (CAM, TCAM, …)
ISPASS 2007
IP lookup using a trie
Consider an IP address: 0 1 0 0 0 1 1 0
Software approach is “flexible”
high memory capacity requirement
high memory bandwidth requirement
not SCALABLE
ISPASS 2007
IP lookup using TCAM
Consider an IP address: 0 1 0 0 0 1 1 0
110100*110101*110111*01000*01100*01101*11011*0100*0110*1101*10*0*
sort beforestoring
choose the firstamong the matched high bandwidth, constant time
lookup
TCAMs are relatively small, expensive
power consumption very high
not SCALABLE
ISPASS 2007
CA-RAM – a hybrid approach
Can we do better than the existing conventional schemes?• CAM-like search performance• RAM-like cost and power
CA-RAM combines hashing w/ hardware parallel matching
CA-RAM design goals• High lookup performance• Low power consumption• Smaller chip area per stored datum• Straightforward system-level integration
ISPASS 2007
Talk roadmap
What is CA-RAM?
Prototype design
Case study 1: IP lookup
Case study 2: Trigram lookup for speech recognition
ISPASS 2007
CA-RAM – Content Addressable RAM
Separate match logic and memory Match logic for a single row, not every row Allows the use of dense RAM technology Enables highly reconfigurable match logic Keep keys sorted in each row, not in entire array
Match logic
Memory cells
Conventional CAM/TCAM CA-RAM
ISPASS 2007
Very simple, yet efficient
Use hashing to store keys in a particular row To look up, hash the search key and retrieve one row Perform matching on entire row in parallel Achieve full content addressability w/o paying overhead!
Index generato
r
Keyi1
Match processor1
…
…
Keyi2
Keyj2Keyj1
Match processor2…
search key
ISPASS 2007
Pipelined CA-RAM operation
Index generator
Search keyKeyi1
Match processor1
Keyi2
Keyj2Keyj1
Match processor2
Result
Match processor3
Keyi3
Keyj3
Step 1 Step 2 Step 3 Step 4
Index
Keyj2Keyj1 Keyj3
Search key Match processor2
Index generationMemory accessKey matchingResult forwarding
ISPASS 2007
Dealing w/ bucket overflows
Careful design of hash function
Increase bucket size• Reduce load factor (); = # of occupied entries / # of total entries
Use “chaining”; store overflows in subsequent rows• Multiple accesses per lookup
Use a small overflow CAM, accessed in parallel• Similar to popular “victim caching”
Use two-level hashing and employ multiple CA-RAM banks
……
ISPASS 2007
CA-RAM reconfig. opportunities
Reconfigurable match logic allows:
Adapting key size to apps• Same hardware to support multiple apps or standards
……
ISPASS 2007
Adapting key size
Keyi1
Reconfigurable match logic
Keyi2
Keyj2Keyj1
Keyi3
Keyj3
Match information
Keyi1 Keyi2
Keyj2Keyj1 Adapting key size is straightforward
Will benefit supporting multiple apps/ standards
Select key bitsfor matching
ISPASS 2007
CA-RAM reconfig. opportunities
Reconfigurable match logic allows:
Adapting key size to apps• Same hardware to support multiple apps or standards
Binary and ternary matching• Some apps require ternary matching, some don’t
……
ISPASS 2007
Supporting binary/ternary matching
Reconfigurable match logic
Match information
Keyi1 Keyi2
Keyj2Keyj1
Search key
Maskj1
Maski1
Developed configurable comparator
T-matching requires 2 bits / 1 symbol
Supporting different types of matching in different bit positions feasible
Consider maskbits or not
Ternary2:1
Ki
MATCHi
Mi TMi
xor...
K0 K1 K2 KN-1
MATCH
MATCHN-1
ISPASS 2007
CA-RAM reconfig. opportunities
Reconfigurable match logic allows:
Adapting key size to apps• Same hardware to support multiple apps or standards
Binary and ternary matching• Some apps require ternary matching, some don’t
Storing data and keys in a CA-RAM module• Cuts # of memory accesses for a lookup by half
……
ISPASS 2007
Simult. key matching & data access
Reconfigurable match logic
Match information
Keyi1 Keyi2
Keyj2Keyj1
Search key
Dataj1
Datai1
Data access follows TCAM lookup
CA-RAM supports data embedding
Cuts memory traffic & latency by half
Match result & Data
Match key &bypass data
ISPASS 2007
CA-RAM reconfig. opportunities
Reconfigurable match logic allows:
Adapting key size to apps• Same hardware to support multiple apps or standards
Binary and ternary matching• Some apps require ternary matching, some don’t
Storing data and keys in a CA-RAM module• Cuts # of memory accesses for IP lookup by half
Providing range checking capabilities• Beneficial for rule-based packet filtering
……
ISPASS 2007
Supporting range checking
Reconfigurable match logic
Match information
Keyi1 Rangei1
Rangej1Keyj1
Search key
(Range checking causes troubles)
(Entries must be expanded)
CA-RAM can upport range checking efficiently Match key &
check range
ISPASS 2007
CA-RAM-based memory subsystem
CA-RAM slice
CA-RAM slice
CA-RAM slice
...
CA-RAM slice
CA-RAM slice
CA-RAM slice
...
... ... ...
Result queue
InputController
OutputController
Request queue
Config
Request
Result
ISPASS 2007
Prototype implementation
We implemented a prototype CA-RAM slice design (w/ a degree of reconfigurability) and evaluated its power and area advantages over state-of-the-art TCAMs
We used a standard cell (0.16m) based ASIC design flow
Step # cells Area, m2 Delay, ns
Expand search key 3,804 66,228 (0.89)
Calculate match vector 5,252 10,591 0.95
Decode match vector 899 1,970 1.91
Extract result 6,037 21,775 1.99
Total 15,992 100,564 4.85
ISPASS 2007
Area and power: CA-RAM vs. TCAM
0123456789
10
16T SRAM-basedTCAM
8T DRAM-basedTCAM
6T DRAM-basedTCAM
DRAM-based ternaryCA-RAM
Per
Cel
l Are
a (u
m2)
@13
0nm
4.5x
11x
0
1
2
3
4
5
6
7
8
16T SRAM-basedTCAM
8T DRAM-basedTCAM
6T DRAM-basedTCAM
DRAM-based ternaryCA-RAM
4.5M
b P
ower
(W
) @
143M
Hz
14x
4x
Cell area (m2)@130nm CMOS
Power (W)4.5Mb @143MHz
CA-RAM area advantage 4.5x~11x
CA-RAM power advantage 4x~14x
ISPASS 2007
Performance: CA-RAM vs. (T)CAM
Case study 1: IP lookup
ISPASS 2007
Problem description
Given• A set of prefixes (each prefix is associated with output port number)• IP address
Find a prefix that matches with input IP address and return output port number associated with it• In the presence of multiple matching prefixes, choose the longest
Procedure• Find a good hash function to distribute prefixes• Determine CA-RAM organization
ISPASS 2007
Data set and hashing method
IP core router’s table having 186,760 entries
Bit selection scheme [Zane et al. ‘03]
• 98% of prefixes are at least 16 bits long• Select hash bits from the first 16 bits (low-order bits)
ISPASS 2007
Shaping CA-RAM
Consider multiple design points:
Design B
Design A
Design D
Design C
Design EDesign F
2,048 rows (32 entries)
4,096 rows (64 entries)
( = 0.47)
( = 0.40)
( = 0.36)
( = 0.36)
( = 0.24)
( = 0.36)
ISPASS 2007
0
0.5
1
1.5
2
2.5
Design A Design B Design C Design D Design E Design F
Performance
0%
10%
20%
30%
40%
Design A Design B Design C Design D Design E Design F
Spilled entries
0
0.5
1
1.5
2
2.5
Design A Design B Design C Design D Design E Design F
Average memoryaccess latency
( = 0.47) ( = 0.40) ( = 0.36) ( = 0.36) ( = 0.24) ( = 0.36)
“Uniform” traffic
“Skewed” traffic
With a properly chosen ,
CA-RAM achieves near-constant AMAL
ISPASS 2007
Area and power
0
0.2
0.4
0.6
0.8
1
1.2
TCAM TCAM
CA-RAM
CA-RAM
Area Power
CA-RAM advantageous over TCAM
Design B
Relative area orpower
Case study 2: Trigram lookup in speech recognition
ISPASS 2007
Problem, data set, and hashing
Problem• Look up a trigram in the trigram database
Data set• A subset of the Sphinx trigram database• We picked up entries having 13~16 characters• Still 5,385,231 entries or 86MB
Hashing• DJB, an efficient string hash function• (Used in Sphinx)
ISPASS 2007
Result
ISPASS 2007
Data distribution
ISPASS 2007
Area comparison
0
0.2
0.4
0.6
0.8
1
1.2
Relative area
CAM CA-RAM
ISPASS 2007
CA-RAM conclusions
Compared w/ software methods• Less # of memory accesses; higher lookup performance
Compared w/ CAM or TCAM• Higher density matching that of DRAM large lookup table• Competitive performance• Low power – a critical advantage for cost-effective system design• Reconfigurable
Can accommodate apps having different key/record sizes, binary vs. ternary searching requirements, range checking, …Can adopt new standards much more easily, e.g., IPv6
Two case studies show the efficacy of the CA-RAM approach• 3~5× improvement in area and power, compared with CAM/TCAM
CA-RAM:A High-Performance Memory
Substrate for Search-Intensive Applications
Questions?