A
PROJECT REPORT
ON
PRE-COMPUTAION BASED CONTENT
ADDRESSABLE MEMORY
Submitted for partial fulfillment of award of
“Bachelor of technology”
IN
Electronics &
Communication
MAHARANA PRATAP ENGINEERING COLLEGE
KANPUR
Submitted to: Submitted by:
Mr.Anjaneya Nigam Neharika Mishra
(Asst. professor) Rahul kr.Gupta
(Department of Electronics & Rituraj Anand
Communication Engineering) Wasiuddin
Certificate
This is to certify that the project report entitled “ PB-CAM” which is
submitted by Neharika Mishra , Rahul kr. Gupta, Rituraj Anand ,
Wasiuddin in partial fulfillment for the requirement for the award of degree
B.Tech in department of Electronics & Communication Engineering,
M.P.E.C, Kanpur,is a record of candidates’ work carried out by them under
my supervision. The matter embodied in this thesis is original and has not
been submitted for the award of any other degree.
May ,2011
Anjaneya Nigam
(Asst. Professor)
Department of Electronics Engineering
M.P.E.C.,Kanpur
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of B.Tech project
undertaken during B.Tech final year. We owe a special debt of gratitude to
Mr.Anjaneya Nigam (Asst. Professor),Department of Electronics &
Communications Engineering ,Maharana Pratap Engineering College, Kanpur.
For his constant support and guidance throughout the course of work , his
sincerity, thoroughness and preservance have been a constant source of inspiration
for us. It is only his cognizant efforts that our endeavors have seen light of day.
We also take the opportunity to acknowledge the contributions of Mr. Mohit
Srivastava,Head,Department of Electronics and Communications Engineering,
Maharana Pratap Engineering College, Kanpur ,for his full support and assistance
during the development of project .We also do not want to miss the opportunity
to acknowledge the contribution of all faculty members of the department for their
kind co-operation during the development of our project.
Last but not the least , we acknowledge our friends for their contribution in
completion of the project.
May, 2011
Neharika Mishra
Rahul kr.Gupta
Rituraj Anand
Wasiuddin
ABSTRACT
Content-addressable memory (CAM) is a special type of computer Memory
used in certain very high speed searching applications. It is also known as
associative memory, associative storage, or associative array.
Content-addressable memory (CAM) is frequently used in applications, such
as lookup tables, databases, associative computing, and networking, that
require high-speed searches due to its ability to improve application
performance by using parallel comparison to reduce search time. Although
the use of parallel comparison results in reduced search time, it also
significantly increases power consumption. In this paper, we propose a
Block-XOR approach to improve the efficiency of low power pre
computation- based CAM (PB-CAM). Compared with the ones-count PB-
CAM system, the experimental results show that our proposed approach can
achieve on average 30% in power reduction and 32% in power performance
reduction. The major contribution of this paper is that it presents practical
proofs to verify that our proposed Block-XOR PB-CAM system can achieve
greater power reduction without the need for a special CAM cell design. This
implies that our approach is more flexible and adaptive for general designs.
CHAPTER 1
1.1 Introduction :
Most memory devices store and retrieve data by addressing specific
memory locations. As a result, this path often becomes the limiting factor for
systems that rely on fast memory accesses. The time required to find an
item stored in memory can be reduced considerably if the item can be
identified for access by its content rather than by its address. A memory that
is accessed in this way is called content-addressable memory or CAM.
It provides a performance advantage over other memory search algorithms,
such as binary or tree-based searches or look-aside tag buffers, by
comparing the desired information against the entire list of pre-stored entries
simultaneously, often resulting in an order-of-magnitude reduction in the
search time.
Classically a CAM is defined as a functional memory with a large
amount of stored data that simultaneously compares the input search data
with the stored data. Once matching data are found, their addresses are
returned as output as shown in Fig. 1.1.
Fig1.1 .conventional CAM architecture
Unlike standard computer memory (Random access memory or RAM) in
which the user supplies a memory address and the RAM returns the data
word stored at that address, a CAM is designed such that the user supplies
a data word and the CAM searches its entire memory to see if that data
word is stored anywhere in it. If the data word is found, the CAM returns a
list of one or more storage addresses where the word was found (and in
some a rchitecture, it also returns the data word, or other associated pieces
of data). Thus, a CAM is the hardware embodiment of what in software
terms would be called an associative array.
Content-addressable memories (CAMs) are hardware search engines that
are much faster than algorithmic approaches for search-intensive
applications. CAMs are composed of conventional semiconductor memory
(usually SRAM) with added comparison circuitry that enables a search
operation to complete in a single clock cycle. CAM can be used as looks up
table that suite for AES applications.
A content-addressable memory (CAM) is a critical device for applications
involving asynchronous transfer mode (ATM), communication networks,
databases, lookup tables, tag directories, data compression, pattern-
recognition and security or encryption information on a packet-by-packet
basis for high-performance data switches, firewalls, bridges and routers due
to its high-speed data search capability. The vast number of comparison
operations required by CAMs consumes a large amount of power. So pre
computation techniques have evolved.
1.2 Literature Survey
In the past decade, much research on energy reduction has focused on the
circuit and technology domains ( provides a comprehensive survey on CAM
designs from circuit to architectural levels). In several works on reducing
CAM power consumption have focused on reducing match-line power.
Although there has been progress in this area in recent years, the power
consumption of CAMs is still high compared with RAMs of similar size.
At the same time, research in associative cache system design for power
efficiency at the architectural level continues to increase. The filter cache
and location cache techniques in can effectively reduce the power
dissipation by adding a very small cache.
However, the use of these caches requires major modifications to the
memory structure and hierarchy to fit the design. Pagiamtzis et al. proposed
a cache-CAM (C-CAM) that reduces power consumption relative to the
cache hit rate in [8]. Lin et al. presented a ones-count pre computation-
based CAM (PB-CAM) that achieves low-power, low cost, low-voltage, and
high-reliability features .
Although Change in further improved the efficiency of PB-CAMs, the
approach proposed requires considerable modification to the memory
architecture to achieve high performance. Therefore, it is beyond the scope
of the general CAM design. Moreover, the disadvantage of the ones count
PB-CAM system in is that it adopts a special memory cell design for
reducing power consumption, which is only applicable to the ones count
parameter extractor.
Block diagram
Existing System
A CAM is a functional memory with a large amount of stored data that
compares the input search data with the stored data. Once matching data
are found, their addresses are returned as output. The vast number of
comparison operations required by CAMs consumes a large amount of
power.
Proposed System:
This proposed system approach can reduce comparison operations by a
minimum of 909and a maximum of 2339.we propose a new parameter
extractor called Block-XOR, Which achieve the requirement.
Objective of the Project:
The main objective of the project is to implement the low power pre
computation-based content addressable memory (PB-CAM) along with the
Block-XOR parameter extractor and GATE block selection algorithm.
.
CHAPTER 2
CAM OVERVIEW
2.1 About CAM:
Content addressable memory (CAM) compares input search data against a
table of stored data, and returns the address of the matching data . CAMs
have a single clock cycle throughput making them faster than other
hardware- and software-based search systems. CAMs can be used in a
wide variety of applications requiring high search speeds. A CAM is a good
choice for implementing this lookup operation due to its fast search
capability.
However, the speed of a CAM comes at the cost of increased silicon area
and power consumption, two design parameters that designers strive to
reduce. As CAM applications grow, demanding larger CAM sizes, the power
problem is further exacerbated. Reducing power consumption, without
sacrificing speed or area, is the main thread of recent research in large-
capacity. CAMs. Development in the cam area is surveyed at two levels:
circuits and architectures levels.
We can compare CAM to the inverse of RAM. When read, RAM produces
the data for a given address. Conversely, CAM produces an address for a
given data word. When searching for data within a RAM block, the search is
performed serially. Thus, finding a particular data word can take many
cycles. CAM searches all addresses in parallel and produces the address
storing a particular word.
CAM supports writing "don't care" bits into words of the memory. The don't
care bit can be used as a mask for CAM comparisons; any bit set to don't
care has no effect on matches.
The output of the CAM can be encoded or un encoded. The encoded output
is better suited for designs that ensure duplicate data is not written into the
CAM. If duplicate data is written into two locations, the CAM's output will not
be correct. If the CAM contains duplicate data, the un encoded output is a
better solution; CAM with un encoded outputs can distinguish multiple data
locations.
We can pre-load CAM with data during configuration, or you can write into
CAM during system operation. In most cases, two clock cycles are required
to write each word into CAM. When you use don't care bits, a third clock
cycle is required.
2.1.1 Operation of CAM:
Fig.2.1: Conceptual view of a content-addressable memory containing w
words.
Fig. 2.1 shows a simplified block diagram of a CAM. The input to the system
is the search word that is broadcast onto the search lines to the table of
stored data. The number of bits in a CAM word is usually large, with existing
implementations ranging from 36 to 144 bits. A typical CAM employs a table
size ranging between a few hundred entries to 32K entries, corresponding to
an address space ranging from 7 bits to 15bits.
Each stored word has a match line that indicates whether the search word
and stored word are identical (the match case) or are different (a mismatch
case, or miss). The match lines are fed to an encoder that generates a
binary match location corresponding to the match line that is in the match
state. An encoder is used in systems where only a single match is expected.
In addition, there is often a hit signal (not shown in the figure) that flags the
case in which there is no matching location in the CAM. The overall function
of a CAM is to take a search word and return the matching memory location.
One can think of this operation as a fully programmable arbitrary mapping of
the large space of the input search word to the smaller space of the output
match location.
The operation of a CAM is like that of the tag portion of a fully associative
cache. The tag portion of a cache compares its input, which is an address, to
all addresses stored in the tag memory. In the case of match, a single match
line goes high, indicating the location of a match. Many circuits are common
to both CAMs and caches; however, we focus on large capacity CAM s
rather than on fully associative caches, which target smaller capacity and
higher speed.
Today’s largest commercially available single-chip CAMs are 18 Megabit
implementations, although the largest CAMs reported in the literature are 9
Megabit in size. As a rule of thumb, the largest available CAM chip is usually
about Half the size of the largest available SRAM chip. This rule of thumb
comes from the fact that a typical CAM cell consists of two SRAM cells.
2.1.2 Simple CAM architecture:
Content Addressable Memories (CAMs) are fully associative storage
devices. Fixed-length binary words can be stored in any location in the
device. The memory can be queried to determine if a particular word, or key,
is stored, and if so, the address at which it is stored. This search operation is
performed in a single clock cycle by a parallel bitwise comparison of the key
against all stored words.
Fig2.2. Simple schematic of a model CAM with 4 words having 3 bits each.
We now take a more detailed look at CAM architecture. A small model is
shown in Fig. 2.2. The figure shows a CAM consisting of 4 words, with each
word containing 3 bits arranged horizontally (corresponding to 3 CAM cells).
There is a match line corresponding to each word (ML0, ML1, etc.) feeding
into match line sense amplifiers (MLSAs), and there is a differential search
line pair corresponding to each bit of the search word ( etc.). A
CAM search operation begins with loading the search-data word into the
search-data registers followed by pre-charging all match lines high, putting
them all temporarily in the match state. Next, the search line drivers
broadcast the search word onto the differential search lines, and each CAM
core cell compares its stored bit against the bit on its corresponding search
lines. Match lines on which all bits match remain in the pre-charged-high
state. Match lines that have at least one bit that misses, discharge to ground.
The MLSA then detects whether its match line has a matching condition or
miss condition. Finally, the encoder maps the match line of the matching
location to its encoded address.
2.2 Implementation types:
2.2.1 Semiconductor implementations:
Because a CAM is designed to search its entire memory in a single
operation, it is much faster than RAM in virtually all search applications.
There are cost disadvantages to CAM however. Unlike a RAM chip, which
has simple storage cells, each individual memory bit in a fully parallel CAM
must have its own associated comparison circuit to detect a match between
the stored bit and the input bit. Additionally, match outputs from each cell in
the data word must be combined to yield a complete data word match signal.
The additional circuitry increases the physical size of the CAM chip which
increases manufacturing cost. The extra circuitry also increases power
dissipation since every comparison circuit is active on every clock cycle.
2.2.2 Alternative implementations:
To achieve a different balance between speed, memory size and cost, some
implementations emulate the function of CAM by using standard tree search
or hashing designs in hardware, using hardware tricks like replication or
pipelining to speed up effective performance. These designs are often used
in routers. Two types of CAMs are discussed below. They are Binary CAM
(BCAM) and Ternary CAM (TCAM).
2.3 Cam types:
2.3.1 Binary CAM :
It is the simplest type of CAM which uses data search words comprised
entirely of 1s and 0s. binary CAMs perform well for exact match operations
and can be used for route lookups in strictly hierarchical addressing
schemes.
2.3.2 Ternary CAM :
Allows a third matching state of "X" or "Don't Care" for one or more bits in
the stored data word, thus adding flexibility to the search. For example, a
ternary CAM might have a stored word of "10XX0" which will match any of
the four search words "10000", "10010", "10100", or "10110". The added
search flexibility comes at an additional cost over binary CAM as the internal
memory cell must now encode three possible states instead of the two of
binary CAM. This additional state is typically implemented by adding a mask
bit ("care" or "don't care" bit) to every memory cell.
• Standard Ternary Mode. Bit X matches either 1, 0, or X (1010 = 1X1X =
10XX) and is referred to as a don’t care bit.
• Enhanced Ternary Mode. Bit X also matches either 1, 0, or X (1010 =
1X1X = 10XX), also referred to as a don’t care bit. Bit U does not match any
of the possible bit values: 1, 0, X, or U, and is referred to as an unmatchable
bit in this document.
2.4 Core Cells and Match line Structure:
A CAM cell serves two basic functions: bit storage (as in RAM) and bit
comparison (unique to CAM). Fig. 2.3 shows a NOR-type CAM cell [Fig.
2.3(a)] and the NAND-type CAM cell [Fig. 2.3(b)]. The bit storage in both
cases is an SRAM cell where cross-coupled inverters implement the bit-
storage nodes D and . To simplify the schematic, we omit the n-MOS
access transistors and bit lines which are used to read and write the SRAM
storage bit. Although some CAM cell implementations use lower area DRAM
cells, typically, CAM cells use SRAM storage. The bit comparison, which is
logically equivalent to an XOR of the stored bit and the search bit is
implemented in a somewhat different fashion in the NOR and the NAND
cells.
Fig 2.3: CAM core cells for (a) NOR type CAM (b) NAND type CAM
2.4.1 NOR Cell:
The NOR cell implements the comparison between the complementary
stored bit, D (and ), and the complementary search data on the
complementary search line, SL (and ), using four comparison transistors,
M1 through M4, which are all typically minimum-size to maintain high cell
density. These transistors implement the pull down path of a dynamic XNOR
logic gate with inputs SL and D. Each pair of transistors, M1/M3 and M2/M4,
forms a pull down path from the match line, ML, such that a mismatch of SL
and D activates least one of the pull down paths, connecting ML to ground.
A match of SL and D disables both pull down paths, disconnecting ML from
ground. The NOR nature of this cell becomes clear when multiple cells are
connected in parallel to form a CAM word by shorting the ML of each cell to
the ML of adjacent cells. The pull down paths connect in parallel resembling
the pull down path of a CMOS NOR logic gate. There is a match condition
on a given ML only if every individual cell in the word has a match.
2.4.2 NAND Cell:
The NAND cell implements the comparison between the stored bit, D, and
corresponding search data on the corresponding search lines, (SL, ),
using the three comparison transistors and , which are all
typically minimum-size to maintain high cell density. We illustrate the bit-
comparison operation of a NAND cell through an example.
Consider the case of a match when SL=1 and D=1 . Pass transistor is
ON and passes the logic “1” on the SL to node B. Node B is the bit-match
node which is logic “1” if there is a match in the cell. The logic “1” on node B
turns ON transistor M1 . Note that M1 is also turned ON in the other match
case when SL=0 and D=0. In this case, the transistor passes logic high
to raise node B. The remaining cases, where , result in a miss
condition, and accordingly node B is logic “0” and the transistor M1 is OFF.
Node B is a pass-transistor implementation of the X-NOR function .
The NAND nature of this cell becomes clear when multiple NAND cells are
serially connected. In this case, the and nodes are joined to
form a word. A serial n-MOS chain of all the transistors resembles the
pull down path of a CMOS NAND logic gate. A match condition for the entire
word occurs only if every cell in a word is in the match condition.
An important property of the NOR cell is that it provides a full rail voltage at
the gates of all comparison transistors. On the other hand, a deficiency of
the NAND cell is that it provides only a reduced logic “1” voltage at node B,
which can reach only VDD-Vtn when the search lines are driven to VDD (where
VDD is the supply voltage and Vtn is the n-MOS threshold voltage).
2.4.3 Cell Variants:
Fig. 2.4 shows a variant of the NOR cell [Fig. 2.4(a)] and a variant of the
NAND cell [Fig. 2.4(b)]. The NOR cell variant uses only 9-Ttransistors
compared to the previous 10-T NOR cell. The bit comparison uses pass
transistors (as in the previous 9-T NAND cell), however, the NOR property of
this cell is apparent when multiple cells are connected in parallel to form a
CAM word by shorting the ML of each cell to the ML of adjacent cells. The
pull down paths are in parallel just as in a CMOS NOR logic gate.
Fig. 2.4: CAM core cells variations for (a) 9-T NOR-type CAM and (b) 10-T
NAND-type CAM.
The 10-T NAND cell [Fig. 2.4(b)] is a variant of the previous 9-T NAND cell.
When the bit comparison succeeds in this cell, one of the transistor paths
between MLn and MLn+1 is ON. Thus, when multiple cells are shorted
together these transistor paths appear in series just as in the pull down
network of a CMOS NAND gate. Since this NAND cell doubles the number
of transistors in series, the 9-T NAND is usually preferred. For the remainder
of this paper we discuss only the 9-T NAND cell and the 10-T NOR cells as
they are in predominant use today.
2.4.4 Ternary Cells:
The NOR and NAND cells that have been presented are binary CAM cells.
Such cells store either a logic “0” or a logic “1”. Ternary cells, in addition,
store an “X” value. The “X” value is a don’t care, that represents both “0” and
“1”, allowing a wildcard operation. Wildcard operation means that an “X”
value stored in a cell causes a match regardless of the input bit. As
discussed earlier,
this is a feature used in packet forwarding in Internet routers.
A ternary symbol can be encoded into two bits according to Table I. We
represent these two bits as D and . Note that although the D and are not
necessarily complementary, we maintain the complementary notation for
consistency with the binary CAM cell. Since two bits can represent 4
possible states, but ternary storage requires only three states, we disallow
the state where D and are both zero. To store a ternary value in a NOR
cell, we add a second SRAM cell, as shown in Fig. 7. One bit, D, connects to
the left pull down path and the other bit , connects to the right pull down
path, making the pull down paths independently controlled. We store an “X”
by setting both D and equal to logic “1”, which disables both pull down
paths and forces the cell to match regardless in the inputs. We store a logic
“1” by setting D=1 and =1 and store a logic “0” by setting D=1 and =1. In
addition to storing an “X”, the cell allows searching for an “X” by setting both
SL and to logic “0”. This is an external don’t care that forces a match of a
bit regardless of the stored bit. Although storing an “X” is possible only in
ternary CAMs, an external “X” symbol possible in both binary and ternary
CAMs. In cases where ternary operation is needed but only binary CAMs are
available, it is possible to emulate ternary operation using two binary cells
per ternary symbol.
TABLE 1
As a modification to the ternary NOR cell of Fig. 2.5(a), Roth et al. propose
implementing the pull down transistors M1—M4 using p-MOS devices and
complementing the logic levels of the search lines and match lines
accordingly. Using p-MOS transistors (instead of n-MOS transistors) for the
comparison circuitry allows for a more compact layout, due to reducing the
number of spacing of p-diffusions to n-diffusions in the cell. In addition to
increased density, the smaller area of the cell reduces wiring capacitance
and therefore reduces power consumption. The tradeoff that results from
using minimum-size p-MOS transistors, rather than minimum-size n-MOS
transistors, is that the pull down path will have a higher equivalent
resistance, slowing down the search operation.
A NAND cell can be modified for ternary storage by adding storage for a
mask bit at node M, as depicted in Fig. 2.5(b). When storing an “X”, we set
this mask bit to “1”. This forces transistor ON, regardless of the value
of D, ensuring that the cell always matches. In addition to storing an “X”, the
cell allows searching for an “X” by setting both SL and to logic “1”. Table
II lists the stored encoding and search-bit encoding for the ternary NAND
cell.
TABLE II
Further minor modifications to CAM cells include mixing parts of the NAND
and NOR cells, using dynamic-threshold techniques in silicon-on-insulator
(SOI) processes, and alternating the logic level of the pull down path to
ground in the NOR cell.
Fig. 2.5: Ternary core cells for (a) NOR-type CAM and (b) NAND-type CAM
Currently, the NOR cell and the NAND cell are the prevalent core cells for
providing storage and comparison circuitry in CMOS CAMs.
2.4.5 Match line Structures:
We now demonstrate the NAND cell and NOR cell in constructing a CAM
matchline. The matchline is one of the two key structures in CAMs.
NOR Matchline: Fig. 2.6 depicts, in schematic form, how NOR cells are
connected in parallel to form a NOR matchline, ML. While we show binary
cells in the figure, the description of matchline operation applies to both
binary and ternary CAM.
Fig. 2.6: Structure of a NOR match line with n cells.
Transistor Mpre pre-charges the match line and the MLSA evaluates the
state of the match line, generating the match result.
A typical NOR search cycle operates in three phases: search line pre-
charge, match line pre-charge, and match line evaluation. First, the search
lines are pre-charged low to disconnect the match lines from ground by
disabling the pull down paths in each CAM cell. Second, with the pull down
paths disconnected, the Mpre transistor pre-charges the match lines high.
Finally, the search lines are driven to the search word values, triggering the
match line evaluation phase. In the case of a match, the ML voltage, VML ,
stays high as there is no discharge path to ground. In the case of a miss,
there is at least one path to ground that discharges the match line. The
match line sense amplifier (MLSA) senses the voltage on ML, and generates
a corresponding full-rail output match result.
The main feature of the NOR match line is its high speed of operation. In the
slowest case of a one-bit miss in a word, the critical evaluation path is
through the two series transistors in the cell that form the pull down path.
Even in this worst case, NOR-cell evaluation is faster than the NAND case,
where between 8 and 16 transistors form the evaluation path.
NAND Match line: Fig. 2.7 shows the structure of the NAND match line .A
number of cells, , are cascaded to form the match line (this is, in fact, a
match node, but for consistency we will refer to it as ML). For the purpose of
explanation, we use the binary version of the NAND cell, but the same
description applies to the case of a ternary cell.
Fig. 2.7 : NAND match line structure with pre-charge and evaluate
transistors.
On the right of the figure, the pre-charge p-MOS transistor, Mpre, sets the
initial voltage of the match line, ML, to the supply voltage, VDD. Next, the
evaluation n-MOS transistor, Meval, turns ON. In the case of a match, all n-
MOS transistors M1 through Mn are ON, effectively creating a path to
ground from the ML node, hence discharging ML to ground. In the case of a
miss, at least one of the series n-MOS transistors, M1 through, Mn is OFF,
leaving the ML voltage high. A sense amplifier, MLSA, detects the difference
between the match (low) voltage and the miss (high) voltage. The NAND
match line has an explicit evaluation transistor, Meval, unlike the NOR match
line, where the CAM cells themselves perform the evaluation.
There is a potential charge-sharing problem in the NAND match line. Charge
sharing can occur between the ML node and the intermediate MLi nodes.
For example, in the case where all bits match except for the leftmost bit in
Fig. 2.7, during evaluation there is charge sharing between the ML node and
nodes ML1 through MLn-1. This charge sharing may cause the ML node
voltage to drop sufficiently low such that the MLSA detects a false match. A
technique that eliminates charge sharing is to pre-charge high, in addition to
ML, the intermediate match nodes ML1 through MLn-1. This is accomplished
by setting to VDD the search lines, SL1 – SLn, and their complements
, which forces all transistors in the chain, M1–Mn, to turn ON and
pre-charge the intermediate nodes. When this pre-charge of the intermediate
match nodes is complete, the search lines are set to the data values
corresponding to the incoming search word. This procedure eliminates
charge sharing, since the intermediate match nodes and the ML node are
initially shorted.
A feature of the NAND match line is that a miss stops signal propagation
such that there is no consumption of power past the final matching transistor
in the serial n-MOS chain. Typically, only one match line is in the match
state, consequently most match lines have only a small number of
transistors in the chain that are ON and thus only a small amount of power is
consumed. Two drawbacks of the NAND match line are a quadratic delay
dependence on the number of cells, and a low noise margin. The quadratic
delay-dependence comes from the fact that adding a NAND cell to a NAND
match line adds both a series resistance due to the series n-MOS transistor
and a capacitance to ground due to the n-MOS diffusion capacitance. These
elements form an RC ladder structure whose overall time constant has a
quadratic dependence on the number of NAND cells. Most implementations
limit the number of cells on a NAND match line to 8 to 16 in order to limit the
quadratic degradation in speed.
The low noise margin is caused by the use of n-MOS pass transistors for the
comparison circuitry. Since the gate voltage of the NAND match line
transistors (M1 through Mn ) when conducting, in Fig. 2.7, is ( ), the
highest voltage that is passed on the match line is , (where
is the threshold voltage of the n-MOS transistor, augmented by the body
effect). NOR cells avoid this problem by applying maximum gate voltage to
all CAM cell transistors when conducting. One implementation of a NAND-
based CAM reclaims some noise margin by employing the bootstrap effect
by reversing the polarity of the match line pre-charge and evaluate.
2.5 A Reconfigurable Content Addressable Memory:
Content Addressable Memories or CAMs are popular parallel matching
circuits. They provide the capability, in hardware, to search a table of data
for a matching entry. This functionality is a high performance alternative to
popular software-based searching schemes. CAMs are typically found in
embedded circuitry where fast matching is essential.
Content Addressable Memories or CAMs are a class of parallel pattern
matching circuits. In one mode, these circuits operate like standard memory
circuits and may be used to store binary data. Unlike standard memory
circuits, however, a powerful match mode is also available. This match mode
permits all of the data in the CAM device to be searched in parallel. While
CAM hardware has been available for decades, its use has typically been in
niche applications, embedded in custom designs. Perhaps the most popular
application has been in cache controllers for central processing units. Here
CAMs are often used to search cache tags in parallel to determine if a cache
“hit" or “miss" has occurred. Clearly in this application performance is crucial
and parallel search hardware such as a CAM can be used to good effect.
A second and more recent use of CAM hardware is in the networking area.
As data packets arrive into a network router, processing of these packets
typically depends on the network destination address of the packet. Because
of the large number of potential addresses, and the increasing performance
demands, CAMs are beginning to become popular in processing network
address information.
2.5.1 A Standard CAM Implementation:
CAM circuits are similar in structure to traditional Random Access Memory
(RAM) circuits, in that data may be written to and read from the device. In
addition to functioning as a standard memory device, CAMs have an
additional parallel search or match mode. The entire memory array can be
searched in parallel using hardware. In this match mode, each memory cell
in the array is accessed in parallel and compared to some value. If this value
is found in any of the memory locations, a match signal is generated.
In some implementations, all that is significant is that a match for the data is
found. In other cases, it is desirable to know exactly where in the memory
address space this data was located. Rather than producing a simple
“match" signal, some CAM implementations also supply the address of the
matching data. In some sense, this provides a functionality opposite of a
standard RAM. In a standard RAM, addresses are supplied to hardware and
data at that address is returned. In a CAM, data is presented to the
hardware and an address returned.
At a lower level, the actual transistor implementation of a CAM circuit is very
similar to a standard static RAM. Figure 2.8 shows transistor level diagrams
of both
CMOS RAM and CAM circuits. The circuits are almost identical, except for
the addition of the match transistors to provide the parallel search capability.
Fig. 2.8. RAM and CAM transistor level circuits.
In a CMOS static RAM circuit, as well as in the CAM cell, data is accessed
via the BIT lines and the cells selected via the WORD lines. In the CAM cell,
however, the match mode is somewhat different. Inverted data is placed on
the BIT lines. If any cell contains data which does not match, the MATCH
line is pulled low, indicating that no match has occurred in the array.
Clearly this transistor level implementation is efficient and may be used to
produce CAM circuits which are nearly as dense as comparable static RAM
circuits. Unfortunately, such transistor level circuits can not be implemented
using standard programmable logic devices.
2.5.2 An FPGA CAM Implementation:
Of course, a content addressable memory is just a digital circuit, and as
such may be implemented in an FPGA. The general approach is to provide
an array of registers to hold the data, and then use some collection of
comparators to see if a match has occurred. While this is a viable solution, it
suffers from the same sort of inefficiencies that plague FPGA-based RAM
implementations. Like RAM, the CAM is efficiently implemented at the
transistor level. Using gate level logic, particularly programmable or
reconfigurable logic, often results in a substantial penalty, primarily in size.
Because the FPGA CAM implementation relies on flip-flops as the data
storage elements, the size of the circuit is restricted by the number of flip
flops in the device. While this is adequate for smaller CAM designs, larger
CAMs quickly deplete the resources of even the largest available FPGA.
2.5.3 The Reconfigurable Content Addressable Memory
(RCAM):
The Reconfigurable Content Addressable Memory or RCAM makes use of
runtime reconfiguration to efficiently implement a CAM circuit. Rather than
using the FPGA flip-flops to store the data to be matched, the RCAM uses
the FPGA Look Up Tables or LUTs. Using LUTs rather than flip-flops results
in a smaller, faster CAM.
The approach uses the LUT to provide a small piece of CAM functionality. In
Figure 10, a LUT is loaded with data which provides a “match 5"
functionality.
Fig. 2.9. Using a LUT to match 5.
Because a LUT can be used to implement any function of N variables, it is
also possible to provide more flexible matching schemes than the simple
match described in the circuit in Figure 2.9. In Figure 2.10, the LUT is loaded
with values which produce a match on any value but binary “4". This circuit
demonstrates the ability to embed a mask in the configuration of a LUT,
permitting arbitrary disjoint sets of values to be matched, within the LUT.
This function is important in many matching applications, particularly
networking.
Fig. 2.10. Using a LUT to match all inputs except 4.
This approach can be used to provide matching circuits such as match all or
match none or any combination of possible LUT values. Note again, that this
arbitrary masking only applies to a single LUT. When combining LUTs to
make larger CAMs, the ability to perform such masking becomes more
restricted.
While using LUTs to perform matching is a powerful approach, it is
somewhat limited when used with traditional design tools. With schematics
and HDLs, the LUT contents may be specified, albeit with some difficulty.
And once specified, modifying these LUTs is difficult or impossible.
However, modification of FPGA circuitry at run-time is possible using a run-
time reconfiguration tool such as J Bits. J Bits permits LUT values, as well as
other parts of the FPGA circuit, to be modified arbitrarily at run time and in-
system. An Application Program Interface (API) into the FPGA configuration
permits LUTs, for instance, to be modified with a single function call. This,
combined with the partial reconfiguration capabilities of new FPGA devices
such as Virtex (tm) permit the LUTs used to build the RCAM to be easily
modified under software control, without disturbing the rest of the circuit.
Finally, using run-time reconfiguration software such as J Bits, RCAM
circuits may be dynamically sized, even at run-time. This opens the
possibility of not only changing the contents of the RCAM during operation,
but actually changing the size and shape of the RCAM circuit itself. This
results in a situation analogous to dynamic memory allocation in RAM. It is
possible to “allocate" and “free" CAM resources as needed by the
application.
2.5.4 An RCAM Example:
One currently popular use for CAMs is in networking. Here data must be
processed under demanding real-time constraints. As packets arrive, their
routing information must be processed. In particular, destination addresses,
typically in the form of 32-bit Internet Protocol (IP) addresses must be
classified. This typically involves some type of search.
Current software based approaches rely on standard search schemes such
as hashing. While effective, this approach requires a powerful processor to
keep up with the real-time demands of the network. Offloading the
computationally demanding matching portion of the algorithms to external
hardware permits less powerful processors to be used in the system. This
results in savings not only in the cost of the processor itself, but in other
areas such as power consumption and overall system cost.
In addition, an external CAM provides networking hardware with the ability to
achieve packet processing in essentially constant time. Provided all
elements to be matched fit in the CAM circuit, the time taken to match is
independent of the number of items being matched. This provides not only
good scalability properties.
Other software based matching schemes such as hashing are data-
dependent and may not meet realtime constraints depending on complex
interactions between the hashing algorithm and the data being processed.
CAMs suffer no such limitations and permit easy analysis and verification.
Fig. 2.11. Matching a 32-bit IP header.
Figure 2.11 shows an example of an IP Match circuit constructed using the
RCAM approach. Note that this example assumes a basic 4-input LUT
structure for simplicity. Other optimizations, including using special-purpose
hardware such as carry chains are possible and may result in substantial
circuit area savings and clock speed increases.
This circuit requires one LUT input per matched bit. In the case of a 32-bit IP
address, this circuit requires 8 LUTs to provide the matching, and three
additional 4-input LUTs to provide the AND-ing for the MATCH signal. An
array of this basic 32-bit matching block may be replicated in an array to
produce the CAM circuit. Again, note that other non-LUT implementations for
generating the MATCH circuit are possible.
Since the LUTs can be used to mask the matching data, it is possible to put
in “match all" conditions by setting the LUTs to all ones. Other more
complicated masking is possible, but typically only using groups of four
inputs. While this does not provide for the most general case, it appears to
cover the popular modes of matching.
2.5.5 System Issues:
The use of run-time reconfiguration to construct, program and reprogram the
RCAM results in some significant overall system savings. In general, both
the hardware and the software are greatly simplified. Most of the savings
accrue from being able to directly reconfigure the LUTs, rather than having
to write them directly as in standard RAM circuits. Reconfiguration rather
than direct access to the stored CAM data first eliminates the entire read /
write access circuitry. This includes the decode logic to decode each
address, the wiring necessary to broadcast these addresses, the data
busses for reading and writing the data, and the IOBs used to communicate
with external hardware.
It should be pointed out that this interface portion of the circuitry is
substantial, both its size and complexity. Busses typically consume tri-state
lines, which are often scarce. Depending on the addressing scheme, tens of
IOBs will necessarily be consumed. These also tend to be valuable
resources. The address decoders are also somewhat problematic circuits
and often require special purpose logic to be implemented efficiently. In
addition, the bus interface is typically the most timing sensitive portion of the
circuit and requires careful design and simulation. This is eliminated with the
use of run-time reconfiguration.
Finally, the system software is simplified. In a standard bus interface
approach, device drivers and libraries must be written, debugged and
maintained to access the CAM. And when the system software or processor
changes, this software must be ported to the new platform. With the RCAM,
all interfacing is performed through the existing configuration port, at no
additional overhead.
The cost of using the configuration port rather than direct hardware access is
primarily one of setup speed. Direct writes can typically be done in some
small number of system cycles. Reconfiguration of the RCAM to update
table entries may take substantially longer, depending on the
implementation. Partial reconfiguration in devices such as Virtex permit
changes to be made more rapidly than in older bulk configuration device, but
the speed may be orders of magnitude slower than direct hardware
approaches. Clearly the RCAM approach favors applications with slowly
changing data sets. Fortunately, many applications appear to fit into this
category.
2.5.6 Associative Processing:
Today, advances in circuit technology permit large CAM circuits to be built.
However, uses for CAM circuits are not necessarily limited to niche
applications like cache controllers or network routers. Any application which
relies on the searching of data can benefit from a CAM-based approach. A
short list of some potential application areas that can benefit from fast
matching are Artificial Intelligence, Database Search, Computer Aided
Design, Graphics Acceleration and Computer Vision.
Much of the work in using parallel matching hardware to accelerate
algorithms was carried out in the 1960s and 1970s, when several large
parallel matching machines were constructed. With the rapid growth both in
size and speed of traditional processors in the intervening years, much of
the interest in CAMs has faded. However, as real-time constraints in areas
such as networking become impossible to meet with traditional processors,
solutions such as CAM-based parallel search will almost certainly become
more prevalent.
In addition, the use of parallel matching hardware in the form of CAMs can
provide another more practical benefit. For many applications, the use of
CAM-based parallel search can offload much of the work done by the
system processor. This should permit smaller, cheaper and lower power
processors to be used in embedded applications which can make use of
CAM-based parallel search.
The RCAM is a flexible, cost-effective alternative to existing CAMs. By using
FPGA technology and run-time reconfiguration, fast, dense CAM circuits can
be easily constructed, even at run-time.
In addition, the size of the RCAM may be tailored to a particular hardware
design, or even temporary changes in the system. This flexibility is not
available in other CAM solutions. In addition, the RCAM need not be a
stand-alone implementation. Because the RCAM is entire a software
solution using state of the art FPGA hardware, it is quite easy to embed
RCAM functionality in larger FPGA designs.
Finally, we believe that existing applications, primarily in the field of network
routing, are just the beginning of RCAM usage. Once other applications
realize that simple, fast, flexible parallel matching is available, it is likely that
other applications and algorithms will be accelerated using this approach.
2.6 Difference between CAM and RAM:
Since CAM is an outgrowth of Random Access Memory (RAM) technology,
in order to understand CAM, it helps to contrast it with RAM. A RAM is an
integrated circuit that stores data temporarily. Data is stored in a RAM at a
particular location, called an address. In a RAM, the user supplies the
address, and gets back the data. The number of address line limits the depth
of a memory using RAM, but the width of the memory can be extended as
far as desired. With CAM, the user supplies the data and gets back the
address. The CAM searches through the memory in one clock cycle and
returns the address where the data is found. The CAM can be preloaded at
device startup and also be rewritten during device operation. Because the
CAM does not need address lines to find data, the depth of a memory
system using CAM can be extended as far as desired, but the width is
limited by the physical size of the memory. Thus, a CAM is the hardware
embodiment of what in software terms would be called an associative array.
Since CAM is an outgrowth of Random Access Memory (RAM) technology,
in order to understand CAM, it helps to contrast it with RAM. A RAM is an
integrated circuit that stores data temporarily. Data is stored in a RAM at a
particular location, called an address. In a RAM, the user supplies the
address, and gets back the data. The number of address line limits the depth
of a memory using RAM, but the width of the memory can be extended as
far as desired. With CAM, the user supplies the data and gets back the
address. The CAM searches through the memory in one clock cycle and
returns the address where the data is found. The CAM can be preloaded at
device startup and also be rewritten during device operation. Because the
CAM does not need address lines to find data, the depth of a memory
system using CAM can be extended as far as desired, but the width is
limited by the physical size of the memory. CAM can be used to accelerate
any application requiring fast searches of data-base, lists, or patterns, such
as in image or voice recognition, or computer and communication designs.
For this reason, CAM is used in applications where search time is very
critical and must be very short. For example, the search key could be the IP
address of a network user, and the associated information could be user’s
access privileges and his location on the network. If the search key
presented to the CAM is present in the CAM’s table, the CAM indicates a
‘match’ and returns the associated information, which is the user’s privileges.
A CAM can thus operate as a data-parallel or Single Instruction/Multiple
Data (SIMD) processor.
Read operation in traditional memory:
o Input is address location of the content
that we are interested in it.
o Output is the content of that address.
o Depth is limited, width can be extended.
In CAM it is the reverse:
o Input is associated with something stored
in the memory.
o Output is location where the associated
content is stored.
o Width is limited, depth can be extended.
2.7 Applications:
Content addressable memory (CAM) is frequently used in applications that
require high-speed searches. Few examples are
LAN bridges/switches & routers.
Asynchronous transfer mode (ATM).
Communication networks.
Look up tables.
Tag directories.
Database engines.
Data compression hardware.
Artificial neural networks.
CPU cache controllers and.
Translation Look aside Buffers.
Single Instruction/Multiple Data (SIMD) processor.
Chapter 3
LOW POWER PB-CAM
WHAT IS PB-CAM
A general CAM architecture usually consists of data memory with valid bit
field, address decoder, bit line pre-charger, word match circuit, and address
priority encoder. The memory organization of the traditional CAM consists of
the data memory and the valid bit field, where the valid bit field indicates the
availability of stored data. In the data searching operation, the input data is
sent into CAM to compare with all valid data stored in CAM simultaneously,
and an address from among those matches of comparison is sent to the
output. In this architecture, the CAM circuit performs large amount of
comparison operations to identify all valid data stored in CAM during each
data searching operation. This comparison consumes most of the total CAM
power. To minimize power consumed during the comparison, one of the best
approaches is to reduce most of the comparison operations. Based on this
idea, a novel architecture is developed for low-power CAM circuit design
called PB-CAM.
To address the proposed low-power PB-CAM architecture, the design
concept for this architecture is introduced. The memory organization of the
proposed PB-CAM architecture, as shown in Fig. 2, is composed of the data
memory, the parameter memory, and the parameter extractor. In the data
writing operation, the parameter extractor extracts the parameter of the input
data, and then stores the input data and its parameter into the data memory
and the parameter memory, respectively. In the data searching operation, in
order to reduce the large amount of comparison operations, the operation is
separated into two comparison processes. In the first comparison process,
the parameter extractor extracts the parameter of the input data, and the
parameter comparison circuits then compare the parameter of the input data
with all parameters stored in the parameter memory in parallel. The data
related to this stored parameter concurrently mismatches the input data, if
the stored parameter mismatches the parameter of the input data.
Otherwise, the data related to this stored parameter has yet to be identified.
Using the first comparison process results, the input data is only compared
with those unidentified data to identify any match in the second comparison
process. Based on the two comparison processes, if majority parts of the
stored parameter mismatch the parameter of the input data, then the number
of comparisons in the second comparison process are largely reduced. The
function of this parameter comparison process is just like filtering; it filters
majority parts of unmatched data in the first comparison process and then
reduces most of the comparisons in the second comparison process.
Fig: General Scheme for CAM Architecture
Fig: General Scheme for CAM Architecture
In this paper, the parameter comparison process is also known as the pre-
computation process. Although the data searching operation uses two
comparison processes to identify any match, both comparison processes are
performed in parallel to improve the data searching speed.
Since content addressable memory (CAM) is frequently used in
applications, that require high-speed searches, and because of its ability to
improve application performance by using parallel comparison, it results in
reduced search time. But it also significantly increases power consumption.
So the main CAM-design challenge is to reduce power consumption
associated with the large amount of parallel active circuitry, without
sacrificing speed or memory density.
3.1 Power saving CAM architecture:
Architectural technique for saving power, which applies to binary CAM, is
pre-computation. Pre-computation stores some extra information along with
each word that is used in the search operation to save power. These extra
bits are derived from the stored word, and used in an initial search before
searching the main word. If this initial search fails, then the CAM aborts the
subsequent search, thus saving power Architectural technique for saving
power, which applies to binary CAM, is pre-computation. Pre-computation
stores some extra information along with each word that is used in the
search operation to save power. These extra bits are derived from the stored
word, and used in an initial search before searching the main word. If this
initial search fails, then the CAM aborts the subsequent search, thus saving
power.
3.2 PB-CAM Architecture:
Fig. 3.1 shows the memory organization of the PB-CAM architecture which
consists of data memory, parameter memory, and parameter extractor,
where k << n. To reduce massive comparison operations for data
searches, the operation is divided into two parts. In the first part, the
parameter extractor extracts a parameter from the input data, which is then
compared to parameters stored in parallel in the parameter memory. If no
match is returned in the first part, it means that the input data mismatch the
data related to the stored parameter. Otherwise, the data related to those
stored parameters have to be compared in the second part. It should be
noted that although the first part must access the entire parameter memory,
the parameter memory is far smaller than that of the CAM (data memory).
Moreover, since comparisons made in the first part have already filtered out
the unmatched data, the second part only needs to compare the data that
match from the first part.
Fig.3.1.memory organization of PB-CAM architecture
The PB-CAM exploits this characteristic to reduce the comparison
operations, thereby saving power. Therefore, the parameter extractor of the
PB-CAM is critical, because it determines the number of comparison
operations in the second part. So, the parameter extractor plays a significant
role since this circuit determines the number of comparison operations
required in the second part. Therefore, the design goal of the parameter
extractor is to filter out as many unmatched data as possible to minimize the
required number of comparison operations in the second part. Two
parameter extractors are discussed, namely One’s count parameter
extractor and Block-XOR parameter extractor.
3.3 One’s count approach:
For ones count approach, with an n-bit data length, there are n+1 types of
one’s count (from 0 ones to n ones count). Further, it is necessary to add an
extra type of one’s count to indicate the availability of stored data. Therefore,
the minimal bit length of the parameter is equal to log (n+ 2). The below fig 5
shows the conceptual view of one’s count approach. The extra information
holds the number of ones in the stored word.
Fig.3.2. conceptual view of one’s count approach
For example, in fig.10, when searching for the data word, 01001101, the pre-
computation circuit the number of ones (which is four in this case). The
number four is compared on the left-hand side to the stored one’s count.
Only match lines PML5 and PML7 match, since only they have a one’s count
of four. In the data-memory stage in fig.3.2, only two comparisons actively
consume power and only match line PML5 results in a match. The 14-bit
ones-count parameter extractor is implemented with full adders as shown in
Fig. 3.3.
3.3.1 Mathematical Analysis:
Fig .3.3:14-bit ones-count parameter extractor
For a 14-bit length input data, all the input data contain 214 numbers, and the
number of input data related to the same parameter for ones count approach
is 14cn, where n is a type of ones-count (from 0 to 14 ones-counts). Then we
can compute the average probability that the parameter occurs. The average
probability can be determined by
TABLE 3
NUMBER OF DATA RELATED TO THE SAME PARAMETERS AND AVERAGEPROBABILITIES FOR THE ONES COUNT
APPROACH
Table III lists the number of data related to the same parameter and their
average probabilities for the input data that is 14-bit in length. For example, if
a match occurs in the first part of the comparison with the parameter 2, the
maximum number of required comparison operations for the second part is 14c2.With conventional CAMs, the comparison circuit must compare all stored
data, whereas with the ones-count PB-CAMs, a large amount of unmatched
data can be initially filtered out, reducing comparison operations for minim
power consumption in some cases. However, the average probabilities of
some parameters, such as 0, 1, 2, 12, 13, and 14 are less than 1%.
In Table III, parameters with over 2000 comparison operations range
between 5 and 9. However, the summation of the average probabilities for
these parameters is close to 82%. Although the number of comparison
operations required for ones-count PB-CAMs is fewer than that of
conventional CAMs, ones-count PB-CAMs fail to reduce the number of
comparison operations in the second part when the parameter value is
between 5 and 9, thereby consuming a large amount of power. From the
Table I we can see that random input patterns for the ones-count approach
demonstrate the Gaussian distribution characteristic. The Gaussian
distribution will limit any further reduction of the comparison operations in
PB-CAMs.
3.4 Block –XOR approach:
The key idea behind this method is to reduce the number of comparison
operations by eliminating the Gaussian distribution. For a 14-bit input data, if
we can distribute the input data uniformly over the parameters, then the
number of input data related to each parameter would be 214/15=1093, and
the maximum number of required comparison operations would be
214/15=1093, for each case in the second part of the comparison process.
Compared with the ones-count approach, this approach can reduce
comparison operations by a minimum of 909 and a maximum of 2339 (i.e.,
for parameter value is from 5 to 9) for 82% of the cases. Based on these
observations, a new parameter extractor called Block-XOR, which is shown
in Fig.3.4, is used to achieve the previous requirement.
In this approach, we first partition the input data bit into several blocks, from
which an output bit is computed using XOR logic operation for each of these
blocks. The output bits are then combined to become the input parameter for
the second part of the comparison process. . To compare with the ones-
count approach, we set the bit length of the parameter to [log (n+ 2)]. Where
n is the bit length of the input data.
Fig.3.4.concept of n-bit Block-XOR block diagram.
Therefore, the number of blocks is [n/ log (n+2)] in this approach. Taking the
14-bit input length as an example, the bit length of the parameter is log
(14+2) = 4-bit, and the number of blocks is [14/ log(14+2)] = 4 . Accordingly,
all the blocks contain 4 bits except the last one, which contains the
remainder 2 bits as shown in the upper part of Fig. 3.5.
Fig .3.5: Structure of Block-XOR approach with valid bit.
The selected signal is defined as
S = A3A2A1A0
According to (2), if the parameter is “0000 to 1110” (S = “0”), the multiplexer
will transmit the i0 data as the output. In other words, the parameter does
not change. Otherwise, (A3A2A1A0 =‘‘1111”, S =‘‘1”), the first block of the
input data becomes the new parameter, and “1111” can then be used as the
valid bit. The case where the first block is “1111” was not considered,
because the “1111” block bits will result in “0” for one of the four parameter
bits.
3.4.1 Mathematical Analysis:
The concept of Block-XOR approach is to uniformly distribute the parameter
over the input data. By the rule of product, the number of input data that
results in the same parameter (without valid bit) is 8 *8*8*2 = 1024.
Consequently, the average probability can be determined as
1024/(1024*16)*100% = 6.25%. Accordingly, the maximum number of
comparison operations is 1024 for each parameter in the second part.
Obviously, the concept of Block-XOR approach can reduce the comparison
operations, hence minimize power consumption.
Table IV lists the number of input data that result in the same parameter for
the proposed Block-XOR PB-CAM (i.e., with valid bit). When the parameter
is “1111”, the new parameter is provided by the first block with an output bit
of “1” so that the number of input data for those parameters is 1024 +
(1024/8) = 1152, and the average probability is (1152/ (1024 * 7 + 1152 * 8))
* 100% = 7.03%. As can be seen from Tables I and II, the Block-XOR PB-
CAM results in at least 850 fewer comparison operations in 82% of the
cases.
TABLE IV
NUMBER OF DATA RELATED TO THE SAME PARAMETERS AND AVERAGE PROBABILITIES FOR THE ONES COUNT
APPROACH
In other words, in most cases, the Block-XOR PB-CAM required far fewer
comparison operations than the ones-count approach for parameter values
between 5 and 9. For example, when the parameter is 7, the proposed
Block-XOR PB-CAM requires 2284 fewer comparison operations than the
ones-count approach.
3.5 Comparison Between Two Approaches :
To eliminate the Gaussian distribution, we uniformly distribute the parameter
over the input data. However, as can be seen from Tables III and IV, when
the parameter is 0, 1, 2, 3, 4, 10, 11, 12, 13, or 14, the number of
comparison operations required for the ones-count approach is fewer than
that for the Block-XOR PB-CAM. Although the Block-XOR PB-CAM is better
than the ones-count PB-CAM only for parameters between 5 and 9, we must
draw attention to the fact that the probability that these parameters occurs is
82%.
For example, when the parameter is 7, there is a 20.95% chance that the
Block-XOR PB-CAM can result in more than 2280 fewer comparison
operations as compared to the ones-count approach. Compared with the
ones-count approach, we can reduce the number of comparison operations
for more than 1000 in most cases. In other words, the ones-count approach
is better than Block-XOR approach in only 18% of the cases.
The number of comparison operations required for different input bit length
4, 8, 14, 16, and 32 bits is shown in Fig.3.6. As can be seen, from the fig 3.6
Block-XOR PB-CAM becomes more effective in reducing the number of
comparison operations as the input bit length increases. This implies that the
longer the input bit length is, the fewer the number of comparison operations
required (i.e., power reduction).
Fig.3.6.Comparision operations for different input bit length.
3.6 Gate-Block Selection Algorithm:
To make the parameter extractor of the block-xor PB-BAM more useful for
specific data types, we take into account the different characteristic of logic
gates to synthesize the parameter extractors for different data types. As can
be seen in Fig. 3.5, if the input bits of each partition block is set into l, the bit
length of the parameter (i.e. the number of blocks) will be [n/l], where n is the
bit length of the input data, and then the levels in each partition block equal
[log2 l]. We observe that when the input bits of each partition block
decreases, the mismatch rate and the number of comparison operations in
each data comparison process will decrease (this is because that the
combinations of the parameter increase). Although the increasing parameter
bit length can decrease the mismatch rate and the number of comparison
operations in each data comparison process, the parameter memory size
must be increased. In other words, it increases the power consumption of
the parameter memory as well. As we stated, when the PBCAM performs
data searching operation, it must compare the entire parameter memory. To
avoid wasting the large mount of power in the parameter memory, we set the
input of each partition block to 8 bits. Fig. 3.7 shows the proposed parameter
extractor architecture. We first partition the input data bit into several blocks,
G0~G6 in each block stand for different logic gates, from which an output bit
is computed using synthesized logic operation for each of these blocks. The
output bits are then combined to become the parameter for data comparison
process.
The objective of our work is to select the proper logic gates in Fig. 3.7
so that the parameter (Pk−1, Pk−2,…………….P0) can reduce the number
of data comparison operations as many as possible.
Fig. 3.7: n-bit block diagram of the proposed parameter extractor
architecture.
In our proposed parameter extractor, the bit length of the parameter is set
into [n/8], and then the levels in each partition block equal [log2 8] (which is
3). Suppose that we use basic logic gates (AND, OR, XOR, NAND, NOR,
and NXOR) to synthesize a parameter extractor for a specific data type,
which has (67)[n/8] different logic combinations based on the proposed
parameter extractor. Obviously, the optimal combination of the parameter
extractor cannot be found in polynomial time.
To synthesize a proper parameter extractor in polynomial time for a specific
data type, we propose a gate-block selection algorithm to find an
approximately optimal combination. We illustrate how to select proper logic
gates to synthesize a parameter extractor for specific data type from
mathematical analysis below.
3.6.1 Mathematical Analysis:
For a 2-input logic gate, let p be the probability of the output signal Y that is
one state. The probability mass function of the the output signal Y is given
by
Assume that the inputs are independent, if we use any 2-input logic gate as
a parameter extractor to generate the parameter for 2-bit data, then the PB-
CAM requires the average number of comparison operations in each data
search operation can be formulated as
where N0 is the number of zero entries, and N1 is the number of one entry
for the generated parameters. To illustrate clearly, we use Table V as an
example.
TABLE V
Suppose that a 2-input AND gate is used to generate the parameter, the
average number of comparison operations in each data search operation for
the PB-CAM can be derived:
In other words, when we use a 2-input AND gate to generate the parameter
for this 2-bit data, the average number of comparison operations required for
each data search operation in the PB-CAM is 4.33. According to Equ. 4,
Table V derives the average number of comparison operations for six basic
logic gates. Obviously, using OR and NOR gates are the best selection for
this case, because they require the least average number of comparison
operations (which is 3). Moreover, when we use the inverse relation of logic
gates (AND/NAND, OR/NOR, and XOR/NXOR) to generate the parameter,
the average number of comparison operations for each data search
operation required in the PB-CAM will be the same. To reduce the
complexity of our proposed algorithm and the performance of the parameter
extractor, our proposed approach only selects NAND, NOR, and XOR gates
to synthesize the parameter extractor for our implementation. This is
because that NAND and NOR is better than AND and OR in terms of the
area, power, and speed. Based on this mathematical analysis, Fig.3.8 shows
our proposed gateblock selection algorithm.
Algorithm to Select Proper Logic Gates For Specific Data:
Fig. 3.8: Gate-Block Selection Algorithm.
Note that when the input is random, the synthesized result will be the same
as the block-xor approach. In order words, the block-xor approach is a
subset of our proposed algorithm.
To better understand our proposed approach, we give a simple example as
illustrated in Fig. 3.9. In this example, a 4-bit data is assigned as input data.
Because the input data is only 4 bits in this example, we set the number of
input bits of each partition block to 4, and then the levels in each partition
block equal [log2 4] (i.e. two levels).
First, we use different logic gates (NAND, NOR, and XOR) to generate the
parameter for D1D0, respectively, and then records their generated
parameter for each pattern as shown in Fig. 3.9 (a).
Fig. 3.9: An example for synthesis of the parameter extractor.
After that, according to Equation. 4, we calculate the average number of
comparison operations Cavg for each logic gate. Obviously, using NAND
gate is the best selection for D1D0, hence NAND gate is selected as a part
of the parameter extractor. Similarly, NOR gate is selected to generate the
parameter for D3D2 (see Fig. 3.9 (b)). Now, the parameter bits is 2, which is
greater than [4/4] (expected parameter bits). According to the proposed
algorithm, we repeat Step 1 to Step 3 to determine the parameter bits for the
next level. For Y1Y0, as shown in Fig.3.9 (c), XOR gate is the best
parameter extractor for Y1Y0. Through the algorithm, the generated
parameter is only 1 bit, which is no longer greater than [4/4], hence the
procedure of synthesizing parameter extractor is done. Finally, Fig. 3.9 (d)
shows the synthesized parameter extractor for the input data.
Chapter 4
TOOLS UTILIZED
4.1 Xilinx(ISE):
There are several EDA (Electronic Design Automation) tools available for
circuit synthesis, implementation, and simulation using VHDL. Some tools
(place and route, for example) are odered as part of a vendor’s design suite
(e.g., Altera’s Quartus II, which allows the synthesis of VHDL code onto
Altera’s CPLD/FPGA chips, or Xilinx’s ISE suite, for Xilinx’s CPLD/FPGA
chips). Other tools (synthesizers, for example), besides being odered as part
of the design suites, can also be provided by specialized EDA companies
(Mentor Graphics, Synopsis, Synplicity,etc.). Examples of the latter group
are Leonardo Spectrum (a synthesizer from Mentor Graphics), Synplify (a
synthesizer from Synplicity), and ModelSim (a simulator from Model
Technology, a Mentor Graphics company). The designs presented in the
book were synthesized onto CPLD/FPGA devices (appendix A) either from
Altera or Xilinx. The tools used were either ISE combined with ModelSim (for
Xilinx chips—appendix B), MaxPlus II combined with Advanced Synthesis
Software, or Quartus II. Leonardo Spectrum was also used occasionally.
Although different EDA tools were used to implement and test the examples
presented in the design, we decided to standardize the visual presentation of
all simulation graphs. Due to its clean appearance, the waveform editor of
MaxPlus II was employed. However, newer simulators like ISE þ ModelSim
and Quartus II, over a much broader set of features, which allow, for
example, a more refined timing analysis. For that reason, those tools were
adopted when examining the fine details of each design.
The Xilinx Integrated Software Environment (ISE) is a powerful and complex
set of tools. First, the HDL files are synthesized. Synthesis is the process of
converting behavioral HDL descriptions into a network of logic gates. The
synthesis engine takes as input the HDL design files and a library of
primitives. Primitives are not necessarily just simple logic gates like AND and
OR gates and D-registers, but can also include more complicated things
such as shift registers and arithmetic units. Primitives also include
specialized circuits such as DLLs that cannot be inferred by behavioral HDL
code and must be explicitly instantiated. The libraries guide in the Xilinx
documentation provides an complete description of every primitive available
in the Xilinx library. (Note that, while there are occasions when it is helpful or
even necessary to explicitly instantiate primitives, it is much better design
practice to write behavioral code whenever possible.)
We will be using the Xilinx supplied synthesis engine known as XST. XST
takes as input a verilog (.v) file and generates a .ngc file. A synthesis report
file (.srp) is also generated, which describes the logic inferred for each part
of the HDL file, and often includes helpful warning messages.
The .ngc file is then converted to an .ngd file. (This step mostly seems to be
necessary to accommodate different design entry methods, such as third-
part synthesis tools or direct schematic entry. Whatever the design entry
method, the result is an .ngd file.)
The .ngd file is essentially a netlist of primitive gates, which could be
implemented on any one of a number of types of FPGA devices Xilinx
manufacturers. The next step is to map the primitives onto the types of
resources (logic cells, i/o cells, etc.) available in the specific FPGA being
targeted. The output of the Xilinx map tool is an .ncd file.
The design is then placed and routed, meaning that the resources described
in the .ncd file are then assigned specific locations on the FPGA, and the
connections between the resources are mapped into the FPGAs
interconnect network. The delays associated with interconnect on a large
FPGA can be quite significant, so the place and route process has a large
impact on the speed of the design. The place and route engine attempts to
honor timing constraints that have been added to the design, but if the
constraints are too tight, the engine will give up and generate an
implementation that is functional, but not capable of operating as fast as
desired. Be careful not to assume that just because a design was
successfully placed and routed, that it will operate at the desired clock rate.
The output of the place and route engine is an updated .ncd file, which
contains all the information necessary to implement the design on the
chosen FPGA. All that remains is to translate the .ncd file into a
configuration bit stream in the format recognized by the FPGA programming
tools. Then the programmer is used to download the design into the FPGA,
or write the appropriate files to a compact flash card, which is then used to
configure the FPGA.
By itself, a Verilog model seldom captures all of the important attributes of a
complete design. Details such as i/o pin mappings and timing constraints
can't be expressed in Verilog, but are nonetheless important considerations
when implementing the model on real hardware. The Xilinx tools allow these
constraints to be defined in several places, the two most notable being a
separate "universal constraints file" (.ucf) and special comments within the
Verilog model.
Xilinx has two main FPGA families: the high-performance Virtex series and
the high-volume Spartan series, with a cheaper Easy Path option for
ramping to volume production. It also manufactures two CPLD lines, the
CoolRunner and the 9500 series. Each model series has been released in
multiple generations since its launch.
The latest Virtex-6 and Spartan-6 FPGA families are said to consume 50
percent less power, cost 20 percent less, and have up to twice the logic
capacity of previous generations of FPGAs.
4.1.1 Spartan Family:
The Spartan series targets applications with a low-power footprint, extreme
cost sensitivity and high-volume such as displays, set-top boxes, wireless
routers and other applications.
The Spartan-6 family is built on a 45-nanometer (nm), 9-metal layer, dual-
oxide process technology. The Spartan-6 was marketed in 2009 as a low-
cost solution for automotive, wireless communications, flat-panel display and
video surveillance applications.
The Spartan-3A consumes more than 70-90 percent less power in suspend
mode and 40-50 percent less for static power compared to standard devices.
Also, the integration of dedicated DSP circuitry in the Spartan series has
inherent power advantages of approximately 25 percent over competing low-
power FPGAs.
4.1.2 Virtex Family:
The Virtex series of FPGAs have integrated features such as wired and
wireless infrastructure equipment, advanced medical equipment, test and
measurement, and defense systems. In addition to FPGA logic, the Virtex
series includes embedded fixed function hardware for commonly used
functions such as multipliers, memories, serial transceivers and
microprocessor cores.
The Virtex-6 family is built on a 40-nm process for compute-intensive
electronic systems, and the company claims it consumes 15 percent less
power and has 15 percent improved performance over competing 40 nm
FPGAs.Older-generation devices such as the Virtex, Virtex-II and Virtex-II
Pro are also still available, although their functionality is largely superseded
by the Virtex-4 and -5 FPGA families.
The Virtex-II Pro family was the first to combine PowerPC embedded
technology (including single and multiple PowerPC 405 processor cores)
and integrated serial transceivers (up to 3.125 Gbit/s in Virtex-II Pro and up
to 10.3125 in Virtex-II Pro X).The Virtex-4 series was introduced in 2004 and
was manufactured on a 1.2V, 90-nm, triple-oxide process technology. The
Virtex-4 family introduced the new Advanced Silicon Modular Block (ASMBL)
architecture enabling FPGA platforms with a combination of features to
support logic (LX), embedded processing and connectivity (FX), digital signal
processing (SX).
Chapter 5
VHDL Code for the different parameter extracting
approaches
1).Coding For One’s Count Approach:
CAM BLOCKONES CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity camblockones is
port (
data : in std_logic_vector(13 downto 0);
clk : in std_logic;
we : in std_logic;
wei : in std_logic;
addr : out std_logic_vector(3 downto 0)
);
End camblockones;
Architecture beh of camblockones is
component datamem
port (
clk : in std_logic;
ad : in std_logic_vector(3 downto 0);
--data : inout std_logic_vector(13 downto 0);
data : in std_logic_vector(13 downto 0);
data_out : out std_logic_vector(13 downto 0);
we : in std_logic
);
End component;
component onescnt
port (
i : in std_logic_vector(13 downto 0);
s : out std_logic_vector(3 downto 0)
);
End component;
component cammem
port (
clk : in std_logic; -- Clock Input
-- address : inout std_logic_vector (3 downto 0);
address : in std_logic_vector (3 downto 0);
address_out : out std_logic_vector (3 downto 0);
data : in std_logic_vector (13 downto 0);
we : in std_logic
);
end component;
Signal ex,exi : std_logic_vector(3 downto 0);
Signal dat : std_logic_vector(13 downto 0);
begin
m0 : onescnt port map(data,ex);
m1 : datamem port map(clk,ex,data,dat,we);
m2 : cammem port map (clk,ex,exi,data,wei);
addr <= exi;
end beh;
CAM DATA MEMORY CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity datamem is
port (
clk : in std_logic;
ad : in std_logic_vector(3 downto 0);
--data : inout std_logic_vector(13 downto 0);
data : in std_logic_vector(13 downto 0);
data_out : out std_logic_vector(13 downto 0);
we : in std_logic
);
End datamem;
Architecture beh of datamem is
----------------Internal variables----------------
constant DEPTH :integer := 16;
--signal data_out :std_logic_vector (13 downto 0):=(others=>'0');
type cmem is array (integer range <>)of std_logic_vector (13 downto 0);
signal mem : cmem (0 to 15);
begin
-- data <= data_out when (we = '0') else (others=>'Z');
-- Memory Write Block
-- Write Operation : When we = 1, cs = 1
MEM_WRITE:
process (clk) begin
if (rising_edge(clk)) then
if we = '1' then
mem(conv_integer(ad)) <= data;
end if;
end if;
end process;
-- Memory Read Block
-- Read Operation : When we = 0, oe = 1, cs = 1
MEM_READ:
process (clk) begin
if (rising_edge(clk)) then
if we = '0' then
data_out <= mem(conv_integer(ad));
end if;
end if;
end process;
end beh;
CAM MEMORY CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity cammem is
port (
clk :in std_logic; -- Clock Input
-- address :inout std_logic_vector (3 downto 0);
address :in std_logic_vector (3 downto 0);
address_out :out std_logic_vector (3 downto 0);
data :in std_logic_vector (13 downto 0);
we : in std_logic
);
end cammem;
Architecture beh of cammem is
----------------Internal variables----------------
constant CAM_DEPTH :integer := 2**14;
-- signal address_out :std_logic_vector (3 downto 0):=(others=>'0');
type CAM is array (integer range <>)of std_logic_vector (3 downto 0);
signal mem : CAM (0 to CAM_DEPTH-1);
begin
-- address <= address_out when ( we = '0') else (others=>'Z');
MEM_WRITE:
process (clk) begin
if (rising_edge(clk)) then
if we = '1' then
mem(conv_integer(data)) <= address; ----- convertion type
end if;
end if;
end process;
MEM_READ:
process (clk) begin
if (rising_edge(clk)) then
if we = '0' then
address_out <= mem(conv_integer(data));
end if;
end if;
end process;
end beh;
CAM ONES COUNT CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity onescnt is
port (
i: in std_logic_vector(13 downto 0);
s: out std_logic_vector(3 downto 0)
);
End onescnt;
Architecture beh of onescnt is
component fa
port (
a : in std_logic;
b : in std_logic;
c : in std_logic;
sum : out std_logic;
cout : out std_logic
);
End component;
Signal w : std_logic_vector(29 downto 0);
begin
m0 : fa port map(i(2),i(1),i(0),w(0),w(1));
m1 : fa port map(i(5),i(4),i(3),w(2),w(3));
m2 : fa port map(i(8),i(7),i(6),w(4),w(5));
m3 : fa port map(i(11),i(10),i(9),w(6),w(7));
m4 : fa port map('0',i(13),i(12),w(8),w(9));
m5 : fa port map(w(4),w(2),w(0),w(10),w(11));
m6 : fa port map('0',w(8),w(6),w(12),w(13));
m7 : fa port map(w(3),w(1),'0',w(14),w(15));
m8 : fa port map(w(9),w(7),w(5),w(16),w(17));
m9 : fa port map('0',w(12),w(10),s(0),w(18));
m10 : fa port map(w(13),w(11),'0',w(19),w(20));
m11 : fa port map('0',w(16),w(14),w(21),w(22));
m12 : fa port map(w(17),w(15),'0',w(23),w(24));
m13 : fa port map(w(18),w(21),w(19),s(1),w(25));
m14 : fa port map(w(22),w(23),w(20),w(26),w(27));
m15 : fa port map(w(26),w(25),'0',s(2),w(28));
w(29)<= w(24) or w(27);
s(3)<= w(29) or w(28);
end beh;
FULL ADDER CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
use ieee.std_logic_unsigned.all;
entity fa is
port (
a : in std_logic;
b : in std_logic;
c : in std_logic;
sum : out std_logic;
cout : out std_logic);
end fa;
architecture Behavioral of fa is
begin
sum <= (a xor b xor c);
cout <= ((a and b)or (b and c)or(a and c));
end Behavioral;
2).Coding For XOR Approach:
CAM BLOCKXOR CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity camblockxor is
port (
data : in std_logic_vector(13 downto 0);
clk : in std_logic;
we : in std_logic;
wei : in std_logic;
addr : out std_logic_vector(3 downto 0)
);
End camblockones;
Architecture beh of camblockones is
component datamem
port (
clk : in std_logic;
ad : in std_logic_vector(3 downto 0);
--data : inout std_logic_vector(13 downto 0);
data : in std_logic_vector(13 downto 0);
data_out : out std_logic_vector(13 downto 0);
we : in std_logic
);
End component;
component xor1
port (
i : in std_logic_vector(13 downto 0);
s : out std_logic_vector(3 downto 0)
);
End component;
component cammem
port (
clk : in std_logic; -- Clock Input
-- address : inout std_logic_vector (3 downto 0);
address : in std_logic_vector (3 downto 0);
address_out : out std_logic_vector (3 downto 0);
data : in std_logic_vector (13 downto 0);
we : in std_logic
);
end component;
Signal ex,exi : std_logic_vector(3 downto 0);
Signal dat : std_logic_vector(13 downto 0);
begin
m0 : xor1 port map(data,ex);
m1 : datamem port map(clk,ex,data,dat,we);
m2 : cammem port map (clk,ex,exi,data,wei);
addr <= exi;
end beh;
CAM DATA MEMORY CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity datamem is
port (
clk : in std_logic;
ad : in std_logic_vector(3 downto 0);
--data : inout std_logic_vector(13 downto 0);
data : in std_logic_vector(13 downto 0);
data_out : out std_logic_vector(13 downto 0);
we : in std_logic
);
End datamem;
Architecture beh of datamem is
----------------Internal variables----------------
constant DEPTH :integer := 16;
--signal data_out :std_logic_vector (13 downto 0):=(others=>'0');
type cmem is array (integer range <>)of std_logic_vector (13 downto 0);
signal mem : cmem (0 to 15);
begin
-- data <= data_out when (we = '0') else (others=>'Z');
-- Memory Write Block
-- Write Operation : When we = 1, cs = 1
MEM_WRITE:
process (clk) begin
if (rising_edge(clk)) then
if we = '1' then
mem(conv_integer(ad)) <= data;
end if;
end if;
end process;
-- Memory Read Block
-- Read Operation : When we = 0, oe = 1, cs = 1
MEM_READ:
process (clk) begin
if (rising_edge(clk)) then
if we = '0' then
data_out <= mem(conv_integer(ad));
end if;
end if;
end process;
end beh;
CAM MEMORY CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity cammem is
port (
clk :in std_logic; -- Clock Input
-- address :inout std_logic_vector (3 downto 0);
address :in std_logic_vector (3 downto 0);
address_out :out std_logic_vector (3 downto 0);
data :in std_logic_vector (13 downto 0);
we : in std_logic
);
end cammem;
Architecture beh of cammem is
----------------Internal variables----------------
constant CAM_DEPTH :integer := 2**14;
-- signal address_out :std_logic_vector (3 downto 0):=(others=>'0');
type CAM is array (integer range <>)of std_logic_vector (3 downto 0);
signal mem : CAM (0 to CAM_DEPTH-1);
begin
-- address <= address_out when ( we = '0') else (others=>'Z');
MEM_WRITE:
process (clk) begin
if (rising_edge(clk)) then
if we = '1' then
mem(conv_integer(data)) <= address; ----- convertion type
end if;
end if;
end process;
MEM_READ:
process (clk) begin
if (rising_edge(clk)) then
if we = '0' then
address_out <= mem(conv_integer(data));
end if;
end if;
end process;
end beh;
CAM XOR CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
use ieee.std_logic_unsigned.all;
entity xor1 is
port (
i : in std_logic_vector(13 downto 0);
s: out std_logic_vector(3 downto 0));
end xor1;
architecture behavioral of xor1 is
signal a : std_logic_vector (5 downto 0);
begin
a(5) <= i(13)xor i(12);
a(4) <= i(11)xor i(10);
a(3) <= i(9)xor i(8);
a(2) <= i(7)xor i(6);
a(1) <= i(5)xor i(4);
a(0) <= i(3)xor i(2);
s(3) <= a(5)xor a(4);
s(2) <= a(3)xor a(2);
s(1) <= a(1)xor a(0);
s(0) <= i(1)xor i(0);
end behavioral;
3).Coding For Gate-Block Selection Approach :
CAM GATE BLOCK SELECTION
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity camblockgb is
port (
data : in std_logic_vector(13 downto 0);
clk : in std_logic;
we : in std_logic;
wei : in std_logic;
addr : out std_logic_vector(3 downto 0)
);
End camblockgb;
Architecture beh of camblockgb is
component datamem
port (
clk : in std_logic;
ad : in std_logic_vector(3 downto 0);
--data : inout std_logic_vector(13 downto 0);
data : in std_logic_vector(13 downto 0);
data_out : out std_logic_vector(13 downto 0);
we : in std_logic
);
End component;
component gbextractor
port (
i : in std_logic_vector(13 downto 0);
ex : out std_logic_vector(3 downto 0)
);
End component;
component cammem
port (
clk : in std_logic; -- Clock Input
-- address : inout std_logic_vector (3 downto 0);
address : in std_logic_vector (3 downto 0);
address_out : out std_logic_vector (3 downto 0);
data : in std_logic_vector (13 downto 0);
we : in std_logic
);
end component;
Signal ex,exi : std_logic_vector(3 downto 0);
Signal dat : std_logic_vector(13 downto 0);
begin
m0 : gbextractor port map(data,ex);
m1 : datamem port map(clk,ex,data,dat,we);
m2 : cammem port map (clk,ex,exi,data,wei);
addr <= exi;
end beh;
GATE BLOCK CAM MEMORY CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity cammem is
port (
clk :in std_logic; -- Clock Input
-- address :inout std_logic_vector (3 downto 0);
--- address :in std_logic_vector (3 downto 0);
address_out :out std_logic_vector (3 downto 0);
data :in std_logic_vector (13 downto 0);
we : in std_logic
);
end cammem;
Architecture beh of cammem is
----------------Internal variables----------------
constant CAM_DEPTH :integer := 2**14;
-- signal address_out :std_logic_vector (3 downto 0):=(others=>'0');
type CAM is array (integer range <>)of std_logic_vector (3 downto 0);
signal mem : CAM (0 to CAM_DEPTH-1);
begin
-- address <= address_out when ( we = '0') else (others=>'Z');
MEM_WRITE:
process (clk) begin
if (rising_edge(clk)) then
if we = '1' then
mem(conv_integer(data)) <= address; ----- convertion type
end if;
end if;
end process;
MEM_READ:
process (clk) begin
if (rising_edge(clk)) then
if we = '0' then
address_out <= mem(conv_integer(data));
end if;
end if;
end process;
end beh;
GATE BLOCK CAM DATA MEMORY CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity datamem is
port (
clk : in std_logic;
ad : in std_logic_vector(3 downto 0);
--data : inout std_logic_vector(13 downto 0);
data : in std_logic_vector(13 downto 0);
data_out : out std_logic_vector(13 downto 0);
we : in std_logic
);
End datamem;
Architecture beh of datamem is
----------------Internal variables----------------
constant DEPTH :integer := 16;
--signal data_out :std_logic_vector (13 downto 0):=(others=>'0');
type cmem is array (integer range <>)of std_logic_vector (13 downto 0);
signal mem : cmem (0 to 15);
begin
-- data <= data_out when (we = '0') else (others=>'Z');
-- Memory Write Block
-- Write Operation : When we = 1, cs = 1
MEM_WRITE:
process (clk) begin
if (rising_edge(clk)) then
if we = '1' then
mem(conv_integer(ad)) <= data;
end if;
end if;
end process;
-- Memory Read Block
-- Read Operation : When we = 0, oe = 1, cs = 1
MEM_READ:
process (clk) begin
if (rising_edge(clk)) then
if we = '0' then
data_out <= mem(conv_integer(ad));
end if;
end if;
end process;
end beh;
GATE BLOCK CAM EXTRACTOR CODING
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity gbextractor is
port (
i : in std_logic_vector(13 downto 0);
ex : out std_logic_vector(3 downto 0)
);
End gbextractor;
Architecture beh of gbextractor is
signal s : std_logic_vector(5 downto 0);
signal a : std_logic_vector(3 downto 0);
signal se : std_logic;
Begin
a(0) <= i(0) nand i(1);
s(0) <= i(2) nor i(3);
s(1) <= i(4) nor i(5);
a(1) <= s(0) xor s(1);
s(2) <= i(6) nand i(7);
s(3) <= i(8) nand i(9);
a(2) <= s(2) xor s(3);
s(4) <= i(10) nor i(11);
s(5) <= i(12) nor i(13);
a(3) <= s(4) xor s(5);
se <= a(0) and a(1) and a(2) and a(3);
process(se,a(3 downto 0),i(13 downto 10))
begin
if se = '1' then
ex <= a(3 downto 0);
else
ex <= i(13 downto 10);
end if;
end process;
end beh;
Chapter 6
SIMULATION RESULTS
In this chapter simulation results of implemented modules are discussed
along with their RTL schematics and output waveforms. This work mainly
concerned with the implementation of the low power pre-computation-based
CAM in VHDL. The developed VHDL code was synthesizable also, by which
it can be implemented in hardware also. A program synthesizable implies
that it is implantable in hardware (in FPGA device). In this work every VHDL
code developed has been synthesized using Xilinx9.2i. Technology used
was SPARTAN III, product of XILINX.
5.1 Results of One’s Count Parameter Extractor:
The fig.5.1 shows the simulation output for the 14-bit one’s count parameter
extractor. The one’s count parameter extractor counts the number of one in
the given input sequence, and gives the count as the output . For a 14-bit
input data the extracted output is log(14+2) = 4-bit. The output is termed as
parameter.
RTL Schematics :
Results of CAM:
The fig.5.8 shows the RTL Schematic of content addressable memory. It
takes data as input and gives address as output. Here two control signals
we,wei are used to enable write and read operations. In this implementation
gate block selection algorithm parameter extractor is used, as it takes less
power for parameter extraction which is the critical part in the search
operation
Fig .5.8.RTL Schematic of CAM.
Fig .5.9.Detailed RTL Schematic of CAM with Parameter extractor.
Fig .5.10. Detailed RTL Schematic of content addressable memory.
Intially the control signals are set to ‘1’(i.e.,we= ‘1’ and wei= ‘1’) to write the
data into the CAM. In other words creating the database. After writing
data ,we can read the address corresponding to the data by setting the
control signals to ‘0’. We can also read the data after one clock cycle by
setting the control signals we= ‘0’ and wei= ‘1’. The output results for writing
data and reading the address as well as data are illustrated in Fig 5.11 to
Fig 5.13.
One count
Signals And Structure
The structure and signal of the simulated model for ones’ count approach ,
obtained after the complete simulation are shown below
Cam Memory
1) Data Memory
Waveforms After Simulation
The waveform shown below shows the status and changes in the memory location
after each run in the form of waveform
Fig.5.11.VHDL output showing the data write into the CAM
Fig.5.12.VHDL output showing the data read from the CAM
Fig.5.13.VHDL output showing the address read from the CAM
Device Utilization Summary:
Logic Utilization Used Available Utilization
Number of
Slices
14 768 1%
Number of 4
input LUTs
26 1536 1%
Number of
bonded IOBs
19 124 15%
Number of
BRAMs
4 4 100%
Number of
GCLKs
1 8 12%
Table 6
Table 7
RTL Schematic in fig 5.3 shows that, full adders are used in the
architecture which consume large area. This is shown in the Table7
Simulation Results Of Block Xor Gate
The fig.shows the rtl schematic for the 14-bit block xor parameter extractor.
The block xor parameter extractor xors the given input sequence, and gives
the xor-ed output . For a 14-bit input data the extracted output is log(14+2) =
4-bit. The output is termed as parameter.
RTL Schematic
Fig .5.5. Detailed RTL Schematic of Block-XOR parameter extractor.
Signals And Structure
The structure and signal of the simulated model for block xor approach ,
obtained after the complete simulation are shown below :
Data Memory
Cam Memory
Block Xor Approach
Waveforms After Simulation
Results of Block-Xor Parameter Extractor:
The fig.5.4 shows the simulation output for the 14-bit Block-XOR
parameter extractor. Here the extraction is done by the independent blocks.
Compared to the one’s count parameter extractor it takes less delay and
less area and it is adaptive for general designs. Input is 14-bit and the
extracted output(parameter) is 4-bit. If the parameter is “0000 ~ 1110” (Se =
“0”), the parameter does not change. Otherwise, if the parameter is
“1111”(Se= “1”), then the first block of the input data becomes the new
parameter, and “1111” can then be used as the valid bit.
The waveform shown below shows the status and changes in the memory
location after each run in the form of waveform
Fig.5.4.VHDL output for Block-XOR parameter extractor.
Device Utilization Summary:
Table 7
Logic Utilization Used Available Utilization
Number of
Slices
2 1920 0%
Number of 4
input LUTs
4 3840 0%
Number of
bonded IOBs
19 97 19%
Number of
BRAMs
4 6 66%
Number of
GCLKs
1 8 12%
Table 8
RTL Schematic in fig 5.5 shows that ,independent block are used for
parameter extraction, and Table VII shows that less number of devices are
used for implementation. So Block-XOR parameter takes less area and
hence less power.
Simulation Results Of Gate Selection Algorithm
The fig.shows the rtl schematic for the 14-bit gate selected parameter
extractor. The gate selected parameter extractor computes the output
sequence from the given input sequence, by the predefined algorithm and
passing the input sequence to the different gates. For a 14-bit input data the
extracted output is log(14+2) = 4-bit. The output is termed as parameter.
Fig .5.7. Detailed RTL Schematic of gate block selection algorithm.
RTL Schematic in fig. 5.7 shows that, independent block are used for
parameter extraction, and Table VIII shows that less number of devices are
used for implementation. So gate block selection algorithm parameter takes
less area and hence less power.
Waveforms After Simulation
Results of Gate Block Selection Algorithm:
The fig.5.6 shows the simulation output for the 14-bit gate block
selection algorithm parameter extractor. Here the extraction is done by the
independent blocks. Compared to the one’s count parameter extractor it
takes less delay and less area and it is adaptive for general designs. Input is
14-bit and the extracted output(parameter) is 4-bit. If the parameter is “0000
~ 1110” (Se = “0”), the parameter does not change. Otherwise, if the
parameter is “1111”(Se= “1”), then the first block of the input data becomes
the new parameter, and “1111” can then be used as the valid bit.
Fig.5.6.VHDL output for gate block selection algorithm
Signals And Structure
The structure and signal of the simulated model for block xor approach ,
obtained after the complete simulation are shown below :
Gate block extractor
Data memory
Cam memory
Device Utilization Summary of CAM
Table 9
The Table 7 gives the Device Utilization Summary of the CAM with gate
block selection algorithm parameter extractor. From the waveforms it is
observed that initially 14-bit data is loaded in to the CAM. If we give the data
that is loaded in the CAM as an input, it will give the address pointing to the
same data as well as different data as an output exactly after one clock
cycle.
Table 10
Device Utilization Summary of gate block selection algorithm
Chapter 7
CONCLUSION
In this thesis 14-bit low power pre-computation–based content addressable
memory (PB-CAM) is simulated in VHDL. Mathematical analysis and
simulation results confirmed that the Block-XOR PB-CAM can effectively
save power by reducing the number of comparison operations in the second
part of the comparison process. In addition, it takes less area as compared
with the one’s count parameter extractor. This PB-CAM takes data as input
and gives the address pointing to the same data as well as different data as
an output exactly after one clock cycle. So it is flexible and adaptive for the
low power and high speed search applications
In this thesis, a gate-block selection algorithm was proposed. The proposed
algorithm can synthesize a proper parameter extractor of the PB-CAM for a
specific data type. Mathematical analysis and simulation results confirmed
that the proposed PB-CAM effectively save power by reducing the number of
comparison operations in the data comparison process. In addition, the
proposed parameter extractor can compute parameter bits in parallel with
only three logic gate delays for any input bit length (i.e. constant delay of
search operation).