Download - Project Report 1

A

PROJECT REPORT

ON

PRE-COMPUTAION BASED CONTENT

ADDRESSABLE MEMORY

Submitted for partial fulfillment of award of

“Bachelor of technology”

IN

Electronics &

Communication

MAHARANA PRATAP ENGINEERING COLLEGE

KANPUR

Submitted to: Submitted by:

Mr.Anjaneya Nigam Neharika Mishra

(Asst. professor) Rahul kr.Gupta

(Department of Electronics & Rituraj Anand

Communication Engineering) Wasiuddin

Certificate

This is to certify that the project report entitled “ PB-CAM” which is

submitted by Neharika Mishra , Rahul kr. Gupta, Rituraj Anand ,

Wasiuddin in partial fulfillment for the requirement for the award of degree

B.Tech in department of Electronics & Communication Engineering,

M.P.E.C, Kanpur,is a record of candidates’ work carried out by them under

my supervision. The matter embodied in this thesis is original and has not

been submitted for the award of any other degree.

May ,2011

Anjaneya Nigam

(Asst. Professor)

Department of Electronics Engineering

M.P.E.C.,Kanpur

ACKNOWLEDGEMENT

It gives us a great sense of pleasure to present the report of B.Tech project

undertaken during B.Tech final year. We owe a special debt of gratitude to

Mr.Anjaneya Nigam (Asst. Professor),Department of Electronics &

Communications Engineering ,Maharana Pratap Engineering College, Kanpur.

For his constant support and guidance throughout the course of work , his

sincerity, thoroughness and preservance have been a constant source of inspiration

for us. It is only his cognizant efforts that our endeavors have seen light of day.

We also take the opportunity to acknowledge the contributions of Mr. Mohit

Srivastava,Head,Department of Electronics and Communications Engineering,

Maharana Pratap Engineering College, Kanpur ,for his full support and assistance

during the development of project .We also do not want to miss the opportunity

to acknowledge the contribution of all faculty members of the department for their

kind co-operation during the development of our project.

Last but not the least , we acknowledge our friends for their contribution in

completion of the project.

May, 2011

Neharika Mishra

Rahul kr.Gupta

Rituraj Anand

Wasiuddin

ABSTRACT

Content-addressable memory (CAM) is a special type of computer Memory

used in certain very high speed searching applications. It is also known as

associative memory, associative storage, or associative array.

Content-addressable memory (CAM) is frequently used in applications, such

as lookup tables, databases, associative computing, and networking, that

require high-speed searches due to its ability to improve application

performance by using parallel comparison to reduce search time. Although

the use of parallel comparison results in reduced search time, it also

significantly increases power consumption. In this paper, we propose a

Block-XOR approach to improve the efficiency of low power pre

computation- based CAM (PB-CAM). Compared with the ones-count PB-

CAM system, the experimental results show that our proposed approach can

achieve on average 30% in power reduction and 32% in power performance

reduction. The major contribution of this paper is that it presents practical

proofs to verify that our proposed Block-XOR PB-CAM system can achieve

greater power reduction without the need for a special CAM cell design. This

implies that our approach is more flexible and adaptive for general designs.

CHAPTER 1

1.1 Introduction :

Most memory devices store and retrieve data by addressing specific

memory locations. As a result, this path often becomes the limiting factor for

systems that rely on fast memory accesses. The time required to find an

item stored in memory can be reduced considerably if the item can be

identified for access by its content rather than by its address. A memory that

is accessed in this way is called content-addressable memory or CAM.

It provides a performance advantage over other memory search algorithms,

such as binary or tree-based searches or look-aside tag buffers, by

comparing the desired information against the entire list of pre-stored entries

simultaneously, often resulting in an order-of-magnitude reduction in the

search time.

Classically a CAM is defined as a functional memory with a large

amount of stored data that simultaneously compares the input search data

with the stored data. Once matching data are found, their addresses are

returned as output as shown in Fig. 1.1.

Fig1.1 .conventional CAM architecture

Unlike standard computer memory (Random access memory or RAM) in

which the user supplies a memory address and the RAM returns the data

word stored at that address, a CAM is designed such that the user supplies

a data word and the CAM searches its entire memory to see if that data

word is stored anywhere in it. If the data word is found, the CAM returns a

list of one or more storage addresses where the word was found (and in

some a rchitecture, it also returns the data word, or other associated pieces

of data). Thus, a CAM is the hardware embodiment of what in software

terms would be called an associative array.

Content-addressable memories (CAMs) are hardware search engines that

are much faster than algorithmic approaches for search-intensive

applications. CAMs are composed of conventional semiconductor memory

(usually SRAM) with added comparison circuitry that enables a search

operation to complete in a single clock cycle. CAM can be used as looks up

table that suite for AES applications.

A content-addressable memory (CAM) is a critical device for applications

involving asynchronous transfer mode (ATM), communication networks,

databases, lookup tables, tag directories, data compression, pattern-

recognition and security or encryption information on a packet-by-packet

basis for high-performance data switches, firewalls, bridges and routers due

to its high-speed data search capability. The vast number of comparison

operations required by CAMs consumes a large amount of power. So pre

computation techniques have evolved.

1.2 Literature Survey

In the past decade, much research on energy reduction has focused on the

circuit and technology domains ( provides a comprehensive survey on CAM

designs from circuit to architectural levels). In several works on reducing

CAM power consumption have focused on reducing match-line power.

Although there has been progress in this area in recent years, the power

consumption of CAMs is still high compared with RAMs of similar size.

http://en.wikipedia.org/wiki/Associative_array

http://en.wikipedia.org/wiki/Data_word

http://en.wikipedia.org/wiki/Random_access_memory

At the same time, research in associative cache system design for power

efficiency at the architectural level continues to increase. The filter cache

and location cache techniques in can effectively reduce the power

dissipation by adding a very small cache.

However, the use of these caches requires major modifications to the

memory structure and hierarchy to fit the design. Pagiamtzis et al. proposed

a cache-CAM (C-CAM) that reduces power consumption relative to the

cache hit rate in [8]. Lin et al. presented a ones-count pre computation-

based CAM (PB-CAM) that achieves low-power, low cost, low-voltage, and

high-reliability features .

Although Change in further improved the efficiency of PB-CAMs, the

approach proposed requires considerable modification to the memory

architecture to achieve high performance. Therefore, it is beyond the scope

of the general CAM design. Moreover, the disadvantage of the ones count

PB-CAM system in is that it adopts a special memory cell design for

reducing power consumption, which is only applicable to the ones count

parameter extractor.

Block diagram

Existing System

A CAM is a functional memory with a large amount of stored data that

compares the input search data with the stored data. Once matching data

are found, their addresses are returned as output. The vast number of

comparison operations required by CAMs consumes a large amount of

power.

Proposed System:

This proposed system approach can reduce comparison operations by a

minimum of 909and a maximum of 2339.we propose a new parameter

extractor called Block-XOR, Which achieve the requirement.

Objective of the Project:

The main objective of the project is to implement the low power pre

computation-based content addressable memory (PB-CAM) along with the

Block-XOR parameter extractor and GATE block selection algorithm.

.

CHAPTER 2

CAM OVERVIEW

2.1 About CAM:

Content addressable memory (CAM) compares input search data against a

table of stored data, and returns the address of the matching data . CAMs

have a single clock cycle throughput making them faster than other

hardware- and software-based search systems. CAMs can be used in a

wide variety of applications requiring high search speeds. A CAM is a good

choice for implementing this lookup operation due to its fast search

capability.

However, the speed of a CAM comes at the cost of increased silicon area

and power consumption, two design parameters that designers strive to

reduce. As CAM applications grow, demanding larger CAM sizes, the power

problem is further exacerbated. Reducing power consumption, without

sacrificing speed or area, is the main thread of recent research in large-

capacity. CAMs. Development in the cam area is surveyed at two levels:

circuits and architectures levels.

We can compare CAM to the inverse of RAM. When read, RAM produces

the data for a given address. Conversely, CAM produces an address for a

given data word. When searching for data within a RAM block, the search is

performed serially. Thus, finding a particular data word can take many

cycles. CAM searches all addresses in parallel and produces the address

storing a particular word.

CAM supports writing "don't care" bits into words of the memory. The don't

care bit can be used as a mask for CAM comparisons; any bit set to don't

care has no effect on matches.

The output of the CAM can be encoded or un encoded. The encoded output

is better suited for designs that ensure duplicate data is not written into the

CAM. If duplicate data is written into two locations, the CAM's output will not

be correct. If the CAM contains duplicate data, the un encoded output is a

better solution; CAM with un encoded outputs can distinguish multiple data

locations.

We can pre-load CAM with data during configuration, or you can write into

CAM during system operation. In most cases, two clock cycles are required

to write each word into CAM. When you use don't care bits, a third clock

cycle is required.

2.1.1 Operation of CAM:

Fig.2.1: Conceptual view of a content-addressable memory containing w

words.

Fig. 2.1 shows a simplified block diagram of a CAM. The input to the system

is the search word that is broadcast onto the search lines to the table of

stored data. The number of bits in a CAM word is usually large, with existing

implementations ranging from 36 to 144 bits. A typical CAM employs a table

size ranging between a few hundred entries to 32K entries, corresponding to

an address space ranging from 7 bits to 15bits.

Each stored word has a match line that indicates whether the search word

and stored word are identical (the match case) or are different (a mismatch

case, or miss). The match lines are fed to an encoder that generates a

binary match location corresponding to the match line that is in the match

state. An encoder is used in systems where only a single match is expected.

In addition, there is often a hit signal (not shown in the figure) that flags the

case in which there is no matching location in the CAM. The overall function

of a CAM is to take a search word and return the matching memory location.

One can think of this operation as a fully programmable arbitrary mapping of

the large space of the input search word to the smaller space of the output

match location.

The operation of a CAM is like that of the tag portion of a fully associative

cache. The tag portion of a cache compares its input, which is an address, to

all addresses stored in the tag memory. In the case of match, a single match

line goes high, indicating the location of a match. Many circuits are common

to both CAMs and caches; however, we focus on large capacity CAM s

rather than on fully associative caches, which target smaller capacity and

higher speed.

Today’s largest commercially available single-chip CAMs are 18 Megabit

implementations, although the largest CAMs reported in the literature are 9

Megabit in size. As a rule of thumb, the largest available CAM chip is usually

about Half the size of the largest available SRAM chip. This rule of thumb

comes from the fact that a typical CAM cell consists of two SRAM cells.

2.1.2 Simple CAM architecture:

Content Addressable Memories (CAMs) are fully associative storage

devices. Fixed-length binary words can be stored in any location in the

device. The memory can be queried to determine if a particular word, or key,

is stored, and if so, the address at which it is stored. This search operation is

performed in a single clock cycle by a parallel bitwise comparison of the key

against all stored words.

Fig2.2. Simple schematic of a model CAM with 4 words having 3 bits each.

We now take a more detailed look at CAM architecture. A small model is

shown in Fig. 2.2. The figure shows a CAM consisting of 4 words, with each

word containing 3 bits arranged horizontally (corresponding to 3 CAM cells).

There is a match line corresponding to each word (ML0, ML1, etc.) feeding

into match line sense amplifiers (MLSAs), and there is a differential search

line pair corresponding to each bit of the search word ( etc.). A

CAM search operation begins with loading the search-data word into the

search-data registers followed by pre-charging all match lines high, putting

them all temporarily in the match state. Next, the search line drivers

broadcast the search word onto the differential search lines, and each CAM

core cell compares its stored bit against the bit on its corresponding search

lines. Match lines on which all bits match remain in the pre-charged-high

state. Match lines that have at least one bit that misses, discharge to ground.

The MLSA then detects whether its match line has a matching condition or

miss condition. Finally, the encoder maps the match line of the matching

location to its encoded address.

2.2 Implementation types:

2.2.1 Semiconductor implementations:

Because a CAM is designed to search its entire memory in a single

operation, it is much faster than RAM in virtually all search applications.

There are cost disadvantages to CAM however. Unlike a RAM chip, which

has simple storage cells, each individual memory bit in a fully parallel CAM

must have its own associated comparison circuit to detect a match between

the stored bit and the input bit. Additionally, match outputs from each cell in

the data word must be combined to yield a complete data word match signal.

The additional circuitry increases the physical size of the CAM chip which

increases manufacturing cost. The extra circuitry also increases power

dissipation since every comparison circuit is active on every clock cycle.

2.2.2 Alternative implementations:

To achieve a different balance between speed, memory size and cost, some

implementations emulate the function of CAM by using standard tree search

or hashing designs in hardware, using hardware tricks like replication or

pipelining to speed up effective performance. These designs are often used

in routers. Two types of CAMs are discussed below. They are Binary CAM

(BCAM) and Ternary CAM (TCAM).

2.3 Cam types:

2.3.1 Binary CAM :

It is the simplest type of CAM which uses data search words comprised

entirely of 1s and 0s. binary CAMs perform well for exact match operations

and can be used for route lookups in strictly hierarchical addressing

schemes.

2.3.2 Ternary CAM :

Allows a third matching state of "X" or "Don't Care" for one or more bits in

the stored data word, thus adding flexibility to the search. For example, a

ternary CAM might have a stored word of "10XX0" which will match any of

the four search words "10000", "10010", "10100", or "10110". The added

search flexibility comes at an additional cost over binary CAM as the internal

memory cell must now encode three possible states instead of the two of

binary CAM. This additional state is typically implemented by adding a mask

bit ("care" or "don't care" bit) to every memory cell.

• Standard Ternary Mode. Bit X matches either 1, 0, or X (1010 = 1X1X =

10XX) and is referred to as a don’t care bit.

• Enhanced Ternary Mode. Bit X also matches either 1, 0, or X (1010 =

1X1X = 10XX), also referred to as a don’t care bit. Bit U does not match any

of the possible bit values: 1, 0, X, or U, and is referred to as an unmatchable

bit in this document.

2.4 Core Cells and Match line Structure:

A CAM cell serves two basic functions: bit storage (as in RAM) and bit

comparison (unique to CAM). Fig. 2.3 shows a NOR-type CAM cell [Fig.

2.3(a)] and the NAND-type CAM cell [Fig. 2.3(b)]. The bit storage in both

cases is an SRAM cell where cross-coupled inverters implement the bit-

storage nodes D and . To simplify the schematic, we omit the n-MOS

access transistors and bit lines which are used to read and write the SRAM

storage bit. Although some CAM cell implementations use lower area DRAM

cells, typically, CAM cells use SRAM storage. The bit comparison, which is

logically equivalent to an XOR of the stored bit and the search bit is

implemented in a somewhat different fashion in the NOR and the NAND

cells.

Fig 2.3: CAM core cells for (a) NOR type CAM (b) NAND type CAM

2.4.1 NOR Cell:

The NOR cell implements the comparison between the complementary

stored bit, D (and ), and the complementary search data on the

complementary search line, SL (and ), using four comparison transistors,

M1 through M4, which are all typically minimum-size to maintain high cell

density. These transistors implement the pull down path of a dynamic XNOR

logic gate with inputs SL and D. Each pair of transistors, M1/M3 and M2/M4,

forms a pull down path from the match line, ML, such that a mismatch of SL

and D activates least one of the pull down paths, connecting ML to ground.

A match of SL and D disables both pull down paths, disconnecting ML from

ground. The NOR nature of this cell becomes clear when multiple cells are

connected in parallel to form a CAM word by shorting the ML of each cell to

the ML of adjacent cells. The pull down paths connect in parallel resembling

the pull down path of a CMOS NOR logic gate. There is a match condition

on a given ML only if every individual cell in the word has a match.

2.4.2 NAND Cell:

The NAND cell implements the comparison between the stored bit, D, and

corresponding search data on the corresponding search lines, (SL, ),

using the three comparison transistors and , which are all

typically minimum-size to maintain high cell density. We illustrate the bit-

comparison operation of a NAND cell through an example.

Consider the case of a match when SL=1 and D=1 . Pass transistor is

ON and passes the logic “1” on the SL to node B. Node B is the bit-match

node which is logic “1” if there is a match in the cell. The logic “1” on node B

turns ON transistor M1 . Note that M1 is also turned ON in the other match

case when SL=0 and D=0. In this case, the transistor passes logic high

to raise node B. The remaining cases, where , result in a miss

condition, and accordingly node B is logic “0” and the transistor M1 is OFF.

Node B is a pass-transistor implementation of the X-NOR function .

The NAND nature of this cell becomes clear when multiple NAND cells are

serially connected. In this case, the and nodes are joined to

form a word. A serial n-MOS chain of all the transistors resembles the

pull down path of a CMOS NAND logic gate. A match condition for the entire

word occurs only if every cell in a word is in the match condition.

An important property of the NOR cell is that it provides a full rail voltage at

the gates of all comparison transistors. On the other hand, a deficiency of

the NAND cell is that it provides only a reduced logic “1” voltage at node B,

which can reach only VDD-Vtn when the search lines are driven to VDD (where

VDD is the supply voltage and Vtn is the n-MOS threshold voltage).

2.4.3 Cell Variants:

Fig. 2.4 shows a variant of the NOR cell [Fig. 2.4(a)] and a variant of the

NAND cell [Fig. 2.4(b)]. The NOR cell variant uses only 9-Ttransistors

compared to the previous 10-T NOR cell. The bit comparison uses pass

transistors (as in the previous 9-T NAND cell), however, the NOR property of

this cell is apparent when multiple cells are connected in parallel to form a

CAM word by shorting the ML of each cell to the ML of adjacent cells. The

pull down paths are in parallel just as in a CMOS NOR logic gate.

Fig. 2.4: CAM core cells variations for (a) 9-T NOR-type CAM and (b) 10-T

NAND-type CAM.

The 10-T NAND cell [Fig. 2.4(b)] is a variant of the previous 9-T NAND cell.

When the bit comparison succeeds in this cell, one of the transistor paths

between MLn and MLn+1 is ON. Thus, when multiple cells are shorted

together these transistor paths appear in series just as in the pull down

network of a CMOS NAND gate. Since this NAND cell doubles the number

of transistors in series, the 9-T NAND is usually preferred. For the remainder

of this paper we discuss only the 9-T NAND cell and the 10-T NOR cells as

they are in predominant use today.

2.4.4 Ternary Cells:

The NOR and NAND cells that have been presented are binary CAM cells.

Such cells store either a logic “0” or a logic “1”. Ternary cells, in addition,

store an “X” value. The “X” value is a don’t care, that represents both “0” and

“1”, allowing a wildcard operation. Wildcard operation means that an “X”

value stored in a cell causes a match regardless of the input bit. As

discussed earlier,

this is a feature used in packet forwarding in Internet routers.

A ternary symbol can be encoded into two bits according to Table I. We

represent these two bits as D and . Note that although the D and are not

necessarily complementary, we maintain the complementary notation for

consistency with the binary CAM cell. Since two bits can represent 4

possible states, but ternary storage requires only three states, we disallow

the state where D and are both zero. To store a ternary value in a NOR

cell, we add a second SRAM cell, as shown in Fig. 7. One bit, D, connects to

the left pull down path and the other bit , connects to the right pull down

path, making the pull down paths independently controlled. We store an “X”

by setting both D and equal to logic “1”, which disables both pull down

paths and forces the cell to match regardless in the inputs. We store a logic

“1” by setting D=1 and =1 and store a logic “0” by setting D=1 and =1. In

addition to storing an “X”, the cell allows searching for an “X” by setting both

SL and to logic “0”. This is an external don’t care that forces a match of a

bit regardless of the stored bit. Although storing an “X” is possible only in

ternary CAMs, an external “X” symbol possible in both binary and ternary

CAMs. In cases where ternary operation is needed but only binary CAMs are

available, it is possible to emulate ternary operation using two binary cells

per ternary symbol.

TABLE 1

As a modification to the ternary NOR cell of Fig. 2.5(a), Roth et al. propose

implementing the pull down transistors M1—M4 using p-MOS devices and

complementing the logic levels of the search lines and match lines

accordingly. Using p-MOS transistors (instead of n-MOS transistors) for the

comparison circuitry allows for a more compact layout, due to reducing the

number of spacing of p-diffusions to n-diffusions in the cell. In addition to

increased density, the smaller area of the cell reduces wiring capacitance

and therefore reduces power consumption. The tradeoff that results from

using minimum-size p-MOS transistors, rather than minimum-size n-MOS

transistors, is that the pull down path will have a higher equivalent

resistance, slowing down the search operation.

A NAND cell can be modified for ternary storage by adding storage for a

mask bit at node M, as depicted in Fig. 2.5(b). When storing an “X”, we set

this mask bit to “1”. This forces transistor ON, regardless of the value

of D, ensuring that the cell always matches. In addition to storing an “X”, the

cell allows searching for an “X” by setting both SL and to logic “1”. Table

II lists the stored encoding and search-bit encoding for the ternary NAND

cell.

TABLE II

Further minor modifications to CAM cells include mixing parts of the NAND

and NOR cells, using dynamic-threshold techniques in silicon-on-insulator

(SOI) processes, and alternating the logic level of the pull down path to

ground in the NOR cell.

Fig. 2.5: Ternary core cells for (a) NOR-type CAM and (b) NAND-type CAM

Currently, the NOR cell and the NAND cell are the prevalent core cells for

providing storage and comparison circuitry in CMOS CAMs.

2.4.5 Match line Structures:

We now demonstrate the NAND cell and NOR cell in constructing a CAM

matchline. The matchline is one of the two key structures in CAMs.

NOR Matchline: Fig. 2.6 depicts, in schematic form, how NOR cells are

connected in parallel to form a NOR matchline, ML. While we show binary

cells in the figure, the description of matchline operation applies to both

binary and ternary CAM.

Fig. 2.6: Structure of a NOR match line with n cells.

Transistor Mpre pre-charges the match line and the MLSA evaluates the

state of the match line, generating the match result.

A typical NOR search cycle operates in three phases: search line pre-

charge, match line pre-charge, and match line evaluation. First, the search

lines are pre-charged low to disconnect the match lines from ground by

disabling the pull down paths in each CAM cell. Second, with the pull down

paths disconnected, the Mpre transistor pre-charges the match lines high.

Finally, the search lines are driven to the search word values, triggering the

match line evaluation phase. In the case of a match, the ML voltage, VML ,

stays high as there is no discharge path to ground. In the case of a miss,

there is at least one path to ground that discharges the match line. The

match line sense amplifier (MLSA) senses the voltage on ML, and generates

a corresponding full-rail output match result.

The main feature of the NOR match line is its high speed of operation. In the

slowest case of a one-bit miss in a word, the critical evaluation path is

through the two series transistors in the cell that form the pull down path.

Even in this worst case, NOR-cell evaluation is faster than the NAND case,

where between 8 and 16 transistors form the evaluation path.

NAND Match line: Fig. 2.7 shows the structure of the NAND match line .A

number of cells, , are cascaded to form the match line (this is, in fact, a

match node, but for consistency we will refer to it as ML). For the purpose of

explanation, we use the binary version of the NAND cell, but the same

description applies to the case of a ternary cell.

Fig. 2.7 : NAND match line structure with pre-charge and evaluate

transistors.

On the right of the figure, the pre-charge p-MOS transistor, Mpre, sets the

initial voltage of the match line, ML, to the supply voltage, VDD. Next, the

evaluation n-MOS transistor, Meval, turns ON. In the case of a match, all n-

MOS transistors M1 through Mn are ON, effectively creating a path to

ground from the ML node, hence discharging ML to ground. In the case of a

miss, at least one of the series n-MOS transistors, M1 through, Mn is OFF,

leaving the ML voltage high. A sense amplifier, MLSA, detects the difference

between the match (low) voltage and the miss (high) voltage. The NAND

match line has an explicit evaluation transistor, Meval, unlike the NOR match

line, where the CAM cells themselves perform the evaluation.

There is a potential charge-sharing problem in the NAND match line. Charge

sharing can occur between the ML node and the intermediate MLi nodes.

For example, in the case where all bits match except for the leftmost bit in

Fig. 2.7, during evaluation there is charge sharing between the ML node and

nodes ML1 through MLn-1. This charge sharing may cause the ML node

voltage to drop sufficiently low such that the MLSA detects a false match. A

technique that eliminates charge sharing is to pre-charge high, in addition to

ML, the intermediate match nodes ML1 through MLn-1. This is accomplished

by setting to VDD the search lines, SL1 – SLn, and their complements

, which forces all transistors in the chain, M1–Mn, to turn ON and

pre-charge the intermediate nodes. When this pre-charge of the intermediate

match nodes is complete, the search lines are set to the data values

corresponding to the incoming search word. This procedure eliminates

charge sharing, since the intermediate match nodes and the ML node are

initially shorted.

A feature of the NAND match line is that a miss stops signal propagation

such that there is no consumption of power past the final matching transistor

in the serial n-MOS chain. Typically, only one match line is in the match

state, consequently most match lines have only a small number of

transistors in the chain that are ON and thus only a small amount of power is

consumed. Two drawbacks of the NAND match line are a quadratic delay

dependence on the number of cells, and a low noise margin. The quadratic

delay-dependence comes from the fact that adding a NAND cell to a NAND

match line adds both a series resistance due to the series n-MOS transistor

and a capacitance to ground due to the n-MOS diffusion capacitance. These

elements form an RC ladder structure whose overall time constant has a

quadratic dependence on the number of NAND cells. Most implementations

limit the number of cells on a NAND match line to 8 to 16 in order to limit the

quadratic degradation in speed.

The low noise margin is caused by the use of n-MOS pass transistors for the

comparison circuitry. Since the gate voltage of the NAND match line

transistors (M1 through Mn ) when conducting, in Fig. 2.7, is ( ), the

highest voltage that is passed on the match line is , (where

is the threshold voltage of the n-MOS transistor, augmented by the body

effect). NOR cells avoid this problem by applying maximum gate voltage to

all CAM cell transistors when conducting. One implementation of a NAND-

based CAM reclaims some noise margin by employing the bootstrap effect

by reversing the polarity of the match line pre-charge and evaluate.

2.5 A Reconfigurable Content Addressable Memory:

Content Addressable Memories or CAMs are popular parallel matching

circuits. They provide the capability, in hardware, to search a table of data

for a matching entry. This functionality is a high performance alternative to

popular software-based searching schemes. CAMs are typically found in

embedded circuitry where fast matching is essential.

Content Addressable Memories or CAMs are a class of parallel pattern

matching circuits. In one mode, these circuits operate like standard memory

circuits and may be used to store binary data. Unlike standard memory

circuits, however, a powerful match mode is also available. This match mode

permits all of the data in the CAM device to be searched in parallel. While

CAM hardware has been available for decades, its use has typically been in

niche applications, embedded in custom designs. Perhaps the most popular

application has been in cache controllers for central processing units. Here

CAMs are often used to search cache tags in parallel to determine if a cache

“hit" or “miss" has occurred. Clearly in this application performance is crucial

and parallel search hardware such as a CAM can be used to good effect.

A second and more recent use of CAM hardware is in the networking area.

As data packets arrive into a network router, processing of these packets

typically depends on the network destination address of the packet. Because

of the large number of potential addresses, and the increasing performance

demands, CAMs are beginning to become popular in processing network

address information.

2.5.1 A Standard CAM Implementation:

CAM circuits are similar in structure to traditional Random Access Memory

(RAM) circuits, in that data may be written to and read from the device. In

addition to functioning as a standard memory device, CAMs have an

additional parallel search or match mode. The entire memory array can be

searched in parallel using hardware. In this match mode, each memory cell

in the array is accessed in parallel and compared to some value. If this value

is found in any of the memory locations, a match signal is generated.

In some implementations, all that is significant is that a match for the data is

found. In other cases, it is desirable to know exactly where in the memory

address space this data was located. Rather than producing a simple

“match" signal, some CAM implementations also supply the address of the

matching data. In some sense, this provides a functionality opposite of a

standard RAM. In a standard RAM, addresses are supplied to hardware and

data at that address is returned. In a CAM, data is presented to the

hardware and an address returned.

At a lower level, the actual transistor implementation of a CAM circuit is very

similar to a standard static RAM. Figure 2.8 shows transistor level diagrams

of both

CMOS RAM and CAM circuits. The circuits are almost identical, except for

the addition of the match transistors to provide the parallel search capability.

Fig. 2.8. RAM and CAM transistor level circuits.

In a CMOS static RAM circuit, as well as in the CAM cell, data is accessed

via the BIT lines and the cells selected via the WORD lines. In the CAM cell,

however, the match mode is somewhat different. Inverted data is placed on

the BIT lines. If any cell contains data which does not match, the MATCH

line is pulled low, indicating that no match has occurred in the array.

Clearly this transistor level implementation is efficient and may be used to

produce CAM circuits which are nearly as dense as comparable static RAM

circuits. Unfortunately, such transistor level circuits can not be implemented

using standard programmable logic devices.

2.5.2 An FPGA CAM Implementation:

Of course, a content addressable memory is just a digital circuit, and as

such may be implemented in an FPGA. The general approach is to provide

an array of registers to hold the data, and then use some collection of

comparators to see if a match has occurred. While this is a viable solution, it

suffers from the same sort of inefficiencies that plague FPGA-based RAM

implementations. Like RAM, the CAM is efficiently implemented at the

transistor level. Using gate level logic, particularly programmable or

reconfigurable logic, often results in a substantial penalty, primarily in size.

Because the FPGA CAM implementation relies on flip-flops as the data

storage elements, the size of the circuit is restricted by the number of flip

flops in the device. While this is adequate for smaller CAM designs, larger

CAMs quickly deplete the resources of even the largest available FPGA.

2.5.3 The Reconfigurable Content Addressable Memory

(RCAM):

The Reconfigurable Content Addressable Memory or RCAM makes use of

runtime reconfiguration to efficiently implement a CAM circuit. Rather than

using the FPGA flip-flops to store the data to be matched, the RCAM uses

the FPGA Look Up Tables or LUTs. Using LUTs rather than flip-flops results

in a smaller, faster CAM.

The approach uses the LUT to provide a small piece of CAM functionality. In

Figure 10, a LUT is loaded with data which provides a “match 5"

functionality.

Fig. 2.9. Using a LUT to match 5.

Because a LUT can be used to implement any function of N variables, it is

also possible to provide more flexible matching schemes than the simple

match described in the circuit in Figure 2.9. In Figure 2.10, the LUT is loaded

with values which produce a match on any value but binary “4". This circuit

demonstrates the ability to embed a mask in the configuration of a LUT,

permitting arbitrary disjoint sets of values to be matched, within the LUT.

This function is important in many matching applications, particularly

networking.

Fig. 2.10. Using a LUT to match all inputs except 4.

This approach can be used to provide matching circuits such as match all or

match none or any combination of possible LUT values. Note again, that this

arbitrary masking only applies to a single LUT. When combining LUTs to

make larger CAMs, the ability to perform such masking becomes more

restricted.

While using LUTs to perform matching is a powerful approach, it is

somewhat limited when used with traditional design tools. With schematics

and HDLs, the LUT contents may be specified, albeit with some difficulty.

And once specified, modifying these LUTs is difficult or impossible.

However, modification of FPGA circuitry at run-time is possible using a run-

time reconfiguration tool such as J Bits. J Bits permits LUT values, as well as

other parts of the FPGA circuit, to be modified arbitrarily at run time and in-

system. An Application Program Interface (API) into the FPGA configuration

permits LUTs, for instance, to be modified with a single function call. This,

combined with the partial reconfiguration capabilities of new FPGA devices

such as Virtex (tm) permit the LUTs used to build the RCAM to be easily

modified under software control, without disturbing the rest of the circuit.

Finally, using run-time reconfiguration software such as J Bits, RCAM

circuits may be dynamically sized, even at run-time. This opens the

possibility of not only changing the contents of the RCAM during operation,

but actually changing the size and shape of the RCAM circuit itself. This

results in a situation analogous to dynamic memory allocation in RAM. It is

possible to “allocate" and “free" CAM resources as needed by the

application.

2.5.4 An RCAM Example:

One currently popular use for CAMs is in networking. Here data must be

processed under demanding real-time constraints. As packets arrive, their

routing information must be processed. In particular, destination addresses,

typically in the form of 32-bit Internet Protocol (IP) addresses must be

classified. This typically involves some type of search.

Current software based approaches rely on standard search schemes such

as hashing. While effective, this approach requires a powerful processor to

keep up with the real-time demands of the network. Offloading the

computationally demanding matching portion of the algorithms to external

hardware permits less powerful processors to be used in the system. This

results in savings not only in the cost of the processor itself, but in other

areas such as power consumption and overall system cost.

In addition, an external CAM provides networking hardware with the ability to

achieve packet processing in essentially constant time. Provided all

elements to be matched fit in the CAM circuit, the time taken to match is

independent of the number of items being matched. This provides not only

good scalability properties.

Other software based matching schemes such as hashing are data-

dependent and may not meet realtime constraints depending on complex

interactions between the hashing algorithm and the data being processed.

CAMs suffer no such limitations and permit easy analysis and verification.

Fig. 2.11. Matching a 32-bit IP header.

Figure 2.11 shows an example of an IP Match circuit constructed using the

RCAM approach. Note that this example assumes a basic 4-input LUT

structure for simplicity. Other optimizations, including using special-purpose

hardware such as carry chains are possible and may result in substantial

circuit area savings and clock speed increases.

This circuit requires one LUT input per matched bit. In the case of a 32-bit IP

address, this circuit requires 8 LUTs to provide the matching, and three

additional 4-input LUTs to provide the AND-ing for the MATCH signal. An

array of this basic 32-bit matching block may be replicated in an array to

produce the CAM circuit. Again, note that other non-LUT implementations for

generating the MATCH circuit are possible.

Since the LUTs can be used to mask the matching data, it is possible to put

in “match all" conditions by setting the LUTs to all ones. Other more

complicated masking is possible, but typically only using groups of four

inputs. While this does not provide for the most general case, it appears to

cover the popular modes of matching.

2.5.5 System Issues:

The use of run-time reconfiguration to construct, program and reprogram the

RCAM results in some significant overall system savings. In general, both

the hardware and the software are greatly simplified. Most of the savings

accrue from being able to directly reconfigure the LUTs, rather than having

to write them directly as in standard RAM circuits. Reconfiguration rather

than direct access to the stored CAM data first eliminates the entire read /

write access circuitry. This includes the decode logic to decode each

address, the wiring necessary to broadcast these addresses, the data

busses for reading and writing the data, and the IOBs used to communicate

with external hardware.

It should be pointed out that this interface portion of the circuitry is

substantial, both its size and complexity. Busses typically consume tri-state

lines, which are often scarce. Depending on the addressing scheme, tens of

IOBs will necessarily be consumed. These also tend to be valuable

resources. The address decoders are also somewhat problematic circuits

and often require special purpose logic to be implemented efficiently. In

addition, the bus interface is typically the most timing sensitive portion of the

circuit and requires careful design and simulation. This is eliminated with the

use of run-time reconfiguration.

Finally, the system software is simplified. In a standard bus interface

approach, device drivers and libraries must be written, debugged and

maintained to access the CAM. And when the system software or processor

changes, this software must be ported to the new platform. With the RCAM,

all interfacing is performed through the existing configuration port, at no

additional overhead.

The cost of using the configuration port rather than direct hardware access is

primarily one of setup speed. Direct writes can typically be done in some

small number of system cycles. Reconfiguration of the RCAM to update

table entries may take substantially longer, depending on the

implementation. Partial reconfiguration in devices such as Virtex permit

changes to be made more rapidly than in older bulk configuration device, but

the speed may be orders of magnitude slower than direct hardware

approaches. Clearly the RCAM approach favors applications with slowly

changing data sets. Fortunately, many applications appear to fit into this

category.

2.5.6 Associative Processing:

Today, advances in circuit technology permit large CAM circuits to be built.

However, uses for CAM circuits are not necessarily limited to niche

applications like cache controllers or network routers. Any application which

relies on the searching of data can benefit from a CAM-based approach. A

short list of some potential application areas that can benefit from fast

matching are Artificial Intelligence, Database Search, Computer Aided

Design, Graphics Acceleration and Computer Vision.

Much of the work in using parallel matching hardware to accelerate

algorithms was carried out in the 1960s and 1970s, when several large

parallel matching machines were constructed. With the rapid growth both in

size and speed of traditional processors in the intervening years, much of

the interest in CAMs has faded. However, as real-time constraints in areas

such as networking become impossible to meet with traditional processors,

solutions such as CAM-based parallel search will almost certainly become

more prevalent.

In addition, the use of parallel matching hardware in the form of CAMs can

provide another more practical benefit. For many applications, the use of

CAM-based parallel search can offload much of the work done by the

system processor. This should permit smaller, cheaper and lower power

processors to be used in embedded applications which can make use of

CAM-based parallel search.

The RCAM is a flexible, cost-effective alternative to existing CAMs. By using

FPGA technology and run-time reconfiguration, fast, dense CAM circuits can

be easily constructed, even at run-time.

In addition, the size of the RCAM may be tailored to a particular hardware

design, or even temporary changes in the system. This flexibility is not

available in other CAM solutions. In addition, the RCAM need not be a

stand-alone implementation. Because the RCAM is entire a software

solution using state of the art FPGA hardware, it is quite easy to embed

RCAM functionality in larger FPGA designs.

Finally, we believe that existing applications, primarily in the field of network

routing, are just the beginning of RCAM usage. Once other applications

realize that simple, fast, flexible parallel matching is available, it is likely that

other applications and algorithms will be accelerated using this approach.

2.6 Difference between CAM and RAM:

Since CAM is an outgrowth of Random Access Memory (RAM) technology,

in order to understand CAM, it helps to contrast it with RAM. A RAM is an

integrated circuit that stores data temporarily. Data is stored in a RAM at a

particular location, called an address. In a RAM, the user supplies the

address, and gets back the data. The number of address line limits the depth

of a memory using RAM, but the width of the memory can be extended as

far as desired. With CAM, the user supplies the data and gets back the

address. The CAM searches through the memory in one clock cycle and

returns the address where the data is found. The CAM can be preloaded at

device startup and also be rewritten during device operation. Because the

CAM does not need address lines to find data, the depth of a memory

system using CAM can be extended as far as desired, but the width is

limited by the physical size of the memory. Thus, a CAM is the hardware

embodiment of what in software terms would be called an associative array.

Since CAM is an outgrowth of Random Access Memory (RAM) technology,

in order to understand CAM, it helps to contrast it with RAM. A RAM is an

integrated circuit that stores data temporarily. Data is stored in a RAM at a

particular location, called an address. In a RAM, the user supplies the

address, and gets back the data. The number of address line limits the depth

of a memory using RAM, but the width of the memory can be extended as

far as desired. With CAM, the user supplies the data and gets back the

address. The CAM searches through the memory in one clock cycle and

returns the address where the data is found. The CAM can be preloaded at

device startup and also be rewritten during device operation. Because the

CAM does not need address lines to find data, the depth of a memory

system using CAM can be extended as far as desired, but the width is

limited by the physical size of the memory. CAM can be used to accelerate

any application requiring fast searches of data-base, lists, or patterns, such

as in image or voice recognition, or computer and communication designs.

For this reason, CAM is used in applications where search time is very

critical and must be very short. For example, the search key could be the IP

address of a network user, and the associated information could be user’s

access privileges and his location on the network. If the search key

presented to the CAM is present in the CAM’s table, the CAM indicates a

‘match’ and returns the associated information, which is the user’s privileges.

A CAM can thus operate as a data-parallel or Single Instruction/Multiple

Data (SIMD) processor.

Read operation in traditional memory:

o Input is address location of the content

that we are interested in it.

o Output is the content of that address.

o Depth is limited, width can be extended.

In CAM it is the reverse:

o Input is associated with something stored

in the memory.

o Output is location where the associated

content is stored.

o Width is limited, depth can be extended.

2.7 Applications:

Content addressable memory (CAM) is frequently used in applications that

require high-speed searches. Few examples are

LAN bridges/switches & routers.

Asynchronous transfer mode (ATM).

Communication networks.

Look up tables.

Tag directories.

Database engines.

Data compression hardware.

Artificial neural networks.

CPU cache controllers and.

Translation Look aside Buffers.

Single Instruction/Multiple Data (SIMD) processor.

Chapter 3

LOW POWER PB-CAM

WHAT IS PB-CAM

A general CAM architecture usually consists of data memory with valid bit

field, address decoder, bit line pre-charger, word match circuit, and address

priority encoder. The memory organization of the traditional CAM consists of

the data memory and the valid bit field, where the valid bit field indicates the

availability of stored data. In the data searching operation, the input data is

sent into CAM to compare with all valid data stored in CAM simultaneously,

and an address from among those matches of comparison is sent to the

output. In this architecture, the CAM circuit performs large amount of

comparison operations to identify all valid data stored in CAM during each

data searching operation. This comparison consumes most of the total CAM

power. To minimize power consumed during the comparison, one of the best

approaches is to reduce most of the comparison operations. Based on this

idea, a novel architecture is developed for low-power CAM circuit design

called PB-CAM.

To address the proposed low-power PB-CAM architecture, the design

concept for this architecture is introduced. The memory organization of the

proposed PB-CAM architecture, as shown in Fig. 2, is composed of the data

memory, the parameter memory, and the parameter extractor. In the data

writing operation, the parameter extractor extracts the parameter of the input

data, and then stores the input data and its parameter into the data memory

and the parameter memory, respectively. In the data searching operation, in

order to reduce the large amount of comparison operations, the operation is

separated into two comparison processes. In the first comparison process,

the parameter extractor extracts the parameter of the input data, and the

parameter comparison circuits then compare the parameter of the input data

with all parameters stored in the parameter memory in parallel. The data

related to this stored parameter concurrently mismatches the input data, if

the stored parameter mismatches the parameter of the input data.

Otherwise, the data related to this stored parameter has yet to be identified.

Using the first comparison process results, the input data is only compared

with those unidentified data to identify any match in the second comparison

process. Based on the two comparison processes, if majority parts of the

stored parameter mismatch the parameter of the input data, then the number

of comparisons in the second comparison process are largely reduced. The

function of this parameter comparison process is just like filtering; it filters

majority parts of unmatched data in the first comparison process and then

reduces most of the comparisons in the second comparison process.

Fig: General Scheme for CAM Architecture

Fig: General Scheme for CAM Architecture

In this paper, the parameter comparison process is also known as the pre-

computation process. Although the data searching operation uses two

comparison processes to identify any match, both comparison processes are

performed in parallel to improve the data searching speed.

Since content addressable memory (CAM) is frequently used in

applications, that require high-speed searches, and because of its ability to

improve application performance by using parallel comparison, it results in

reduced search time. But it also significantly increases power consumption.

So the main CAM-design challenge is to reduce power consumption

associated with the large amount of parallel active circuitry, without

sacrificing speed or memory density.

3.1 Power saving CAM architecture:

Architectural technique for saving power, which applies to binary CAM, is

pre-computation. Pre-computation stores some extra information along with

each word that is used in the search operation to save power. These extra

bits are derived from the stored word, and used in an initial search before

searching the main word. If this initial search fails, then the CAM aborts the

subsequent search, thus saving power Architectural technique for saving

power, which applies to binary CAM, is pre-computation. Pre-computation

stores some extra information along with each word that is used in the

search operation to save power. These extra bits are derived from the stored

word, and used in an initial search before searching the main word. If this

initial search fails, then the CAM aborts the subsequent search, thus saving

power.

3.2 PB-CAM Architecture:

Fig. 3.1 shows the memory organization of the PB-CAM architecture which

consists of data memory, parameter memory, and parameter extractor,

where k << n. To reduce massive comparison operations for data

searches, the operation is divided into two parts. In the first part, the

parameter extractor extracts a parameter from the input data, which is then

compared to parameters stored in parallel in the parameter memory. If no

match is returned in the first part, it means that the input data mismatch the

data related to the stored parameter. Otherwise, the data related to those

stored parameters have to be compared in the second part. It should be

noted that although the first part must access the entire parameter memory,

the parameter memory is far smaller than that of the CAM (data memory).

Moreover, since comparisons made in the first part have already filtered out

the unmatched data, the second part only needs to compare the data that

match from the first part.

Fig.3.1.memory organization of PB-CAM architecture

The PB-CAM exploits this characteristic to reduce the comparison

operations, thereby saving power. Therefore, the parameter extractor of the

PB-CAM is critical, because it determines the number of comparison

operations in the second part. So, the parameter extractor plays a significant

role since this circuit determines the number of comparison operations

required in the second part. Therefore, the design goal of the parameter

extractor is to filter out as many unmatched data as possible to minimize the

required number of comparison operations in the second part. Two

parameter extractors are discussed, namely One’s count parameter

extractor and Block-XOR parameter extractor.

3.3 One’s count approach:

For ones count approach, with an n-bit data length, there are n+1 types of

one’s count (from 0 ones to n ones count). Further, it is necessary to add an

extra type of one’s count to indicate the availability of stored data. Therefore,

the minimal bit length of the parameter is equal to log (n+ 2). The below fig 5

shows the conceptual view of one’s count approach. The extra information

holds the number of ones in the stored word.

Fig.3.2. conceptual view of one’s count approach

For example, in fig.10, when searching for the data word, 01001101, the pre-

computation circuit the number of ones (which is four in this case). The

number four is compared on the left-hand side to the stored one’s count.

Only match lines PML5 and PML7 match, since only they have a one’s count

of four. In the data-memory stage in fig.3.2, only two comparisons actively

consume power and only match line PML5 results in a match. The 14-bit

ones-count parameter extractor is implemented with full adders as shown in

Fig. 3.3.

3.3.1 Mathematical Analysis:

Fig .3.3:14-bit ones-count parameter extractor

For a 14-bit length input data, all the input data contain 214 numbers, and the

number of input data related to the same parameter for ones count approach

is 14cn, where n is a type of ones-count (from 0 to 14 ones-counts). Then we

can compute the average probability that the parameter occurs. The average

probability can be determined by

TABLE 3

NUMBER OF DATA RELATED TO THE SAME PARAMETERS AND AVERAGEPROBABILITIES FOR THE ONES COUNT

APPROACH

Table III lists the number of data related to the same parameter and their

average probabilities for the input data that is 14-bit in length. For example, if

a match occurs in the first part of the comparison with the parameter 2, the

maximum number of required comparison operations for the second part is 14c2.With conventional CAMs, the comparison circuit must compare all stored

data, whereas with the ones-count PB-CAMs, a large amount of unmatched

data can be initially filtered out, reducing comparison operations for minim

power consumption in some cases. However, the average probabilities of

some parameters, such as 0, 1, 2, 12, 13, and 14 are less than 1%.

In Table III, parameters with over 2000 comparison operations range

between 5 and 9. However, the summation of the average probabilities for

these parameters is close to 82%. Although the number of comparison

operations required for ones-count PB-CAMs is fewer than that of

conventional CAMs, ones-count PB-CAMs fail to reduce the number of

comparison operations in the second part when the parameter value is

between 5 and 9, thereby consuming a large amount of power. From the

Table I we can see that random input patterns for the ones-count approach

demonstrate the Gaussian distribution characteristic. The Gaussian

distribution will limit any further reduction of the comparison operations in

PB-CAMs.

3.4 Block –XOR approach:

The key idea behind this method is to reduce the number of comparison

operations by eliminating the Gaussian distribution. For a 14-bit input data, if

we can distribute the input data uniformly over the parameters, then the

number of input data related to each parameter would be 214/15=1093, and

the maximum number of required comparison operations would be

214/15=1093, for each case in the second part of the comparison process.

Compared with the ones-count approach, this approach can reduce

comparison operations by a minimum of 909 and a maximum of 2339 (i.e.,

for parameter value is from 5 to 9) for 82% of the cases. Based on these

observations, a new parameter extractor called Block-XOR, which is shown

in Fig.3.4, is used to achieve the previous requirement.

In this approach, we first partition the input data bit into several blocks, from

which an output bit is computed using XOR logic operation for each of these

blocks. The output bits are then combined to become the input parameter for

the second part of the comparison process. . To compare with the ones-

count approach, we set the bit length of the parameter to [log (n+ 2)]. Where

n is the bit length of the input data.

Fig.3.4.concept of n-bit Block-XOR block diagram.

Therefore, the number of blocks is [n/ log (n+2)] in this approach. Taking the

14-bit input length as an example, the bit length of the parameter is log

(14+2) = 4-bit, and the number of blocks is [14/ log(14+2)] = 4 . Accordingly,

all the blocks contain 4 bits except the last one, which contains the

remainder 2 bits as shown in the upper part of Fig. 3.5.

Fig .3.5: Structure of Block-XOR approach with valid bit.

The selected signal is defined as

S = A3A2A1A0

According to (2), if the parameter is “0000 to 1110” (S = “0”), the multiplexer

will transmit the i0 data as the output. In other words, the parameter does

not change. Otherwise, (A3A2A1A0 =‘‘1111”, S =‘‘1”), the first block of the

input data becomes the new parameter, and “1111” can then be used as the

valid bit. The case where the first block is “1111” was not considered,

because the “1111” block bits will result in “0” for one of the four parameter

bits.


The concept of Block-XOR approach is to uniformly distribute the parameter

over the input data. By the rule of product, the number of input data that

results in the same parameter (without valid bit) is 8 *8*8*2 = 1024.

Consequently, the average probability can be determined as

1024/(1024*16)*100% = 6.25%. Accordingly, the maximum number of

comparison operations is 1024 for each parameter in the second part.

Obviously, the concept of Block-XOR approach can reduce the comparison

operations, hence minimize power consumption.

Table IV lists the number of input data that result in the same parameter for

the proposed Block-XOR PB-CAM (i.e., with valid bit). When the parameter

is “1111”, the new parameter is provided by the first block with an output bit

of “1” so that the number of input data for those parameters is 1024 +

(1024/8) = 1152, and the average probability is (1152/ (1024 * 7 + 1152 * 8))

* 100% = 7.03%. As can be seen from Tables I and II, the Block-XOR PB-

CAM results in at least 850 fewer comparison operations in 82% of the

cases.

TABLE IV

NUMBER OF DATA RELATED TO THE SAME PARAMETERS AND AVERAGE PROBABILITIES FOR THE ONES COUNT

APPROACH

In other words, in most cases, the Block-XOR PB-CAM required far fewer

comparison operations than the ones-count approach for parameter values

between 5 and 9. For example, when the parameter is 7, the proposed

Block-XOR PB-CAM requires 2284 fewer comparison operations than the

ones-count approach.

3.5 Comparison Between Two Approaches :

To eliminate the Gaussian distribution, we uniformly distribute the parameter

over the input data. However, as can be seen from Tables III and IV, when

the parameter is 0, 1, 2, 3, 4, 10, 11, 12, 13, or 14, the number of

comparison operations required for the ones-count approach is fewer than

that for the Block-XOR PB-CAM. Although the Block-XOR PB-CAM is better

than the ones-count PB-CAM only for parameters between 5 and 9, we must

draw attention to the fact that the probability that these parameters occurs is

82%.

For example, when the parameter is 7, there is a 20.95% chance that the

Block-XOR PB-CAM can result in more than 2280 fewer comparison

operations as compared to the ones-count approach. Compared with the

ones-count approach, we can reduce the number of comparison operations

for more than 1000 in most cases. In other words, the ones-count approach

is better than Block-XOR approach in only 18% of the cases.

The number of comparison operations required for different input bit length

4, 8, 14, 16, and 32 bits is shown in Fig.3.6. As can be seen, from the fig 3.6

Block-XOR PB-CAM becomes more effective in reducing the number of

comparison operations as the input bit length increases. This implies that the

longer the input bit length is, the fewer the number of comparison operations

required (i.e., power reduction).

Fig.3.6.Comparision operations for different input bit length.

3.6 Gate-Block Selection Algorithm:

To make the parameter extractor of the block-xor PB-BAM more useful for

specific data types, we take into account the different characteristic of logic

gates to synthesize the parameter extractors for different data types. As can

be seen in Fig. 3.5, if the input bits of each partition block is set into l, the bit

length of the parameter (i.e. the number of blocks) will be [n/l], where n is the

bit length of the input data, and then the levels in each partition block equal

[log2 l]. We observe that when the input bits of each partition block

decreases, the mismatch rate and the number of comparison operations in

each data comparison process will decrease (this is because that the

combinations of the parameter increase). Although the increasing parameter

bit length can decrease the mismatch rate and the number of comparison

operations in each data comparison process, the parameter memory size

must be increased. In other words, it increases the power consumption of

the parameter memory as well. As we stated, when the PBCAM performs

data searching operation, it must compare the entire parameter memory. To

avoid wasting the large mount of power in the parameter memory, we set the

input of each partition block to 8 bits. Fig. 3.7 shows the proposed parameter

extractor architecture. We first partition the input data bit into several blocks,

G0~G6 in each block stand for different logic gates, from which an output bit

is computed using synthesized logic operation for each of these blocks. The

output bits are then combined to become the parameter for data comparison

process.

The objective of our work is to select the proper logic gates in Fig. 3.7

so that the parameter (Pk−1, Pk−2,…………….P0) can reduce the number

of data comparison operations as many as possible.

Fig. 3.7: n-bit block diagram of the proposed parameter extractor

architecture.

In our proposed parameter extractor, the bit length of the parameter is set

into [n/8], and then the levels in each partition block equal [log2 8] (which is

3). Suppose that we use basic logic gates (AND, OR, XOR, NAND, NOR,

and NXOR) to synthesize a parameter extractor for a specific data type,

which has (67)[n/8] different logic combinations based on the proposed

parameter extractor. Obviously, the optimal combination of the parameter

extractor cannot be found in polynomial time.

To synthesize a proper parameter extractor in polynomial time for a specific

data type, we propose a gate-block selection algorithm to find an

approximately optimal combination. We illustrate how to select proper logic

gates to synthesize a parameter extractor for specific data type from

mathematical analysis below.


For a 2-input logic gate, let p be the probability of the output signal Y that is

one state. The probability mass function of the the output signal Y is given

by

Assume that the inputs are independent, if we use any 2-input logic gate as

a parameter extractor to generate the parameter for 2-bit data, then the PB-

CAM requires the average number of comparison operations in each data

search operation can be formulated as

where N0 is the number of zero entries, and N1 is the number of one entry

for the generated parameters. To illustrate clearly, we use Table V as an

example.

TABLE V

Suppose that a 2-input AND gate is used to generate the parameter, the

average number of comparison operations in each data search operation for

the PB-CAM can be derived:

In other words, when we use a 2-input AND gate to generate the parameter

for this 2-bit data, the average number of comparison operations required for

each data search operation in the PB-CAM is 4.33. According to Equ. 4,

Table V derives the average number of comparison operations for six basic

logic gates. Obviously, using OR and NOR gates are the best selection for

this case, because they require the least average number of comparison

operations (which is 3). Moreover, when we use the inverse relation of logic

gates (AND/NAND, OR/NOR, and XOR/NXOR) to generate the parameter,

the average number of comparison operations for each data search

operation required in the PB-CAM will be the same. To reduce the

complexity of our proposed algorithm and the performance of the parameter

extractor, our proposed approach only selects NAND, NOR, and XOR gates

to synthesize the parameter extractor for our implementation. This is

because that NAND and NOR is better than AND and OR in terms of the

area, power, and speed. Based on this mathematical analysis, Fig.3.8 shows

our proposed gateblock selection algorithm.

Algorithm to Select Proper Logic Gates For Specific Data:

Fig. 3.8: Gate-Block Selection Algorithm.

Note that when the input is random, the synthesized result will be the same

as the block-xor approach. In order words, the block-xor approach is a

subset of our proposed algorithm.

To better understand our proposed approach, we give a simple example as

illustrated in Fig. 3.9. In this example, a 4-bit data is assigned as input data.

Because the input data is only 4 bits in this example, we set the number of

input bits of each partition block to 4, and then the levels in each partition

block equal [log2 4] (i.e. two levels).

First, we use different logic gates (NAND, NOR, and XOR) to generate the

parameter for D1D0, respectively, and then records their generated

parameter for each pattern as shown in Fig. 3.9 (a).

Fig. 3.9: An example for synthesis of the parameter extractor.

After that, according to Equation. 4, we calculate the average number of

comparison operations Cavg for each logic gate. Obviously, using NAND

gate is the best selection for D1D0, hence NAND gate is selected as a part

of the parameter extractor. Similarly, NOR gate is selected to generate the

parameter for D3D2 (see Fig. 3.9 (b)). Now, the parameter bits is 2, which is

greater than [4/4] (expected parameter bits). According to the proposed

algorithm, we repeat Step 1 to Step 3 to determine the parameter bits for the

next level. For Y1Y0, as shown in Fig.3.9 (c), XOR gate is the best

parameter extractor for Y1Y0. Through the algorithm, the generated

parameter is only 1 bit, which is no longer greater than [4/4], hence the

procedure of synthesizing parameter extractor is done. Finally, Fig. 3.9 (d)

shows the synthesized parameter extractor for the input data.

Chapter 4

TOOLS UTILIZED

4.1 Xilinx(ISE):

There are several EDA (Electronic Design Automation) tools available for

circuit synthesis, implementation, and simulation using VHDL. Some tools

(place and route, for example) are odered as part of a vendor’s design suite

(e.g., Altera’s Quartus II, which allows the synthesis of VHDL code onto

Altera’s CPLD/FPGA chips, or Xilinx’s ISE suite, for Xilinx’s CPLD/FPGA

chips). Other tools (synthesizers, for example), besides being odered as part

of the design suites, can also be provided by specialized EDA companies

(Mentor Graphics, Synopsis, Synplicity,etc.). Examples of the latter group

are Leonardo Spectrum (a synthesizer from Mentor Graphics), Synplify (a

synthesizer from Synplicity), and ModelSim (a simulator from Model

Technology, a Mentor Graphics company). The designs presented in the

book were synthesized onto CPLD/FPGA devices (appendix A) either from

Altera or Xilinx. The tools used were either ISE combined with ModelSim (for

Xilinx chips—appendix B), MaxPlus II combined with Advanced Synthesis

Software, or Quartus II. Leonardo Spectrum was also used occasionally.

Although different EDA tools were used to implement and test the examples

presented in the design, we decided to standardize the visual presentation of

all simulation graphs. Due to its clean appearance, the waveform editor of

MaxPlus II was employed. However, newer simulators like ISE þ ModelSim

and Quartus II, over a much broader set of features, which allow, for

example, a more refined timing analysis. For that reason, those tools were

adopted when examining the fine details of each design.

The Xilinx Integrated Software Environment (ISE) is a powerful and complex

set of tools. First, the HDL files are synthesized. Synthesis is the process of

converting behavioral HDL descriptions into a network of logic gates. The

synthesis engine takes as input the HDL design files and a library of

primitives. Primitives are not necessarily just simple logic gates like AND and

OR gates and D-registers, but can also include more complicated things

such as shift registers and arithmetic units. Primitives also include

specialized circuits such as DLLs that cannot be inferred by behavioral HDL

code and must be explicitly instantiated. The libraries guide in the Xilinx

documentation provides an complete description of every primitive available

in the Xilinx library. (Note that, while there are occasions when it is helpful or

even necessary to explicitly instantiate primitives, it is much better design

practice to write behavioral code whenever possible.)

We will be using the Xilinx supplied synthesis engine known as XST. XST

takes as input a verilog (.v) file and generates a .ngc file. A synthesis report

file (.srp) is also generated, which describes the logic inferred for each part

of the HDL file, and often includes helpful warning messages.

The .ngc file is then converted to an .ngd file. (This step mostly seems to be

necessary to accommodate different design entry methods, such as third-

part synthesis tools or direct schematic entry. Whatever the design entry

method, the result is an .ngd file.)

The .ngd file is essentially a netlist of primitive gates, which could be

implemented on any one of a number of types of FPGA devices Xilinx

manufacturers. The next step is to map the primitives onto the types of

resources (logic cells, i/o cells, etc.) available in the specific FPGA being

targeted. The output of the Xilinx map tool is an .ncd file.

The design is then placed and routed, meaning that the resources described

in the .ncd file are then assigned specific locations on the FPGA, and the

connections between the resources are mapped into the FPGAs

interconnect network. The delays associated with interconnect on a large

FPGA can be quite significant, so the place and route process has a large

impact on the speed of the design. The place and route engine attempts to

honor timing constraints that have been added to the design, but if the

constraints are too tight, the engine will give up and generate an

implementation that is functional, but not capable of operating as fast as

desired. Be careful not to assume that just because a design was

successfully placed and routed, that it will operate at the desired clock rate.

The output of the place and route engine is an updated .ncd file, which

contains all the information necessary to implement the design on the

chosen FPGA. All that remains is to translate the .ncd file into a

configuration bit stream in the format recognized by the FPGA programming

tools. Then the programmer is used to download the design into the FPGA,

or write the appropriate files to a compact flash card, which is then used to

configure the FPGA.

By itself, a Verilog model seldom captures all of the important attributes of a

complete design. Details such as i/o pin mappings and timing constraints

can't be expressed in Verilog, but are nonetheless important considerations

when implementing the model on real hardware. The Xilinx tools allow these

constraints to be defined in several places, the two most notable being a

separate "universal constraints file" (.ucf) and special comments within the

Verilog model.

Xilinx has two main FPGA families: the high-performance Virtex series and

the high-volume Spartan series, with a cheaper Easy Path option for

ramping to volume production. It also manufactures two CPLD lines, the

CoolRunner and the 9500 series. Each model series has been released in

multiple generations since its launch.

The latest Virtex-6 and Spartan-6 FPGA families are said to consume 50

percent less power, cost 20 percent less, and have up to twice the logic

capacity of previous generations of FPGAs.

4.1.1 Spartan Family:

The Spartan series targets applications with a low-power footprint, extreme

cost sensitivity and high-volume such as displays, set-top boxes, wireless

routers and other applications.

The Spartan-6 family is built on a 45-nanometer (nm), 9-metal layer, dual-

oxide process technology. The Spartan-6 was marketed in 2009 as a low-

cost solution for automotive, wireless communications, flat-panel display and

video surveillance applications.

The Spartan-3A consumes more than 70-90 percent less power in suspend

mode and 40-50 percent less for static power compared to standard devices.

Also, the integration of dedicated DSP circuitry in the Spartan series has

http://en.wikipedia.org/wiki/Wireless_router

http://en.wikipedia.org/wiki/Wireless_router

http://en.wikipedia.org/wiki/Set-top_boxes

http://en.wikipedia.org/wiki/Displays

http://en.wikipedia.org/wiki/CPLD

inherent power advantages of approximately 25 percent over competing low-

power FPGAs.

4.1.2 Virtex Family:

The Virtex series of FPGAs have integrated features such as wired and

wireless infrastructure equipment, advanced medical equipment, test and

measurement, and defense systems. In addition to FPGA logic, the Virtex

series includes embedded fixed function hardware for commonly used

functions such as multipliers, memories, serial transceivers and

microprocessor cores.

The Virtex-6 family is built on a 40-nm process for compute-intensive

electronic systems, and the company claims it consumes 15 percent less

power and has 15 percent improved performance over competing 40 nm

FPGAs.Older-generation devices such as the Virtex, Virtex-II and Virtex-II

Pro are also still available, although their functionality is largely superseded

by the Virtex-4 and -5 FPGA families.

The Virtex-II Pro family was the first to combine PowerPC embedded

technology (including single and multiple PowerPC 405 processor cores)

and integrated serial transceivers (up to 3.125 Gbit/s in Virtex-II Pro and up

to 10.3125 in Virtex-II Pro X).The Virtex-4 series was introduced in 2004 and

was manufactured on a 1.2V, 90-nm, triple-oxide process technology. The

Virtex-4 family introduced the new Advanced Silicon Modular Block (ASMBL)

architecture enabling FPGA platforms with a combination of features to

support logic (LX), embedded processing and connectivity (FX), digital signal

processing (SX).

http://en.wikipedia.org/wiki/2004

Chapter 5

VHDL Code for the different parameter extracting

approaches

1).Coding For One’s Count Approach:

CAM BLOCKONES CODING

library ieee;

use ieee.std_logic_1164.all;

use ieee.std_logic_unsigned.all;

entity camblockones is

port (

data : in std_logic_vector(13 downto 0);

clk : in std_logic;

we : in std_logic;

wei : in std_logic;

addr : out std_logic_vector(3 downto 0)

);

End camblockones;

Architecture beh of camblockones is

component datamem

port (

clk : in std_logic;

ad : in std_logic_vector(3 downto 0);

--data : inout std_logic_vector(13 downto 0);


data_out : out std_logic_vector(13 downto 0);

we : in std_logic

);

End component;

component onescnt

port (

i : in std_logic_vector(13 downto 0);

s : out std_logic_vector(3 downto 0)

);

End component;

component cammem

port (

clk : in std_logic; -- Clock Input

-- address : inout std_logic_vector (3 downto 0);

address : in std_logic_vector (3 downto 0);

address_out : out std_logic_vector (3 downto 0);

data : in std_logic_vector (13 downto 0);

we : in std_logic

);

end component;

Signal ex,exi : std_logic_vector(3 downto 0);

Signal dat : std_logic_vector(13 downto 0);

begin

m0 : onescnt port map(data,ex);

m1 : datamem port map(clk,ex,data,dat,we);

m2 : cammem port map (clk,ex,exi,data,wei);

addr <= exi;

end beh;

CAM DATA MEMORY CODING

library ieee;



entity datamem is

port (

clk : in std_logic;





we : in std_logic

);

End datamem;

Architecture beh of datamem is

----------------Internal variables----------------

constant DEPTH :integer := 16;

--signal data_out :std_logic_vector (13 downto 0):=(others=>'0');

type cmem is array (integer range <>)of std_logic_vector (13 downto 0);

signal mem : cmem (0 to 15);

begin

-- data <= data_out when (we = '0') else (others=>'Z');

-- Memory Write Block

-- Write Operation : When we = 1, cs = 1

MEM_WRITE:

process (clk) begin

if (rising_edge(clk)) then

if we = '1' then

mem(conv_integer(ad)) <= data;

end if;

end if;

end process;

-- Memory Read Block

-- Read Operation : When we = 0, oe = 1, cs = 1

MEM_READ:

process (clk) begin


if we = '0' then

data_out <= mem(conv_integer(ad));

end if;

end if;

end process;

end beh;

CAM MEMORY CODING

library ieee;



entity cammem is

port (

clk :in std_logic; -- Clock Input

-- address :inout std_logic_vector (3 downto 0);

address :in std_logic_vector (3 downto 0);

address_out :out std_logic_vector (3 downto 0);

data :in std_logic_vector (13 downto 0);

we : in std_logic

);

end cammem;

Architecture beh of cammem is


constant CAM_DEPTH :integer := 2**14;

-- signal address_out :std_logic_vector (3 downto 0):=(others=>'0');

type CAM is array (integer range <>)of std_logic_vector (3 downto 0);

signal mem : CAM (0 to CAM_DEPTH-1);

begin

-- address <= address_out when ( we = '0') else (others=>'Z');

MEM_WRITE:

process (clk) begin


if we = '1' then

mem(conv_integer(data)) <= address; ----- convertion type

end if;

end if;

end process;

MEM_READ:

process (clk) begin


if we = '0' then

address_out <= mem(conv_integer(data));

end if;

end if;

end process;

end beh;

CAM ONES COUNT CODING

library ieee;



entity onescnt is

port (

i: in std_logic_vector(13 downto 0);

s: out std_logic_vector(3 downto 0)

);

End onescnt;

Architecture beh of onescnt is

component fa

port (

a : in std_logic;

b : in std_logic;

c : in std_logic;

sum : out std_logic;

cout : out std_logic

);

End component;

Signal w : std_logic_vector(29 downto 0);

begin

m0 : fa port map(i(2),i(1),i(0),w(0),w(1));




m4 : fa port map('0',i(13),i(12),w(8),w(9));

m5 : fa port map(w(4),w(2),w(0),w(10),w(11));

m6 : fa port map('0',w(8),w(6),w(12),w(13));

m7 : fa port map(w(3),w(1),'0',w(14),w(15));


m9 : fa port map('0',w(12),w(10),s(0),w(18));

m10 : fa port map(w(13),w(11),'0',w(19),w(20));

m11 : fa port map('0',w(16),w(14),w(21),w(22));

m12 : fa port map(w(17),w(15),'0',w(23),w(24));

m13 : fa port map(w(18),w(21),w(19),s(1),w(25));


m15 : fa port map(w(26),w(25),'0',s(2),w(28));

w(29)<= w(24) or w(27);

s(3)<= w(29) or w(28);

end beh;

FULL ADDER CODING

library ieee;


use ieee.std_logic_arith.all;


entity fa is

port (

a : in std_logic;

b : in std_logic;

c : in std_logic;

sum : out std_logic;

cout : out std_logic);

end fa;

architecture Behavioral of fa is

begin

sum <= (a xor b xor c);

cout <= ((a and b)or (b and c)or(a and c));

end Behavioral;

2).Coding For XOR Approach:

CAM BLOCKXOR CODING

library ieee;



entity camblockxor is

port (


clk : in std_logic;

we : in std_logic;

wei : in std_logic;


);

End camblockones;

Architecture beh of camblockones is

component datamem

port (

clk : in std_logic;





we : in std_logic

);

End component;

component xor1

port (


s : out std_logic_vector(3 downto 0)

);

End component;

component cammem

port (






we : in std_logic

);

end component;



begin

m0 : xor1 port map(data,ex);



addr <= exi;

end beh;

CAM DATA MEMORY CODING

library ieee;



entity datamem is

port (

clk : in std_logic;





we : in std_logic

);

End datamem;







begin




MEM_WRITE:

process (clk) begin


if we = '1' then


end if;

end if;

end process;



MEM_READ:

process (clk) begin


if we = '0' then


end if;

end if;

end process;

end beh;

CAM MEMORY CODING

library ieee;



entity cammem is

port (



address :in std_logic_vector (3 downto 0);



we : in std_logic

);

end cammem;







begin


MEM_WRITE:

process (clk) begin


if we = '1' then


end if;

end if;

end process;

MEM_READ:

process (clk) begin


if we = '0' then


end if;

end if;

end process;

end beh;

CAM XOR CODING

library ieee;


use ieee.std_logic_arith.all;


entity xor1 is

port (


s: out std_logic_vector(3 downto 0));

end xor1;

architecture behavioral of xor1 is

signal a : std_logic_vector (5 downto 0);

begin

a(5) <= i(13)xor i(12);

a(4) <= i(11)xor i(10);

a(3) <= i(9)xor i(8);

a(2) <= i(7)xor i(6);

a(1) <= i(5)xor i(4);

a(0) <= i(3)xor i(2);

s(3) <= a(5)xor a(4);

s(2) <= a(3)xor a(2);

s(1) <= a(1)xor a(0);

s(0) <= i(1)xor i(0);

end behavioral;

3).Coding For Gate-Block Selection Approach :

CAM GATE BLOCK SELECTION

library ieee;



entity camblockgb is

port (


clk : in std_logic;

we : in std_logic;

wei : in std_logic;


);

End camblockgb;

Architecture beh of camblockgb is

component datamem

port (

clk : in std_logic;





we : in std_logic

);

End component;

component gbextractor

port (


ex : out std_logic_vector(3 downto 0)

);

End component;

component cammem

port (






we : in std_logic

);

end component;



begin

m0 : gbextractor port map(data,ex);



addr <= exi;

end beh;

GATE BLOCK CAM MEMORY CODING

library ieee;



entity cammem is

port (



--- address :in std_logic_vector (3 downto 0);



we : in std_logic

);

end cammem;







begin


MEM_WRITE:

process (clk) begin


if we = '1' then


end if;

end if;

end process;

MEM_READ:

process (clk) begin


if we = '0' then


end if;

end if;

end process;

end beh;

GATE BLOCK CAM DATA MEMORY CODING

library ieee;



entity datamem is

port (

clk : in std_logic;





we : in std_logic

);

End datamem;







begin




MEM_WRITE:

process (clk) begin


if we = '1' then


end if;

end if;

end process;



MEM_READ:

process (clk) begin


if we = '0' then


end if;

end if;

end process;

end beh;

GATE BLOCK CAM EXTRACTOR CODING

library ieee;



entity gbextractor is

port (


ex : out std_logic_vector(3 downto 0)

);

End gbextractor;

Architecture beh of gbextractor is

signal s : std_logic_vector(5 downto 0);

signal a : std_logic_vector(3 downto 0);

signal se : std_logic;

Begin

a(0) <= i(0) nand i(1);

s(0) <= i(2) nor i(3);

s(1) <= i(4) nor i(5);

a(1) <= s(0) xor s(1);

s(2) <= i(6) nand i(7);

s(3) <= i(8) nand i(9);

a(2) <= s(2) xor s(3);

s(4) <= i(10) nor i(11);

s(5) <= i(12) nor i(13);

a(3) <= s(4) xor s(5);

se <= a(0) and a(1) and a(2) and a(3);

process(se,a(3 downto 0),i(13 downto 10))

begin

if se = '1' then

ex <= a(3 downto 0);

else

ex <= i(13 downto 10);

end if;

end process;

end beh;

Chapter 6

SIMULATION RESULTS

In this chapter simulation results of implemented modules are discussed

along with their RTL schematics and output waveforms. This work mainly

concerned with the implementation of the low power pre-computation-based

CAM in VHDL. The developed VHDL code was synthesizable also, by which

it can be implemented in hardware also. A program synthesizable implies

that it is implantable in hardware (in FPGA device). In this work every VHDL

code developed has been synthesized using Xilinx9.2i. Technology used

was SPARTAN III, product of XILINX.

5.1 Results of One’s Count Parameter Extractor:

The fig.5.1 shows the simulation output for the 14-bit one’s count parameter

extractor. The one’s count parameter extractor counts the number of one in

the given input sequence, and gives the count as the output . For a 14-bit

input data the extracted output is log(14+2) = 4-bit. The output is termed as

parameter.

RTL Schematics :

Results of CAM:

The fig.5.8 shows the RTL Schematic of content addressable memory. It

takes data as input and gives address as output. Here two control signals

we,wei are used to enable write and read operations. In this implementation

gate block selection algorithm parameter extractor is used, as it takes less

power for parameter extraction which is the critical part in the search

operation

Fig .5.8.RTL Schematic of CAM.

Fig .5.9.Detailed RTL Schematic of CAM with Parameter extractor.

Fig .5.10. Detailed RTL Schematic of content addressable memory.

Intially the control signals are set to ‘1’(i.e.,we= ‘1’ and wei= ‘1’) to write the

data into the CAM. In other words creating the database. After writing

data ,we can read the address corresponding to the data by setting the

control signals to ‘0’. We can also read the data after one clock cycle by

setting the control signals we= ‘0’ and wei= ‘1’. The output results for writing

data and reading the address as well as data are illustrated in Fig 5.11 to

Fig 5.13.

One count

Signals And Structure

The structure and signal of the simulated model for ones’ count approach ,

obtained after the complete simulation are shown below

Cam Memory

1) Data Memory

Waveforms After Simulation

The waveform shown below shows the status and changes in the memory location

after each run in the form of waveform

Fig.5.11.VHDL output showing the data write into the CAM

Fig.5.12.VHDL output showing the data read from the CAM

Fig.5.13.VHDL output showing the address read from the CAM

Device Utilization Summary:

Logic Utilization Used Available Utilization

Number of

Slices

14 768 1%

Number of 4

input LUTs

26 1536 1%

Number of

bonded IOBs

19 124 15%

Number of

BRAMs

4 4 100%

Number of

GCLKs

1 8 12%

Table 6

Table 7

RTL Schematic in fig 5.3 shows that, full adders are used in the

architecture which consume large area. This is shown in the Table7

Simulation Results Of Block Xor Gate

The fig.shows the rtl schematic for the 14-bit block xor parameter extractor.

The block xor parameter extractor xors the given input sequence, and gives

the xor-ed output . For a 14-bit input data the extracted output is log(14+2) =

4-bit. The output is termed as parameter.

RTL Schematic

Fig .5.5. Detailed RTL Schematic of Block-XOR parameter extractor.


The structure and signal of the simulated model for block xor approach ,

obtained after the complete simulation are shown below :

Data Memory

Cam Memory

Block Xor Approach

Results of Block-Xor Parameter Extractor:

The fig.5.4 shows the simulation output for the 14-bit Block-XOR

parameter extractor. Here the extraction is done by the independent blocks.

Compared to the one’s count parameter extractor it takes less delay and

less area and it is adaptive for general designs. Input is 14-bit and the

extracted output(parameter) is 4-bit. If the parameter is “0000 ~ 1110” (Se =

“0”), the parameter does not change. Otherwise, if the parameter is

“1111”(Se= “1”), then the first block of the input data becomes the new

parameter, and “1111” can then be used as the valid bit.

The waveform shown below shows the status and changes in the memory

location after each run in the form of waveform

Fig.5.4.VHDL output for Block-XOR parameter extractor.

Device Utilization Summary:

Table 7

Logic Utilization Used Available Utilization

Number of

Slices

2 1920 0%

Number of 4

input LUTs

4 3840 0%

Number of

bonded IOBs

19 97 19%

Number of

BRAMs

4 6 66%

Number of

GCLKs

1 8 12%

Table 8

RTL Schematic in fig 5.5 shows that ,independent block are used for

parameter extraction, and Table VII shows that less number of devices are

used for implementation. So Block-XOR parameter takes less area and

hence less power.

Simulation Results Of Gate Selection Algorithm

The fig.shows the rtl schematic for the 14-bit gate selected parameter

extractor. The gate selected parameter extractor computes the output

sequence from the given input sequence, by the predefined algorithm and

passing the input sequence to the different gates. For a 14-bit input data the

extracted output is log(14+2) = 4-bit. The output is termed as parameter.

Fig .5.7. Detailed RTL Schematic of gate block selection algorithm.

RTL Schematic in fig. 5.7 shows that, independent block are used for

parameter extraction, and Table VIII shows that less number of devices are

used for implementation. So gate block selection algorithm parameter takes

less area and hence less power.


Results of Gate Block Selection Algorithm:

The fig.5.6 shows the simulation output for the 14-bit gate block

selection algorithm parameter extractor. Here the extraction is done by the

independent blocks. Compared to the one’s count parameter extractor it

takes less delay and less area and it is adaptive for general designs. Input is

14-bit and the extracted output(parameter) is 4-bit. If the parameter is “0000

~ 1110” (Se = “0”), the parameter does not change. Otherwise, if the

parameter is “1111”(Se= “1”), then the first block of the input data becomes

the new parameter, and “1111” can then be used as the valid bit.

Fig.5.6.VHDL output for gate block selection algorithm


The structure and signal of the simulated model for block xor approach ,

obtained after the complete simulation are shown below :

Gate block extractor

Data memory

Cam memory

Device Utilization Summary of CAM

Table 9

The Table 7 gives the Device Utilization Summary of the CAM with gate

block selection algorithm parameter extractor. From the waveforms it is

observed that initially 14-bit data is loaded in to the CAM. If we give the data

that is loaded in the CAM as an input, it will give the address pointing to the

same data as well as different data as an output exactly after one clock

cycle.

Table 10

Device Utilization Summary of gate block selection algorithm

Chapter 7

CONCLUSION

In this thesis 14-bit low power pre-computation–based content addressable

memory (PB-CAM) is simulated in VHDL. Mathematical analysis and

simulation results confirmed that the Block-XOR PB-CAM can effectively

save power by reducing the number of comparison operations in the second

part of the comparison process. In addition, it takes less area as compared

with the one’s count parameter extractor. This PB-CAM takes data as input

and gives the address pointing to the same data as well as different data as

an output exactly after one clock cycle. So it is flexible and adaptive for the

low power and high speed search applications

In this thesis, a gate-block selection algorithm was proposed. The proposed

algorithm can synthesize a proper parameter extractor of the PB-CAM for a

specific data type. Mathematical analysis and simulation results confirmed

that the proposed PB-CAM effectively save power by reducing the number of

comparison operations in the data comparison process. In addition, the

proposed parameter extractor can compute parameter bits in parallel with

only three logic gate delays for any input bit length (i.e. constant delay of

search operation).