A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments...

IBM Austin Research Lab

A High-Speed and Large-Scale Dictionary Matching Engine for Information Extraction Systems

Kanak Agarwal, Raphael Polig

IBM Corporation

Business Unit or Product Name

© 2003 IBM Corporation2

Information Extraction (IE)

Task of extracting structured information from semi-structured and unstructured text

Tasks include finding entities, relationship between entities, attributes describing entities, etc.

2

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Bill Gates CEO Microsoft

Bill Veghte VP Microsoft

Richard Stallman founder Free Soft..

IE

Example from Tutorial (Cohen)



Information Extraction Systems

IE is undergoing a rapid growth in its use-cases in enterprise applications– Business intelligence, semantic search, brand management, customer

sentiment analysis, news tracking, event tracking, etc.

Emerging as one of the key technology in several Big Data and business analytics solutions

Enterprise adoption coupled with very large data volumes are forcing new constraints on performance of IE systems– Extraction throughput constraints– Accuracy requirements– System scalability



Patten Matching in IE Systems

Dictionary matching and regular expression matching are the most important kernels in IE tasks

Dictionary matching involves matching strings extracted from text documents against a dictionary of known patterns

Example, use of dictionary matching in Named-Entity Recognition (NER) – NER is task of finding and classifying (person, location, organization, etc.)

names in text


Richard Stallman countered saying…

Text document


Richard Stallman countered saying…

NER Recognition

Gazetteersof

PersonLocation

Organization



Dictionary Matching Kernel in IE

Gazetteers (or dictionaries) used in IE tasks are often extremely large– May easily contain tens to hundreds of thousands of patterns

Dictionary matching in IE is a highly compute intensive kernel for large dictionaries that limits the overall extraction throughput

This work: hardware accelerator for off-loading dictionary matching operations to an FPGA based co-processor

Design point chosen to allow matching of ~100K patterns at a throughput of ~10Gbps



String Matching in Non-IE Environments

Dictionary (string) matching is also a commonly used operation in a variety of other applications– Network Intrusion Detection Systems (NIDS), virus scanners, spam

filters, etc.

Several approaches have been proposed for hardware acceleration of string matching– Usually employ Deterministic Finite Automata (DFA) based algorithms such

as Aho-Corasick algorithm and its variants– Implementations are generally either constrained by dictionary size or max

achievable throughput



Existing Hardware Solutions for String Matching

Ref: Scarpazza, D.P., et al, “Exact Multi-Pattern String Matching on Cell/B.E. Processor”, 2008

Most of these solutions are based on parsing one character every cycle. Typically achieve higher bandwidth by using multiple input streams in parallel



Dictionary Matching Use-Case in IE

Matching requirements in IE environments are different than generic string matching– Based on tokenizing a text document (whitespace, punctuation marks, etc.)

and only matching at the word token boundaries

Simplified use-model as no partial matches have to be detected by the accelerator– DFA based approaches are generally an overkill

DFA accelerators typically process a single character per cycle and usually require large state transition tables

Our approach: hashing based solution– Process a word token with multiple characters per cycle– Key challenge: limit RAM resource overhead of mapping dictionary in

on-chip hash table



Generic Hashing

Generic hashing scheme for storing dictionary in on chip block RAMs– Consider a dictionary with LD = 2N words– Consider a hashing scheme which maps a dictionary word into an M bit hash

• Hash value is used as an index (M-bit array address) to store a dictionary word in a particular RAM location

• For an M-bit hash, a RAM with LM = 2M slots (or words) is used to store 2N words

– Multiple dictionary words can map to the same hash value (and hence the same RAM slot) leading to hash collisions• Expected number of collisions depend on the relatives sizes of the dictionary and the RAM

Dictionarylet’sbuild

asmarter

planet

RAMHash function

hash collision



Hash Collisions Expected number of hash collisions for

hashing LD words in LM RAM slots– Assuming perfect uniform random distribution

Hash collisions can be reduced by increasing RAM size– Approach incurs significant RAM overhead– Limited on-chip RAM limits the size of the dictionary that can be hashed in on-chip RAM

M

D

M

L

LL

LCD

21 2

2

RAM Size Expected # of Collisions

LM=LD LD/2

LM=2*LD LD/4

LM=3*LD LD/6

LM=4*LD LD/8

LM=5*LD LD/10

LM=6*LD LD/12

LM=7*LD LD/14

LM=8*LD LD/16

LM=9*LD LD/18

LM=10*LD LD/20

Hash Collisions (64K Dictionary)

2M RAM slots for1000 collisions

Greater than 10X RAM overhead for less than 5% collisions



Proposed Modified Hashing Technique For a dictionary of size 2N words

– Create an N bit hash: hash[N-1:0]– Use N RAM arrays

• First RAM has the same size as the dictionary (2N slots) and is addressed by the full hash[N-1:0]• Second RAM is half the size of the dictionary (2N-1 slots) and is indexed by hash[N-2:0] and so on..

– Each dictionary word is stored in the first free RAM at its corresponding hash position• If a word causes hash collisions in all N RAMs, then it is stored in the final unindexed register

Dictionary (2N words)

N bit Hash

Hash[N-1:0]

RAM 1 (2N slots)

Hash[N-2:0]

RAM 2 (2N-1 slots)

Hash[0]

RAM N (2 slots) 1 slot



Hash Collisions in Modified Hashing

Consider hash collisions in chain hashing for a 64K word dictionary– Expected number of collisions in the first RAM (64K words in 64K slots) should be 32K– The colliding 32K words will be sent to the second RAM (32K slots) resulting in an average of

16K collisions in RAM 2– The progression should continue with the last RAM stage hashing on average of 2 words in 2

slots with potentially 1 collision that can be stored in the final unindexed register

The modified hashing technique achieves following useful properties– Probability of non-zero collisions is very low– RAM overhead is limited to twice the size of the dictionary– For a dictionary of size LD, log(LD) RAMs are used so only logarithmic number of RAM

accesses / comparisons are required for dictionary matching– All memory lookups / searches are performed in parallel in single cycle and do not require any

sequential search

Dictionary (2N words)

RAM 1

Expected # ofCollisions:

2N Slots2N-1 words

122222

N

N

NN

RAM 2

2N-1 Slots

2N-2 collisions

RAM N

2 Slots

1 collision

2 words 1 word

No collision

1 Slot



Hash Collision Experiments Randomly generated hundred 1000 word dictionaries Store each dictionary in RAMs using proposed hashing technique

– Compare modified hashing against the standard hashing which uses the same RAM (2K RAM slots for each 1000 word dictionary) but uses a single large RAM to store each dictionary

Chain hashing stores all dictionaries without ANY collisions while standard hashing shows on an average 200+ collisions per dictionary

Modified Hashing (Total RAM – 2K Slots)

RAM 1 (1k slots)

RAM 2 (512 slots)

RAM 3 (256 slots) RAM 4 (128 slots) RAM 10 (2 slots)

Standard Hashing (Total RAM – 2K Slots)

Histogram (Slots occupied)

Histogram (Collisions)



Loading Dictionary in RAMs

Hash logic: hash[N-1:0]

Dictionary word

Load word

rd_en

rd_addr

[N-1

:0]

wr_addr

[N-2

:0]

[0]

1-D Arrays for storing data valid bits

valid valid valid

write enable logic: wr_en[N-1:0]

[N-1

:0]

[N-2

:0]

[0]

RAMs for storing dictionary words

[N-1

]

[N-2

]

[0]

wr_data

Maintain a 1-D status array (initialized to zero) corresponding to each of the N RAMs When a dictionary word is loaded, first compute the hash of the word and read the

status bits from the status arrays from their respective hash locations Use the status bits to generate write enable signals for the RAMs (priority encoding

to generate only one active signal) Write the dictionary word in the first free RAM at it hashed address



Matching Dictionary from RAMs Compute the hash of the input data word For a valid input word, read the status bits from the status arrays and the dictionary

words from the dictionary RAMs from their respective hash locations Compare each RAM output with the input data word

– A match occurs if the data word matches any of the dictionary RAM outputs and the corresponding status signal from the status array is valid

For a dictionary of size 2N, N RAMs are used and hence only logarithmic number of comparisons are required for full dictionary matching


Data word

Datavalid

rd_en

rd_addr

[N-1

:0]

valid dictionary word

[N-1

:0]

comparator

[N-2

:0]

dictionary word

[N-2

:0]

comparator

[0]

dictionary word

[0]

comparator

match



Overall Hardware Architecture

Input FIFO

rd_ptr wr_ptr

n0

Tokenizer(whitespace, punctuation)

Extract next word from FIFO

Dictionary word

Load word

n1

Input stream

01


Dictionary RAMs and status arraysInput token match

Process one word (token) every cycle from input stream– For an average word length of 5-6 characters (+1 char for whitespace) and clock frequency

of 250 MHz, theoretical processing bandwidth is 12-14Gbps– The architecture can be used to indicate start and end offset of matching tokens, count

total matches per dictionary, individual counts for each word, and for filtering stopwords from the input stream



Experimental Setup

System consisting of a general purpose processor attached to an FPGA device

FPGA device: Altera Stratix IV device (~425K logic elements, 1280 M9K, 64 M144K)

FPGA running at frequency: 250 MHz FPGA attached on DMA capable bus with peak transfer bandwidth

of 2.5 GB/s FPGA logic designed to process an input of 8 bytes/cycle :

theoretical processing throughput limit (@ 250MHz) is 2 GB/s Experiments vary average document size and average token

length (ATL) in the documents

17



Hardware Measurements (Single Stream)

18



Hardware Measurements (Two Streams)

19



Resource Usage

Single instance consisting of 1K dictionary organized as 12 byte wide RAM

20



Memory Efficiency

Architecture limit is 2 bytes/char of dictionary but overhead is due to fixed memory word width in implementation– Experiments performed with 12 byte wide RAM– Memory efficiency can be used by binning words based on the word size and

using multiple logical RAMs with different word widths

21



Conclusion

Proposed an hardware accelerator for off-loading dictionary matching kernel in Information Extraction systems

Novel hashing based scheme that allows processing of multiple characters per cycle with bounded RAM overhead

Hardware measurement results show extraction throughput of ~12Gbps for typical documents

Efficient resource usage allow capability to fit ~100K pattern dictionaries in on-chip block RAM of modern FPGA devices

22



Differentiation from Existing Software Hash Collision Reduction Solutions

Bucket hashing – A set of values can be stored in the same slot Open hashing – Uses linked lists. Values for keys that hash to a particular

slot are placed on that slot’s linked list Closed hashing – A record can be stored in more than one locations

within the same hash table. Some sequential search (probe sequence) is used to identify the free slot in the possible set of locations

Double hashing – Use a secondary hashing function to map a record to its alternate overflow location

None of these directly satisfy our hardware requirements, which are:– Expected number of collisions should be zero– Memory overhead should be limited including overhead due to multi-port memory

(which essentially increases overhead by the number of ports) – All memory lookups / searches should be performed in parallel in single cycle

A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments...

Documents

Transcript of A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments...