A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments...

23
IBM Austin Research Lab A High-Speed and Large-Scale Dictionary Matching Engine for Information Extraction Systems Kanak Agarwal, Raphael Polig IBM Corporation

Transcript of A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments...

Page 1: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

IBM Austin Research Lab

A High-Speed and Large-Scale Dictionary Matching Engine for Information Extraction Systems

Kanak Agarwal, Raphael Polig

IBM Corporation

Page 2: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation2

Information Extraction (IE)

Task of extracting structured information from semi-structured and unstructured text

Tasks include finding entities, relationship between entities, attributes describing entities, etc.

2

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Bill Gates CEO Microsoft

Bill Veghte VP Microsoft

Richard Stallman founder Free Soft..

IE

Example from Tutorial (Cohen)

Page 3: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation3

Information Extraction Systems

IE is undergoing a rapid growth in its use-cases in enterprise applications– Business intelligence, semantic search, brand management, customer

sentiment analysis, news tracking, event tracking, etc.

Emerging as one of the key technology in several Big Data and business analytics solutions

Enterprise adoption coupled with very large data volumes are forcing new constraints on performance of IE systems– Extraction throughput constraints– Accuracy requirements– System scalability

Page 4: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation44

Patten Matching in IE Systems

Dictionary matching and regular expression matching are the most important kernels in IE tasks

Dictionary matching involves matching strings extracted from text documents against a dictionary of known patterns

Example, use of dictionary matching in Named-Entity Recognition (NER) – NER is task of finding and classifying (person, location, organization, etc.)

names in text

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Richard Stallman countered saying…

Text document

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Richard Stallman countered saying…

NER Recognition

Gazetteersof

PersonLocation

Organization

Page 5: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation55

Dictionary Matching Kernel in IE

Gazetteers (or dictionaries) used in IE tasks are often extremely large– May easily contain tens to hundreds of thousands of patterns

Dictionary matching in IE is a highly compute intensive kernel for large dictionaries that limits the overall extraction throughput

This work: hardware accelerator for off-loading dictionary matching operations to an FPGA based co-processor

Design point chosen to allow matching of ~100K patterns at a throughput of ~10Gbps

Page 6: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation66

String Matching in Non-IE Environments

Dictionary (string) matching is also a commonly used operation in a variety of other applications– Network Intrusion Detection Systems (NIDS), virus scanners, spam

filters, etc.

Several approaches have been proposed for hardware acceleration of string matching– Usually employ Deterministic Finite Automata (DFA) based algorithms such

as Aho-Corasick algorithm and its variants– Implementations are generally either constrained by dictionary size or max

achievable throughput

Page 7: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation77

Existing Hardware Solutions for String Matching

Ref: Scarpazza, D.P., et al, “Exact Multi-Pattern String Matching on Cell/B.E. Processor”, 2008

Most of these solutions are based on parsing one character every cycle. Typically achieve higher bandwidth by using multiple input streams in parallel

Page 8: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation88

Dictionary Matching Use-Case in IE

Matching requirements in IE environments are different than generic string matching– Based on tokenizing a text document (whitespace, punctuation marks, etc.)

and only matching at the word token boundaries

Simplified use-model as no partial matches have to be detected by the accelerator– DFA based approaches are generally an overkill

DFA accelerators typically process a single character per cycle and usually require large state transition tables

Our approach: hashing based solution– Process a word token with multiple characters per cycle– Key challenge: limit RAM resource overhead of mapping dictionary in

on-chip hash table

Page 9: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation99

Generic Hashing

Generic hashing scheme for storing dictionary in on chip block RAMs– Consider a dictionary with LD = 2N words– Consider a hashing scheme which maps a dictionary word into an M bit hash

• Hash value is used as an index (M-bit array address) to store a dictionary word in a particular RAM location

• For an M-bit hash, a RAM with LM = 2M slots (or words) is used to store 2N words

– Multiple dictionary words can map to the same hash value (and hence the same RAM slot) leading to hash collisions• Expected number of collisions depend on the relatives sizes of the dictionary and the RAM

Dictionarylet’sbuild

asmarter

planet

RAMHash function

hash collision

Page 10: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation1010

Hash Collisions Expected number of hash collisions for

hashing LD words in LM RAM slots– Assuming perfect uniform random distribution

Hash collisions can be reduced by increasing RAM size– Approach incurs significant RAM overhead– Limited on-chip RAM limits the size of the dictionary that can be hashed in on-chip RAM

M

D

M

L

LL

LCD

21 2

2

RAM Size Expected # of Collisions

LM=LD LD/2

LM=2*LD LD/4

LM=3*LD LD/6

LM=4*LD LD/8

LM=5*LD LD/10

LM=6*LD LD/12

LM=7*LD LD/14

LM=8*LD LD/16

LM=9*LD LD/18

LM=10*LD LD/20

Hash Collisions (64K Dictionary)

2M RAM slots for1000 collisions

Greater than 10X RAM overhead for less than 5% collisions

Page 11: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation1111

Proposed Modified Hashing Technique For a dictionary of size 2N words

– Create an N bit hash: hash[N-1:0]– Use N RAM arrays

• First RAM has the same size as the dictionary (2N slots) and is addressed by the full hash[N-1:0]• Second RAM is half the size of the dictionary (2N-1 slots) and is indexed by hash[N-2:0] and so on..

– Each dictionary word is stored in the first free RAM at its corresponding hash position• If a word causes hash collisions in all N RAMs, then it is stored in the final unindexed register

Dictionary (2N words)

N bit Hash

Hash[N-1:0]

RAM 1 (2N slots)

Hash[N-2:0]

RAM 2 (2N-1 slots)

Hash[0]

RAM N (2 slots) 1 slot

Page 12: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation1212

Hash Collisions in Modified Hashing

Consider hash collisions in chain hashing for a 64K word dictionary– Expected number of collisions in the first RAM (64K words in 64K slots) should be 32K– The colliding 32K words will be sent to the second RAM (32K slots) resulting in an average of

16K collisions in RAM 2– The progression should continue with the last RAM stage hashing on average of 2 words in 2

slots with potentially 1 collision that can be stored in the final unindexed register

The modified hashing technique achieves following useful properties– Probability of non-zero collisions is very low– RAM overhead is limited to twice the size of the dictionary– For a dictionary of size LD, log(LD) RAMs are used so only logarithmic number of RAM

accesses / comparisons are required for dictionary matching– All memory lookups / searches are performed in parallel in single cycle and do not require any

sequential search

Dictionary (2N words)

RAM 1

Expected # ofCollisions:

2N Slots2N-1 words

122222

N

N

NN

RAM 2

2N-1 Slots

2N-2 collisions

RAM N

2 Slots

1 collision

2 words 1 word

No collision

1 Slot

Page 13: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation1313

Hash Collision Experiments Randomly generated hundred 1000 word dictionaries Store each dictionary in RAMs using proposed hashing technique

– Compare modified hashing against the standard hashing which uses the same RAM (2K RAM slots for each 1000 word dictionary) but uses a single large RAM to store each dictionary

Chain hashing stores all dictionaries without ANY collisions while standard hashing shows on an average 200+ collisions per dictionary

Modified Hashing (Total RAM – 2K Slots)

RAM 1 (1k slots)

RAM 2 (512 slots)

RAM 3 (256 slots) RAM 4 (128 slots) RAM 10 (2 slots)

Standard Hashing (Total RAM – 2K Slots)

Histogram (Slots occupied)

Histogram (Collisions)

Page 14: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation1414

Loading Dictionary in RAMs

Hash logic: hash[N-1:0]

Dictionary word

Load word

rd_en

rd_addr

[N-1

:0]

wr_addr

[N-2

:0]

[0]

1-D Arrays for storing data valid bits

valid valid valid

write enable logic: wr_en[N-1:0]

[N-1

:0]

[N-2

:0]

[0]

RAMs for storing dictionary words

[N-1

]

[N-2

]

[0]

wr_data

Maintain a 1-D status array (initialized to zero) corresponding to each of the N RAMs When a dictionary word is loaded, first compute the hash of the word and read the

status bits from the status arrays from their respective hash locations Use the status bits to generate write enable signals for the RAMs (priority encoding

to generate only one active signal) Write the dictionary word in the first free RAM at it hashed address

Page 15: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation1515

Matching Dictionary from RAMs Compute the hash of the input data word For a valid input word, read the status bits from the status arrays and the dictionary

words from the dictionary RAMs from their respective hash locations Compare each RAM output with the input data word

– A match occurs if the data word matches any of the dictionary RAM outputs and the corresponding status signal from the status array is valid

For a dictionary of size 2N, N RAMs are used and hence only logarithmic number of comparisons are required for full dictionary matching

Hash logic: hash[N-1:0]

Data word

Datavalid

rd_en

rd_addr

[N-1

:0]

valid dictionary word

[N-1

:0]

comparator

[N-2

:0]

dictionary word

[N-2

:0]

comparator

[0]

dictionary word

[0]

comparator

match

Page 16: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation1616

Overall Hardware Architecture

Input FIFO

rd_ptr wr_ptr

n0

Tokenizer(whitespace, punctuation)

Extract next word from FIFO

Dictionary word

Load word

n1

Input stream

01

Hash logic: hash[N-1:0]

Dictionary RAMs and status arraysInput token match

Process one word (token) every cycle from input stream– For an average word length of 5-6 characters (+1 char for whitespace) and clock frequency

of 250 MHz, theoretical processing bandwidth is 12-14Gbps– The architecture can be used to indicate start and end offset of matching tokens, count

total matches per dictionary, individual counts for each word, and for filtering stopwords from the input stream

Page 17: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation17

Experimental Setup

System consisting of a general purpose processor attached to an FPGA device

FPGA device: Altera Stratix IV device (~425K logic elements, 1280 M9K, 64 M144K)

FPGA running at frequency: 250 MHz FPGA attached on DMA capable bus with peak transfer bandwidth

of 2.5 GB/s FPGA logic designed to process an input of 8 bytes/cycle :

theoretical processing throughput limit (@ 250MHz) is 2 GB/s Experiments vary average document size and average token

length (ATL) in the documents

17

Page 18: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation18

Hardware Measurements (Single Stream)

18

Page 19: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation19

Hardware Measurements (Two Streams)

19

Page 20: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation20

Resource Usage

Single instance consisting of 1K dictionary organized as 12 byte wide RAM

20

Page 21: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation21

Memory Efficiency

Architecture limit is 2 bytes/char of dictionary but overhead is due to fixed memory word width in implementation– Experiments performed with 12 byte wide RAM– Memory efficiency can be used by binning words based on the word size and

using multiple logical RAMs with different word widths

21

Page 22: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation22

Conclusion

Proposed an hardware accelerator for off-loading dictionary matching kernel in Information Extraction systems

Novel hashing based scheme that allows processing of multiple characters per cycle with bounded RAM overhead

Hardware measurement results show extraction throughput of ~12Gbps for typical documents

Efficient resource usage allow capability to fit ~100K pattern dictionaries in on-chip block RAM of modern FPGA devices

22

Page 23: A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments Dictionary (string) matching is also a commonly used operation in a variety of other applications

Business Unit or Product Name

© 2003 IBM Corporation2323

Differentiation from Existing Software Hash Collision Reduction Solutions

Bucket hashing – A set of values can be stored in the same slot Open hashing – Uses linked lists. Values for keys that hash to a particular

slot are placed on that slot’s linked list Closed hashing – A record can be stored in more than one locations

within the same hash table. Some sequential search (probe sequence) is used to identify the free slot in the possible set of locations

Double hashing – Use a secondary hashing function to map a record to its alternate overflow location

None of these directly satisfy our hardware requirements, which are:– Expected number of collisions should be zero– Memory overhead should be limited including overhead due to multi-port memory

(which essentially increases overhead by the number of ports) – All memory lookups / searches should be performed in parallel in single cycle