A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments...
Transcript of A High-Speed and Large-Scale Dictionary Matching …...String Matching in Non-IE Environments...
IBM Austin Research Lab
A High-Speed and Large-Scale Dictionary Matching Engine for Information Extraction Systems
Kanak Agarwal, Raphael Polig
IBM Corporation
Business Unit or Product Name
© 2003 IBM Corporation2
Information Extraction (IE)
Task of extracting structured information from semi-structured and unstructured text
Tasks include finding entities, relationship between entities, attributes describing entities, etc.
2
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
Richard Stallman founder Free Soft..
IE
Example from Tutorial (Cohen)
Business Unit or Product Name
© 2003 IBM Corporation3
Information Extraction Systems
IE is undergoing a rapid growth in its use-cases in enterprise applications– Business intelligence, semantic search, brand management, customer
sentiment analysis, news tracking, event tracking, etc.
Emerging as one of the key technology in several Big Data and business analytics solutions
Enterprise adoption coupled with very large data volumes are forcing new constraints on performance of IE systems– Extraction throughput constraints– Accuracy requirements– System scalability
Business Unit or Product Name
© 2003 IBM Corporation44
Patten Matching in IE Systems
Dictionary matching and regular expression matching are the most important kernels in IE tasks
Dictionary matching involves matching strings extracted from text documents against a dictionary of known patterns
Example, use of dictionary matching in Named-Entity Recognition (NER) – NER is task of finding and classifying (person, location, organization, etc.)
names in text
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Richard Stallman countered saying…
Text document
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Richard Stallman countered saying…
NER Recognition
Gazetteersof
PersonLocation
Organization
Business Unit or Product Name
© 2003 IBM Corporation55
Dictionary Matching Kernel in IE
Gazetteers (or dictionaries) used in IE tasks are often extremely large– May easily contain tens to hundreds of thousands of patterns
Dictionary matching in IE is a highly compute intensive kernel for large dictionaries that limits the overall extraction throughput
This work: hardware accelerator for off-loading dictionary matching operations to an FPGA based co-processor
Design point chosen to allow matching of ~100K patterns at a throughput of ~10Gbps
Business Unit or Product Name
© 2003 IBM Corporation66
String Matching in Non-IE Environments
Dictionary (string) matching is also a commonly used operation in a variety of other applications– Network Intrusion Detection Systems (NIDS), virus scanners, spam
filters, etc.
Several approaches have been proposed for hardware acceleration of string matching– Usually employ Deterministic Finite Automata (DFA) based algorithms such
as Aho-Corasick algorithm and its variants– Implementations are generally either constrained by dictionary size or max
achievable throughput
Business Unit or Product Name
© 2003 IBM Corporation77
Existing Hardware Solutions for String Matching
Ref: Scarpazza, D.P., et al, “Exact Multi-Pattern String Matching on Cell/B.E. Processor”, 2008
Most of these solutions are based on parsing one character every cycle. Typically achieve higher bandwidth by using multiple input streams in parallel
Business Unit or Product Name
© 2003 IBM Corporation88
Dictionary Matching Use-Case in IE
Matching requirements in IE environments are different than generic string matching– Based on tokenizing a text document (whitespace, punctuation marks, etc.)
and only matching at the word token boundaries
Simplified use-model as no partial matches have to be detected by the accelerator– DFA based approaches are generally an overkill
DFA accelerators typically process a single character per cycle and usually require large state transition tables
Our approach: hashing based solution– Process a word token with multiple characters per cycle– Key challenge: limit RAM resource overhead of mapping dictionary in
on-chip hash table
Business Unit or Product Name
© 2003 IBM Corporation99
Generic Hashing
Generic hashing scheme for storing dictionary in on chip block RAMs– Consider a dictionary with LD = 2N words– Consider a hashing scheme which maps a dictionary word into an M bit hash
• Hash value is used as an index (M-bit array address) to store a dictionary word in a particular RAM location
• For an M-bit hash, a RAM with LM = 2M slots (or words) is used to store 2N words
– Multiple dictionary words can map to the same hash value (and hence the same RAM slot) leading to hash collisions• Expected number of collisions depend on the relatives sizes of the dictionary and the RAM
Dictionarylet’sbuild
asmarter
planet
RAMHash function
hash collision
Business Unit or Product Name
© 2003 IBM Corporation1010
Hash Collisions Expected number of hash collisions for
hashing LD words in LM RAM slots– Assuming perfect uniform random distribution
Hash collisions can be reduced by increasing RAM size– Approach incurs significant RAM overhead– Limited on-chip RAM limits the size of the dictionary that can be hashed in on-chip RAM
M
D
M
L
LL
LCD
21 2
2
RAM Size Expected # of Collisions
LM=LD LD/2
LM=2*LD LD/4
LM=3*LD LD/6
LM=4*LD LD/8
LM=5*LD LD/10
LM=6*LD LD/12
LM=7*LD LD/14
LM=8*LD LD/16
LM=9*LD LD/18
LM=10*LD LD/20
Hash Collisions (64K Dictionary)
2M RAM slots for1000 collisions
Greater than 10X RAM overhead for less than 5% collisions
Business Unit or Product Name
© 2003 IBM Corporation1111
Proposed Modified Hashing Technique For a dictionary of size 2N words
– Create an N bit hash: hash[N-1:0]– Use N RAM arrays
• First RAM has the same size as the dictionary (2N slots) and is addressed by the full hash[N-1:0]• Second RAM is half the size of the dictionary (2N-1 slots) and is indexed by hash[N-2:0] and so on..
– Each dictionary word is stored in the first free RAM at its corresponding hash position• If a word causes hash collisions in all N RAMs, then it is stored in the final unindexed register
Dictionary (2N words)
N bit Hash
Hash[N-1:0]
RAM 1 (2N slots)
Hash[N-2:0]
RAM 2 (2N-1 slots)
Hash[0]
RAM N (2 slots) 1 slot
Business Unit or Product Name
© 2003 IBM Corporation1212
Hash Collisions in Modified Hashing
Consider hash collisions in chain hashing for a 64K word dictionary– Expected number of collisions in the first RAM (64K words in 64K slots) should be 32K– The colliding 32K words will be sent to the second RAM (32K slots) resulting in an average of
16K collisions in RAM 2– The progression should continue with the last RAM stage hashing on average of 2 words in 2
slots with potentially 1 collision that can be stored in the final unindexed register
The modified hashing technique achieves following useful properties– Probability of non-zero collisions is very low– RAM overhead is limited to twice the size of the dictionary– For a dictionary of size LD, log(LD) RAMs are used so only logarithmic number of RAM
accesses / comparisons are required for dictionary matching– All memory lookups / searches are performed in parallel in single cycle and do not require any
sequential search
Dictionary (2N words)
RAM 1
Expected # ofCollisions:
2N Slots2N-1 words
122222
N
N
NN
RAM 2
2N-1 Slots
2N-2 collisions
RAM N
2 Slots
1 collision
2 words 1 word
No collision
1 Slot
Business Unit or Product Name
© 2003 IBM Corporation1313
Hash Collision Experiments Randomly generated hundred 1000 word dictionaries Store each dictionary in RAMs using proposed hashing technique
– Compare modified hashing against the standard hashing which uses the same RAM (2K RAM slots for each 1000 word dictionary) but uses a single large RAM to store each dictionary
Chain hashing stores all dictionaries without ANY collisions while standard hashing shows on an average 200+ collisions per dictionary
Modified Hashing (Total RAM – 2K Slots)
RAM 1 (1k slots)
RAM 2 (512 slots)
RAM 3 (256 slots) RAM 4 (128 slots) RAM 10 (2 slots)
Standard Hashing (Total RAM – 2K Slots)
Histogram (Slots occupied)
Histogram (Collisions)
Business Unit or Product Name
© 2003 IBM Corporation1414
Loading Dictionary in RAMs
Hash logic: hash[N-1:0]
Dictionary word
Load word
rd_en
rd_addr
[N-1
:0]
wr_addr
[N-2
:0]
[0]
1-D Arrays for storing data valid bits
valid valid valid
write enable logic: wr_en[N-1:0]
[N-1
:0]
[N-2
:0]
[0]
RAMs for storing dictionary words
[N-1
]
[N-2
]
[0]
wr_data
Maintain a 1-D status array (initialized to zero) corresponding to each of the N RAMs When a dictionary word is loaded, first compute the hash of the word and read the
status bits from the status arrays from their respective hash locations Use the status bits to generate write enable signals for the RAMs (priority encoding
to generate only one active signal) Write the dictionary word in the first free RAM at it hashed address
Business Unit or Product Name
© 2003 IBM Corporation1515
Matching Dictionary from RAMs Compute the hash of the input data word For a valid input word, read the status bits from the status arrays and the dictionary
words from the dictionary RAMs from their respective hash locations Compare each RAM output with the input data word
– A match occurs if the data word matches any of the dictionary RAM outputs and the corresponding status signal from the status array is valid
For a dictionary of size 2N, N RAMs are used and hence only logarithmic number of comparisons are required for full dictionary matching
Hash logic: hash[N-1:0]
Data word
Datavalid
rd_en
rd_addr
[N-1
:0]
valid dictionary word
[N-1
:0]
comparator
[N-2
:0]
dictionary word
[N-2
:0]
comparator
[0]
dictionary word
[0]
comparator
match
Business Unit or Product Name
© 2003 IBM Corporation1616
Overall Hardware Architecture
Input FIFO
rd_ptr wr_ptr
n0
Tokenizer(whitespace, punctuation)
Extract next word from FIFO
Dictionary word
Load word
n1
Input stream
01
Hash logic: hash[N-1:0]
Dictionary RAMs and status arraysInput token match
Process one word (token) every cycle from input stream– For an average word length of 5-6 characters (+1 char for whitespace) and clock frequency
of 250 MHz, theoretical processing bandwidth is 12-14Gbps– The architecture can be used to indicate start and end offset of matching tokens, count
total matches per dictionary, individual counts for each word, and for filtering stopwords from the input stream
Business Unit or Product Name
© 2003 IBM Corporation17
Experimental Setup
System consisting of a general purpose processor attached to an FPGA device
FPGA device: Altera Stratix IV device (~425K logic elements, 1280 M9K, 64 M144K)
FPGA running at frequency: 250 MHz FPGA attached on DMA capable bus with peak transfer bandwidth
of 2.5 GB/s FPGA logic designed to process an input of 8 bytes/cycle :
theoretical processing throughput limit (@ 250MHz) is 2 GB/s Experiments vary average document size and average token
length (ATL) in the documents
17
Business Unit or Product Name
© 2003 IBM Corporation18
Hardware Measurements (Single Stream)
18
Business Unit or Product Name
© 2003 IBM Corporation19
Hardware Measurements (Two Streams)
19
Business Unit or Product Name
© 2003 IBM Corporation20
Resource Usage
Single instance consisting of 1K dictionary organized as 12 byte wide RAM
20
Business Unit or Product Name
© 2003 IBM Corporation21
Memory Efficiency
Architecture limit is 2 bytes/char of dictionary but overhead is due to fixed memory word width in implementation– Experiments performed with 12 byte wide RAM– Memory efficiency can be used by binning words based on the word size and
using multiple logical RAMs with different word widths
21
Business Unit or Product Name
© 2003 IBM Corporation22
Conclusion
Proposed an hardware accelerator for off-loading dictionary matching kernel in Information Extraction systems
Novel hashing based scheme that allows processing of multiple characters per cycle with bounded RAM overhead
Hardware measurement results show extraction throughput of ~12Gbps for typical documents
Efficient resource usage allow capability to fit ~100K pattern dictionaries in on-chip block RAM of modern FPGA devices
22
Business Unit or Product Name
© 2003 IBM Corporation2323
Differentiation from Existing Software Hash Collision Reduction Solutions
Bucket hashing – A set of values can be stored in the same slot Open hashing – Uses linked lists. Values for keys that hash to a particular
slot are placed on that slot’s linked list Closed hashing – A record can be stored in more than one locations
within the same hash table. Some sequential search (probe sequence) is used to identify the free slot in the possible set of locations
Double hashing – Use a secondary hashing function to map a record to its alternate overflow location
None of these directly satisfy our hardware requirements, which are:– Expected number of collisions should be zero– Memory overhead should be limited including overhead due to multi-port memory
(which essentially increases overhead by the number of ports) – All memory lookups / searches should be performed in parallel in single cycle