Efficient Approximate Entity Extraction with Edit Distance Constraints

22
Presented by: Aneeta Kolhe

description

Efficient Approximate Entity Extraction with Edit Distance Constraints. Presented by: Aneeta Kolhe. Introduction. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text mining and also for web search. Problem. - PowerPoint PPT Presentation

Transcript of Efficient Approximate Entity Extraction with Edit Distance Constraints

Presented by: Aneeta Kolhe

• Named Entity Recognition finds approximate matches in text.

• Important task for information extraction and integration, text mining and also for web search.

Approximate dictionary matching. Previous solution – Token based similarity

constraints Proposed solution – Neighborhood

generation method

It uses Jaccard co-efficient similarity

It may miss some match.

It may result in too many matches.

For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched

unless use low jaccard similarity of 0.33.

“alqaeda” will match “al gore” as well as “al pacino”

Hence we use edit distance

Problem Definition:

For example: Given :document D, a dictionary E of entities To find: all substrings in D such that they are within edit

distance from one of the entities in E

Solution: Iterate through all the valid substrings of the document D

Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint.

Consider each substring as a query segment.

at least one partition with at most one edit error

select k т = (т +1)/2Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ]т = 3 , k т = 2 s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]

Shifting the first partition s by 2 => s = [cdef]

scaling it by -1 => s = [ cdefg ] Transformation rules First partition, we only need to consider

scaling within the range of [−2, 2]. Last partition, we only need to consider the

combination of the same amount of shifting and scaling within the range of [− т, т ] (so that the last character is always included in the resulting substring).

For the rest of the partitions, we need to consider shifting within the range [− т, т ] and scaling within the range [−2, 2].

1st partition: 5 variations intermediate partitions: 5*(2 т +1)

variations last partition: (2 т +1) variations Total amount of the 1-variants generated = O(m + 2).

s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]

< [ abcd ], 1>< [ abcdefgh ], 1>< [ ghijkl ], 2> <[ abcde ], 1> <[ jkl ], 2> < [ fghijkl ], 2 > <[ abcdef ], 1> < [ ijkl ], 2 > < [ efghijkl ], 2> <[ abcdefg ],1>< [ hijkl ],2><[ defghijkl ], 2> segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s

partition variation [fghijkl ] generated from s’s second partition.

The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants.

Assume l p is set to 3. Then 1-variantsare generated from only the following

prefixes. <[ abc ], 1> <[ ghi ], 2 > <[ hij ], 2> <[ fgh ], 2 > By setting l p ≤ m/kт – 2 Total # of 1-variants generated is further

reduced to O(l p т²).

to index short and long entities in the dictionary, and store them in two

inverted indexes, Ishort and Ilong For each entity whose length is smaller than kт lp + т lp-prefix of each partition variation is used

to generate its 1-variant family, which will be indexed.

Algorithm : BuildIndex (E, , lp) for each e Є E do if |e| < k lp + then V GenVariants(e[1 .. min(lp, |e|)], ); /* The GenVariants (s, k) function generates the k-variant family of string s */ for each v Є V do Ishort <- Ishort U { e }; if |e| ≥ k lp then P the set of k partitions of e; for each i-th partition p Є P do PT TransformPartition(p); /* according to the three transformation rules in Section 3.1 */ for each partition variations pT Є PT do V GenVariants(p[1 .. lp], 1); for each v 2 V do Ilong <- Ilong U <e, i >; return (Ishort, Ilong)

Algorithm : MatchDocument (D, E, т ) for each starting position p Є[1, |D| − Lmin + т + 1] do SearchLong (D[p .. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */ SearchShort (D[p .. p + lp − 1], E, т );/* matching entities of length in [lmin, kт lp)

*/

R <- ф; /* holds results */

C <- ф ; /* holds candidates */

V <- GenVariants(s, 1) ; /* gen 1-variant family */

for each v Є V do for each <e, pid > Є Ilongv do C <- C U <e, pid > ; /*

duplicates removed */ 7 for each <e, pid > Є C do 8 S <- QuerySegmentInstantiation(e, pid); /* returns the set of query segment candidates for e */ for each seg Є S do if Verify(seg, e) = true then R <-R <seg, e > Return R

Search short(s) We need to generate the т-variant families for

each possible length l between Lmin − т and lp If the current query segment is shorter than lp,

every candidate pair formed by probing the index needs to be verified

Otherwise, we need to perform verification for 2 т + 1 possible query segments.

For example, enumerate 1-variants of the string [ abcdef ] from left to right.

no variant starts with abc in the index. Algorithm still enumerate other three 1-

variants containing abc. To avoid this set parameter lpp set to lp/2.

Consider 4 possible cases:

Prefix Match

Suffix Match

Action

True true enumerate all 1-variants of q[1 .. lp]

False False discard q as there is no match

False True enumerate all 1-variants of q[1 .. lpp]

False False enumerate all 1-variants of q[(lpp + 1) .. lp]

Successfully reduced the size of neighborhood

Proposed an efficient query processing algorithm

Optimized the algorithm to share computation

Avoid unnecessary variant enumeration