Trie -Join : Efficient Trie -based String Similarity Joins with Edit Distance Constraints

1

Trie-Join : Efficient Trie-based String Similarity Joins with Edit Distance ConstraintsJiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)Jianhua Feng (Tsinghua, China)

1OutlineMotivationPreliminariesTrie-based FrameworkTrie-based AlgorithmsPruning TechniquesSupport Data UpdateExperimentConclusion2010-9-19Trie-Join @ VLDB20102/382Microsoft Academic Search

PKhttp://academic.research.microsoft.com/Author/2037349.aspxhttp://academic.research.microsoft.com/Author/3054641.aspxReal-world Data is Rather DirtyKenneth De JongKenneth Dejong2010-9-19Trie-Join @ VLDB20103/383Typo in author

Typo in title

relaxed related Argyrios Zymnis Argyris ZymnisDBLP Complete Search2010-9-19Real-world Data is Rather DirtyTrie-Join @ VLDB20104/384The similarity join is an essential operation for data integration and cleaning

Perform a similarity join on Name attribute (find all record pairs whose Name attributes are similar)Output: (2037349, 3054641),

Similarity JoinsRIdNameUniv.2037349Kenneth De JongGeorge 3054641Kenneth DejongGeorge

2010-9-19Trie-Join @ VLDB20105/385OutlineMotivationPreliminariesTrie-based FrameworkTrie-based AlgorithmsPruning TechniquesSupport Data UpdateExperimentConclusion2010-9-19Trie-Join @ VLDB20106/386We use edit distance to quantify the string similarity

Compute ED(r, s)

Verify ED(r, s) 1Similarity EvaluationED(r,s) is the minimum number of single-character edit operations (i.e., insertion, deletion, and substitution) needed to transform r to sFor example: ED(kobe, ebey)=3

1. kobe obe (delete k at the beginning)2. obe ebe (substitute o' with e')3. ebe ebey (insert y' at the end)Given an edit-distance threshold , we say r and s are similar if ED(r, s) Dynamic Programming2010-9-19Trie-Join @ VLDB20107/387Problem FormulationString Similarity Join-----------------------------------------------Input: two sets of strings R and S an edit-distance threshold Output: RS s.t. ED(r, s) For example: Suppose R=S.------------------------------------------------------------Input: S = {bag, ebay, bay, kobe, koby, beagy} = 1Output: {, , }Nave SolutionEnumerate all pairs RS and verify ED(r, s) ---------------------------------------------------------------------------|R||S| verifications are rather expensive!!For |R|=|S|=1m, it needs 1012 verifications. 2010-9-19Trie-Join @ VLDB20108/388Prior WorkSignature-based methodsBasic idea ED(r, s) Sig(r) Sig(s) FrameworkFilter pairs s.t. Sig(r) Sig(s) = Verify the survived pairsDisadvantagesLarge indexInverted index for signatures is largeLow efficiency for short stringsCan not select high-quality signatures for short stringsAs least one parameter need to be tunedE.g. tune the parameter q for q-gram signatures2010-9-19Trie-Join @ VLDB20109/389OutlineMotivationPreliminariesTrie-based FrameworkTrie-based AlgorithmsPruning TechniquesSupport Data UpdateExperimentConclusion2010-9-19Trie-Join @ VLDB201010/3810Trie IndexTrie is a tree structure, in whichEach path from the root to a leaf represents a string in the data setEvery node on the path has a label of a character in the string

2010-9-19

Node 2 / Node ba /String baTrie-Join @ VLDB201011/3811

Verify ED(r, s) 1

Verify ED(r, s) 1Subtrie PruningWhatever * is, (2) is larger than =1.Given =1 , all strings starting with ko can not be similar with ebay.

Observation ISubtrie PruningGiven =1 , for the string ebay, we can prune the subtrie rooted at ko. 2010-9-19Trie-Join @ VLDB201012/3812Computing Active-Node SetActive NodeNode u is an active node of string s if ED(u, s) . E.g. Node bay is an active node of string ebay, but node ko is not.Active-Node SetFor a string s, As is a set consisting of all the active nodes in the trie.

2010-9-19Ako={13,14,15}31513141617kobey12875bagyeagyebay1246910110Akob={14,15,16,17}Incremental algorithm (www09)Trie-Join @ VLDB201013/3813Trie-Search Algorithmebagsebayebeyebook

bagbaybeagyebaykobekobySRTrie TObservation II

Share prefix eb. Should we do subtrie pruning on S!2010-9-19Aebags={} , Aebay={4,11,12} , Aebey={12} , Aebooks={} , , ,

Trie-Join @ VLDB201014/3814Dual Subtrie PruningConstruct a trie for stings in both R and SDo subtrie pruning for strings in both R and S

2010-9-19For example:Given =1 , all strings starting with ko can not be similar with the strings starting with ebTrie-Join @ VLDB201015/3815OutlineMotivationPreliminariesTrie-based FrameworkTrie-based AlgorithmsPruning TechniquesSupport Data UpdateExperimentConclusion2010-9-19Trie-Join @ VLDB201016/3816Trie-TraverseWe focus on self-join, that is R = SAlgorithm DescriptionConstruct a trie index T for all strings in S (R)Traverse the trie T in pre-orderCompute the active-node set of each visited node incrementallyWhen reaching a leaf node, find leaf nodes in its active-node set and output the similar string pairsBenefits of pre-order trie traversalCompute active-node sets incrementallyDiscard active-node sets of the nodes whose descendants have been visited

2010-9-19Trie-Join @ VLDB201017/38171287155bagyeagyebaykobey{0,1,9,13}0123469101113141617{0,1,2,5,9,10,13}{1,2,3,4,5,6,11}{2,3,4,7}{2,3,4,12}Depth Output: Output: Illustration of Trie-TraverseConsider = 1 and S = {bag, ebay, bay, kobe, koby, beagy}.2010-9-19Trie-Join @ VLDB201018/3818

{6,7,8}Trie-DynamicBasic idea: utilize active-node symmetry propertyIllustration of Trie-Dynamic

Consider = 1 and S = {bag, ebay, bay, kobe, koby, beagy}.

{0,1}

ab{1,2,3,6,8}{1,2,3,6}{2,3,8}{2,3}{6,7}y{2,3,7,8}Trie-Dynamic maintains the active-node sets of all trie nodes that involve large space!!2010-9-198Trie-Join @ VLDB201019/3819MotivationTrie-Traverse uses little memory space but involves unnecessary active-node computation Trie-Dynamic avoids repeated active-node computation but involves large memory spaceBasic Idea

Runtime stack from current node to root node ( = 1)

Trie-PathStack1287155bagyeagyebaykobey1234691011131416170Virtual Partial Subtrie

Virtual partial subtrie2010-9-19Trie-Join @ VLDB201020/3820Bi-Trie-PathStackMotivationIt is expensive to compute active-node sets for large edit-distance thresholdsBasic IdeaConsider a string r = arnold schwarzeneger and = 5.

If a string s is similar to r within =5 (i.e. ed(r, s) 5), then either rs first part arnold sch is similar to a prefix of s within 5/2=2or rs second part warzeneger is similar to a suffix of s within 5/2=2.

Algorithm Perform Trie-PathStack twice within the half thresholdVerify the survived pairs

2010-9-19arnold sch warzenegerTrie-Join @ VLDB201021/3821OutlineMotivationPreliminariesTrie-based FrameworkTrie-based algorithmsPruning TechniquesSupport Data UpdateExperimentConclusion2010-9-19Trie-Join @ VLDB201022/3822Pruning TechniquesLength pruningPrune v from Au if the difference of vs range and us range is larger than Single-branch pruningPrune v from Au if u is the only child node of vCount pruningPrune v from Au if theres only one string have both u and v as prefixes

=12010-9-19Trie-Join @ VLDB201023/3823OutlineMotivationPreliminariesTrie-based FrameworkTrie-based algorithmsPruning TechniquesSupport Data UpdateExperimentConclusion2010-9-19Trie-Join @ VLDB201024/3824

Support Data UpdateIncremental Similarity JoinInput: a set of strings S, a set of strings S, an edit-distance threshold Output: s.t ED(r, s) Illustration of incremental similarity join

Consider = 1 and S = {bag, ebay, bay, kobe, koby, beagy} and S = {eby}.

2010-9-19Trie-Join @ VLDB201025/3825OutlineMotivationPreliminariesTrie-based FrameworkTrie-based algorithmsPruning TechniquesSupport Data UpdateExperimentConclusion2010-9-19Trie-Join @ VLDB201026/3826Experiment SetupData setsEnglish Dict: English words from the Aspell spellchecker for CygwinDBLP Author: Author names from DBLP datasetAOL Query Log: Queries from AOL dataset

Existing algorithmsAll-Pairs-Ed[www07]Ed-Join [vldb08]Part-Enum [vldb06]EnvironmentC++ , GCC 4.2.3, UbuntuIntel Core 2 Quad X5450 3.00GHz processor and 4 GB memory2010-9-19

Trie-Join @ VLDB201027/3827Comparison of Four Trie-Based Algorithms

Our trie-join algorithms outperform Trie-Search by 1~2 orders of magnitudeTrie-PathStack performs the best 2010-9-19Trie-Join @ VLDB201028/3828Index Size (MB)

3~5 times smaller2010-9-19Trie-PathStack VS Ed-Join, All-Pairs-Ed, Part-EnumTrie-Join @ VLDB201029/3829Trie-PathStack VS Ed-Join, All-Pairs-Ed, Part-EnumEfficiency (Log-Scale)

Trie-PathStack performs the bestExisting methods need to tune parameters 2010-9-19Trie-Join @ VLDB201030/3830Algorithm SelectionTrie-based algorithms VS Ed-Join

2010-9-19Trie-Join @ VLDB201031/3831Trie-based algorithms outperform Ed-Join for all string lengths ( 3)2010-9-19

Trie-Join @ VLDB201032/38322010-9-19Trie-based algorithms outperform Ed-Join for short strings (avg. len 30 and >3)Trie-Join @ VLDB2010

33/38332010-9-19Ed-Join outperforms Trie-based algorithms for long strings (avg. len >30 and >3)Trie-Join @ VLDB2010

34/3834OutlineMotivationPreliminariesTrie-based FrameworkTrie-based algorithmsPruning TechniquesSupport Data UpdateExperimentConclusion2010-9-19Trie-Join @ VLDB201035/3835ConclusionTrie-based similarity-join frameworkTrie-based algorithms Pruning techniquesTrie-based algorithms have many advantagessmall indexno need to tune parametersefficient for short stringssupport dynamic data updateTrie-based algorithms significantly outperform state-of-the-art methods on data sets with short strings (Avg. Length 30)

2010-9-19Trie-Join @ VLDB201036/3836Reference[1] Arasu et al. Efficient exact set-similarity joins. VLDB 2006.[2] Bayardo et al. Scaling up all pairs similarity search. WWW 2007.[3] Chaudhuri et al. A primitive operator for similarity joins in data cleaning. ICDE 2006.[4] Gravano et al. Approximate string joins in a database (almost) for free. VLDB 2001.[5] Sarawagi et al. Efficient set joins on similarity predicates. SIGMOD 2004.[6] Xiao et al. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 2008.[7] Ji et al. Efficient interactive fuzzy keyword search. WWW 2009.[8] Chaudhuri et al. Extending autocompletion to tolerate errors. SIGMOD 2009.2010-9-19Trie-Join @ VLDB201037/38372010-9-19Thanks!Q&ATrie-Join @ VLDB201038/38http://dbgroup.cs.tsinghua.edu.cn/wangjn/projects/triejoin/38

Trie -Join : Efficient Trie -based String Similarity Joins with Edit Distance Constraints

Documents

Transcript of Trie -Join : Efficient Trie -based String Similarity Joins with Edit Distance Constraints