Minimum Edit Distance Definition of Minimum Edit Distance

Click here to load reader

download Minimum Edit Distance Definition of Minimum Edit Distance

of 52

  • date post

    04-Jan-2016
  • Category

    Documents

  • view

    241
  • download

    0

Embed Size (px)

Transcript of Minimum Edit Distance Definition of Minimum Edit Distance

Information Extraction

Minimum Edit DistanceDefinition of Minimum Edit Distance

Dan Jurafsky1How similar are two strings?Spell correctionThe user typed graffeWhich is closest? grafgraftgrailgiraffe

Computational BiologyAlign two sequences of nucleotides

Resulting alignment:Also for Machine Translation, Information Extraction, Speech RecognitionAGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACDan Jurafsky2Edit DistanceThe minimum edit distance between two stringsIs the minimum number of editing operationsInsertionDeletionSubstitutionNeeded to transform one into the otherDan Jurafsky3Minimum Edit DistanceTwo strings and their alignment:

Dan Jurafsky4Minimum Edit DistanceIf each operation has cost of 1Distance between these is 5If substitutions cost 2 (Levenshtein)Distance between them is 8

Dan Jurafsky5Alignment in Computational BiologyGiven a sequence of bases

An alignment:

Given two sequences, align each letter to a letter or gap-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACAGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGACDan JurafskyOther uses of Edit Distance in NLPEvaluating Machine Translation and speech recognitionR Spokesman confirms senior government adviser was shotH Spokesman said the senior adviser was shot dead S I D INamed Entity Extraction and Entity CoreferenceIBM Inc. announced todayIBM profitsStanford President John Hennessy announced yesterdayfor Stanford University President John Hennessy

Dan JurafskyHow to find the Min Edit Distance?Searching for a path (sequence of edits) from the start string to the final string:Initial state: the word were transformingOperators: insert, delete, substituteGoal state: the word were trying to get toPath cost: what we want to minimize: the number of edits

8

Dan Jurafsky8Minimum Edit as SearchBut the space of all edit sequences is huge!We cant afford to navigate navelyLots of distinct paths wind up at the same state.We dont have to keep track of all of themJust the shortest path to each of those revisted states.

9Dan Jurafsky9Defining Min Edit DistanceFor two stringsX of length n Y of length mWe define D(i,j)the edit distance between X[1..i] and Y[1..j] i.e., the first i characters of X and the first j characters of YThe edit distance between X and Y is thus D(n,m)Dan JurafskyMinimum Edit DistanceDefinition of Minimum Edit Distance

Dan Jurafsky11Minimum Edit DistanceComputing Minimum Edit Distance

Dan Jurafsky12Dynamic Programming forMinimum Edit DistanceDynamic programming: A tabular computation of D(n,m)Solving problems by combining solutions to subproblems.Bottom-upWe compute D(i,j) for small i,j And compute larger D(i,j) based on previously computed smaller valuesi.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)

Dan JurafskyDefining Min Edit Distance (Levenshtein)InitializationD(i,0) = iD(0,j) = jRecurrence Relation:For each i = 1M For each j = 1N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) Y(j) 0; if X(i) = Y(j)Termination:D(N,M) is distance

Dan JurafskyN9O8I7T6N5E4T3N2I1#0123456789#EXECUTIONThe Edit Distance TableDan Jurafsky15N9O8I7T6N5E4T3N2I1#0123456789#EXECUTION

The Edit Distance TableDan Jurafsky16N9O8I7T6N5E4T3N2I1#0123456789#EXECUTIONEdit Distance

Dan Jurafsky17N989101112111098O8789101110989I767891098910T656789891011N5456789101110E43456789109T3456787898N2345678787I1234567678#0123456789#EXECUTIONThe Edit Distance TableDan Jurafsky18Minimum Edit DistanceComputing Minimum Edit Distance

Dan Jurafsky19Minimum Edit DistanceBacktrace for Computing Alignments

Dan Jurafsky20Computing alignmentsEdit distance isnt sufficientWe often need to align each character of the two strings to each otherWe do this by keeping a backtraceEvery time we enter a cell, remember where we came fromWhen we reach the end, Trace back the path from the upper right corner to read off the alignmentDan Jurafsky21N9O8I7T6N5E4T3N2I1#0123456789#EXECUTIONEdit Distance

Dan Jurafsky22MinEdit with Backtrace

Dan JurafskyAdding Backtrace to Minimum Edit DistanceBase conditions: Termination:D(i,0) = i D(0,j) = j D(N,M) is distance Recurrence Relation:For each i = 1M For each j = 1N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) Y(j) 0; if X(i) = Y(j) LEFT ptr(i,j)= DOWN DIAG

insertiondeletionsubstitutioninsertiondeletionsubstitutionDan JurafskyThe Distance MatrixSlide adapted from Serafim Batzoglouy0 yMx0 xNEvery non-decreasing path

from (0,0) to (M, N)

corresponds to an alignment of the two sequences

An optimal alignment is composed of optimal subalignmentsDan Jurafsky25Result of BacktraceTwo strings and their alignment:

Dan Jurafsky26PerformanceTime:O(nm)Space:O(nm)BacktraceO(n+m)Dan Jurafsky27Minimum Edit DistanceBacktrace for Computing Alignments

Dan Jurafsky28Minimum Edit DistanceWeighted Minimum Edit Distance

Dan Jurafsky29Weighted Edit DistanceWhy would we add weights to the computation?Spell Correction: some letters are more likely to be mistyped than othersBiology: certain kinds of deletions or insertions are more likely than othersDan Jurafsky30Confusion matrix for spelling errors

Dan Jurafsky

Dan JurafskyWeighted Min Edit DistanceInitialization:D(0,0) = 0D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ND(0,j) = D(0,j-1) + ins[y(j)]; 1 < j MRecurrence Relation: D(i-1,j) + del[x(i)]D(i,j)= min D(i,j-1) + ins[y(j)] D(i-1,j-1) + sub[x(i),y(j)]Termination:D(N,M) is distance

Dan Jurafsky33Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. [the] Secretary of Defense had a pathological fear and hatred of the word, research

I decided therefore to use the word, programming.

I wanted to get across the idea that this was dynamic, this was multistage I thought, lets take a word that has an absolutely precise meaning, namely dynamic its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. Its impossible.

Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to.

Richard Bellman, Eye of the Hurricane: an autobiography 1984.Dan JurafskyMinimum Edit DistanceWeighted Minimum Edit Distance

Dan Jurafsky35Minimum Edit DistanceMinimum Edit Distance in Computational Biology

Dan Jurafsky36Sequence Alignment-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACAGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGACDan Jurafsky37Why sequence alignment?Comparing genes or regions from different speciesto find important regionsdetermine functionuncover evolutionary forcesAssembling fragments to sequence DNACompare individuals to looking for mutationsDan JurafskyAlignments in two fieldsIn Natural Language ProcessingWe generally talk about distance (minimized)And weightsIn Computational BiologyWe generally talk about similarity (maximized)And scoresDan JurafskyThe Needleman-Wunsch AlgorithmInitialization:D(i,0) = -i * dD(0,j) = -j * dRecurrence Relation: D(i-1,j) - dD(i,j)= min D(i,j-1) - d D(i-1,j-1) + s[x(i),y(j)]Termination:D(N,M) is distance

Dan Jurafsky40The Needleman-Wunsch MatrixSlide adapted from Serafim Batzogloux1 xMy1 yN(Note that the origin is at the upper left.)Dan Jurafsky41A variant of the basic algorithm:Maybe it is OK to have an unlimited # of gaps in the beginning and end:Slide from Serafim Batzoglou----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGCGCGAGTTCATCTATCAC--GACCGC--GGTCG--------------If so, we dont want to penalize gaps at the endsDan Jurafsky42Different types of overlapsSlide from Serafim BatzoglouExample:2 overlappingreads from a sequencing project Example:Search for a mouse genewithin a human chromosomeDan Jurafsky43The Overlap Detection variantChanges:

InitializationFor all i, j,F(i, 0) = 0F(0, j) = 0

Termination maxi F(i, N)FOPT = max maxj F(M, j)Slide from Serafim Batzogloux1 xMy1 yNDan Jurafsky44Given two strings x = x1xM, y = y1yN

Find substrings x, y whose similarity (optimal global alignment value)is maximum

x = aaaacccccggggttay = ttcccgggaaccaaccSlide from Serafim BatzoglouThe Local Alignment ProblemDan Jurafsky45The Smith-Waterman algorithmIdea: Ignore badly aligning regions

Modifications to Needleman-Wunsch:

Initialization:F(0, j) = 0F(i, 0) = 0 0Iteration: F(i, j) = max F(i 1, j) d F(i, j 1) d F(i 1, j 1) + s(xi, yj) Slide from Serafim BatzoglouDan Jurafsky46The Smith-Waterman algorithmTermination:If we want the best local alignmentFOPT = maxi,j F(i, j)Find FOPT and trace back

If we want all local alignments scoring > t

??For all i, j find F(i, j) > t, and trace back?

Complicated by overlapping local alignments

Slide from Serafim BatzoglouDan Jurafsky47Local alignment exampleATTATC0000000A0T0C0A0T0X = ATCATY = ATTATC

Let: m = 1 (1 point for match) d = 1 (-1 point for del/ins/sub)Dan JurafskyLocal alignment exampleATTATC0000000A0100100T0021020C0011013A0100212T0020132X = ATCATY = ATTATC

Dan JurafskyLocal alignment exampleATTATC0000000A