Longest Common Sub Sequence

download Longest Common Sub Sequence

of 12

Transcript of Longest Common Sub Sequence

  • 8/3/2019 Longest Common Sub Sequence

    1/12

    Algorithms for the Longest Common Subsequence ProblemD A N I E L S . H IR S C H B E R GPrinceton U ntverslty, Princeton, N ew JerseyAaS~ACT Tw o algorithms are presented that solve the longest common subsequence problem Th e firstalgorithm is applicable in the general case and requires O ( p n + n log n) tim e where p is the length o f thelongest comm on subsequence The second algorithm requires time bounded b y O ( p ( m + 1 - p) log n) In thecomm on spe oal case where p is close to m, this algorithm takes m uch less time than n ~KEY WORDSAND PHRASES' subsequence, com mon subsequen ce, algorithmCR CATEOORIES 3 73 , 3 7 9, 5 25 , 5 3 9

    I n t r o d u c t i o nW e s t a r t b y d e f i n i n g c o n v e n t i o n s a n d t e r m i n o l o g y t h a t w il l b e u s e d t h r o u g h o u t t h isp a p e r .

    S t r i n g C = c l c ~ . . . c p is a s u b s e q u e n c e o f s tr i n g A = a j a 2 " ' " a m i f t h e r e i s a m a p p i n gF : { 1 , 2 . . . . , p } ~ { 1 , 2 , . . . , m } s u c h t h a t F ( i ) = k o n l y i f c~ = a k a n d F i s a m o n o t o n es t r i c t l y i n c r e a s i n g f u n c t i o n ( i . e . F ( i ) = u , F ( ] ) = v , a n d i < j i m p l y t h a t u < v ) . C c a n b ef o r m e d b y d e le t in g m - p ( n o t n e c e s sa r i l y a d j a c e n t ) s y m b o l s f r o m A . F o r e x a m p l e ," c o u r s e " is a s u b s e q u e n c e o f " c o m p u t e r s c i e n c e . "

    S t r i n g C i s a c o m m o n s u b s e q u e n c e o f s t r i n g s A a n d B i f C i s a s u b s e q u e n c e o f A a n da l so a s u b s e q u e n c e o f B .

    S t r i n g C i s a l o ng e s t c o m m o n s u b s e q u e n c e ( a b b r e v i a t e d L C S ) o f s tr i n g A a n d B i f C isa c o m m o n s u b s e q u e n c e o f A a n d B o f m a x i m a l l e n g t h , i . e . t h e r e is n o c o m m o n s u b s e-q u e n c e o f A a n d B t h a t h a s g r e a t e r l e n g t h .

    T h r o u g h o u t t h is p a p e r , w e a s s u m e t h a t A a n d B a r e s t r i n g s o f l e n g t h s m a n d n , m _< n ,t h a t h a v e a n L C S C o f (u n k n o w n ) l e n g t h p .

    W e a s s u m e t h a t th e s y m b o l s t h a t m a y a p p e a r i n th e s e s tr in g s c o m e f r o m s o m e a l p h a b e to f s iz e t . A s y m b o l c a n b e s t o r e d i n m e m o r y b y u s i n g l o g t b i t s , w h i c h w e a s s u m e w i l l f it i no n e w o r d o f m e m o r y . S y m b o l s c a n b e c o m p a r e d ( a -< b ? ) i n o n e t i m e u n i t .

    T h e n u m b e r o f d i f f e r e n t s y m b o l s t h a t a c t u a l l y a p p e a r i n s t ri n g B is d e f i n e d t o b e s( w h i c h m u s t b e l e s s t h a n n a n d t ) .

    T h e l o n g e s t c o m m o n s u b s e q u e n c e p r o b l e m h a s b e e n s o l v e d b y u s in g a r e c u r s i o nr e l a t i o n s h i p o n t h e l e n g t h o f t h e s o l u t i o n [ 7 , 1 2 , 1 6 , 2 1] . T h e s e a r e g e n e r a l l y a p p l i c a b l ea l g o r i t h m s t h a t t a k e O ( m n ) t i m e f o r a n y i n p u t s t r in g s o f l e n g t h s m a n d n e v e n t h o u g ht h e l o w e r b o u n d o n t i m e o f O ( m n ) n e e d n o t a p p l y t o a ll i n p u t s [ 2] . W e p r e s e n ta l g o r i t h m s t h a t , d e p e n d i n g o n t h e n a t u r e o f th e I n p u t , m a y n o t r e q u i r e q u a d r a t i c t i m et o r e c o v e r a n L C S . T h e f i r st a l g o r i th m is a p p l i c a b l e i n th e g e n e r a l c a s e a n d r e q u i r e sO ( p n + n l o g n ) t i m e . T h e s e c o n d a l g o r i t h m r e q u i r e s t i m e b o u n d e d b y O ( ( m + 1 - p ) pl o g n ) . I n t h e c o m m o n s p e c i a l c a s e w h e r e p i s c l o s e t o m , t h is a l g o r i t h m t a k e s t i m eCopyright 1977, Associatton for Com puting M achinery, Inc Ge nera l permission to republish, but not forprofit , all or part o f this material ts granted provtded that A CM 's copyright notice is gl~,en and that refe rence ismade to the publication, to Its date of ~ssue, and to the fact that reprinting privileges we re granted bypermission of the Association for Computing Machinery.This research was supported by a National Science Foundation graduate fellowship and by the National ScienceFoundat ion under Grant GJ-35570.Au tho r 's present address' De partm ent of Electrical Engineering, R ice Umv erslty, Ho uston , TX 77001Journalof the Assoclauon or ComputingMachinery.Vol 24. No 4. October 1977. pp 664-675

  • 8/3/2019 Longest Common Sub Sequence

    2/12

    A l g o r i t h m s f o r t h e L o n g e s t C o m m o n S u b s e q u e n c e P r o b l e m 6 6 5m u c h l e ss t h a n n z. W e c o n c l u d e w i th r e f e r e n c e s t o o t h e r a l g o r i t h m s f o r t h e L C Sp r o b l e m t h a t m a y b e o f in t e r e st .p n A l g o r i t h mW e p r e s e n t i n t h i s s e c t i o n a l g o r i t h m A L G D , w h i c h w i l l f i n d a n L C S i n t i m e O ( p n +n l o g n ) w h e r e p is t h e l e n g t h o f t h e L C S . T h u s t h i s a l g o r i t h m m a y b e p r e f e r r e d f o ra p p l i c a t i o n s w h e r e t h e e x p e c t e d l e n g t h o f a n L C S i s s m a l l r e l a t i v e t o t h e l e n g t h s o f th ei n p u t s t r i n g s .

    S o m e p r e l i m i n a r y d e f i n i t io n s a r e a s f o ll o w s :W e r e p r e s e n t t h e c o n c a t e n a t i o n o f s t r in g s X a n d Y b y X I I Y .A i , r e p r e s e n t s t h e s t r i n g a la2 " " a , ( e l e m e n t s 1 t h r o u g h i o f s t ri n g A ) . S i m i l a r l y , t h e

    p r e f ix o f l e n g t h j o f s t ri n g B is r e p r e s e n t e d b y B u .W e d e f in e L ( i , j ) t o b e t h e l e n g t h o f th e L C S o f p r e f i x e s o f le n g t h s i a n d j o f s tr i n g s Aa n d B , i e . t h e le n g t h o f t h e L C S o r A l , a n d B o .

    ( t, j ) r e p r e s e n t s t h e p o s i t i o n s o f a , a n d b , t h e i th e l e m e n t o f s t r i n g A a n d t h e j t he l e m e n t o f s t ri n g B . W e r e f e r t o i ( / ) a s t h e t - v a l u e ( j - v a l u e ) o f (1 , j ) .W e d e f i n e { (0 , 0 )} t o b e t h e s e t o f O - c a n d i d a t e s , a n d w e d e f i n e ( i, j } to b e a k

    c a n d t d a t e ( f o r k - > 1 ) i f a , = b e a n d t h e r e e x i s t i ' a n d j ' s u c h t h a t i ' < i , j ' < j , a n d( i ' , j ' ) is a ( k - 1 ) - c a n d i d a t e . W e s a y t h a t ( i ' , j ' ) g e n e r a t e s ( i , j ) .D e f i n e a0 = b 0 = $ w h e r e $ is s o m e s y m b o l t h a t d o e s n o t a p p e a r i n s t r m g s A o r B .LEM~A 1 . F o r k -> 1 , ( t , j ) i s a k - c a n d i d a t e i f f L ( i , j ) > - k a n d a~ = b j . T h u s t h e r e i s ac o m m o n s u b s e q u e n c e o f l e ng t h k o f A l~ a n d B wP RO OF . B y i n d u c t i o n o n k . ( i , j ) I s a 1 - c a n d i d a t e i f f a , = b e ( b y d e f i n i t i o n ) , i n w h i c hc a s e L ( i , j ) n e c e s s a r i l y i s a t l e a s t 1 . T h u s t h e l e m m a i s t r u e f o r k = 1 . A s s u m e i t i s t r u ef o r k - 1 . C o n s i d e r k . I f ( t, j ) i s a k - c a n d i d a t e t h e n t h e r e e x i s t i ' < i a n d j ' < j s u c ht h a t ( i ' , j ' ) is a ( k - 1 ) - c a n d i d a t e . B y a s s u m p t i o n , t h e r e is a c o m m o n s u b s e q u e n c e D ' =d l d z " " d k - l o f A l e , a n d B ~ j,. S i n c e a ~ = b j ( ( i , j ) i s a k - c a n d i d a t e ) , D = D ' I l a ~ Is ac o m m o n s u b s e q u e n c e o f le n g t h k o f A im a n d B ~ . T h u s L( i , 1 ) -> k .C o n v e r s e l y , i f L ( i , j ) > - k a n d a~ = b j , t h e n t h e r e e x i s t i ' < i a n d j ' < j s u c h t h a t a e =b j , a n d L ( i ' , 1 ' ) = L ( i , j ) - 1 -> k - 1 . ( i ' , j ' ) is a ( k - D - c a n d i d a t e ( b y i n d u c t i v eh y p o t h e s i s ) a n d t h u s (i , j ) is a k - c a n d i d a t e . [ ]

    T h e l e n g t h o f a n L C S i s p , t h e m a x i m u m v a l u e o f k s u c h t h a t t h e r e e x i s t s a k -c a n d i d a t e . A s w e s h a l l s e e , t o r e c o v e r a n L C S , i t s u f fi c es to m a i n t a i n t h e s e q u e n c e o f a 0 -c a n d i d a t e , 1 - c a n d i d a t e . . . . , ( p - 1 ) - c a n d i d a t e , a n d a p - c a n d i d a t e s u c h th a t in th i ss e q u e n c e e a c h / - c a n d i d a t e c a n g e n e r a t e t h e ( i + 1 ) - c a n d i d a t e fo r 0 --< i < p .

    R u l e . L e t x = (x ~ ,x 2 ) a n d y = ( y a , y ~ ) b e t w o k - c a n d i d a t e s . I f x ~ > - y ~ a n d x 2 ~ - Y 2,t h e n w e s a y t h a t y r u l e s o u t x ( x ~s a s u p e r f l u o u s k - c a n d i d a t e ) s i n c e a n y ( k + 1 )-c a n d i d a t e t h a t c o u l d b e g e n e r a t e d b y x c a n a ls o b e g e n e r a t e d b y y . T h u s , f r o m t h e s e to f k - c a n d i d a t e s , w e n e e d c o n s i d e r o n l y t h o s e t h a t a r e m i n i m a l u n d e r t h e u s u a l v e c to ro r d e r i n g . N o t e t h a t l f x a n d y a r e m i n i m a l e l e m e n t s t h e n x~ < y~ i f fx 2 > Y2.

    LEMraA 2 . L e t t h e s e t of f k - c a n d t d a t e s b e {(i~, 1~)} ( r = 1 , 2 . . . . ) . W e c a n r u l e o u tc a n d i d a t e s s o t h a t ( a f t e r r e n u m b e r i n g ) t~ < i 2 < "'" a n d j ~ > j 2 > " " .P R o o v. A n y t w o k - c a n d i d a t e s ( l, j ) a n d ( t ' , j ' ) s a t i sf y o n e o f t h e f o l l o w i n g ( w i t h o u tl o s s o f g e n e r a l i t y , i - < i ' ) :

    ( 1 ) i < t ' , j - < j ' .( 2 ) i < i ' . j > j ' .( 3 ) t = i ' , j - - < j ' .( 4 ) i = t ' , j > j ' .

    I n c a s e s ( 1 ) a n d ( 3 ) ( t ' , j ' ) c a n b e r u l e d o u t ; m c a s e ( 4 ) ( i , j ) c a n b e r u l e d o u t ; a n d c a s e ( 2 )s a ti sf ie s t h e s t a t e m e n t o f t h e l e m m a . T h u s a n y s e t o f k - c a n d i d a t e s w h tc h c a n n o t b er e d u c e d b y f u r t h e r a p p l i c a t i o n o f t h e r u l e w i ll s a ti s fy th e c o n d i t i o n s t a t e d i n t h el e m m a ' . [ ]T h e s e t o f k - c a n d i d a t e s , r e d u c e d b y a p p h c a t m n o f th e r u l e s o a s t o s at is f y t h es t a t e m e n t o f L e m m a 2 , a r e t h e m i n i m a l e le m e n t s o f th e s e t o f k - c a n d i d a t e s ( s in c e n o

  • 8/3/2019 Longest Common Sub Sequence

    3/12

    6 6 6 D A N I E L $ . H I I S C H l t E I I Ge l e m e n t c a n r u l e o u t a m i n i m a l e l e m e n t ) a n d w i l l b e c a l l e d t h e s e t o f m i n i m a l k -cand i da t e s . B y L e m m a 2 , t h e r e i s a t m o s t o n e m i n i m a l k - c a n d i d a t e f o r e a c h i - va l u e .

    W e n o t e t h a t i f ( i, j ) i s a m i n i m a l k - c a n d i d a t e t h e n L ( / , j ) = k a n d ( i, ] ) is t h e k -c a n d i d a t e w i t h / - v a l u e i h a v i n g s m a ll e st j - v a l u e ] s u c h t h a t L ( i , j ) = k .LEMMA 3. For k -~ 1 , ( i, ] ) i s a m i n t m al k . ca nd t da t e i f f ] is t h e m i n i m um va l ue such

    t ha t b~ ffi a , an d l ow < ] < h i gh , w here h i gh i s t h e m i n i m um ] . va l ue o f a ll k - cand i da t e sw h ose i - va l ue i s l es s t han i ( no u pper l i m i t i f t h e re are no such k - cand i da t e s ) a nd l o w i s t h em m i m u m ] - va lue o fa l l ( k - 1 ) - cand i da te s w h ose i - va lue is l e ss t han i .P RO O F. A s s u m e t h a t ( i, ] ) is a m i n i m a l k - c a n d i d a t e . I f ] -> h i g h t h e n t h e r e i s a k -c a n d i d a t e ~ /', j ' ) s u c h t h a t i ' < i a n d ] ' = h ig h _< j. lo w s i n c e( i ' , j ' > m u s t b e g e n e r a t e d f r o m a (k - 1 ) - c a n d i d a t e a n d b ~, = a , s i n c e < i ', j ) i s a k -c a n d id a t e. A l s o # < j < h i g h . T h u s j ' s a ti sf ie s a ll t h e c o n s t r a i n t s a n d ] is n o t t h e m i n i m u mv a l u e t h a t d o e s s o , a c o n t r a d i c t i o n . [ ]W e p r e s e n t a l g o r i t h m A L G D , w h i c h , u si n g th e r e s ul ts o f L e m m a 3 , o b t a in s a n L C SC o f l e n g t h p o f i n p u t s t r in g s A a n d B i n t i m e O ( p n + n l o g n ) .T h e a l g o r i t h m i s b a s e d o n a n e f f ic i e n t r e p r e s e n t a t i o n o f t h e L m a t r ix . S i n c e L isn o n d e c r e a s i n g i n b o t h a r g u m e n t s , w e m a y d r a w c o n t o u r s i n i t s m a t r i x a s s h o w n i n t h ef o l l o w i n g e x a m p l e :

    Bc b o c b o a b o

    A

    0 , , , , ,b O0 ~,,.L) I ~ . 2 2 2 2c i ~ 1 1 1~ 1 2 2 2 2 2d 2 2 2 2 2b Ill 2 ~ 2 2 ~ 3 3 3 3b 2 3 3 ~ 4 ~ -

    T h e e n t i r e m a t r i x i s s p e c i f i e d b y i t s c o n t o u r s . T h e c o n t o u r s a r e d e s c r i b e d b y s e t s o fm i n i m a l k - c a n d i d a t e s. T h e c o n t o u r b e t w e e n L - v a l u e s o f k - 1 a n d k i s d e f i n e d b y t h e s eto f m i n im a l k - c a n d i d a t e s w h o s e e l e m e n t s a r e p o s i t io n e d a t th e c o n v e x c o r n e r s o f t h ec o n t o u r .T o k e e p t r a ck o f t h e m i n i m a l k - c a n d i d a t e s , w e u s e th e m a t ri x D . D [ k , i ] is t h e j - v a l u eo f th e u n i q u e m i n i m a l k - c a n d i d a t e h a v i n g / - v a l u e o f i o r 0 i f t h e r e i s n o s u c h m i n i m a l k -c a n d i d a t e . T h u s D [ k , i ] d e s c ri b e s t h e c o n t o u r s b y g i v in g th e n u m b e r o f t h e f ir st c o l u m no f r o w i t h a t i s i n r e g i o n k ( if t h a t n u m b e r is d i ff e r e n t f r o m D [ k , ~ - 1]) .

  • 8/3/2019 Longest Common Sub Sequence

    4/12

    A l g o r i t h m s f o r t h e L o n g e s t C o m m o n S u b s e q u e n c e P r o b l e m 6 6 7l o w c h e c k i s t h e s m a l l e s t / - v a l u e o f a ( k - 1 ) - c a n d i d a t e . F L A G h a s v a l u e 1 if f t h e r e a r e

    a n y k - c a n d i d a t e s .N B [ O ] i s t h e n u m b e r o f t i m e s s y m b o l 0 o c c u r s i n s t r i n g B . P B [ O , 1 ] . . . . .

    P B [ 0, N B [0 ] ] i s t h e o r d e r e d l is t , s m a l l e s t f ir s t , o f p o s i t i o n s i n B i n w h i c h s y m b o l 0 o c c u r s .I f t , t h e s iz e o f t h e s y m b o l a l p h a b e t , i s n o t l a r g e c o m p a r e d t o n , t h e n w e m a y i n d e x a na r r a y b y t h e b i t r e p r e s e n t a t i o n o f a s y m b o l . O t h e r w i s e , i f t > > n , t h e n w e c o n s t r u c t ab a l a n c e d b i n a r y s e a r c h t re e w h i c h p r o v i d e s a m a p p i n g f r o m s y m b o l s th a t a p p e a r in s t r in g

    B t o t h e i n te g e r s 1 th r o u g h s ( t h e r e a r e s d i f f e r e n t s y m b o l s t h a t a p p e a r i n B ) . W h e n e v e rs t ri n g e l e m e n t a, a p p e a r s a s a n a r r a y s u b s c r i p t ( a s i n N [ a , ] ) , i t s h o u l d b e u n d e r s t o o d t h a tw e a r e i n d e x i n g N b y t h e i n t e g e r s , w h i ch h a s b e e n o b t a i n e d ( d u r i n g i n i t i a l i z a t io n f o rA L G D ) f r o m t r a v e r s i n g th e s e a r c h t r e e j u s t d e s c r i b e d . I f a , d o e s n o t a p p e a r i n B , t h e nt h e i n t e g e r s , is z e r o . A n e q u i v a l e n t a s s u m p t i o n i s f o l l o w e d f o r s u b s c r i p t b~ i n s t e p 1 .A L G D ( m , n , A , B , C , p )

    1 . N B [ O ] , ~ - O f o r O = l , . , sP B [ O , O ] ~ - - O f o rO = l , . , sen[0 , 0] ~ 0; PB [0, 1] ~ 0f o r j * - 1 step 1 u n t i l n d o~ i n

    N B [ b j ] ~ N B [ b j ] + 1eB [bj , NB[b~]] ~ j

    e n d2 D [ O , i ] ~ . - O f o r z = O , . . , m

    low ch eck ~-- 03 . f o r k ~ l s t e p l d o~ g i n4 . N [ O ] ~ NB[O] for 0 = 1, , sN[0] ~ 1F L A G ~ 0

    l o w ~ D [ k - 1 , l o w c h e c k ]h i g h ~ n + 1

    5 . f o r t ~ l o w c h e c k + 1 s t ep 1 u n t i l m d obegin6 . w h i l e PB[ai , N[at] - 1] > l o w d o N [ a ~ ] ~ N [ a , ] - 17 i f h tgh > PB[a~ , N[a~]] > l o w

    t h e n beginh t g h ~ P B [ a , , N[a,]]D [ k , t ] < - - h i g hi f F L A G = 0 t h e n { l o w c h e c k ~ -- t , F L A G ~ -- 1}

    e n de l s e D [ k , , ] * - 0

    8 i f D [ k - 1 , t ] > 0 the n l o w ~ D [ k - l , t ]e n d loop of step 5

    9 . i f F L A G = 0 then go to step 10e n d loop of step 310 p c - - k - 1k ~ - . - pf o r t ~ m + 1 s t e p - l u n t i l O d oi f D [ k , ~] > 0 t h e nI~gin

    Ck ~- a ik , , - - k - 1e n d

    T h e l o o p o f s te p 3 e v a l u a t e s t h e se t o f m i n i m a l k - c a n d i d a t e s f o r k = 1 , 2 , . . . . T h e l o o po f s te p 5 e v a l u a t e s t h e s e t o f m i n i m a l k - c a n d i d a t e s , s m a l l e s t / - v a l u e f i r st , a n d f il ls i n th eD a r r a y a c c o r d i n g l y ( in t h e e x a m p l e g i v e n p r e v i o u s l y t h is is l e f t - to - r i g h t ) w h i l e s c a n n i n gt h e c h a i n s o f o c c u r r e n c e s o f a g iv e n c h a r a c t e r i n B w i t h l a r g e s t j - v a l u e f ir s t (r i g h t - to - l e f t ) .F o r e a c h i , i ca n b e t h e / - v a l u e o f a m i n i m a l k - c a n d i d a t e i f t h e r e i s a ] s a t is f y i n g t h ec o n s t ra i n ts o f L e m m a 3 . T h is i s t e s te d b y d e t e r m i n i n g t h e m i n i m u m ] - v a l u e o f s y m b o l a ,t h a t i s g r e a t e r t h a n l o w , I f t h a t v a l u e i s l e s s t h a n h t g h , t h e n ( i , ] ) i s a m i n i m a l k - c a n d i d a t e .

  • 8/3/2019 Longest Common Sub Sequence

    5/12

    6 6 8 DANIEL S. HIRSCHBERGT h e r e c a n b e n o k - c a n d i d a t e w i t h / - v a l u e l e s s t h a n o r eq u a l t o l o w c h e c k , s o th e l o o p o fs t e p 5 b e g i n s a t l o w c he c k + 1 . l o w c he c k i s s e t , i n s t e p 7 , w h e n t h e f i r s t m i n i m a l k -

    c a n d i d a t e ( t h a t h a v i n g s m a l l e s t / - v a l u e o f a ll k - c a n d i d a t e s ) is d e t e r m i n e d .LEMMA 4. A L G D eva lua tes t he correct va lues o f h igh and low (as de f ined mL e m m a 3 ) fo r de t e rmin ing wh e ther each k -cand ida te ( i , j ) i s min imal .P R o o v . high is s u p p o s e d t o b e t h e m i n i m u m ] - v a lu e o f all k - c a n d i d a t e s w i t h / - v a l u e

    l e s s t h a n i high i s i m t i a l i z e d a t n + 1 ( i . e . d o e s n o t l i m i t ) i n s t e p 4 , b e f o r e a n y k -c a n d i d a t e s h a v e b e e n g e n e r a t e d . T h e r e a f t e r , i f a n y k - c a n d i d a t e s a r e f o u n d t o b e m i n i m a l( in s t e p 7 ) , t h e n , s i n ce t h e ] - v a l u e s o f m i n i m a l k - c a n d i d a t e s d e c r e a s e a s t h e t -v a l u e si n c r e a s e , t h e m i m m u m j - v a l u e o f a ll m i n i m a l k - c a n d i d a t e s w i t h t - v a l u e le s s t h a n i w i ll b et h e ] - v a l u e o f t h e m i n i m a l k - c a n d i d a t e w i t h g r e a t e s t / - v a l u e l e ss t h a n i ( i. e . t h e l a st o n ef o u n d , s in c e w e g e n e r a t e m i n i m a l k -c a n d i d a t e s i n o r d e r o f in c r e a s i n g / - v a l u e ) . T h e j-v a l u es o f r u l ed - o u t ( n o n m i n i m a l ) k - c a n d i d a t e s c a n n o t b e s m a l l e r t h a n t h e ] - v a l u e o f th el a s t m i n i m a l k - c a n & d a t e high is u p d a t e d t o th e m o s t r e c e n t / - v a l u e e a c h t im e a n e wm i n i m a l k - c a n d i d a t e is f o u n d i n s te p 7 . T h u s high h a s v al u e a s d e f i n ed i n L e m m a 3 .lo w is s u p p o s e d t o b e th e m i n i m u m ] - v a l u e o f a ll ( k - 1 ) - c a n d i d a t e s w h o s e / - v a l u e isl e s s t h a n i . A g a i n , s i n c e j - v a l u e s d e c r e a s e a s / - v a l u e s i n c r e a s e , lo w s h o u l d b e t h e j - v a l u eo f t h e ( k - 1 ) - c a n d i d a t e w h o s e / - v a l u e is a s g r e a t a s p o s s ib l e b u t le s s t h a n i . l ow isi n i ti a l iz e d i n s t e p 4 t o b e t h e ] - v a l u e o f t h e f ir s t ( l o w e s t / - v a l u e ) ( k - 1 ) - c a n d i d a t e . A s ii n c r e a s e s , i f t h e r e w a s a m i n i m a l ( k - 1 ) - c a n d i d a t e w i t h / - v a l u e o f i , t h e n t h e m i m m u mp e r m i s s i b le j - v a l u e w i ll d e c r e a s e a n d lo w is u p d a t e d ( in s t e p 8 ) f o r t h e n e x t i t e r a ti o n . [ ]LEMMA 5 . A L G D correc tl y de t ermines t he se t o f min im al k -cand tda tes.

    P a o oF . B y L e m m a 4 , high a n d lo w a r e c o m p u t e d c o r r e c t l y . W e m u s t s h o w t h a t i nt h e l o o p o f s t e p s 5 - 8 D [ k , i] g e t s t h e m l n i m u m j - v a l u e ( 0 i f n o n e ) s u c h t h a t b~ = a~ a n dl o w < j < h i g h .

    T h e ] - v a l u e s o f s u c c e s s iv e m i n i m a l k - c a n d i d a t e s d e c r e a s e i n v a l u e s i n c e t h e i r / - v a l u e si n c r e a s e . I n l o o k i n g f o r D [ k , i ] w e l o o k f o r a m a t c h f o r s y m b o l a , in s t ri n g B , a n d w ec a n r e s t ri c t o u r a t t e n t i o n t o o c c u r r e n c e s ( j - v a l u e s ) o f s y m b o l as i n s tr i n g B t h a t a r eb e f o r e ( l e s s t h a n ) t h e l a s t o c c u r r e n c e ( ] - v a l u e ) t h a t w a s e x a m i n e d . S t e p 6 d o e s t h a t .P B [ a , o] i s t h e o r d e r e d l is t o f j - v a l u e s o f s y m b o l a , a n d N[a,] p o i n t s t o t h e s m a l l e s t] - v a l u e s ( i n P B ) o f s y m b o l at t h a t h a s b e e n e x a m i n e d . I n i t i a l l y , m s t e p 4 , N[a,] p o i n t s t ot h e l a s t o c c u r r e n c e o f s y m b o l a t . I f t h e l a s t - e x a m i n e d ] - v a l u e o f a , i s g r e a t e r t h a n l o w ,s t ep 6 s e t s N[a,] t o p o i n t t o t h e l o w e s t j - v a l u e o f a t t h a t is g r e a t e r t h a n l o w . I f t h e l a s t-e x a m i n e d ] - v a l u e o f a , is n o t g r e a t e r t h a n l o w , t h e n t h e r e c a n b e n o m i n im a l k -c a n d i d a t e f o r t h is v a l u e o f i s i n c e t h e m i n i m u m ] - v a l u e t h a t is g r e a t e r t h a n lo w e i t h e rv i o l a t e s t h e h i g h c o n s t r a i n t o r r e s u l t s i n a c a n d i d a t e t h a t c a n b e r u l e d o u t . I n t h i s c a s es t e p 6 d o e s n o t h i n g , t h e t e s t i n s t e p 7 f a i l s , a n d D [ k , i ] i s s e t t o z e r o . [ ]

    THEOREm 1. A L G D c or re ctly c o m p u t es th e L C S o f s tr in gs A a n d B .PR OO F. B y L e m m a 5 , A L G D c o r r e c t l y d e t e r m i n e s t h e s e t o f m i n i m a l k - c a n d i d a t e s .T h u s , i f t h e r e a r e a n y k - c a n d i d a t e s , a t l e as t o n e i s m i n i m a l . I f ( t, j ) i s t h e p t h m a t c h i na n L C S w h i c h is o f l e n g t h p , t h e n , b y L e m m a 1 , ( /, j ) is a p - c a n d i d a t e . T h u s t h e r e i s a tl ea s t o n e m i n im a l p - c a n & d a t e ( a n d t h e r e a r e n o ( p + D - c a n d i d a t e s ) . S t e p 1 0 o fA L G D r e c o v e r s a c o m m o n s u b s e q u e n c e o f l e n g t h p b y r e c o v e r in g a s e q u e n c e o f( /- v a lu e s o 0 m i n i m a l c a n d i d a t e s s u c h t h a t th e m i n i m a l k - c a n d i d a t e g e n e r a t e d t h em i n i m a l ( k + D - c a n d i d a t e . [ ]THEOREI~ 2. A ssu m i n g t ha t s y mb o l s c a n b e c o mp a re d i n o n e t ime u n it , A L G Drequ ires t ime o f O (p n + n log s ) , where s is the num ber o f d if f e ren t sym bol s t ha t appearin string B.P R O O F. S t e p 1 c a n b e d o n e i n t i m e O (n l o g s ) . S t e p 2 c a n b e d o n e i n ti m e O ( m ) .S t e p 3 e x e c u t e s s t e p s 4 - 9 p t i m e s . S t e p 4 t a k e s t i m e O( s ) p e r e x e c u t i o n , s - < n , f o rt o t a l t i m e l e s s t h a n o r e q u a l t o O ( p n ) . S t e p 5 e x e c u t e s s te p s 6 - 8 a t m o s t m t i m e s , at o t a l o f a t m o s t p m t i m e s . T h e w h i l e l o o p i n s t e p 6 is e x e c u t e d a t m o s t n t i m e s w i t h int h e l o o p o f s te p 5 s in c e t h e N[O] a r e n o t i n c r e a s e d w i t h in t h i s l o o p ( e a c h p o s i t i o n o f B ise x a m i n e d a t m o s t o n c e f o r e a c h v a l u e o f k ) . T h e t o t a l t i m e i n s t e p 6 is t h e r e f o r e O ( p n )

  • 8/3/2019 Longest Common Sub Sequence

    6/12

    A l g o r i t h m s f o r t h e L o n g e s t C o m m o n S u b s e q u e nc e P r o b l e m 66 9Steps 7 and 8 are done in constant time. Total time is O ( p m ) . Step 9 is done inconstant time. Total time is O ( p ) . Step 10 is done in time O ( m ) . Total execution time isthus as stated above. []

    Not e t hat for p _> O(log s ) , A L G D requires time O ( p n ) .p e l o g n A l g o r i t h mWe now c onsider a special case that often occurs in applications such as deter mini ngthe discrepancies between two files, one of which was obtained by making minoralterations to the other (and we wish to recover those alterations). We assume thatthere is an LCS of length at least m - ~ (for some given ~).

    If C is an LCS of A and B, th ere will be at most ~ eleme nts of A that do not a ppear inC. The position of each such elem ent will be called a s k i p p e d p o s i t i o n . Thus there are atmost E skipped posit ions. We define e to be ~ + 1.

    If (t,j ) is a mini mal k-ca ndid ate that can be an el emen t in an LCS (that is, a, = bj is thekth ele ment of an LCS), then k -< i -< k + (otherwise more than E positions i nA wouldbe skipped). We shall call such candidates f eas ib l e k - cand ida te s . Let h = i - k. The n 0 -

  • 8/3/2019 Longest Common Sub Sequence

    7/12

    670 DANIEL S. HIRSCHBERGI f a ll o f th e s k i p p e d A - p o s i t i o n s h a v e b e e n r e c o v e r e d , t h e n p t [ x ] i s z e r o .O t h e r w i s e , p t [ x ] i s t h e i n d e x o f K E E P t h a t h a s i n f o r m a t i o n e n a b l i n g r e c o v e r y o f t h e

    s k i p p e d A - p o s i t i o n s h a v i n g i n d i c e s s m a l l e r t h a n a a [ x ] - n s k i p [ x ] + 1 .E x a m p l e . I f p o s i t io n s 2 , 5 , 6 , 7 , 9 , 1 0 i n s t ri n g A c o r r e s p o n d t o a c o m m o n

    s u b s e q u e n c e o f l e n g t h 6 ( o f A 1 , ~ 0), t h e n h = 4 a n d K E E P [ P [ 4 ] ] w i l l e n a b l e r e c o v e r y o fp o s i t i o n s 1 , 3 , 4 , 8 : a a [ P [ 4 ] ] = 8 , n s k i p [ P [ 4 ] ] = 1 , p t [ P [ 4 ] ] = y ( a n o t h e r i n d e x o fK E E P ) . a n [ y ] = 4 , n s k i p [ y ] = 2 ( p o s i t io n s 3 a n d 4 h a v e b e e n s k i p p e d ) , p t [ y ] =z . a n [ z ] = 1 , n s k l p [ z ] = 1 , p t [ z ] = 0 ( al l s k i p p e d p o s i t io n s h a v e b e e n r e c o v e r e d ) .

    R e f e r e n c e c o u n t s a r e k e p t f o r e a c h e l e m e n t o f K E E P . S p a c e s i n t h e K E E P a r r a y a r em a i n t a i n e d b y g a r b a g e c o l l e c t i o n fu n c t i o n s G E T S P A C E w h i c h p r o v i d e s a n a v a i l a b l es p a c e a n d P U T S P A C E w h i c h p l a c e s a n e w l y a v a i l a b l e s p a c e ( i . e . o n e w h o s e r e f e r e n c ec o u n t d r o p s t o z e r o ) o n t h e g a r b a g e l i n k e d l is t . S e e [ 1 0 ] f o r i m p l e m e n t a t i o n t e c h n i q u e s .

    W e n o w p r e s e n t A L G E , w h i c h u s e s L e m m a 6 i n o r d e r t o s o lv e t h e L C S p r o b l e m i nt i m e O ( p e l o g n ) :A L G E ( m , n , A , B , C , p , e)

    1 F [ h ] , G [ h ] ~ 0 f o r h = 0 . . . P[O ] ~-- O; P [ h ] ~ - - - l f o r h = 1 , , 2 f or k ~ 1 s t ep 1 wh i le there were candidates found m the last pass d obegin3 lmax ~-- 0

    j r m n ~ n + 14 for h ~ 0 s tep 1 unt i l e do

    b e g i n5 t ~ - - h + k

    J ~ N E X T B ( a . G[h])f f l -> I m m

    6 t h e n b e g i nF [ h ] ~ l m a xG [ h ] ~ I m mNE W P[ h ]

  • 8/3/2019 Longest Common Sub Sequence

    8/12

    A l g o r t t h m s f o r t h e L o n g e s t C o m m o n S u b s e q u e n c e P r o b l e m 6 7 12 w h i l e y ~ do

    b e g i ncou nt ~ nsk~p[y]posttton ~ aa[y]3 w h i l e c oun t > 0 dob e g i n

    SKIP[x] ~ post t lonx ~ - x - 1positzon ~-- posttton - 1cou nt ~-- cou nt - 1e n d loop of step 3y ~- p t [y]

    e n d loop of step 24. x ~-- 1k ~ - Ifor t ~ - 1 s t ep 1 u n t il lastmatch d oif ~ = S K I P [ x ] t h e n x ~--x + 1e l s e b e g i n

    Ck 4.-- azk ~ - k + le n d

    5 E N D O F R E C O V E RT h e l o o p o f s t e p 2 e v a l u a t e s s e t s o f f e a s i b l e k - c a n d i d a t e s f o r k = 1 , 2 , . . . . T h e l o o p

    o f s t e p 4 e v a l u a t e s w h e t h e r t h e r e ~s a t e a s i b l e k - c a n d i d a t e h a v i n g p r e c i s e l y h s k i p p e dp o s i t i o n s , f o r h = 0 , 1 , . .. , e , b y u s i n g L e m m a 6 to d e t e r m i n e t h e j - v a l u e f o r ap a r t i c u l a r t - v a lu e a n d t h e n c h e c k i n g , b y u s i n g L e m m a 2 , w h e t h e r ( i, j} i s m i n i m a l , i m a xis t h e m a x i m u m t - v a l u e o f f e a s tb l e k - c a n d i d a t e s g e n e r a t e d t h u s f ar ( i .e . w t t h / - v a l u e sl es s t h a n t h e c u r r e n t v a l u e o f i ) ; j m m is t h e c o r r e s p o n d i n g j - v a l u e ( w h i c h i s th em i n i m u m j - v a l u e o f f e a s ib l e k - c a n d i d a t e s g e n e r a t e d t h u s f a r ) . I f ( i, j } is a f e a s ib l e k -c a n d i d a t e , t h e n i t is s t o r e d i n th e F a n d G a r r a y s a n d i n f o r m a t i o n w t l l b e s t o r e d i n P [ h ] ,e n a b h n g r e c o v e r y o f a n y a d d i t i o n a l s k i p p e d p o s i t i o n s th a t o c c u r b e t w e e n i a n d F [ h ] a sw e l l a s th e s k i p p e d p o s i t i o n s o c c u r r i n g b e f o r e F [ h ] ( ( F [ h ] , G [ h ] ) i s a ( k - 1 ) - c a n d i d a t et h a t c a n g e n e r a t e ( i, ] ) ) . T h e h s k i p p e d p o s i t io n s c o r r e s p o n d i n g t o ( F [ h ] , G [ h ] ) a r er e c o v e r a b l e b y a c c e s s in g K E E P [ P [ h ] ] . I n g e n e r a l t h e r e m a y b e m o r e t h a n o n e f e a s i b lek - c a n d i d a t e t h a t w i ll b e g e n e r a t e d b y ( F [ h ] , G [ h ] ) . T h u s w e m u s t n o t d e s t r o y P [ h ] u n t i la l l r e q u i r e d r e f e r e n c e s t o K E E P [ P [ h ] ] a r e m a d e . F o r t h is r e a s o n , n e w v a l u e s fo r th e Pa r r a y a r e s t o r e d i n t h e N E W P a r r a y . W h e n w e n o l o n g e r n e e d t h e o l d v a l u e s o f P ( a f te rt h e i n n e r l o o p o f s t ep s 4 - 9 ) , w e c a n t h e n r e p l a c e t h e m w i t h th e n e w v a l u e s , b e i n gc a r e fu l to d e c r e m e n t r e f e r e n c e c o u n t s o f K E E P e l e m e n t s t h a t w e r e p o i n t e d t o b y th eo l d P a r r a y

    F u n c t i o n R E M O V E ( x ) d e c r e m e n t s t h e r e f e re n c e c o u n t o f K E E P [ x ] ( u n l e s s x --< 0 , i nw h i c h c a s e n o t h i n g is d o n e ) , a n d , i f K E E P [ x ] n o w h a s re f e r e n c e c o u n t z e r o , t h e n a ca l lw i ll b e m a d e t o R E M O V E ( p t [ x ] ) a f t e r K E E P [ x ] h a s b e e n p u t o n t h e g a r b a g e l i n k e d l i stb y u s i n g P U T S P A C E .I m p l e m e n t a t i o n o f N E X T BT h e f o l lo w i n g s h o u l d b e d o n e b e f o r e u s i n g A L G E :1 Sort the symbols m A and the n construct a balanced bina ry search tree o f symbols that appear in str ing A

    Let there be ss such symbols (ss -< m).2 . f or k ~ 1 s t ep 1 u n t i l ss d o L A S ' I ~ k ] , - - 03 for i ~- - 1 s tep 1 unt i l n do

    b e g i nfind out that bt = 0k1 ~ L A S T [ k ]L A S T [ k ] ~ tif I ~ 0 t h e n NE X T [ i ] ~ -- telse FIR ST [ k ] ~ -- t

    e n d loop of step 3

  • 8/3/2019 Longest Common Sub Sequence

    9/12

    67 2 DANIEL S . HIRSCHBERG4. start z.- 1

    for k ~ 1 s tep 1 unt i l s s d obet~JnPlace the pos i t ions j o f B such that b j = 0~ in to N [ s t a r t ] t h r o u g h N [ s t a r t + n n - 1 ] w h e r e 0~ occurs n ntimes in string B. The first posmo n in B at w hich 0k occurs Is at FIRST[k]. If 0~ occurs at position , thenthe nex t occurrence of 0~ in B w dl be a t posinon N E X T [ i ] unles s LAST [k] = j , in w hich case t h e r e a r c nomore occurrences of 0~ in B.S[k] ~ startstart ~ start + n n

    en dW e c a n f i n d o u t th a t a , = 0 k m t i m e O ( l o g s ) . N [ S [ 1 ] : S ~ k + 1 ] - 1 ] h o l d s t h e b l o c k

    o f p o s i t i o n s j w i t h b e = 0 ~. T h i s b l o c k o f c e ll s c a n b e s e a r c h e d b y u s i n g b i n a r y s e a r c h o fa h n e a r l y o r d e r e d a r r a y [ 1 1 , S e c . 6 . 2 . 1 ] . N E X T ( a , j ) c a n t h u s b e e x e c u t e d i n t i m eO ( l o g n ) .

    I f s is v e r y s m a l l , t h e n t h e f o l l o w i n g a l t e r n a t e w a y o f c o m p u t i n g N E X T B ( O , j ) m a y b ep r e f e r r e d : I n s t e a d o f c o n s t r u c t i n g a c o m p r e s s e d a r r a y in s te p 4 , c o n s t r u c t a N E X T Bm a t r i x w h i le i n s t e p 3 . F o r e a c h l , s e t N E X T B [ k , t ] = i f o r j _< t < i . T h i s w i l l r e s u l t i nt im e a n d s p a c e c o m p l e x i ty ( o f t h e s e t u p ) o f O ( s n ) . T h e f u n c t i o n N E X T B ( O , j ) c a n b ee v a l u a t e d b y d e t e r m i n i n g t h a t 0 = 0 k i n ti m e O ( l o g S ) a n d b y d o i n g a s i m p l e ta b l e l o o k -u p .

    A L G E r e t a in s k - c a n d i d a t e s , a s d i d A L G D , e x c e p t f o r t h o s e c a n d i d a t e s t h a t c a n n o tl e a d t o a s u f fi c ie n t l y l o n g c o m m o n s u b s e q u e n c e b e c a u s e t o o m a n y A - p o s i t i o n s h a v ea l r e a d y b e e n s k i p p e d . T h e ( k + D - c a n d i d a t e s th a t c a n b e g e n e r a t e d b y t h e d r o p p e d k -c a n d i d a t e s a l so s k i p t o o m a n y A - p o s i t io n s .

    L E M M A 7 . A L G E r e ta i ns a ll f e a s ib l e k - c a n d id a t e s.P RO O F. B y i n d u c t i o n o n k . I t is t r i v i a l l y t r u e f o r k = 0 ( t h e F a n d G a r r a y s a r e

    i n i t i a l iz e d t o z e r o in s t e p 1 ) . A s s u m e t h a t t h e s e t o f f e a s i b l e ( k - 1 ) - c a n d i d a t e s h a s b e e ne v a l u a t e d a n d s t o r e d i n a r r a y s F a n d G . A L G E g e n e r a t e s t h e s e t o f f e a s i b l e k -c a n d i d a t e s in o r d e r o f i n c r e a s i n g / - v a l u e . F [ h ] i s t o h o l d i = h + k i f i i s a n / - v a l u e o f af e a s ib l e k - c a n d i d a t e ; o t h e r w i s e F [ h ] I s t o h o l d t h e m a x i m u m i ' < i s u c h th a t i ' is af e a s ib l e k - c a n d i d a t e . G [ h ] i s t o h o l d t h e c o r r e s p o n d i n g j - v a l u e , i m a x a n d j m i n h o l d t h el a s t - g e n e r a t e d f e a s ib l e k - c a n d i d a t e , w h i ch , b y L e m m a 2 , h a s t h e m a x i m u m / - v a l u e a n dm i n i m u m j - v a l u e g e n e r a t e d t h u s f a r . S t e p 3 i n i ti a li z es t h e m t o c o r r e c t ly i n d i c a te t h a t n ok - c a n d i d a t e s h a v e y e t b e e n g e n e r a t e d . S t e p 5 e v a l u a t e s th e j - v a l u e f o r a g iv e n p o t e n t i a lk - c a n d i d a t e b y u sin g L e m m a 6 . I f j _ > j m i n t h e n , e v e n t h o u g h t h e n e c e s s a r y c o n d i t i o n f o rf e a s i b i li t y h a s b e e n m e t , ( i , ] ) i s n o t m i n i m a l s i n c e it w o u l d b e r u l e d o u t b y ( i m a x , j m i n ) .I n t h i s c a s e s t e p 6 s e t s F [ h ] a n d G [ h ] t o i m a x a n d j m i n . I f j < j m i n , t h e n ( i , j ) is m i n i m a ls in c e i t c a n n o t b e r u l e d o u t b y a n y p r e v i o u s l y g e n e r a t e d k - c a n d i d a t e ( / < j m m ) a n d i tc a n n o t b e r u l e d o u t b y a n y f u t u r e g e n e r a t e d k - c a n d i d a t e ( a ll f u t u r e i ' > i ) . I n t h i s c a s es t e p 8 s e t s F [ h ] a n d G [ h ] a n d a l s o u p d a t e s t m a x a n d j m i n . [ ]

    THEOREM 3. A L G E c or re ctl y c o m p u t e s th e L C S o f s tr in gs A a n d B i f t h e L C S is o fl eng th a t l eas t m - ~ .P RO OF . B y L e m m a 7 , A L G E c o r r e c tl y k e e p s m i n i m a l k - c a n d i d a t e s . T h u s , i f t h e r ei s a c o m m o n s u b s e q u e n c e o f l e n g t h p .~ m - , t h e n t h e r e i s a m i n i m a l p - c a n d i d a t e

    w h i c h w i l l b e f e a s i b l e . T h e d a t a s t r u c t u r e o f A L G E k e e p s t r a c k , f o r e a c h f e a s i b l e k -c a n d i d a t e ( t, j ) , o f t h e h = i - k p o s i t i o n s in s t r i n g A t h a t h a v e b e e n s k i p p e d in t h ec o m m o n s u b s e q u e n c e o f l e n g th k o f A t , a n d B l s. P [ h ] p o in t s t o t h e e l e m e n t o f K E E Pt h a t c o n t a i n s t h e n e c e s s a r y i n f o r m a t i o n . P i s u p d a t e d i n s t e p 7 w h e n a f e a s i b l e k -c a n d i d a t e i s g e n e r a t e d . I f a n y a d d i t i o n a l p o s it io n s a r e s k i p p e d ( b e t w e e n t h e k - c a n d i d a t e( i, j ) a n d t h e ( k - 1 ) - c a n d i d a t e ( i ' , j ' ) t h a t g e n e r a t e d ( i , ] ) ) , t h e n t h a t i n f o r m a t i o n i sr e c o r d e d i n a n e l e m e n t o f K E E P a s w e l l a s a p o i n t e r , e n a b l i n g r e c o v e r y o f t h e h -n s k i p p r e v i o u s ly s k i p p e d A - p o s i t i o n s ( o f ( i ' , j ' ) ) . S u b r o u t i n e R E C O V E R r e c o v e r s t h es k i p p e d p o s i U o n s o f a f e a s ib l e p - c a n d i d a t e b y r e v e r s i n g t h e p r o c e s s i n w h i c h t h e y w e r es t o r e d a n d t h e n c o m p u t e s t h e L C S b y d e le t i n g t h e s k i p p e d p o s i t io n s f r o m s t r i n g A . [ ]THEOREM 4. For ~ _< O ( n l t 2 ) , A L G E r e qu ir e s s p a c e l in e ar m n .

  • 8/3/2019 Longest Common Sub Sequence

    10/12

    A lg o r i t hm s fo r t he L o n g e s t C o m m o n Su b se q ue n ce P r o b lem 6 7 3P R O O F . T he K E E P a r r a y r e q u i r e s O ( e 2) s p a c e : T h e c o m m o n s u b s e q u e n c e i m p l i e d b y

    k - c a n d i d a t e (h + k , j~ h a s h s k i p p e d A - p o s i t i o n s , h < - ~ , a n d t h u s c a n u s e a t m o s t h s p a c e si n t h e K E E P a r r a y . T h e t o t a l n u m b e r o f s p a c e s r e f e r r e d t o b y a l l f e a s i b l e k - c a n d i d a t e s ist h u s a t m o s t ~ ( e + 1 ) /2 . A d d i n g t o t h a t t h e ( e x a c t l y ) E r e f e r e n c e s t o g e t th e s e t o f f e a s i b l e( k + 1 ) - c a n d i d a t e s g i v e s a t o t a l o f n o m o r e t h a n ( e 2 + e ) / 2 . E a c h e l e m e n t o f a r r a y K E E Pr e q u i r e s f o u r w o r d s (aa, nsk ip , p t , a n d a r e f e r e n c e c o u n t e r ) .

    T h e a r r a y s a n d s p a c e t h a t t h e y u s e a r e a s f o l lo w s : Fie] , G[e] , C [ p ] , P i e ] , N E W P [ e ] ,K EE P [2 e 2 + 2 e ] , F IR ST [s s] , NE X T [ n ] , L A ST [s s ] , SKI P [e] , S [ s s] , N[n ].T h e N E X T B f u n c t io n r e q u i r e s a t m o s t 2 n l o c a t i o n s to s t o r e t h e v a r i o u s b a l a n c e db i n a r y s e a r c h t r e e s .T h u s a t o t a l o f a t m o s t 2e ~ + 7 e + 4 n + p + 3 ss l o c a t i o n s is u s e d . F o r e -< 0(nl /2) , s p a c er e q u i r e m e n t s a r e l i n e a r in n . [ ]THEOREm 5. A L G E r eq u ir es t im e O ( p e l o g n ) .P RO O F. P r e p r o c e s s i n g f o r t h e N E X T B f u n c t i o n r e q u i r e s t i m e O (n l o g m ) . S t e p 1t a k e s t im e O ( e ) . S t e p 2 e x e c u t e s s te p s 3 - 1 2 p t i m e s . S t e p 3 t a k e s c o n s t a n t t i m e f o r a

    t o t a l t i m e o f O ( p ) . S t e p 4 e x e c u t e s s t e p s 5 - 9 a t m o s t e t i m e s . S t e p 5 t a k e s t im e O ( l o g n )f o r a t o t a l t i m e o f O ( p e l o g n ) . S t e p s 6 - 9 t a k e c o n s t a n t t i m e f o r a to t a l t i m e o f O ( p e ) .S t e p s 1 0 - 1 2 , e x c l u d i n g t i m e s p e n t in f u n c t io n R E M O V E , t a k e t i m e O ( e ) f o r a t o t a l t i m eo f O ( p e ) .

    S u b r o u t i n e R E C O V E R r e c o v e r s a t m o s t c s k i p p e d p o s i t r o n s ( t a k i n g t i m e O ( e ) ) a n dt h e n d e l e t e s t h e m f r o m s t ri n g A ( t a k i n g t i m e O ( m ) ) f o r a t o t a l t im e o f O ( m ) .T h e n u m b e r o f r e f e r e n c e s ( to a r r a y K E E P ) r e m o v e d is a t m o s t t h e n u m b e r o fr e f e r e n c e s i n s e r t e d . T h e r e a r e a t m o s t p e r e f e r e n c e s i n s e r te d ( o n e p e r e x e c u t i o n o f s t e p7 ) , a n d t h e a m o u n t o f ti m e ( p e r r e f e r e n c e r e m o v a l ) s p e n t in fu n c t io n R E M O V E i sc o n s t a n t . T h e r e f o r e t h e t o t a l t i m e s p e n t i n f u n c t i o n R E M O V E i s O ( p e ) .

    T h e r e f o r e t h e t o t a l t i m e o f e x e c u t i o n o f A L G E is O ( p e l o g n ) . [ ]I t is n o t e d t h a t s t e p 5 , r e q m r i n g O ( l o g n ) t i m e , i s t h e b o t t l e n e c k , c a u s in g t o t a l t i m er e q u i r e m e n t s o f O ( p e l o g n ) . P . v a n E m d e B o a s ' s r e c e n t a l g o r i t h m f o r p r i o r i ty q u e u e s[ 1 9] a p p e a r s c a p a b l e o f s o lv i n g th e p o s i t i o n - f i n d i n g p r o b l e m i n t im e O ( l o g l o g n ) . I f s o ,t h is w o u l d r e d u c e t h e t i m e b o u n d o f t h is p r o b l e m t o O ( p e l o g l o g n ) .A L G E a s s u m e s t h a t c i s k n o w n . I f ~ i s n o t k n o w n , t h e n s e t E ~ -- 2 a n d p r o c e e dt h r o u g h t h e a l g o r i t h m . I f t h a t v a l u e o f ~ i s i n s u f fi c ie n t ( i . e . t h e r e i s n o c o m m o ns u b s e q u e n c e o f l e n g t h m - e ) , th e n d o u b l e t h e g u e s s f o r e a n d c o n t i n u e i t e r a t i v e l y u n t i la c o m m o n s u b s e q u e n c e is f o u n d .T o t a l t i m e s p e n t w i l l b e ( l e tt i n g k b e t h e m u l t i p li c a t iv e c o e f f i c i e n t o f t h e t i m er e q u i r e m e n t )

    2 p k l o g n + 4 p k l o g n + . . . + epk l o g n ,w h i c h i s l e s s t h a n 2 p e k l o g n . S i n c e e < 2 ( m + 1 - p ) , w e c a n r e c o v e r a n L C S i n t i m eO ( p ( m + 1 - p ) l o g n ) .Other A lgor i thmsT h e o n l y k n o w n a l g o r i t h m f o r th e L C S p r o b l e m w i t h w o r s t - c a s e b e h a v i o r l e ss t h a nq u a d r a t i c is d u e t o P a t e r s o n [ 1 4 ]. T h e a l g o r i t h m h a s c o m p l e x i t y O ( n ~ l o g l o g n / l o g n ) . I tu s e s a " F o u r R u s s i a n s " a p p r o a c h ( s e e [ 3] o r [ 1 , p p . 2 4 4 - 2 4 7 ] ) . E s s e n t i a l ly , i n s t e a d o fm a t r i x L ( w h e r e L [ t , j ] i s t h e l e n g t h o f a n L C S o f A 1 , a n d B t j ) b e i n g c a l c u l a t e d o n ee l e m e n t a t a ti m e ( s e e [ 7 ] ), th e m a t r i x is b r o k e n u p i n t o b o x e s o f s o m e a p p r o p r i a t e s iz ek . T h e high s i d es o f a b o x ( t h e 2 k - 1 e l e m e n t s o f L o n t h e e d g e s o f t h e b o x w i t hl a rg e s t i n d ic e s ) a re c o m p u t e d f r o m L - v a l u e s k n o w n f o r b o x e s a d j a c e n t t o i t o n t h e lo ws i d e a n d f r o m t h e r e l e v a n t s y m b o l s o f A a n d B b y u s in g a l o o k - u p t a b l e w h i c h w a sp r e c o m p u t e d .

    T h e a l g o r i t h m a s s u m e s a f i x e d a l p h a b e t s i ze a lt h o u g h m o d i f i c a t io n s t o t h e a l g o r i t h mm a y b e a b l e to g e t a r o u n d t h a t c o n d i t io n .

  • 8/3/2019 Longest Common Sub Sequence

    11/12

    6 7 4 D A N I E L S . H I R SC H B E R GT h e r e a r e 2 k + 1 e l e m e n t s o f L a d j a c e n t t o a b o x o n t h e lo w s i d e . T w o a d j a c e n t L -

    e l em ent s ca n d i f f er by e i t her z ero or one . There a re t hus 2 2 k p os s i b i l i t i e s i n t h i s re s p ec t .T h e A - a n d B - v a l u e s r a n g e o v e r a n a lp h a b e t o f s iz e s f o r e a c h o f 2 k e l e m e n t s , y i e l d in g am u l t i p li c a t iv e f a c to r o f s zk , a n d t h e t o ta l n u m b e r o f b o x e s t o b e p r e c o m p u t e d i s t h e r e f o r e2 ~kt~+l gs~ . Ea ch s uch box ca n be p r eco m p ut ed i n t i m e O ( k ~) f or a t o t a l p recom p ut i ngt ime of O(k~22k~+lg ~) .

    T h e r e a r e ( n / k ) z b o x e s t o b e l o o k e d u p , e a c h o f w h i c h w i l l r e q u i r e O ( k l o g k ) t i m e t obe rea d , f or a t o t a l t i m e o f O ( n2 1 og k / k ) .

    T he tota l ex ec ut io n t im e w i l l the refo re be O (k~ 2 ~k~+~g s~ + n21og k / k ) . I f w e l e t k = l ogn / 2 ( 1 + l o g s ) , w e s e e t h a t t h e t o ta l e x e c u t i o n t i m e w i ll b e O ( n 21 o g l o g n / l o g n ) .Restrictions on the LCS ProblemS z y m a n s k i [ 1 7 ] sh o w s t h a t f f w e c o n s i d e r t h e L C S p r o b l e m w i t h t h e r e s t ri c t io n t h a t n os y m b o l a p p e a r s m o r e t h a n o n c e w i t h i n e i t h e r i n p u t s t r i n g , t h e n t h i s p r o b l e m c a n b es o l v e d i n t i m e O(n l o g n ) .I n a d d i t i o n i f o n e o f t h e i n p u t s t r in g s is t h e s t r in g o f in t e g e r s 1 - n , t h is p r o b l e m ise q u i v a l e n t t o fi n d i n g t h e l o n g e s t a s c e n d i n g s u b s e q u e n c e i n a s tr i n g o f d i st in c t i n t e g e r s. I fw e a s s u m e t h a t a c o m p a r i s o n b e t w e e n i n te g e r s c a n b e d o n e i n u n it t im e , t hi s p ro b l e m c a nb e s o l v e d i n t i m e O(n l o g l o g n ) b y u s in g t h e t e c h n i q u e s o f v a n E m d e B o a s [ 18 } .ACKNOW LEDGM ENT. I w o u l d l i k e t o t h a n k t h e ( a n o n y m o u s ) r e f e r e e s f o r t h e i r m a n yh e l p f u l s u g g e s t i o n s w h i c h h a v e l e d t o a m a t e r i a l i m p r o v e m e n t i n t h e r e a d a b i l i t y o f th isp a p e r .REFERENCES(Note References [4-6, 8, 9, 13, 15, 20, 22, 23] are not ot ed m the text. )

    1 AHO, A V , HOPCROFr, J E , AND ULLMAN, J D The Design and Analysis of Computer AlgortthmsAddison-Wesley, Reading, Mass, 1974

    2 Ario, A V, HmSCHBER6, D.S , AND ULLUAN, J D Bounds on the complextty of the longest commonsubsequenee problem J ACM 23, 1 (Jan 1976), 1-12.

    3 ARLAZAROV, V L, , DINIC, E A , KRONROD, M A, AND FARADZEV, I A On economic construction ofthe trans~ttve closure of a dtrected graph Dokl Akad Nauk SSSR 194 (1970), 487-488 (m Russian)Enghsh transi Jn Sovtet Math Dokl 11, 5 (1970), 1209-12104 CnVATAL, V , KLARNER, D A , AND KNUTrI, D E Selected combinatoria l research problems STAN-CS-72-292, Stanford U , Stanford, Ca hf , 1972, p 26

    5 CItVATAL, W , AND SANKOFF, D Longest common subsequences of two random sequences STAN-CS-75-477, Stanford U , Stanford, Ca hf , Jan 19756 HIRSCnBER6, D S On finding maximal common subsequences TR-156, Comptr So Lab , PrincetonU , Princeton, N.J , Aug 1974

    7. HIRSCHEERG, D S A hnear space algor ithm for comput ing maximal common subsequences CommACM 18, 6 (June 1975), 341-343

    8 HIESCr~EERG, D S The longest common subsequence problem Ph D Tb , Princeton U, Princeton,NJ,Aug 1975

    9 HUN1", J W , AND SZYMANSKI, T G A fast algori thm for computing longest common suhsequencesComm ACM 20, 5 (May 1977), 350-353.10 KNUTn, D E The Art of Computer Programming, Vol 1. Fundamental Algorithms Addison-Wesley,

    Readmg, Mass., sec. ed , 197311 KNUTTn, D. E The Art o f Computer Programming, Vol 3" Sorting and Searching. Addison-Wesley,Reading, Mass., 19731 2 L O W R A N C E , R , AND WAGNER, R A An extension of the string-to-string correction problem J ACM22, 2 (April 1975), 177-183

    13 NEEDLEMAN, S B , AND WUNSCH, C D A general method applicable to the search for slmdantles mthe amino acid sequence of two proteins J. Mol Biology 48 (1970), 443-453

    14 PATERSON,M.S Unpubh shed manuscript U of Warwick, Coventry, England, 197415 SANKOFF,D Matching sequences under deletion/insertion constraints Proc Nat Acad Sct USA 69, 1(Jan 1974), 4-616 SELLERS, P H An algor ithm for the distance between two fmlte sequences J. Combmatortal Theory,Ser A, 16 (1974), 253-258

  • 8/3/2019 Longest Common Sub Sequence

    12/12

    A l g o r i th m s f o r th e L o n g e s t C o m m o n S u b s e q u en c e P r o b l e m 6 7 517 SZYMANSKI,T G A s p e ci a l c a se o f t h e m a x i m a l c o m m o n s u b s e q u e n c e p r o b l e m T R - 1 7 0 , C o m p t r S c l

    L a b , P r i n c e t o n U , P r i n c e t o n , N J , J a n 1 9 7 5 .18 VAN E M DE B OAS , P A n O(n l o g l o g n ) o n - h n e a l g o r i t h m f o r t h e i n s e rt - e x t r a c t r a in p r o b l e m T R 7 4 -

    2 2 1 , D e p t C o m p t r S o l , C o r n e U U , I t h a c a , N Y , D e c 1 9 7 41 9 VAN E M D E B O A S , P . P r e s e r v i n g o r d e r in a f o r e s t m le s s t h a n l o g a r i t h m i c t i m e C o n f R e c 1 6 t h A n n u a l

    S y m p o n t h e F o u n d a t i o n s o f C o m p t r SO l , O c t 1 9 7 5 , p p 7 5 - 8 42 0 W A G N E R , R A O n t h e c o m p l e x i t y o f t h e e x t e n d e d s t ri n g - to - s t ri n g c o r r e c t io n p r o b l e m P r o c S e v e n t h

    A n n u a l A C M S y r up o n T h e o r y o f C o m p i n g , 1 9 7 5 , p p 2 1 8 - 2 2 321 W A G N ER , R A , AND FISCHER, M J T h e s t r in g - t o - st r in g c o r r e c t io n p r o b l e m J . ACM 21, 1 ( J a n

    1 9 7 4 ) , 1 6 8 - 1 7 32 2 W o N o , C K . . AN D C R Atq DR A , A . K b o u n d s f o r t h e s t r i n g e d i t i n g p r o b l e m .Jr A C M 2 3, 1 ( J a n 1 9 7 6 ) ,

    1 3 - 1 62 3 Y A O, C C , A ND Y A O, F C O n c o m p u t i n g t h e r a n k f u n c t i o n f o r a s e t o f v e c t o r s U I U C D C S - R - 7 5 - 6 9 9 ,

    D e p t C o m p t r Sc~ , U o f l l li n o l s a t U r b a n a - C h a m p a l g n , U r b a n a , I11 , F e b 1 9 7 5 .RECEIVED JUN E 19 75 , REVISED FEBRUARY 19 77

    Journal of the Association or Com putingMachinery,Vol 24, No 4, October 1977