An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data May/23/2008...

An Efficient Algorithm for Finding Similar Short An Efficient Algorithm for Finding Similar Short

Substrings from Large Scale String DataSubstrings from Large Scale String DataAn Efficient Algorithm for Finding Similar Short An Efficient Algorithm for Finding Similar Short

Substrings from Large Scale String DataSubstrings from Large Scale String Data

May/23/2008 PAKDD 2008

Takeaki UnoTakeaki Uno

National Institute of Informatics, JAPAN&

The Graduate University for Advanced Science

Motivation: Analyzing Huge DataMotivation: Analyzing Huge DataMotivation: Analyzing Huge DataMotivation: Analyzing Huge Data

•• Recent information technology gave us many huge database - - Web, genome, POS, log, …

•• "Construction" and "keyword search" can be done efficiently

•• The next step is analysis; capture features of the data - - statistics, such as size, #rows, density, attributes, distribution…•• Can we get more?

look at (simple) local structures but keep being simple and basic

genome

Results of experiments

Database

ATGCGCCGTATAGCGGGTGGTTCGCGTTAGGGATATAAATGCGCCAAATAATAATGTATTATTGAAGGGCGACAGTCTCTCAATAAGCGGCT

ATGCGCCGTATAGCGGGTGGTTCGCGTTAGGGATATAAATGCGCCAAATAATAATGTATTATTGAAGGGCGACAGTCTCTCAATAAGCGGCT

実験1

実験2

実験3

実験4

　● 　▲ 　▲ 　　● 　▲

　● 　● 　▲ 　●　● 　● 　▲ 　●　▲ 　● 　●

　● 　▲ 　●　● 　▲ 　▲　　▲ 　▲ 　

Our FocusOur FocusOur FocusOur Focus

•• Find all pairs of similar objects (or structures)

(or binary relation, instead)

•• Maybe, this is very basic and fundamental

　　　　 There would be many applications

-- detecting global similar structures,

-- constructing neighbor graphs,

-- detecting locally dense structures (groups of related objects)

In this talk, we look at stringsIn this talk, we look at strings

Existing StudiesExisting StudiesExisting StudiesExisting Studies

•• There are so many studies on similarity search (homology search)

　　　　　　 Given a database, construct a data structure which enables us to find the objects similar to the given a query object, quickly

-- strings with Hamming distance, edit distance

-- points in plane (k-d trees), Euclidian space

-- sets

-- constructing neighbor graphs (for smaller dimensions)

-- genome sequence comparison (heuristics)

•• Both exact and approximate approaches

•• All pairs comparison does not work for

large scale data

Approach from Algorithm TheoryApproach from Algorithm TheoryApproach from Algorithm TheoryApproach from Algorithm Theory

•• Parallel computation is a popular way to fast computation, but its high cost, including hardness of programming, is a disadvantage

•• Algorithm improvement decreases the increase against the database size by the derivation on the design of the way of computation

Efficiency increases as the increase of database sizeWe approach the problem from the algorithmic point Efficiency increases as the increase of database size

We approach the problem from the algorithmic point

size = 1002-3 times

size = 1,000,000

10,000 times

Our ProblemOur ProblemOur ProblemOur Problem

•• We address databases whose records are short strings

Problem: Problem: For given a database composed of n strings of the fixed same length l, and a threshold d, find all the pairs of strings such that the Hamming distance of the two strings is at most d.

•• We propose an efficient algorithm SACHICA (Scalable Algorithm for Characteristic/Homogenous Interval Calculation),

and a method to detect long similar substrings of input strings

(especially efficient for genomic data)

ATGCCGCGGCGTGTACGCCTCTATTGCGTTTCTGTAATGA　　．．．

ATGCCGCGGCGTGTACGCCTCTATTGCGTTTCTGTAATGA　　．．．

・・ ATGCCGCG , AAGCCGCC・・ GCCTCTAT , GCTTCTAA・・ TGTAATGA , GGTAATGG　　　　．．．

・・ ATGCCGCG , AAGCCGCC・・ GCCTCTAT , GCTTCTAA・・ TGTAATGA , GGTAATGG　　　　．．．

•• When two strings S1 and S2 are similar, they must have several pairs of similar short substrings

•• “Having several similar substrings” is a necessary condition to be similar strings

Ex) for strings of length 3000 s.t., Hamming distance 290 (=10%) they have at least 3 pairs of substrings of length 30 with Hamming distance at most 2 the position of these substrings must differ at most 30, if we allow deletion and insertion

•• It gives a condition that substrings of length β are similar only if “k pairs of their short substrings are similar, and their start positions differ at most α

Approaching Long-string SimilarityApproaching Long-string SimilarityApproaching Long-string SimilarityApproaching Long-string Similarity

•• Consider to find long similar substrings of given strings S1 and S2

•• Comparison of all substrings of length β needs square time redundant overlapping pairs use our similarity condition

(1)(1) find all pairs of similar short substrings(2)(2) scan diagonal belt of width 2α to find an interval of length β including k pairs(3)(3) shift the diagonal belt by α, and repeat

•• We can always find substrings of length βsatisfying the condition approach from similar short substrings is possible

Detecting Long Similar SubstringsDetecting Long Similar SubstringsDetecting Long Similar SubstringsDetecting Long Similar Substrings

Related WorksRelated WorksRelated WorksRelated Works

•• Computing edit/Hamming distance is done in square/linear time

the whole strings have to be similar

can not detect local exchange

•• Heuristic homology search such as BLAST, Pattern Hunter

usually finds exact match of short substrings (11 letters), and extend

must find terrible number of pairs when input strings are huge

lengthen 11 letters loses the accuracy

heuristics, ignoring frequent substring, dealing only gene areas

•• Similarity search

involves huge number of queries, taking much much longer time than exact search

Trivial Bound of the ComplexityTrivial Bound of the ComplexityTrivial Bound of the ComplexityTrivial Bound of the Complexity

•• If all the strings are exactly the same, we have to output all the pairs, thus take Θ(n2) time

　 simple all pairs comparison of O(l n2) time is optimal,

if l is a fixed constant

　 is there no improvement?

•• In practice, we would analyze only when output is small, otherwise the analysis is non-sense

　　　 consider complexity in the term of

the output size

We propose O(2l(n+lM)) time algorithmWe propose O(2l(n+lM)) time algorithm

M: #outputsM: #outputs

Basic Idea: Solve SubproblemBasic Idea: Solve SubproblemBasic Idea: Solve SubproblemBasic Idea: Solve Subproblem

•• Consider the partition of strings into k blocks, and a subproblem

subproblem:subproblem: ffor given k-d block positions, find all pairs of strings with distance at most d s.t. "the given blocks are the same"

Ex) 2nd, 4th, 5th blocks of S1 and S2 (length 30) are the same

much much fewer comparisons !!much much fewer comparisons !!

•• We can solve by "radix sort" on combined blocks, in O(l n) time.

Examine All CasesExamine All CasesExamine All CasesExamine All Cases

•• Solve the subproblem for all combinations of the positions

if distance of two strings S1 and S2 is at most 2,

letters on l-2 blocks are the same

in at least one combination of blocks, the pair ”S1 and S2” is found

(in the subproblem of combination P)

• • #combinations is kCd. When k=5 and d=2, it is 10

computation is "radix sorts +α", O(kCd ln ) time for sorting

recursive radix sort to reducing to O(kCd n )

ExampleExampleExampleExample

・・ Find all pairs of strings with Hamming distance at most 1

A BCDEA BDDEA DCDEC DEFGC DEFFC DEGGA AGAB　

A BCDEA BDDEA DCDEC DEFGC DEFFC DEGGA AGAB　

ABCDEABDDEADCDECDEFGCDEFFCDEGGAAGAB

ABCDEABDDEADCDECDEFGCDEFFCDEGGAAGAB

A BC DEA BD DEA DC DEC DE FGC DE FFC DE GGA AG AB

A BC DEA BD DEA DC DEC DE FGC DE FFC DE GGA AG AB

ABC DEABD DEADC DECDE FGCDE FFCDE GGAAG AB

ABC DEABD DEADC DECDE FGCDE FFCDE GGAAG AB

Figure out Intuition Figure out Intuition Figure out Intuition Figure out Intuition

•• Finding pairs of similar records is something finding all certain cells in a matrix

•• All pairs comparison sweeps and looks at all cells

•• Our multi-classification algorithm recursively reduces the areas to be checked in many ways,

thus the search route forms a tree,

whose leaves corresponds to

a group of strings to be compared

Avoid Duplications by Canonical Avoid Duplications by Canonical PositionsPositions

Avoid Duplications by Canonical Avoid Duplications by Canonical PositionsPositions

•• For two strings S1 and S2, their canonical positions are the first l-d positions of the same letters

•• Only we output the pair S1 and S2 only in the subproblem of their canonical positions

•• Computation of canonical posisions takes O(l) time, "+α" needs O(M l kCd ) time

Avoid duplications without keeping the solutions in memoryAvoid duplications without keeping the solutions in memory

O(lCd (n+dM)) = O(2l (n + lM) ) time in total ( if we set k=l )O(lCd (n+dM)) = O(2l (n + lM) ) time in total ( if we set k=l )

Difference from BLASTDifference from BLASTDifference from BLASTDifference from BLAST

•• The original “BLAST” algorithm finds pairs of the identical intervals of 11 letters

roughly, classifies into 411 = 4 million groups

may take long time for 100 million letters

•• Our method for length 30 with Hamming distance 3 (quality equal to finding same interval of 7 letters), with dividing into 6 blocks

roughly, classifies into 415 = 1,000 million groups, (20 times)

may take long time for 2000 million letters, but we can increase the #blocks

But, not good at searching a given short string(no difference of time between many strings and one string)

But, not good at searching a given short string(no difference of time between many strings and one string)

Experiments: l = 20 and d = 0,1,2,3Experiments: l = 20 and d = 0,1,2,3Experiments: l = 20 and d = 0,1,2,3Experiments: l = 20 and d = 0,1,2,3

0.1

1

10

100

1000

10000

200

700

2000

7000

2295

3length(1000base)

CPU t

ime(

sec.

)

d=0d=1d=2d=3

Prefixes of Y chromosome of Human Note PC with Pentium M 1.1GHz, 256MB RAM

Comparison of ChromosomeComparison of ChromosomeComparison of ChromosomeComparison of ChromosomeHuman 21st and chimpanzee 22nd chromosomes•• Take strings of 30 letters from both, with overlaps• • Intensity is given by # pairs

•• White possibly similar•• Black never similar

•• Grid lines detect "repetitions of similar structures"

human 21st chr.

chim

pan

zee 22nd

ch

r.

20 min. by PC20 min. by PC

Homology Search on Mouse X Chr.Homology Search on Mouse X Chr.Homology Search on Mouse X Chr.Homology Search on Mouse X Chr.

Human X and mouse X chromosomes (150M strings for each)

•• take strings of 30 letters beginning at every position・・ For human X, without overlaps・・ d=2, k=7・・ dots if 3 points are in area of width 300 and length 3000

1 hour by PC1 hour by PC

human X chr.

mou

se X

chr.

Comparison of Many BacteriasComparison of Many Bacterias Comparison of Many BacteriasComparison of Many Bacterias

Comparison of the genomes of 30 bacteria

•• The genomes are concatenated and compared in the same way

•• The genomes are concatenated and compared in the same way

1 hour by PC

Comparison of BAC clonesComparison of BAC clonesComparison of BAC clonesComparison of BAC clones

•• Sequencing a genome is done by detecting overlaps of fragments

•• When genome has complex repeating structures, detection is hard

•• We detected the overlaps in

the mouse genome, and

completed some undetermined

complex repeating parts

(joint research with Koide,

Umemori of National

Institute of Genetics, Japan)

1 sec. by PC for a pair

Extensions ???Extensions ???Extensions ???Extensions ???

•• Can we solve the problem for other objects?

(sets, sequences, graphs,…)

•• For graphs, maybe yes, but not sure for the practical performance

•• For sets, Hamming distance is not preferable.

for large sets, many difference should be allowed.

•• For continuous objects, such as points in Euclidian space, we can

hardly bound the complexity in the same way.

(In the discrete version, the neighbors are finite, actually

classified into constant number of groups)

ConclusionConclusionConclusionConclusion

•• Output sensitive algorithm for finding pairs of similar strings

( in the term of Hamming distance)

•• Multi-classification for speeding up

•• Application to genome sequence comparison

•• Models and algorithms for natural language text

•• Extension to other objects (sets, sequences, graphs)

•• Extension to continuous objects (points in Euclidian space)

•• Efficient spin-out heuristics for practice

•• Genome analyze tools and systems

•• Models and algorithms for natural language text

•• Extension to other objects (sets, sequences, graphs)

•• Extension to continuous objects (points in Euclidian space)

•• Efficient spin-out heuristics for practice

•• Genome analyze tools and systems

Future worksFuture works

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data May/23/2008...

Documents

Transcript of An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data May/23/2008...