Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld...

43
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität

Transcript of Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld...

Page 1: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Geometric Matching on Sequential Data

Veli Mäkinen

AG Genominformatik

Technical Fakultät

Bielefeld Universität

Page 2: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 2

Introduction

Motivation: To study problems in the intersection of geometry and stringology.

Applications to time-series data.

Page 3: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 3

Three problems

1D point set matching under translations (Akutsu, COCOON’04).

1D point set matching under translations, scaling and noise (Böcker & Mäkinen, EuroCG’05)

2D point set matching under translations (Ukkonen & Lemström & Mäkinen, 2003 + Cieliebak & Mäkinen, 2005).

Page 4: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 4

1D point set matching under translations

Two point sets A and B of sizes m and n. Problem 1a: Find largest common point set

of f(A) and B over translations f. Problem 1b: Find largest common point set

of f(A) and a continuous subset of B. Let k be the number of unmatched points.

Page 5: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 5

Example

B

A

f(A)

Problem 1a: k=3Problem 1b: k=1

Page 6: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 6

Solutions

Trivial in O(m2n log n) time. Easy in O(mn log m) time. Akutsu gives an O(k3+n log n) time solution.

Page 7: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 7

Akutsu’s solution

Use differential encoding for A and B. A’=a2-a1,a3-a2,..., am-am-1,

B’=b2-b1,b3-b2,..., bn-bn-1.

Construct suffix tree T of A’#B’$. Preprocess T for LCA queries.

Page 8: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 8

Akutsu’s solution...

Let Jump(ai,bj)=h where h is largest integer such that,

Jump(ai,bj) can be computed O(1) time.

bj bj+h-1

ai ai+h-1

Page 9: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 9

Akutsu’s solution...

Observation: One of the first k+1 points in both A and B must match.

Each match defines a translation. For each translation, one needs at most k+1

queries to Jump() to find out whether there is large enough overlap.

Page 10: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 10

Akutsu’s solution...

Theorem 1: Problem 1a can be solved in O(k3+n log n) time and Problem 1b in O(k2n+n log n) time.

Akutsu also gives reductions from 2D/3D problems to 1D achieving good bounds.

Page 11: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 11

Three problems

1D point set matching under translations (Akutsu, COCOON’04).

1D point set matching under translations, scaling and noise (Böcker & Mäkinen, EuroCG’05)

2D point set matching under translations (Ukkonen & Lemström & Mäkinen, 2003 + Cieliebak & Mäkinen, 2005).

Page 12: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 12

Linear 1D point set matching

Let us consider generalization where we allow also scaling and noise.

We search for best linear mapping from point set A to point set B.- maximum number of points of A should move close to points of B.

Page 13: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 13

Example

A

B

Page 14: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 14

Example...

A

B

f(A)

Page 15: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 15

Linear 1D point set matching...

There is an optimum mapping such that two points of A are mapped exactly at -distance from some points of B.

One mapping fixes the translation, second the scale around the new origin defined by the translation.

Page 16: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 16

Example

2

A

B

f(A)

Page 17: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 17

Degenerate solution!

2B

A

f(A)

Page 18: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 18

One-to-one mapping

To avoid the degenerate solution, one needs a better definition for the mapping searched for.

Hence, we search for a mapping producing maximum size one-to-one matching between the points (Problem 2).

2 22 2 2 2

f(A)B

Page 19: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 19

Solving one-to-one case

Consider a fixed translation and scale. Construct a bipartite graph having edges

between points of f(A) and B that are at -distance.

Solve the maximum matching problem on this graph.

2 22 2 2 2

f(A)B

Page 20: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 20

Solving one-to-one case...

Repeating the algorithm on each relevant translation and scale gives the optimum solution.

The overall time complexity is O((mn)2 g(mn)) where g(x) is the complexity of the maximum matching algorithm on a graph with x edges.

Page 21: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 21

Solving one-to-one case faster

Consider a fixed translation, and sort the relevant scales from smallest to largest.

Observation [Alt et al. 88]: The graph Gi corresponding to ith scale differs from the graph Gi-1 of the (i-1)th scale by one edge.

The maximum matching on Gi can be found by searching for an augmenting path in Gi-1 added/deleted one edge.

Page 22: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 22

Solving one-to-one case faster..

Incremental computation gives O((mn)3) time solution.

Theorem 2: Problem 2 can be solved in O((mn)2(m+n)) time.

To obtain the result, we exploit the monotonicity of the match graph.

Page 23: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 23

Staircase property

fi(A)

B

Page 24: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 24

Greedy algorithm is enough

B

fi(A)

Page 25: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 25

scale i => scale i+1

B

fi+1(A)

Page 26: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 26

scale i+1

B

fi+1(A)

Page 27: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 27

scale i+1 => scale i+2

B

fi+2(A)

Page 28: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 28

scale i+2

B

fi+2(A)

Page 29: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 29

Observation - open question

Observation: With only translations and noise, we obtain O(mn(m+n)) time.

The staircase matrix changes only by one cell when moving from one scale to another.

Question: Can one update the greedy path incrementally?

O(1) solution for the above would imply that adding noise does not make the problem any harder.

Page 30: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 30

Three problems

1D point set matching under translations (Akutsu, COCOON’04).

1D point set matching under translations, scaling and noise (Böcker & Mäkinen, EuroCG’05)

2D point set matching under translations (Ukkonen & Lemström & Mäkinen, 2003 + Cieliebak & Mäkinen, 2005).

Page 31: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 31

2D point set matching

B

A f(A)

Page 32: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 32

Solutions

Easy in O(mn log m) time by constructing the set of mn translation vectors, sorting it, and finding maximum repeating element.

Possible also in O(mn) time by using naive string matching type algorithm.

Page 33: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 33

Naive point set matching

A

B

Remark: This is the fastest known algorithm for this problem!!

Page 34: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 34

Restricted case?

Would the problem become easier if there were no other points inside the area of matches?

f(A)

Page 35: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 35

Restricted case?

Restricted 1D case is extremely easy:- Exact string matching on the differentially encoded sequences.

Page 36: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 36

Easier on grid points

Page 37: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 37

Easier on grid points...

The problem becomes a special case of two-dimensional exact string matching.

Can be solved in O(N2) time on a text grid of size N£N and pattern grid of size M£M.

Notice that the run-length encoded representation of the rows of the matrix is of size O(n).

Page 38: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 38

Easier on grid points...

The algorithm of Amir & Landau & Sokol, 2002, for run-length compressed 2D search can be applied:- Time complexity O(M2+n). (can be reduced to O(m2+n)?)

Page 39: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 39

What about Bird-Baker?

Our idea to solve the problem is to modify Bird-Baker algorithm to work directly on point sets.

As a preliminary tool, we need an Aho-Corasick automaton that recognizes run-length encoded binary strings.

Page 40: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 40

Run-length encoding

5.7 12.2

3.1 9.3 ...

05.71012.2

...

Page 41: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 41

Modified Aho-Corasick automaton

Proposition: There is an automaton accepting a set of run-length encoded binary strings with the following properties:- O(m log m) construction time, where m is the number of 1-bits in the set.- Reading a fail-link in O(log m) time. - Scanning a string with n 1-bits in O(n log m) time.

Page 42: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 42

Bird-Baker on point sets

Now we can build our automaton on the rows of set A, scan it with the rows of set B.

Let R be the set of positions where a row of A was accepted inside the rows of B.

After sorting R by columns, we can test in O(|R|) time if any column of R contains the correct sequence of accepting states.

Page 43: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

Stringology Haifa 2005 Geometric matching on sequential data 43

Bird-Baker on point sets

The overall running time is O(n log m +|R| log |R|).

Unfortunately, there are examples where |R|=(mn) :-(

Hence, it is still open if (even) the restricted case has o(mn) solution or not.