Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad...

37
Similarity based Retrieval from Sequence Databases using Automata as Queries 作作 :A. Prasad Sistla, Tao Hu, Vikas howdhry 作作 :CIKM 2002 ACM 作作作作 : 作作作作 作作 : 作作作

Transcript of Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad...

Page 1: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Similarity based Retrieval from Sequence Databases using Automata as Queries

作者 :A. Prasad Sistla, Tao Hu, Vikas howdhry

出處 :CIKM 2002 ACM

指導教授 : 郭煌政老師 學生 : 林奕森

Page 2: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Outline Introduction Related work Definitions and Examples Algorithms for Infinite Norm Distance Algorithms for Average Block Distance Experimental Results Conclusion and Discussion

Page 3: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Introduction(1/4) Sequence Databases occur in many

areas of research in Database Management Systems. For example, Temporal Databases, Time-

series Databases and Video Databases are some examples of sequence databases.

In this paper we consider similarity based retrieval from sequence databases.

Page 4: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Introduction (2/4) Similarity based retrieval consists of

retrieving those subsequences that closely satisfy the query based on a similarity measure.

In this paper, we consider a language based on finite state automata for specifying queries on sequences, and develop similarity based methods for retrieval

Page 5: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Introduction (3/4) We consider the following problems for

a given database sequence d and a specification automaton A: (i) retrieval of k closest subsequences of d

with respect to the automaton A (called “nearest neighbor query”)

(ii) retrieval of all subsequences of d with in a given distance from A (also called “range query”)

Page 6: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Introduction (4/4) We have implemented the proposed

methods on top of Sequel Server. We also consider a restricted class of

automata, called cycle-restricted automata.

We present more efficient algorithms for these automata.

Page 7: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Related work (1/3) There has been much work done on

querying from time-series and other sequence databases

For example, methods for similarity based retrieval from such databases have been proposed in [11, 2, 3, 5, 15, 14]

Page 8: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Related work (2/3) The paper [1] presents a language,

called SDL The retrieval is done based on exact match

and is not similarity based retrieval like ours using a global distance measure.

There has also been much work done on data-mining over time series data [4, 12, 6] and other databases. Among these works, [6] uses automata

Page 9: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Related work (3/3) All these works mostly consider

discovery of patterns that have a given minimum level of support. They do not consider similarity based retrieval.

A temporal query language and efficient algorithms for similarity based retrieval have been presented in [18].

Page 10: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Basic Automata and Similarity values

Automata 1 An automaton A is 5-tuple (Q,Σ, δ, I,F) where

Q is a finite set of states, Σ is a finite set of symbols called the input alphabet, δ is the set of transitions, I,F ⊆ Q are the set of initial and final states, respectively.

2 Each input symbol represents an atomic predicate (also called an atomic query in some places) on a single database state.

Page 11: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Automata example

1 Each transition of A, i.e. each member of δ, is a triple of the form (q, a, q’) where q, q’ ∈ Q and a ∈ Σ; this triple denotes that the automaton makes a transition from state q to q’ on input a; we also represent such a transition as q →a q’.

2 For example, in a stock market database, price(ibm) = 100 represents an atomic predicate.

Page 12: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Automata example

the automaton B defined as follows. It has three states 1,2,3. Its input symbols are the atomic queries time = 10AM, time = 4PM and price(IBM) < 100. States 1,3 are the start and final states repsectively. The automaton has the following transitions— from state 1 to 2 on the input symbol time = 10AM, from state 2 back to 2 on the symbol price(IBM) < 100, and from state 2 to 3 on the symbol time = 4PM.

Page 13: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Similarity Measure

A database sequence d is a finite sequence of database states

A database state represent an image (in case of video databases) or a document in case of textual databases.

For a database state c and an atomic query c’, we let sim (c’, c) denote the similarity value with which c satisfies the query c’.

Page 14: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Similarity Measure

We let dist(c, c’) = 1- sim(c, c’) represent the distance between c and c’

we define the similarity of a database sequence d = (do, ..., dn-1) with respect to an automaton A

we define a distance measure dist(d, a) between d and an input sequence a = (a0, ..., an-1) of equal length.

Page 15: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Similarity Measure

Let sim_vec(d, a) be the sequence (s0, ..., sn-1) where for each i = 0, ..., n- 1, si = sim(ai, di). We assume that all similarity values and distances are normalized , i.e. they lie in the interval [0, 1]

Let F be a vector distance function which given two vectors x , y as arguments, associates a positive real number lying in the interval [0, 1]

xx

Page 16: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Similarity Measure

We define dist(d, a) = F(sim vec(d, a), 1). Now, we define a distance measure dist(d,C) between the database sequence d and a set C ⊆ Σ. dist(d,C) is the minimum of dist(d,α), where the minimum is taken over all α ∈ C such that |α| = |d|; if there is no sequence α ∈ C such that |α| = |d| then we take dist(d,C) to be equal to 1.

Page 17: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Similarity Measure

we define the distance of d with respect to A, denoted by dist(d,A), to be dist(d,L(A)). We define the similarity of d with respect to the automaton A, denoted by sim(d,A), to be 1- dist(d,L(A)).

Page 18: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Similarity Measure

Page 19: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Similarity Measure

Note that F1 is the average block distance function and F2 is the mean square distance function, etc.

We call F1 as the average block distance function and F∞ as the infinite norm distance function.

Page 20: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Wild Card Symbol

We assume that there is a special input symbol φ which denotes a wild card symbol, i.e. it denotes an atomic query which is always satisfied.

Cycle-Restricted Automata Let A = (Q,Σ, δ,I,F) be an automaton. A path of

the automaton is a sequence of transitions of the following form — q0 →a0 q1, q1 →a1 q2, ..., qn-1 →an-1 qn. We call such a sequence as a path from q0 to qn.

Page 21: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Cycle-Restricted Automata

We call the path a φ-path if all input symbols appearing in it are wild cards, i.e., for each i = 0, ..., n- 1, ai = φ. The above path is called a cycle if qn = q0 and q0, q1, ..., qn-1 are all distinct. A φ-path which is also a cycle is called a φ-cycle. We say that an automaton is cycle-restricted if it has no φ-cycles of length greater than 1

Page 22: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Nearest Neighbor and Range

Queries In this paper, we consider the

evaluation of the two types of queries assuming that we are given a query automaton A and a database sequence d

Page 23: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Nearest Neighbor and Range

Queries The first type of queries are called

nearest neighbor queries. Here we have to retrieve k subsequences of d having the lowest distances with respect to A where k is an additional input which is a positive integer.

Page 24: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Definitions and Examples Nearest Neighbor and Range

Queries The second type of queries are called

range queries. Here we have to retrieve all subsequences of d whose distance with respect to A is less than or equal to &, where & is an additional input which is a positive fraction.

Page 25: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

ALGORITHMS FOR INFINITE NORM

DISTANCE definitions and lemma Lemma4.1

Let q be any state in Q and i be an integer such that 1 ≤ i ≤ n. Further, let q1, ..., qm be the successor states of q on input symbols a1, ..., am respectively

Page 26: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

ALGORITHMS FOR INFINITE NORM DISTANCE

Page 27: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

ALGORITHMS FOR INFINITE NORM DISTANCE

Page 28: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

ALGORITHMS FOR INFINITE NORM DISTANCE

Page 29: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

ALGORITHMS FOR INFINITE NORM DISTANCE

Employing Indices for fast retrieval for each i = 1, ...m, we can retrieve a list Li

of entries of the form (I, val) where I is an interval of the form [u,v] such that 1 ≤ u ≤ v ≤ n and and 0 ≤ val < 1. The entry ([u,v], val) on the list Li denotes that the the distance, with respect to ai, of all database states whose indices fall with in the range [u,v] is val; that is, for all j such that u ≤ j ≤ v, dist(dj, ai) = val.

Page 30: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Algorithms for Average Block Distance

For any subsequence σ = (di, ..., di+l-1) of d and any string a = (a1, ..., al) ∈ Σ* of the same length, let bd(σ, a) be the sum Σj=0,...,l-1dist(di+j, aj+1); it denotes the block distance between σ and a.

Page 31: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Algorithms for Average Block Distance

let val(q, i, r) = min{bd(σ, a) : σ is a subsequence of d starting from di and a is any string in T(q) which is of the same length as σ whose pseudo length is r }

T(q) is the set of strings accepted by A starting from the state q

Page 32: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Algorithms for Average Block Distance

AVG-DIST :computes the minimum of the distances of all the subsequences of the database sequence with respect to the automaton A.

AVGDIST- RESTR-AUT :cycle restricted automata .

Page 33: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Experimental Results We have implemented all the

algorithms INF-NORM, INFNORM-INDX, AVG-DIST, INF-NORM-RESTR-AUT and AVG-DIST-RESTR-AUT.

They use SQL to run algorithms on a stock market database.

Page 34: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Experimental Results The database stored the end-of-day

Dow-Jones Industrial averages over the last 98 years giving a database sequence of length 26,716 ( the length is the total number of trading days during that period).

This query is specified by an automaton that accepts the language given by the regular expression ab*c .

Page 35: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Experimental Results

Page 36: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Conclusion and Discussion Introduced a powerful formalism based

on automata for expressing queries on sequence databases.

We also have given efficient algorithms for similarity based retrieval that employ indices.

Implemented the algorithms for time-series databases on PC using Sequel server

Page 37: Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Conclusion and Discussion Experimental results showing the

effectiveness of our methods are presented.

It will also be interesting to see if and how the techniques of the paper can be extended for data mining over sequences.