Sequence Data Mining

8/7/2019 Sequence Data Mining

1/85

Sequence Data Mining:

Techniques and Applications

Sunita Sarawagi

IIT Bombayhttp://www.it.iitb.ac.in/~sunita

Mark Craven

University of Wisconsinhttp://www.biostat.wisc.edu/~craven


2/85

What is a sequence?

Ordered set of elements:s = a1,a2,..an Each element a

icould be

Numerical

Categorical: domain a finite set of symbols , | |=m

Multiple attributes

The length n of a sequence is not fixed

Order determined by time or position and couldbe regular or irregular


3/85

Real-life sequences

Classical applications Speech: sequence of phonemes

Language: sequence of words and delimiters

Handwriting: sequence of strokes

Newer applications

Bioinformatics: Genes: Sequence of 4 possible nucleotides, | |=4

Example: AACTGACCTGGGCCCAATCC

Proteins: Sequence of 20 possible amino-acids, | |=20 Example:

Telecommunications: alarms, data packets

Retail data mining: Customer behavior, sessions in a e-store(Example, Amazon)

Intrusion detection


4/85

Intrusion detection

Intrusions could be detected at Network-level (denial-of-service attacks,

port-scans, etc) [KDD Cup 99]

Sequence of TCP-dumps

Host-level (attacks on privileged programslike lpr, sendmail) Sequence of system calls

| | = set of all possible system calls ~100

open

lseek

lstat

mmap

execve

ioctl

ioctl

close

execve

close

unlink


5/85

Outline

Traditional mining operations on sequences Classification

Clustering

Finding repeated patterns

Primitives for handling sequence data Sequence-specific mining operations

Partial sequence classification (Tagging)

Segmenting a sequence Predicting next symbol of a sequence

Applications in biological sequence mining


6/85

Classification of whole sequences

Given: a set of classes C and

a number of example sequences in each class,

train a model so that for an unseen sequence we

can say to which class it belongsExample: Given a set of protein families, find family of a new

protein

Given a sequence of packets, label session asintrusion or normal

Given several utterances of a set of words, classify anew utterance to the right word


7/85

Conventional classification methods

Conventional classification methods assume record data: fixed number of attributes

single record per instance to be classified (no order)

Sequences:

variable length, order important.

Adapting conventional classifiers to sequences

Generative classifiers

Boundary-based classifiers Distance based classifiers

Kernel-based classifiers


8/85

Generative methods

For each class i,

train a generative model Mi to

maximize likelihood over all

training sequences in the class i

Estimate Pr(ci) as fraction of training instances inclass i

For a new sequencex,

find Pr(x|ci)*Pr(ci) for each i using Mi

choose iwith the largest value ofPr(x|ci)*P(ci)

x

Pr(x|c1)*Pr(c1)

Pr(x|c2)*Pr(c2)

Pr(x|c3)*Pr(c3)

Need a generative model for sequence data


9/85

Boundary-based methods

Data: points in a fixed multi-dimensional space

Output of training: Boundaries that define regions

within which same class predicted

Application: Tests on boundaries to find region

Need to embed sequence data in a fixed coordinate space

Decision trees Neural networksLinear discriminants


10/85

Kernel-based classifiers

Define function K(xi, x) that intuitively defines similarity between two

sequences and satisfies two properties K is symmetric i.e., K(xi, x) = K(x, xi)

K is positive definite

Training: learn for each class c,

wicfor each train sequence xi

bc

Application: predict class of x For each class c, find f(x,c) = wicK(xi, x)+bc

Predicted class is c with highest value f(x,c)

Well-known kernel classifiers

Nearest neighbor classifier

Support Vector Machines

Radial Basis functions

Need kernel/similarity function


11/85

Sequence clustering

Given a set of sequences, create groups such

that similar sequences in the same group

Three kinds of clustering algorithms

Distance-based:

K-means Various hierarchical algorithms

Model-based algorithms

Expectation Maximization algorithm

Density-based algorithms Clique

Need similarity function

Need generative models

Need dimensional embedding


12/85

Outline

Traditional mining on sequences

Primitives for handling sequence data Embed sequence in a fixed dimensional space

All conventional record mining techniques will apply

Generative models for sequence

Sequence classification: generative methods Clustering sequences: model-based approach

Distance between two sequences Sequence classification: SVM and NN

Clustering sequences: distance-based approach

Sequence-specific mining operations

Applications


13/85

Embedding sequences in fixed

dimensional space Ignore order, each symbol maps to a dimension

extensively used in text classification and clustering

Extract aggregate features

Real-valued elements: Fourier coefficients, Wavelet coefficients,

Auto-regressive coefficients

Categorical data: number of symbol changes Sliding window techniques (k: window size)

Define a coordinate for each possible k-gram (mk coordinates)

-th coordinate is number of times in sequence

(k,b) mismatch score: -th coordinate is number of k-grams in

sequence with b mismatches with

Define a coordinate for each of the k-positions

Multiple rows per sequence


14/85

Sliding window examples

o c l i e m

1 2 1 1 3 2 1

2 .. .. .. .. .. ..

3 .. .. .. .. .. ..

open

lseek

ioctl

mmap

execve

ioctl

ioctl

open

execve

close

mmap

One symbol per column

Sliding window: window-size 3

ioe cli oli lie lim ...

1 1 0 1 0 1

2 .. .. .. .. .. ..

3 .. .. .. .. .. ..

One row per tracesid A1 A2 A3

1 o l i

1

l i m1 i m e

1 .. .. ..

1 e c m

Multiple rows per trace

ioe cli oli lie lim ...

1 2 1 1 0 1

2 .. .. .. .. .. ..

3 .. .. .. .. .. ..

mis-match scores: b=1


15/85

Detecting attacks on privileged programs

Short sequences of system calls made during

normal execution of system calls are very

consistent, yet different from the sequences of

its abnormal executions

Two approaches STIDE (Warrender 1999)

Create dictionary of unique k-windows in normal traces,

count what fraction occur in new traces and threshold.

RIPPER based (Lee 1998) next...


16/85

Classification models on k-grams trace

data When both normal and

abnormal data available

class label =

normal/abnormal:

When only normal trace,

class-label=k-th system

call

7-grams class

vtimes open seek read read read seek normal

lstat lstat lstat bind open close vtimes abnormal

Learn rules to predict class-label [RIPPER]

6-attributes class

vtimes open seek read read read seek

lstat lstat lstat bind open closevtimes


17/85

Examples of output RIPPER rules

Both-traces:

if the 2nd system call is vtimes and the 7th is vtrace, then thesequence is normal

if the 6th system call is lseekand the 7th is sigvec, then thesequence is normal

if none of the above, then the sequence is abnormal Only-normal:

if the 3rd system call is lstatand the 4th is write, then the 7th isstat

if the 1st system call is sigblockand the 4th is bind, then the 7th

is setsockopt

if none of the above, then the 7th is open


18/85

Experimental results on sendmailtraces Only -normal BOTH

sscp -1 13.5 32.2

sscp -2 13.6 30.4sscp -3 13.6 30.4

syslog -remote -1 11.5 21.2

syslog -remote -2 8.4 15.6

syslog -local -1 6.1 11.1

syslog -local -2 8.0 15.9decode -1 3.9 2.1

decode -2 4.2 2.0

sm565a 8.1 8.0

sm5x 8.2 6.5

sendmail 0.6 0.1

The output rule sets contain~250 rules, each with 2 or 3attribute tests

Score each trace by counting

fraction of mismatches and

thresholding

Summary: Only normal traces

sufficient to detect intrusionsPercent of mismatching traces


19/85

More realistic experiments

Different programs need different thresholds

Simple methods (e.g. STIDE) work as well Results sensitive to window size

Is it possible to do better with sequence specific

methods?

STIDE RIPPER

threshold %false-pos threshold %false-posSite-1 lpr 12 0.0 3 0.0016

Site-2 lpr 12 0.0013 4 0.0265

named 20 0.0019 10 0.0

xlock 20 0.00008 10 0.0

[from Warrender 99]


20/85

Outline


Primitives for handling sequence data

Embed sequence in a fixed dimensional space

All conventional record mining techniques will apply

Distance between two sequences

Sequence classification: SVM and NN

Clustering sequences: distance-based approach

Generative models for sequences Sequence classification: whole and partial

Clustering sequences: model-based approach




21/85

Probabilistic models for sequences

Independent model

One-level dependence (Markov chains)

Fixed memory (Order-lMarkov chains)

Variable memory models

More complicated models Hidden Markov Models


22/85


23/85

Model structure A state for each symbol in

Edges between states with probabilities

Probability of a sequence s being

generated from the model Example: Pr(AACA)

= Pr(A) Pr(A|A) Pr(C|A) Pr(A|C)

= 0.5 0.1 0.9 0.4

Training transition probability betweenstatesPr( | ) = Count( T) / Count( T)

First Order Markov Chains

CA0.9

0.4

0.1 0.6

start

0.5 0.5


24/85

l = memory of sequence Model

A state for each possible suffix oflength l| |lstates

Edges between states withprobabilities

Pr(AACA)

= Pr(AA)Pr(C|AA) Pr(A|AC)

= 0.5 0.9 0.7 Training model

Pr( |s) = count(s T) / count(s T)

Higher order Markov Chains

ACAA

C 0.3

C 0.9

A 0.1

l = 2

CCCA0.8

A 0.7

C 0.2

A 0.4

C 0.6

start

0.5


25/85

Variable Memory models

Probabilistic Suffix Automata (PSA)

Model

States: substrings of size no greater than l

where no string is suffix of another

Calculating Pr(AACA):= Pr(A)Pr(A|A)Pr(C|A)Pr(A|AC)

= 0.5 0.3 0.7 0.1

Training: not straight-forward

Eased by Prediction Suffix Trees PSTs can be converted to PSA after training

CCAC

C 0.7

C 0.9

A 0.1

A

C 0.6

A 0.3

A 0.6

start

0.2 0.5


26/85

Prediction Suffix Trees (PSTs)

Suffix trees with emission probabilities of

observation attached with each tree node

Linear time algorithms exist for constructingsuch PSTs from training data [Apostolico 2000]

CCAC

C 0.7

C 0.9

A 0.1

A

A 0.3

C 0.6e

A C

AC CC

0.3, 0.7

0.28, 0.72

0.25, 0.75

0.1, 0.9 0.4, 0.6

Pr(AACA)=0.28 0.3 0.7 0.1


27/85

Hidden Markov Models

Doubly stochastic models

Efficient dynamic programmingalgorithms exist for Finding Pr(S)

The highest probability path P that

maximizes Pr(S|P) (Viterbi)

Training the model (Baum-Welch algorithm)

S2

S4

S1

0.9

0.5

0.5

0.8

0.2

0.1

S3

A

C

0.6

0.4

A

C

0.3

0.7

A

C

0.5

0.5

A

C

0.9

0.1


28/85


29/85

HMMs for profiling system calls

Training:

Initial number of states = 40 (roughly equals number

of distinct system calls)

Train on normal traces

Testing: Need to handle variable length and online data

For each call, find the total probability of outputting

given all calls before it.

If probability below a threshold call it abnormal.

Trace is abnormal if fraction of abnormal calls are

high


30/85


31/85

ROC curves comparing different

methods

[from Warrender 99]


32/85

Outline





Segmenting a sequence

Predicting next symbol of a sequence



33/85


The tagging problem: Given:

A set of tags L

Training examples of sequences showing the breakup ofthe sequence into the set of tags

Learn to breakup a sequence into tags (classification of parts of sequences)

Examples: Text segmentation

Break sequence of words forming an address string intosubparts like Road, City name etc

Continuous speech recognition Identify words in continuous speech


34/85

Text sequence segmentation

Example: Addresses, bib records

House

number Building Road CityZip

4089 Whispering Pines Nobel Drive San Diego CA 92122

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S.Dordick (1993) Protein and Solvent Engineering of Subtilising

BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc.115, 12231-12237.

Author Year Title Journal VolumePage

State


35/85

Approaches used for tagging

Learn to identify start and end boundaries of each

label

Method: for each label, build two classifiers for accepting

its two boundaries.

Any conventional classifier would do:

Rule-based most common.

K-windows approach:

For each label, train a classifier on k-windows

During testing

classify each k-window

Adapt state-based generative models like HMM

St t b d d l f


36/85

State-based models for sequence

tagging Two approaches:

Separate model per tag with special prefix and suffix states to capture the start and end of atag

S2

S4

S1

S3

Prefix Suffix

Road name

S2

S4

S1

S3

PrefixSuffix

Building name


37/85

Combined model over all tags

Mahatma Gandhi Road Near Parkland ...

[Mahatma Gandhi Road Near: Landmark] Parkland ...

Example: IE

Nave Model: One state per element

Nested model

Each element

another HMM


38/85

Other approaches

Disadvantages of generative models (HMMs)

Maximizing joint probability of sequence and labels

may not maximize accuracy

Conditional independence of features a restrictive

assumption

Alternative: Conditional Random Fields

Maximize conditional probability of labels given

sequence

Arbitrary overlapping features allowed


39/85

Outline





Segmenting a sequence

Predicting next symbol of a sequence



40/85

Si l bl S ti 0/1


41/85

Simpler problem: Segmenting a 0/1

sequence Players A and B

A has a set of coins with

different biases

A repeatedly

Picks arbitrary coin

Tosses it arbitrary number

of times

B observes H/T

Guesses transition

points and biases

Pick

Toss

Return

A

B

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1


42/85

A MDL-based approach

Given n head/tail observations

Can assume n different coins with bias 0 or 1

Data fits perfectly (with probability one)

Many coins needed

Or assume one coin

May fit data poorly

Best segmentation is a compromise between

model length and data fit

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

1/4 5/7 1/3


43/85

MDL

For each segment i:

L(Mi): cost of model parameters: log(Pr(head))

+ segment boundary: log (sequence length)

L(Di|Mi): cost of describing data in segment Sigiven model Mi: log(H

h T(1-h) ) H: #heads, T: #tails

Goal: find segmentation that leads to smallest total cost

segment i L(Mi) + L(Di|Mi)


44/85

How to find optimal segments

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

Sequence of 17 tosses:

Derived graph with 18 nodes and all possible edges

Edge cost =

model cost

+ data cost

Shortest path


45/85

Non-independent models

In previous method each segment is assumed to be

independent of each other, does not allow model reuseof the form:

The (k,h) segmentation problem:

Assume a fixed number h of models is to be used for

segmenting into k parts an n-element sequence (k > h)

(k,k) segmentation solvable by dynamic programming

General (k,h) problem NP hard

0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0


46/85

Approximations: for (k,h) segmentation

1. Solve (k,k) to get segments

2. Solve (n,h) to get H models

Example: 2-medians

1. Assign each segment to best of H

A second variant (Replace step 2 above with) Find summary statistics for each k segment, cluster

them into H clusters

A C T G G T T T A C C C C T G T GS1 S2 S3 S4

M1 M2

A C T G G T T T A C C C C T G T G

Results of segmenting genetic


47/85

Results of segmenting genetic

sequencesFrom:

Gionis &

Mannila

2003


48/85


49/85

Two sequence mining applications


50/85


51/85

image from the DOE Human Genome Program

http://www.ornl.gov/hgmis


52/85

The roles of proteins

A protein family is

Figure from the DOE Human Genome Program



53/85

The roles of proteins

The amino-acid sequence of a protein determines its structure

The structure of a protein determines its function Proteins play many key roles in cells

structural support

storage of amino acids

transport of other substances

coordination of an organisms activities

response of cell to chemical stimuli

movement

protection against disease

selective acceleration of chemical reactions


54/85

Protein taxonomies

The SCOP and CATH databasesprovide hierarchical taxonomies of

protein families

An alignment of globin family proteins


55/85

An alignment of globin family proteins

Figure from www-cryst.bioc.cam.ac.uk/~max/res_globin.html

The sequences in

a family may vary

in length

Some positions are

more conserved

than others

f


56/85

Profile HMMs

i 2 i 3i 1i 0

d 1 d 2 d 3

m 1 m 3m 2start endtch states represent

y conserved positions

ert states account

extra characters

some sequences

lete states are silent; they

count for characters missing

some sequences

Profile HMMs are commonly used to model families

of sequences

A 0.01

R 0.12

D 0.04

N 0.29C 0.01

E 0.03

Q 0.02

G 0.01Insert and match states have

emission distributions over

sequence characters


57/85

Profile HMMs

To classify sequences according to family, we can

train a profile HMM to model the proteins of eachfamily of interest

Given a sequencex, use Bayes rule to make

classification

-galacto

sidase

-glucana

se

-amylas

e

-amyla

se

=j

jj

ii

i ccx

ccx

xc )Pr()|Pr(

)Pr()|Pr(

)|Pr(


58/85

How Likely is a Given Sequence?

=

+=

L

i

iNL iiiaxeaxx

1

001 11)()...,...Pr(

Lxx ...1

N ...0

)|Pr( icx

transition

probabilities

emission

probabilities


59/85

How Likely Is A Given Sequence?

A 0.1

C 0.4

G 0.4

T 0.1

A 0.4

C 0.1

G 0.1

T 0.4

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

6.03.08.04.02.04.05.0

)C()A()A(),AACPr( 35313111101

=

= aeaeaea

A 0.4

C 0.1

G 0.2

T 0.3

A 0.2

C 0.3

G 0.3

T 0.2


60/85

How Likely is a Given Sequence?

the probability overallpaths is:

)...,...Pr()...Pr( 011 =

NLL xxxx

but the number of paths can be exponential in the length of the sequence...

the Forward algorithm enables us to compute this efficiently using dynamic programming


61/85

How Likely is a Given Sequence:

The Forward Algorithm

define to be the probability of being in state k

having observed the first icharacters ofx

)(ifk

we want to compute , the probability of being

in the end state having observed all ofx

can define this recursively

)(LfN


62/85


63/85

Training a profile HMM

The parameters in profile HMMs are typically trainedusing the Baum-Welch method (an EM algorithm)

Heuristic methods are used to determine the length of

the model

Initialize using an alignment of the trainingsequences (as in globin example)

Iteratively make local adjustments to length if delete

or insert states are used too much for training

sequences

The Fisher kernel method for


64/85


protein classification

Standard HMMs are generative models Training involves estimating

Predictions based on are made using

Bayes rule

Sometimes we can get more accurate predictions

using discriminative methods which try to

optimize directly

One example: the Fisher kernel method

[Jaakola et al. 00]

)|Pr( icx

)|Pr( xci

)|Pr( xci



65/85


protein classification

Consider learning to discriminate proteins in classfrom proteins other families

1. Train an HMM for

2. Use HMM to map each protein sequencex

into a fixed-length vector

3. Train an support vector machine (SVM)

whose kernel function is the Euclidean

distance between vectors

The resulting discriminative model is given by

1c

1c

=

11 ::

),(),()(cxi

i

i

cxi

i

iii

xxKxxKxD


66/85

Profile HMM accuracy


67/85

Profile HMM accuracyFigure from Jaakola et al., ISMB 1999

BLAST-based

methods

profile HMMs

classifying 2447proteins into 33 families

x-axis represents the median fraction of negative sequences that score as highas a positive sequence for a given familys model

profile HMMs w/

Fisher kernel SVM

Th fi di t k


68/85

The gene finding task

Given: an uncharacterized DNA sequenceDo: locate the genes in the sequence, including the

coordinates of individual exons and introns

image from the UCSC Genome Browser

http://genome.ucsc.edu/


69/85

image from the DOE Human Genome Program


Th t t f


70/85

The structure of genes

Genes consist of alternating sequences ofexons and

introns Introns are spliced out before the gene is translated into

protein

ATG GAA ACC CGA TCG GGC AC

intergenic

region

intergenic

region

intron exonexon exon intron

G TAA AGT CTA

Exons consist of three-letter words, called codons

Each codon encodes a single amino acid (character in

a protein sequence)

The GENSCAN HMM for gene finding


71/85

Each shape represents a functional unitf a gene or genomic region

Pairs of intron/exon units represent

he different ways an intron can interrupt

coding sequence (after 1st base in codon,

fter 2nd base or after 3rd base)

omplementary submodel

not shown) detects genes on

pposite DNA strand

The GENSCAN HMM for gene finding

[Burge & Karlin 97]

Th GENSCAN HMM


72/85

The GENSCAN HMM For each sequence type, GENSCAN models

the length distribution

the sequence composition

Length distribution models vary depending on sequence type Nonparametric (using histograms) Parametric (using geometric distributions)

Fixed-length

Sequence composition models vary depending on type 5th-order, inhomogeneous 5th -order homogenous

Independent and 1st-order inhomogeneous Tree-structured variable memory

R ti i GENSCAN


73/85

Representing exons in GENSCAN

For exons, GENSCAN uses Histograms to represent exon lengths

5th-order, inhomogeneous Markov models to

represent exon sequences

5th-order, inhomogeneous models can represent

statistics about pairs of neighboring codons

A 5th d M k d l f DNA


74/85

A 5th-order Markov model for DNA

GCTAC

AAAAA

TTTTT

CTACG

CTACA

CTACC

CTACT

Pr(A | GCTAC)

start

Pr(GCTAC)

)|Pr()Pr)Pr( GCTACA(GCTACGCTACA=

Markov models for exons


75/85

Markov models for exons

for each word we evaluate, well want to consider itsposition with respect to the assumed codon framing

thus well want to use an inhomogenous model to

represent genes

G C T A C G G A G C T T C G G A G C

G C T A C G Gis in 3rd

codon position

C T A C G G Gis in 1st position

T A C G G AAis in 2nd position

A 5th order inhomogeneous model


76/85

A 5th-order inhomogeneous model

GCTAC

CTACG

CTACA

CTACC

CTACT

AAAAA

TTTTT

start

TACAG

TACAA

TACAC

TACAT

AAAAA

TTTTT

GCTAC

CTACG

CTACA

CTACC

CTACT

AAAAA

TTTTT

position 1 position 2 position 3

Transitions go to

states in position 1

Inference with the gene finding HMM


77/85

Inference with the gene-finding HMM

given: an uncharacterized DNA sequence

do: find the most probable path through the model for the

sequence

This path will specify the coordinates of the predicted

genes (including intron and exon boundaries)

The Viterbi algorithm is used to compute this path

Finding the most probable path:


78/85


the Viterbi algorithm

define to be the probability of the most probable

path accounting for the first icharacters ofxand ending

in state k

)(ivk

we want to compute , the probability of the most probable

path accounting for all of the sequence and ending in the end state

can define recursively

can use dynamic programming to find efficiently

)(LvN

)(LvN

Fi di th t b bl th


79/85


the Viterbi algorithm

initialization:

1)0(0 =v

statessilentnotarethatfor,0)0( kvk =

Th Vit bi l ith


80/85

The Viterbi algorithm

recursion for emitting states (i=1L):

[ ]klkk

ill aivxeiv )1(max)()( =

[ ]klkkl aiviv )(max)( =

recursion for silent states:

[ ]klkk

l aivi )(maxarg)(ptr =

[ ]klkkl aivi )1(maxarg)(ptr =keep track of most

probable path


81/85

The Viterbi algorithm

to recover the most probable path, follow pointers

back starting at

termination:

L

( )kNkk

aLv )(maxargL =

( )kNk

kaLvx )(max),Pr( =

Parsing a DNA Sequence


82/85

Parsing a DNA Sequence

CGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGAGAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA

e Viterbi path represents

parse of a given sequence,edicting exons, introns, etc

Some lessons from these


83/85

biological applications HMMs provide state-of-the-art performance in protein classification

and gene finding

HMMs can naturally classify and parse sequences of variable length

Much domain knowledge can be incorporated into the structure of themodels

Types, lengths and ordering of sequence features

Appropriate amount of memory to represent various sequencefeatures

Models can vary representation across sequence features

Discriminative methods often provide superior predictive accuracy togenerative methods

References


84/85

References S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped

BLAST and PSI-BLAST: A new generation of protein database search programs.Nucleic AcidsResearch, 25:33893402, 1997.

Apostolico, A., and Bejerano, G. 2000. Optimal amnesic probabilistic automata or how to learn andclassify proteins in linear time and space. In Proceedings of RECOMB2000.http://citeseer.nj.nec.com/apostolico00optimal.html

Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic text segmentation forextracting structured records. SIGMOD 2001.

C. Burge and S. Karlin, Prediction of Complete Gene Structures in Human Genomic DNA. Journal ofMolecular Biology, 268:78-94, 1997.

Mary Elaine Calif and R. J. Mooney. Relational learning of pattern-match rules for informationextraction. AAAI 1999.

S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal descriptionlength,VLDB, 1998

M. Collins, Discriminitive training method for Hidden Markov Models: Theory and experiments withperceptron algorithms, EMNLP 2002

R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models ofproteins and nucleic acids, Cambridge University Press, 1998.

Eleazar Eskin, Wenke Lee and Salvatore J. Stolfo. ``Modeling System Calls for Intrusion Detection withDynamic Window Sizes.''Proceedings of DISCEX II. June 2001.

IDS http://www.cs.columbia.edu/ids/publications/

D Freitag and A McCallum, Information Extraction with HMM Structures Learned by StochasticOptimization, AAAI 2000

References
http://www.cs.columbia.edu/ids/publications/http://www.cs.columbia.edu/ids/publications/


85/85

References Gionis and H. Mannila: Finding recurrent sources in sequences.ACM ReCOMB 2003

Michael T. Goodrich, Efficient Piecewise-Approximation Using the Uniform Metric Symposium onComputational Geometry , (1994)

D. Haussler. Convolution kernels on discrete structure. Technical report, UC Santa Cruz, 1999. K. Karplus, C. Barrett, and R. Hughey, Hidden Markov models for detecting remote protein

homologies. Bioinformatics 14(10): 846-856, 1998.

L. Lo Conte, S. Brenner, T. Hubbard, C. Chothia, and A. Murzin. SCOP database in 2002:refinements accommodate structural genomics. Nucleic Acids Research, 30:264-267, 2002

Wenke Lee and Sal Stolfo. ``Data Mining Approaches for Intrusion Detection''In Proceedings of theSeventh USENIX Security Symposium (SECURITY '98), San Antonio, TX, January 1998

A. McCallum and D. Freitag and F. Pereira, Maximum entropy Markov models for informationextraction and segmentation, ICML-2000

Rabiner, Lawrence R., 1990.A tutorial on hidden Markov models and selected applications in speechrecognition. In Alex Weibel and Kay-Fu Lee (eds.), Readings in Speech Recognition. Los Altos, CA:Morgan Kaufmann, pages 267--296.

D. Ron, Y. Singer and N. Tishby. The power of amnesia: learning probabilistic automata with variablememory length. Machine Learning, 25:117-- 149, 1996

Warrender, Christina, Stephanie Forrest, and Barak Pearlmutter. Detecting Intrusions Using System

Calls: Alternative Data Models. To appear, 1999 IEEE Symposium on Security and Privacy. 1999
http://www.cs.helsinki.fi/u/mannila/postscripts/p115-gionis.pshttp://www.cs.helsinki.fi/u/mannila/postscripts/p115-gionis.ps

Sequence Data Mining

Documents

Transcript of Sequence Data Mining