Sequence Data Mining

download Sequence Data  Mining

of 85

Transcript of Sequence Data Mining

  • 8/7/2019 Sequence Data Mining

    1/85

    Sequence Data Mining:

    Techniques and Applications

    Sunita Sarawagi

    IIT Bombayhttp://www.it.iitb.ac.in/~sunita

    Mark Craven

    University of Wisconsinhttp://www.biostat.wisc.edu/~craven

  • 8/7/2019 Sequence Data Mining

    2/85

    What is a sequence?

    Ordered set of elements:s = a1,a2,..an Each element a

    icould be

    Numerical

    Categorical: domain a finite set of symbols , | |=m

    Multiple attributes

    The length n of a sequence is not fixed

    Order determined by time or position and couldbe regular or irregular

  • 8/7/2019 Sequence Data Mining

    3/85

    Real-life sequences

    Classical applications Speech: sequence of phonemes

    Language: sequence of words and delimiters

    Handwriting: sequence of strokes

    Newer applications

    Bioinformatics: Genes: Sequence of 4 possible nucleotides, | |=4

    Example: AACTGACCTGGGCCCAATCC

    Proteins: Sequence of 20 possible amino-acids, | |=20 Example:

    Telecommunications: alarms, data packets

    Retail data mining: Customer behavior, sessions in a e-store(Example, Amazon)

    Intrusion detection

  • 8/7/2019 Sequence Data Mining

    4/85

    Intrusion detection

    Intrusions could be detected at Network-level (denial-of-service attacks,

    port-scans, etc) [KDD Cup 99]

    Sequence of TCP-dumps

    Host-level (attacks on privileged programslike lpr, sendmail) Sequence of system calls

    | | = set of all possible system calls ~100

    open

    lseek

    lstat

    mmap

    execve

    ioctl

    ioctl

    close

    execve

    close

    unlink

  • 8/7/2019 Sequence Data Mining

    5/85

    Outline

    Traditional mining operations on sequences Classification

    Clustering

    Finding repeated patterns

    Primitives for handling sequence data Sequence-specific mining operations

    Partial sequence classification (Tagging)

    Segmenting a sequence Predicting next symbol of a sequence

    Applications in biological sequence mining

  • 8/7/2019 Sequence Data Mining

    6/85

    Classification of whole sequences

    Given: a set of classes C and

    a number of example sequences in each class,

    train a model so that for an unseen sequence we

    can say to which class it belongsExample: Given a set of protein families, find family of a new

    protein

    Given a sequence of packets, label session asintrusion or normal

    Given several utterances of a set of words, classify anew utterance to the right word

  • 8/7/2019 Sequence Data Mining

    7/85

    Conventional classification methods

    Conventional classification methods assume record data: fixed number of attributes

    single record per instance to be classified (no order)

    Sequences:

    variable length, order important.

    Adapting conventional classifiers to sequences

    Generative classifiers

    Boundary-based classifiers Distance based classifiers

    Kernel-based classifiers

  • 8/7/2019 Sequence Data Mining

    8/85

    Generative methods

    For each class i,

    train a generative model Mi to

    maximize likelihood over all

    training sequences in the class i

    Estimate Pr(ci) as fraction of training instances inclass i

    For a new sequencex,

    find Pr(x|ci)*Pr(ci) for each i using Mi

    choose iwith the largest value ofPr(x|ci)*P(ci)

    x

    Pr(x|c1)*Pr(c1)

    Pr(x|c2)*Pr(c2)

    Pr(x|c3)*Pr(c3)

    Need a generative model for sequence data

  • 8/7/2019 Sequence Data Mining

    9/85

    Boundary-based methods

    Data: points in a fixed multi-dimensional space

    Output of training: Boundaries that define regions

    within which same class predicted

    Application: Tests on boundaries to find region

    Need to embed sequence data in a fixed coordinate space

    Decision trees Neural networksLinear discriminants

  • 8/7/2019 Sequence Data Mining

    10/85

    Kernel-based classifiers

    Define function K(xi, x) that intuitively defines similarity between two

    sequences and satisfies two properties K is symmetric i.e., K(xi, x) = K(x, xi)

    K is positive definite

    Training: learn for each class c,

    wicfor each train sequence xi

    bc

    Application: predict class of x For each class c, find f(x,c) = wicK(xi, x)+bc

    Predicted class is c with highest value f(x,c)

    Well-known kernel classifiers

    Nearest neighbor classifier

    Support Vector Machines

    Radial Basis functions

    Need kernel/similarity function

  • 8/7/2019 Sequence Data Mining

    11/85

    Sequence clustering

    Given a set of sequences, create groups such

    that similar sequences in the same group

    Three kinds of clustering algorithms

    Distance-based:

    K-means Various hierarchical algorithms

    Model-based algorithms

    Expectation Maximization algorithm

    Density-based algorithms Clique

    Need similarity function

    Need generative models

    Need dimensional embedding

  • 8/7/2019 Sequence Data Mining

    12/85

    Outline

    Traditional mining on sequences

    Primitives for handling sequence data Embed sequence in a fixed dimensional space

    All conventional record mining techniques will apply

    Generative models for sequence

    Sequence classification: generative methods Clustering sequences: model-based approach

    Distance between two sequences Sequence classification: SVM and NN

    Clustering sequences: distance-based approach

    Sequence-specific mining operations

    Applications

  • 8/7/2019 Sequence Data Mining

    13/85

    Embedding sequences in fixed

    dimensional space Ignore order, each symbol maps to a dimension

    extensively used in text classification and clustering

    Extract aggregate features

    Real-valued elements: Fourier coefficients, Wavelet coefficients,

    Auto-regressive coefficients

    Categorical data: number of symbol changes Sliding window techniques (k: window size)

    Define a coordinate for each possible k-gram (mk coordinates)

    -th coordinate is number of times in sequence

    (k,b) mismatch score: -th coordinate is number of k-grams in

    sequence with b mismatches with

    Define a coordinate for each of the k-positions

    Multiple rows per sequence

  • 8/7/2019 Sequence Data Mining

    14/85

    Sliding window examples

    o c l i e m

    1 2 1 1 3 2 1

    2 .. .. .. .. .. ..

    3 .. .. .. .. .. ..

    open

    lseek

    ioctl

    mmap

    execve

    ioctl

    ioctl

    open

    execve

    close

    mmap

    One symbol per column

    Sliding window: window-size 3

    ioe cli oli lie lim ...

    1 1 0 1 0 1

    2 .. .. .. .. .. ..

    3 .. .. .. .. .. ..

    One row per tracesid A1 A2 A3

    1 o l i

    1

    l i m1 i m e

    1 .. .. ..

    1 e c m

    Multiple rows per trace

    ioe cli oli lie lim ...

    1 2 1 1 0 1

    2 .. .. .. .. .. ..

    3 .. .. .. .. .. ..

    mis-match scores: b=1

  • 8/7/2019 Sequence Data Mining

    15/85

    Detecting attacks on privileged programs

    Short sequences of system calls made during

    normal execution of system calls are very

    consistent, yet different from the sequences of

    its abnormal executions

    Two approaches STIDE (Warrender 1999)

    Create dictionary of unique k-windows in normal traces,

    count what fraction occur in new traces and threshold.

    RIPPER based (Lee 1998) next...

  • 8/7/2019 Sequence Data Mining

    16/85

    Classification models on k-grams trace

    data When both normal and

    abnormal data available

    class label =

    normal/abnormal:

    When only normal trace,

    class-label=k-th system

    call

    7-grams class

    vtimes open seek read read read seek normal

    lstat lstat lstat bind open close vtimes abnormal

    Learn rules to predict class-label [RIPPER]

    6-attributes class

    vtimes open seek read read read seek

    lstat lstat lstat bind open closevtimes

  • 8/7/2019 Sequence Data Mining

    17/85

    Examples of output RIPPER rules

    Both-traces:

    if the 2nd system call is vtimes and the 7th is vtrace, then thesequence is normal

    if the 6th system call is lseekand the 7th is sigvec, then thesequence is normal

    if none of the above, then the sequence is abnormal Only-normal:

    if the 3rd system call is lstatand the 4th is write, then the 7th isstat

    if the 1st system call is sigblockand the 4th is bind, then the 7th

    is setsockopt

    if none of the above, then the 7th is open

  • 8/7/2019 Sequence Data Mining

    18/85

    Experimental results on sendmailtraces Only -normal BOTH

    sscp -1 13.5 32.2

    sscp -2 13.6 30.4sscp -3 13.6 30.4

    syslog -remote -1 11.5 21.2

    syslog -remote -2 8.4 15.6

    syslog -local -1 6.1 11.1

    syslog -local -2 8.0 15.9decode -1 3.9 2.1

    decode -2 4.2 2.0

    sm565a 8.1 8.0

    sm5x 8.2 6.5

    sendmail 0.6 0.1

    The output rule sets contain~250 rules, each with 2 or 3attribute tests

    Score each trace by counting

    fraction of mismatches and

    thresholding

    Summary: Only normal traces

    sufficient to detect intrusionsPercent of mismatching traces

  • 8/7/2019 Sequence Data Mining

    19/85

    More realistic experiments

    Different programs need different thresholds

    Simple methods (e.g. STIDE) work as well Results sensitive to window size

    Is it possible to do better with sequence specific

    methods?

    STIDE RIPPER

    threshold %false-pos threshold %false-posSite-1 lpr 12 0.0 3 0.0016

    Site-2 lpr 12 0.0013 4 0.0265

    named 20 0.0019 10 0.0

    xlock 20 0.00008 10 0.0

    [from Warrender 99]

  • 8/7/2019 Sequence Data Mining

    20/85

    Outline

    Traditional mining on sequences

    Primitives for handling sequence data

    Embed sequence in a fixed dimensional space

    All conventional record mining techniques will apply

    Distance between two sequences

    Sequence classification: SVM and NN

    Clustering sequences: distance-based approach

    Generative models for sequences Sequence classification: whole and partial

    Clustering sequences: model-based approach

    Sequence-specific mining operations

    Applications in biological sequence mining

  • 8/7/2019 Sequence Data Mining

    21/85

    Probabilistic models for sequences

    Independent model

    One-level dependence (Markov chains)

    Fixed memory (Order-lMarkov chains)

    Variable memory models

    More complicated models Hidden Markov Models

  • 8/7/2019 Sequence Data Mining

    22/85

  • 8/7/2019 Sequence Data Mining

    23/85

    Model structure A state for each symbol in

    Edges between states with probabilities

    Probability of a sequence s being

    generated from the model Example: Pr(AACA)

    = Pr(A) Pr(A|A) Pr(C|A) Pr(A|C)

    = 0.5 0.1 0.9 0.4

    Training transition probability betweenstatesPr( | ) = Count( T) / Count( T)

    First Order Markov Chains

    CA0.9

    0.4

    0.1 0.6

    start

    0.5 0.5

  • 8/7/2019 Sequence Data Mining

    24/85

    l = memory of sequence Model

    A state for each possible suffix oflength l| |lstates

    Edges between states withprobabilities

    Pr(AACA)

    = Pr(AA)Pr(C|AA) Pr(A|AC)

    = 0.5 0.9 0.7 Training model

    Pr( |s) = count(s T) / count(s T)

    Higher order Markov Chains

    ACAA

    C 0.3

    C 0.9

    A 0.1

    l = 2

    CCCA0.8

    A 0.7

    C 0.2

    A 0.4

    C 0.6

    start

    0.5

  • 8/7/2019 Sequence Data Mining

    25/85

    Variable Memory models

    Probabilistic Suffix Automata (PSA)

    Model

    States: substrings of size no greater than l

    where no string is suffix of another

    Calculating Pr(AACA):= Pr(A)Pr(A|A)Pr(C|A)Pr(A|AC)

    = 0.5 0.3 0.7 0.1

    Training: not straight-forward

    Eased by Prediction Suffix Trees PSTs can be converted to PSA after training

    CCAC

    C 0.7

    C 0.9

    A 0.1

    A

    C 0.6

    A 0.3

    A 0.6

    start

    0.2 0.5

  • 8/7/2019 Sequence Data Mining

    26/85

    Prediction Suffix Trees (PSTs)

    Suffix trees with emission probabilities of

    observation attached with each tree node

    Linear time algorithms exist for constructingsuch PSTs from training data [Apostolico 2000]

    CCAC

    C 0.7

    C 0.9

    A 0.1

    A

    A 0.3

    C 0.6e

    A C

    AC CC

    0.3, 0.7

    0.28, 0.72

    0.25, 0.75

    0.1, 0.9 0.4, 0.6

    Pr(AACA)=0.28 0.3 0.7 0.1

  • 8/7/2019 Sequence Data Mining

    27/85

    Hidden Markov Models

    Doubly stochastic models

    Efficient dynamic programmingalgorithms exist for Finding Pr(S)

    The highest probability path P that

    maximizes Pr(S|P) (Viterbi)

    Training the model (Baum-Welch algorithm)

    S2

    S4

    S1

    0.9

    0.5

    0.5

    0.8

    0.2

    0.1

    S3

    A

    C

    0.6

    0.4

    A

    C

    0.3

    0.7

    A

    C

    0.5

    0.5

    A

    C

    0.9

    0.1

  • 8/7/2019 Sequence Data Mining

    28/85

  • 8/7/2019 Sequence Data Mining

    29/85

    HMMs for profiling system calls

    Training:

    Initial number of states = 40 (roughly equals number

    of distinct system calls)

    Train on normal traces

    Testing: Need to handle variable length and online data

    For each call, find the total probability of outputting

    given all calls before it.

    If probability below a threshold call it abnormal.

    Trace is abnormal if fraction of abnormal calls are

    high

  • 8/7/2019 Sequence Data Mining

    30/85

  • 8/7/2019 Sequence Data Mining

    31/85

    ROC curves comparing different

    methods

    [from Warrender 99]

  • 8/7/2019 Sequence Data Mining

    32/85

    Outline

    Traditional mining on sequences

    Primitives for handling sequence data

    Sequence-specific mining operations

    Partial sequence classification (Tagging)

    Segmenting a sequence

    Predicting next symbol of a sequence

    Applications in biological sequence mining

  • 8/7/2019 Sequence Data Mining

    33/85

    Partial sequence classification (Tagging)

    The tagging problem: Given:

    A set of tags L

    Training examples of sequences showing the breakup ofthe sequence into the set of tags

    Learn to breakup a sequence into tags (classification of parts of sequences)

    Examples: Text segmentation

    Break sequence of words forming an address string intosubparts like Road, City name etc

    Continuous speech recognition Identify words in continuous speech

  • 8/7/2019 Sequence Data Mining

    34/85

    Text sequence segmentation

    Example: Addresses, bib records

    House

    number Building Road CityZip

    4089 Whispering Pines Nobel Drive San Diego CA 92122

    P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S.Dordick (1993) Protein and Solvent Engineering of Subtilising

    BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc.115, 12231-12237.

    Author Year Title Journal VolumePage

    State

  • 8/7/2019 Sequence Data Mining

    35/85

    Approaches used for tagging

    Learn to identify start and end boundaries of each

    label

    Method: for each label, build two classifiers for accepting

    its two boundaries.

    Any conventional classifier would do:

    Rule-based most common.

    K-windows approach:

    For each label, train a classifier on k-windows

    During testing

    classify each k-window

    Adapt state-based generative models like HMM

    St t b d d l f

  • 8/7/2019 Sequence Data Mining

    36/85

    State-based models for sequence

    tagging Two approaches:

    Separate model per tag with special prefix and suffix states to capture the start and end of atag

    S2

    S4

    S1

    S3

    Prefix Suffix

    Road name

    S2

    S4

    S1

    S3

    PrefixSuffix

    Building name

  • 8/7/2019 Sequence Data Mining

    37/85

    Combined model over all tags

    Mahatma Gandhi Road Near Parkland ...

    [Mahatma Gandhi Road Near: Landmark] Parkland ...

    Example: IE

    Nave Model: One state per element

    Nested model

    Each element

    another HMM

  • 8/7/2019 Sequence Data Mining

    38/85

    Other approaches

    Disadvantages of generative models (HMMs)

    Maximizing joint probability of sequence and labels

    may not maximize accuracy

    Conditional independence of features a restrictive

    assumption

    Alternative: Conditional Random Fields

    Maximize conditional probability of labels given

    sequence

    Arbitrary overlapping features allowed

  • 8/7/2019 Sequence Data Mining

    39/85

    Outline

    Traditional mining on sequences

    Primitives for handling sequence data

    Sequence-specific mining operations

    Partial sequence classification (Tagging)

    Segmenting a sequence

    Predicting next symbol of a sequence

    Applications in biological sequence mining

  • 8/7/2019 Sequence Data Mining

    40/85

    Si l bl S ti 0/1

  • 8/7/2019 Sequence Data Mining

    41/85

    Simpler problem: Segmenting a 0/1

    sequence Players A and B

    A has a set of coins with

    different biases

    A repeatedly

    Picks arbitrary coin

    Tosses it arbitrary number

    of times

    B observes H/T

    Guesses transition

    points and biases

    Pick

    Toss

    Return

    A

    B

    0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

    0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

  • 8/7/2019 Sequence Data Mining

    42/85

    A MDL-based approach

    Given n head/tail observations

    Can assume n different coins with bias 0 or 1

    Data fits perfectly (with probability one)

    Many coins needed

    Or assume one coin

    May fit data poorly

    Best segmentation is a compromise between

    model length and data fit

    0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

    1/4 5/7 1/3

  • 8/7/2019 Sequence Data Mining

    43/85

    MDL

    For each segment i:

    L(Mi): cost of model parameters: log(Pr(head))

    + segment boundary: log (sequence length)

    L(Di|Mi): cost of describing data in segment Sigiven model Mi: log(H

    h T(1-h) ) H: #heads, T: #tails

    Goal: find segmentation that leads to smallest total cost

    segment i L(Mi) + L(Di|Mi)

  • 8/7/2019 Sequence Data Mining

    44/85

    How to find optimal segments

    0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

    Sequence of 17 tosses:

    Derived graph with 18 nodes and all possible edges

    Edge cost =

    model cost

    + data cost

    Shortest path

  • 8/7/2019 Sequence Data Mining

    45/85

    Non-independent models

    In previous method each segment is assumed to be

    independent of each other, does not allow model reuseof the form:

    The (k,h) segmentation problem:

    Assume a fixed number h of models is to be used for

    segmenting into k parts an n-element sequence (k > h)

    (k,k) segmentation solvable by dynamic programming

    General (k,h) problem NP hard

    0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0

  • 8/7/2019 Sequence Data Mining

    46/85

    Approximations: for (k,h) segmentation

    1. Solve (k,k) to get segments

    2. Solve (n,h) to get H models

    Example: 2-medians

    1. Assign each segment to best of H

    A second variant (Replace step 2 above with) Find summary statistics for each k segment, cluster

    them into H clusters

    A C T G G T T T A C C C C T G T GS1 S2 S3 S4

    M1 M2

    A C T G G T T T A C C C C T G T G

    Results of segmenting genetic

  • 8/7/2019 Sequence Data Mining

    47/85

    Results of segmenting genetic

    sequencesFrom:

    Gionis &

    Mannila

    2003

  • 8/7/2019 Sequence Data Mining

    48/85

  • 8/7/2019 Sequence Data Mining

    49/85

    Two sequence mining applications

  • 8/7/2019 Sequence Data Mining

    50/85

  • 8/7/2019 Sequence Data Mining

    51/85

    image from the DOE Human Genome Program

    http://www.ornl.gov/hgmis

  • 8/7/2019 Sequence Data Mining

    52/85

    The roles of proteins

    A protein family is

    Figure from the DOE Human Genome Program

    http://www.ornl.gov/hgmis

  • 8/7/2019 Sequence Data Mining

    53/85

    The roles of proteins

    The amino-acid sequence of a protein determines its structure

    The structure of a protein determines its function Proteins play many key roles in cells

    structural support

    storage of amino acids

    transport of other substances

    coordination of an organisms activities

    response of cell to chemical stimuli

    movement

    protection against disease

    selective acceleration of chemical reactions

  • 8/7/2019 Sequence Data Mining

    54/85

    Protein taxonomies

    The SCOP and CATH databasesprovide hierarchical taxonomies of

    protein families

    An alignment of globin family proteins

  • 8/7/2019 Sequence Data Mining

    55/85

    An alignment of globin family proteins

    Figure from www-cryst.bioc.cam.ac.uk/~max/res_globin.html

    The sequences in

    a family may vary

    in length

    Some positions are

    more conserved

    than others

    f

  • 8/7/2019 Sequence Data Mining

    56/85

    Profile HMMs

    i 2 i 3i 1i 0

    d 1 d 2 d 3

    m 1 m 3m 2start endtch states represent

    y conserved positions

    ert states account

    extra characters

    some sequences

    lete states are silent; they

    count for characters missing

    some sequences

    Profile HMMs are commonly used to model families

    of sequences

    A 0.01

    R 0.12

    D 0.04

    N 0.29C 0.01

    E 0.03

    Q 0.02

    G 0.01Insert and match states have

    emission distributions over

    sequence characters

  • 8/7/2019 Sequence Data Mining

    57/85

    Profile HMMs

    To classify sequences according to family, we can

    train a profile HMM to model the proteins of eachfamily of interest

    Given a sequencex, use Bayes rule to make

    classification

    -galacto

    sidase

    -glucana

    se

    -amylas

    e

    -amyla

    se

    =j

    jj

    ii

    i ccx

    ccx

    xc )Pr()|Pr(

    )Pr()|Pr(

    )|Pr(

  • 8/7/2019 Sequence Data Mining

    58/85

    How Likely is a Given Sequence?

    =

    +=

    L

    i

    iNL iiiaxeaxx

    1

    001 11)()...,...Pr(

    Lxx ...1

    N ...0

    )|Pr( icx

    transition

    probabilities

    emission

    probabilities

  • 8/7/2019 Sequence Data Mining

    59/85

    How Likely Is A Given Sequence?

    A 0.1

    C 0.4

    G 0.4

    T 0.1

    A 0.4

    C 0.1

    G 0.1

    T 0.4

    begin end

    0.5

    0.5

    0.2

    0.8

    0.4

    0.6

    0.1

    0.90.2

    0.8

    0 5

    4

    3

    2

    1

    6.03.08.04.02.04.05.0

    )C()A()A(),AACPr( 35313111101

    =

    = aeaeaea

    A 0.4

    C 0.1

    G 0.2

    T 0.3

    A 0.2

    C 0.3

    G 0.3

    T 0.2

  • 8/7/2019 Sequence Data Mining

    60/85

    How Likely is a Given Sequence?

    the probability overallpaths is:

    )...,...Pr()...Pr( 011 =

    NLL xxxx

    but the number of paths can be exponential in the length of the sequence...

    the Forward algorithm enables us to compute this efficiently using dynamic programming

  • 8/7/2019 Sequence Data Mining

    61/85

    How Likely is a Given Sequence:

    The Forward Algorithm

    define to be the probability of being in state k

    having observed the first icharacters ofx

    )(ifk

    we want to compute , the probability of being

    in the end state having observed all ofx

    can define this recursively

    )(LfN

  • 8/7/2019 Sequence Data Mining

    62/85

  • 8/7/2019 Sequence Data Mining

    63/85

    Training a profile HMM

    The parameters in profile HMMs are typically trainedusing the Baum-Welch method (an EM algorithm)

    Heuristic methods are used to determine the length of

    the model

    Initialize using an alignment of the trainingsequences (as in globin example)

    Iteratively make local adjustments to length if delete

    or insert states are used too much for training

    sequences

    The Fisher kernel method for

  • 8/7/2019 Sequence Data Mining

    64/85

    The Fisher kernel method for

    protein classification

    Standard HMMs are generative models Training involves estimating

    Predictions based on are made using

    Bayes rule

    Sometimes we can get more accurate predictions

    using discriminative methods which try to

    optimize directly

    One example: the Fisher kernel method

    [Jaakola et al. 00]

    )|Pr( icx

    )|Pr( xci

    )|Pr( xci

    The Fisher kernel method for

  • 8/7/2019 Sequence Data Mining

    65/85

    The Fisher kernel method for

    protein classification

    Consider learning to discriminate proteins in classfrom proteins other families

    1. Train an HMM for

    2. Use HMM to map each protein sequencex

    into a fixed-length vector

    3. Train an support vector machine (SVM)

    whose kernel function is the Euclidean

    distance between vectors

    The resulting discriminative model is given by

    1c

    1c

    =

    11 ::

    ),(),()(cxi

    i

    i

    cxi

    i

    iii

    xxKxxKxD

  • 8/7/2019 Sequence Data Mining

    66/85

    Profile HMM accuracy

  • 8/7/2019 Sequence Data Mining

    67/85

    Profile HMM accuracyFigure from Jaakola et al., ISMB 1999

    BLAST-based

    methods

    profile HMMs

    classifying 2447proteins into 33 families

    x-axis represents the median fraction of negative sequences that score as highas a positive sequence for a given familys model

    profile HMMs w/

    Fisher kernel SVM

    Th fi di t k

  • 8/7/2019 Sequence Data Mining

    68/85

    The gene finding task

    Given: an uncharacterized DNA sequenceDo: locate the genes in the sequence, including the

    coordinates of individual exons and introns

    image from the UCSC Genome Browser

    http://genome.ucsc.edu/

  • 8/7/2019 Sequence Data Mining

    69/85

    image from the DOE Human Genome Program

    http://www.ornl.gov/hgmis

    Th t t f

  • 8/7/2019 Sequence Data Mining

    70/85

    The structure of genes

    Genes consist of alternating sequences ofexons and

    introns Introns are spliced out before the gene is translated into

    protein

    ATG GAA ACC CGA TCG GGC AC

    intergenic

    region

    intergenic

    region

    intron exonexon exon intron

    G TAA AGT CTA

    Exons consist of three-letter words, called codons

    Each codon encodes a single amino acid (character in

    a protein sequence)

    The GENSCAN HMM for gene finding

  • 8/7/2019 Sequence Data Mining

    71/85

    Each shape represents a functional unitf a gene or genomic region

    Pairs of intron/exon units represent

    he different ways an intron can interrupt

    coding sequence (after 1st base in codon,

    fter 2nd base or after 3rd base)

    omplementary submodel

    not shown) detects genes on

    pposite DNA strand

    The GENSCAN HMM for gene finding

    [Burge & Karlin 97]

    Th GENSCAN HMM

  • 8/7/2019 Sequence Data Mining

    72/85

    The GENSCAN HMM For each sequence type, GENSCAN models

    the length distribution

    the sequence composition

    Length distribution models vary depending on sequence type Nonparametric (using histograms) Parametric (using geometric distributions)

    Fixed-length

    Sequence composition models vary depending on type 5th-order, inhomogeneous 5th -order homogenous

    Independent and 1st-order inhomogeneous Tree-structured variable memory

    R ti i GENSCAN

  • 8/7/2019 Sequence Data Mining

    73/85

    Representing exons in GENSCAN

    For exons, GENSCAN uses Histograms to represent exon lengths

    5th-order, inhomogeneous Markov models to

    represent exon sequences

    5th-order, inhomogeneous models can represent

    statistics about pairs of neighboring codons

    A 5th d M k d l f DNA

  • 8/7/2019 Sequence Data Mining

    74/85

    A 5th-order Markov model for DNA

    GCTAC

    AAAAA

    TTTTT

    CTACG

    CTACA

    CTACC

    CTACT

    Pr(A | GCTAC)

    start

    Pr(GCTAC)

    )|Pr()Pr)Pr( GCTACA(GCTACGCTACA=

    Markov models for exons

  • 8/7/2019 Sequence Data Mining

    75/85

    Markov models for exons

    for each word we evaluate, well want to consider itsposition with respect to the assumed codon framing

    thus well want to use an inhomogenous model to

    represent genes

    G C T A C G G A G C T T C G G A G C

    G C T A C G Gis in 3rd

    codon position

    C T A C G G Gis in 1st position

    T A C G G AAis in 2nd position

    A 5th order inhomogeneous model

  • 8/7/2019 Sequence Data Mining

    76/85

    A 5th-order inhomogeneous model

    GCTAC

    CTACG

    CTACA

    CTACC

    CTACT

    AAAAA

    TTTTT

    start

    TACAG

    TACAA

    TACAC

    TACAT

    AAAAA

    TTTTT

    GCTAC

    CTACG

    CTACA

    CTACC

    CTACT

    AAAAA

    TTTTT

    position 1 position 2 position 3

    Transitions go to

    states in position 1

    Inference with the gene finding HMM

  • 8/7/2019 Sequence Data Mining

    77/85

    Inference with the gene-finding HMM

    given: an uncharacterized DNA sequence

    do: find the most probable path through the model for the

    sequence

    This path will specify the coordinates of the predicted

    genes (including intron and exon boundaries)

    The Viterbi algorithm is used to compute this path

    Finding the most probable path:

  • 8/7/2019 Sequence Data Mining

    78/85

    Finding the most probable path:

    the Viterbi algorithm

    define to be the probability of the most probable

    path accounting for the first icharacters ofxand ending

    in state k

    )(ivk

    we want to compute , the probability of the most probable

    path accounting for all of the sequence and ending in the end state

    can define recursively

    can use dynamic programming to find efficiently

    )(LvN

    )(LvN

    Fi di th t b bl th

  • 8/7/2019 Sequence Data Mining

    79/85

    Finding the most probable path:

    the Viterbi algorithm

    initialization:

    1)0(0 =v

    statessilentnotarethatfor,0)0( kvk =

    Th Vit bi l ith

  • 8/7/2019 Sequence Data Mining

    80/85

    The Viterbi algorithm

    recursion for emitting states (i=1L):

    [ ]klkk

    ill aivxeiv )1(max)()( =

    [ ]klkkl aiviv )(max)( =

    recursion for silent states:

    [ ]klkk

    l aivi )(maxarg)(ptr =

    [ ]klkkl aivi )1(maxarg)(ptr =keep track of most

    probable path

  • 8/7/2019 Sequence Data Mining

    81/85

    The Viterbi algorithm

    to recover the most probable path, follow pointers

    back starting at

    termination:

    L

    ( )kNkk

    aLv )(maxargL =

    ( )kNk

    kaLvx )(max),Pr( =

    Parsing a DNA Sequence

  • 8/7/2019 Sequence Data Mining

    82/85

    Parsing a DNA Sequence

    CGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGAGAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA

    e Viterbi path represents

    parse of a given sequence,edicting exons, introns, etc

    Some lessons from these

  • 8/7/2019 Sequence Data Mining

    83/85

    biological applications HMMs provide state-of-the-art performance in protein classification

    and gene finding

    HMMs can naturally classify and parse sequences of variable length

    Much domain knowledge can be incorporated into the structure of themodels

    Types, lengths and ordering of sequence features

    Appropriate amount of memory to represent various sequencefeatures

    Models can vary representation across sequence features

    Discriminative methods often provide superior predictive accuracy togenerative methods

    References

  • 8/7/2019 Sequence Data Mining

    84/85

    References S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped

    BLAST and PSI-BLAST: A new generation of protein database search programs.Nucleic AcidsResearch, 25:33893402, 1997.

    Apostolico, A., and Bejerano, G. 2000. Optimal amnesic probabilistic automata or how to learn andclassify proteins in linear time and space. In Proceedings of RECOMB2000.http://citeseer.nj.nec.com/apostolico00optimal.html

    Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic text segmentation forextracting structured records. SIGMOD 2001.

    C. Burge and S. Karlin, Prediction of Complete Gene Structures in Human Genomic DNA. Journal ofMolecular Biology, 268:78-94, 1997.

    Mary Elaine Calif and R. J. Mooney. Relational learning of pattern-match rules for informationextraction. AAAI 1999.

    S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal descriptionlength,VLDB, 1998

    M. Collins, Discriminitive training method for Hidden Markov Models: Theory and experiments withperceptron algorithms, EMNLP 2002

    R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models ofproteins and nucleic acids, Cambridge University Press, 1998.

    Eleazar Eskin, Wenke Lee and Salvatore J. Stolfo. ``Modeling System Calls for Intrusion Detection withDynamic Window Sizes.''Proceedings of DISCEX II. June 2001.

    IDS http://www.cs.columbia.edu/ids/publications/

    D Freitag and A McCallum, Information Extraction with HMM Structures Learned by StochasticOptimization, AAAI 2000

    References

    http://www.cs.columbia.edu/ids/publications/http://www.cs.columbia.edu/ids/publications/
  • 8/7/2019 Sequence Data Mining

    85/85

    References Gionis and H. Mannila: Finding recurrent sources in sequences.ACM ReCOMB 2003

    Michael T. Goodrich, Efficient Piecewise-Approximation Using the Uniform Metric Symposium onComputational Geometry , (1994)

    D. Haussler. Convolution kernels on discrete structure. Technical report, UC Santa Cruz, 1999. K. Karplus, C. Barrett, and R. Hughey, Hidden Markov models for detecting remote protein

    homologies. Bioinformatics 14(10): 846-856, 1998.

    L. Lo Conte, S. Brenner, T. Hubbard, C. Chothia, and A. Murzin. SCOP database in 2002:refinements accommodate structural genomics. Nucleic Acids Research, 30:264-267, 2002

    Wenke Lee and Sal Stolfo. ``Data Mining Approaches for Intrusion Detection''In Proceedings of theSeventh USENIX Security Symposium (SECURITY '98), San Antonio, TX, January 1998

    A. McCallum and D. Freitag and F. Pereira, Maximum entropy Markov models for informationextraction and segmentation, ICML-2000

    Rabiner, Lawrence R., 1990.A tutorial on hidden Markov models and selected applications in speechrecognition. In Alex Weibel and Kay-Fu Lee (eds.), Readings in Speech Recognition. Los Altos, CA:Morgan Kaufmann, pages 267--296.

    D. Ron, Y. Singer and N. Tishby. The power of amnesia: learning probabilistic automata with variablememory length. Machine Learning, 25:117-- 149, 1996

    Warrender, Christina, Stephanie Forrest, and Barak Pearlmutter. Detecting Intrusions Using System

    Calls: Alternative Data Models. To appear, 1999 IEEE Symposium on Security and Privacy. 1999

    http://www.cs.helsinki.fi/u/mannila/postscripts/p115-gionis.pshttp://www.cs.helsinki.fi/u/mannila/postscripts/p115-gionis.ps