Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di...

Algorithms to Search Position Specific Scoring Matrices in Biosequences

Cinzia PizziDipartimento di Ingegneria dell’Informazione

Università degli Studi di Padova

C.Pizzi, DEI – Univ. Of Padova (Italy) 2

Outline Weighted patterns in Biology The problem of profile matching The look-ahead method

Suffix based Algorithms Aho-Corasick Extension (ACE) Look-ahead Filtration Algorithm (LFA) Superalphabet (NS)

Some experimental results

What are Motifs? Motifs are biologically significant

elements that are responsible for common structures or functions

Motifs are statistically significant substrings in bio-sequences

Assumption: if two entities share same function or structure, common over-represented elements might be responsible for observed similarity

Motif Discovery Take set of co-expressed genes Compare their promoter regions Common over-represented

substrings are good candidates for TFBS

Need counted/expected frequency

Promoters of co-expressedgenes

Motif Discovery TFBS, DNA motifs Motifs = binding sites = substrings

Intrinsic variability of biological sequences Mismatches, indels, wildcards,

superalphabets...

Promoters of co-expressedgenes

Motif Representation Binding sites of the same factor

are not exactly the same in all sequences

ACATACCCGAATATGCATGCCTACTCCAAATTCGAAACGGACTCCTATGCCCACTCGGAA

1 2 3 4 5 6A

Profile -> matrix representation

C.Pizzi, DEI – Univ. Of Padova (Italy)

Motif Representation Protein classification: each family

is modeled by a matrix

ACDEHNPVACCCDEGAMMATATHCATVVST

1 2 3 4 5 6A

... 1 2 3 4 5 6A

...WVDEHNPVAC

Profile Weighted pattern p oflength m

defined over alphabet Σ |Σ| x m matrix defines scores

1 2 3 4 5 6A 0.

Segment Score

S = s1 s2 … sm

1 2 3 4 5 6A 0.

s1 s2 s3 s4 s5 s6

ii isMScore

Meaning of the score

)|(lnlnln],[

fisMScore

Segment Score Example

Score = 2.1

1 2 3 4 5 6A 0.

G T A C A C

Profile Matching Problem Text T of length n defined over Σ Profile p (|Σ| x m) Score threshold th Score Si of the segment of length

m starting at position i Find all positions i in T where Si ≥

Example: th = 2CGTACACTCGGTA

Score = 0.6

Not a match!

1 2 3 4 5 6

Score = 2.1

Match at pos 2!

1 2 3 4 5 6

Score = 1.4

Not a match!

1 2 3 4 5 6

Score = 1.8

Not a match!

1 2 3 4 5 6

Score = 0.9

Not a match!

1 2 3 4 5 6

Score = 1.3

Not a match!

1 2 3 4 5 6

Score = 1.4

Not a match!

1 2 3 4 5 6

Score = 2.2

Match at pos 8!

1 2 3 4 5 6

Scenarios of applications Online Algorithms (no indexing)

Database of profile matrices (e.g. TRANSFAC, JASPAR for TFBS)

Input sequence to be searched Offline algorithms (indexing)

Sequence or set of sequences Input matrix to search for matches

Summary of current methods

Look-ahead method LA (Wu et al,00)

Offline methods based on LA: Suffix-tree (Dorohonceanu et al, 00) Suffix-array (Beckstette et al, 04,06) Truncated Suffix Tree (Pizzi and

Favaretto, 10) Online methods based on LA:

Aho-Corasick,Filtering(Pizzi et al. 07,09)

Summary of current methods Pattern Matching

Shift-Add (Salmela e Tarhio, 08) KMP (Liefoghee et al, 09)

Matrix partitioning (Liefhooghe et al.,06, Pizzi et al., 07, 09)

FFT based (Rajasekaran et al., 02) Compression based(Freschi et al., 05)

The look-ahead approach

]1max[][

],[]max[

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

max 3.0 2.2 1.7 1.4 0.4

Pth -1.0 -0.2 0.3 0.6 1.6 2.0

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

max 3.0 2.2 1.7 1.4 0.4

Pth -1.0 -0.2 0.3 0.6 1.6 2.0

C G T A C A 0.1

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

max 3.0 2.2 1.7 1.4 0.4

Pth -1.0 -0.2 0.3 0.6 1.6 2.0

C G T A C A 0.1 0.1

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

max 3.0 2.2 1.7 1.4 0.4

Pth -1.0 -0.2 0.3 0.6 1.6 2.0

C G T A C A 0.1 0.1 0.1 Don’t need to compare these ones!

The suffix tree of T data structure suffix tree, Tree(T),

is compacted trie that represents all the suffixes of string T

linear size: |Tree(T)| = O(|T|) can be constructed in linear time

O(|T|)

Suffix trie and suffix tree

baabab

abaabbaabaababb

Trie(abaab) Tree(abaab)

Tree(T) is of linear size only the internal branching nodes

and the leaves represented explicitly

edges labeled by substrings of T v = node(α) if the path from root to

v spells α one-to-one correspondence of

leaves and suffixes |T| leaves, hence < |T| internal

nodesC.Pizzi, DEI – Univ. Of Padova (Italy) 30

Tree(hattivatti)hattivatt

attivatti

ttivatti

tivatti

ivatti

hattivatti attivatt

ttivatti

tivatti

ivatti

vattivatti

attiti

hattivatti

Tree(hattivatti)hattivatt

attivatti

ttivatti

tivatti

ivatti

1 2 34

6,106,10

2,5 4,510

hattivatti

Tree(T) is full text indexTree(T)

P occurs in T at locations 8, 31, …

P occurs in T P is a prefix of some suffix of T Path for P exists in Tree(T)

All occurrences of P in time O(|P| + #occ)C.Pizzi, DEI – Univ. Of Padova (Italy)

LA over a Suffix Tree

Score(CG)=0.2 > -0.2 = Th(2)Score(CGT)=0.2 < 0.3 = Th(3) : Skip the subtree

LA over a Suffix Tree

Score(TCC)=1.9 > 0.3 = Th(3)Score(TCCG)=2.2 > 2 = Th(6) : Match, all the subtree

Suffix array: example

suffix array = lexicographic order of the suffixes

hattivatti

attivatti

ttivatti

tivatti

ivatti

εatti

attivatti

hattivatti

ivatti

tivatti

ttivatti

Suffix array suffix array SA(T) = an array giving

the lexicographic order of the suffixes of T

practitioners like suffix arrays (simplicity, space efficiency)

theoreticians like suffix trees (explicit structure)

LA over a Suffix Array

In terms of suffix trees, skp[i] is the lexicographically next leaf that does notoccur in the subtree below the branching node corresponding to the longest common prefix of Ssuf[i-1] and Ssuf[i].skp[i] = min({n + 1} U [ j in [i + 1; n] | lcp[i] > lcp[j])

LA over Truncated ST Build TST with truncation factor h L = max length of a matrix in the

DB if h=L, simply work as ST if h<L, filtering

if a leaf is reached take corresponding positions (p1, p2, …, pt)

For each pi check positions pi+j, h<j<=m with lookahead

LA over Truncated ST

p1 p3p2p1 + h

p2 +h p3 +hL-h L-h L-h

Space OccupationTruST

Running Time TruST

Aho-Corasick Expansion (ACE) Pattern matching + LA

Lookahead Filtration Algorithm(LFA) Score for fixed length prefix as a filter

+ LA Naive Superalphabet (NS)

Encode k-mers in superalphabet symbol

Online Profile Matching

The Aho-Corasick Algorithm

A trie for D = {he, she, his, hers}

The Aho-Corasick algorithm

Add failure links his -- she

Time O(n+m)Space depends on Dm = sum of word lengths

The Fast Aho-Corasick

0 1 2 8 9

h e r s

h ee,i,r

h,sh e,i

Time O(n)Space depends on D and Σ

AC and profile matching Build AC automaton for all the words

that are a match for the matrix LA partial threshold limits the number of

words to those that actually match O(|D||Σ|m + m|Σ|) pre-processing |D|≤|Σ|m depends on matrix and threshold

Search the text with AC automaton O(n) search

AC-Extension by LA First position

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

Pth -1.0

0.3 0.6 1.6 2.0

[C,0.1]

[G,0.2] [A,0.3]

[T,0.4]

AC-Extension by LA Second position

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

Pth -1.0

0.3 0.6 1.6 2.0

[C,0.1]

[G,0.2] [A,0.3]

[T,0.4]

[A,0.1]

[G,0.1] [T,0.3][C,0.9]

AC-Extension by LA Third position

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

Pth -1.0

0.3 0.6 1.6 2.0

[C,0.1]

[G,0.2] [A,0.3]

[T,0.4]

[A,0.1]

[G,0.1] [T,0.3][C,0.9]

[G,0.5] [C,0.6]

ACE Example

CGTACACTCGGTA

g t a c

ACE Example

CGTACACTCGGTA

g t a c

ACE Example

CGTACACTCGGTA

g t a c

ACE Example

CGTACACTCGGTA

g t a c

ACE Example

CGTACACTCGGTA

g t a c

ACE Example

CGTACACTCGGTA

g t a c

ACE Example

CGTACACTCGGTA

g t a c

Match at p-m+1 = 7-6+1=2

Minimum Gain for ACE Dual Concept of look-ahead Compute for every prefix the

minimum contribution of the remaining positions in the pattern

If current_score(i) + min_gain(i) > Th Report a match

Adv: in the automaton save a full subtree of height m-i

Example: M0003, MSS=0.85

[G,18500]

[C,37000]

[G,18500]

[C,37000]

[C,55500]

• GCC is sufficient to detect a match

[G,18500]

[C,37000]

[C,55500]

||• Save 5464 nodes out of 5468

Minimum Gain ACE

Look-ahead Filtration Compute the scores for all words of fixed

length k and store them O(|Σ|k) pre-processing

Sliding window of size k When score ≥ Pth[k], check remaining symbols

with LA (up to m-k)

O(n + (m -k)r) search; k is the prefix length, r is avg number of full scoring

Lookahaed Filtration ExampleK=3

... ...

ATT 0.5

CAA 0.2

... ...

CGT 0.1

CTT 0.3

... ...

GTA 0.5

... ...

GTT 0.4

TAA 0.5

... ...

TTT 0.6

Pth[3]=0.3

CGTACACTCGGTA

Score(CGT) = 0.1 < Pth[3]

Shift and concatenate to obtain thenext 3-mer

entries

Filtered Lookahaed Example

... ...

ATT 0.5

CAA 0.2

... ...

CGT 0.1

CTT 0.3

... ...

GTA 0.5

... ...

GTT 0.4

TAA 0.5

... ...

TTT 0.6

Pth[3]=0.3

CGTACACTCGGTA

Score(GTA) = 0.5 > Pth[3]

Check at most m-k remaining symbols

Score(GTAC) = 0.7 > Pth[4]Score(GTACA) = 1.7 > Pth[5]Score(GTACAC) = 2.1 > th

Match!

entries

More on ACE and LF It is possible to combine both

methods Automaton build on qualifying

prefixes only Multi-matrix version

Super-Alphabet Code words of length k to super-

alphabet symbols |Σ|k symbols are needed

Code the matrix M into matrix M’ (|Σ|k x m/k)

Run the naive algorithm on the sequence O(nm/k)

SuperAlphabet ExampleK=2 SCORE 1-2 SCORE 3-4 SCORE 5-6

AA 0.3 0.3 1.3

AC 1.1 0.3 1.4

AG 0.3 0.4 1.0

AT 0.3 0.4 1.3

CA 0.1 0.7 0.3

CC 0.9 0.7 0.4

CG 0.1 0.8 0.0

CT 0.3 0.8 0.3

GA 0.2 0.6 0.3

GC 1.0 0.6 0.4

GG 0.2 0.7 0.0

GT 0.4 0.7 0.3

TA 0.4 0.2 0.3

TC 1.2 0.2 0.4

TG 0.4 0.3 0.0

TT 0.6 0.3 0.3

CGTACACTCGGTA

Score = 0.6 < Th

entries

SuperAlphabet ExampleK=2 SCORE 1-2 SCORE 3-4 SCORE 5-6

AA 0.3 0.3 1.3

AC 1.1 0.3 1.4

AG 0.3 0.4 1.0

AT 0.3 0.4 1.3

CA 0.1 0.7 0.3

CC 0.9 0.7 0.4

CG 0.1 0.8 0.0

CT 0.3 0.8 0.3

GA 0.2 0.6 0.3

GC 1.0 0.6 0.4

GG 0.2 0.7 0.0

GT 0.4 0.7 0.3

TA 0.4 0.2 0.3

TC 1.2 0.2 0.4

TG 0.4 0.3 0.0

TT 0.6 0.3 0.3

CGTACACTCGGTA

Score = 2.1 match!

entries

Experiments Jaspar Database: 123 TFBS

matrices (DNA), PRINTS database (proteins)

Test sequence about 50M bases P-value defines threshold 3 GHz Intel Pentium IV processor

with 2 gigabytes of main memory, running under Linux.

DNA – avg running times per matrix

DNA- matrix length

DNA – window width

Proteins, avg time per matrix

Proteins - matrix length

MOODS – Motif Occurrence Detection Suite

Conclusions Searching matrix is a core step for

many bioinformatics applications (searching, discovery, classification…)

Several approaches have been developed in recent years

Online methods based on filtering are currently the most efficient

References C.Pizzi, P.Rastas, E.Ukkonen

Fast Search Algorithms for Position Specific Scoring Matrices In Proc. of the 1st Conference on Bioinformatics Research and Development (BIRD 07), Berlin, Germany, March 2007, LNCS/LCBI 4414 pp 239--250

C.Pizzi, E.UkkonenFast Profile Matching Algorithms - a survey Theoretical Computer Science, 395(2-3), 2008, pp 137--157, Special Issue SAIL: String Algorithms, Information and Learning

C.Pizzi, P.Rastas, E.UkkonenFinding significant matches of position weight matrices in linear time Accepted for publication by IEEE Transaction on Computational Biology and Bioinformatics, 2009

J.Korhonen, P.Martinmaki, C.Pizzi, P.Rastas, E.Ukkonen MOODS: fast search for position weight matrix matches in DNA sequences Bioinformatics 2009 25(23):3181-3182

Thanks

Acknowledgements Esko Ukkonen, Pasi Rastas, Janne

Korhonen, P.Martinmaki Academy of Finland grant “From Data

to knowledge” EU Project “Regulatory Networks”

Premio di Ricerca `Avere Trent’Anni’ Univ.Padova, Parco Scientifico Galileo,

Il Mattino, Giovani Confindustria, Scuola Galileiana di Studi SuperioriC.Pizzi, DEI – Univ. Of Padova (Italy)

Length 100

NA = Naïve AlgorithmLSA = Look-ahead Search AlgorithmLFA = Look-ahead Filter Algorithm (k=7)NS = Naïve Superalphabet (k=7)

• 13 patterns obtained by concateneting Jaspar matrices

• MSS: Matrix Similarity Score (% of maximal score)

Multiple Matrices Search

Running Time per matrix

Length 0 to 15 (108 matrices)

NA = Naïve AlgorithmLSA = Look-ahead Search AlgorithmACE = Aho-Corasick ExpansionLFA = Look-ahead Filtration Algorithm (k=7)NS = Naïve Super-alphabet (k=7)

Running Time per matrix

Length 16 to 30 (15 matrices)

NA = Naïve AlgorithmLSA = Look-ahead Search AlgorithmLFA = Look-ahead Filtration AlgorithmNS = Naïve Super-alphabet

Length 100

NA = Naïve AlgorithmLSA = Look-ahead Search AlgorithmLFA = Look-ahead Filter Algorithm (k=7)NS = Naïve Superalphabet (k=7)

• 13 patterns obtained by concateneting Jaspar matrices

P=10-5 P=10-4 P=10-3 P=10-2

NA 10.234 10.244 10.434 11.080

LSA 11.835 12.675 13.335 15.118

LFA 9.955 10.347 11.096 12.965

NS 3.576 3.677 4.593 9.918

Motif Representation Istances of a biological signal are

different

ACATACCCGAATATGCATGCCTACTCCAAATTCGAAACGGACTCCTATGCCCACTCGGAA

TCC(G|T)AC

1 2 3 4 5 6A

Consensus -> pattern representation

Profile -> matrix representation

Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di...

Documents

Transcript of Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di...

#Digital Caribbean: Skip Pizzi, NAB, USA

La cinzia 2014

Referente stranieri : Ricci Cinzia

INFORMAZIONI PERSONALI Cinzia Lucia Randazzootass.it/wp-content/uploads/2017/09/RANDAZZO-CINZIA...Curriculum Vitae Cinzia Lucia Randazzo Pagina 2 Miroiologia Agraria; Titolo del progetto

8 cinzia de marzo

Marina Pizzi - Inediti (Miserere)

Edition 41 Becca Pizzi

זאוס PIZZI 4700 · 2020. 6. 4. · Title: זאוס PIZZI 4700.cdr Author: user Created Date: 5/19/2020 2:59:23 PM

Cinzia Sciangula - corsi.univr.it

Data bases ( Biosequences, Structures, Genomes, DNA Chips, Proteomics, Interactomics ) Design

Cinzia Palazzini

Portfolio Cinzia Sparacino

Vibrating Wire Strain Gauges - Pizzi Instruments

CANZON DEGLI ANNI 20 clc61ê Nil-LA PIZZI LP Records_FEMALE/NILLA PIZZI/LP... · 2014. 5. 13. · CANZON DEGLI ANNI 20 clc61ê Nil-LA PIZZI . Created Date: 4/19/2010 11:11:35 AM

Bicicletas Cinzia 2015

Orario Lezioni · Docente Luzi Cinzia Luzi Cinzia Luzi Cinzia Luzi Cinzia Aula Aula Magna Aula Magna Aula Magna Aula Magna 8 mercoled ...

Print Quarterly. Pizzi Cannella.

Prof. Cinzia Cappiello - Andreadd.it

PIZZI F HOME LLC IEP West Nyack Family Owned & Operated · 120 Paris Ave., Northvale, NJ 07647 Tel 201-767-3050 • Fax 201-768-6680 Michael Pizzi, NJ Lic. No. 4176 Joyce M. Pizzi,

SALVATORE PIZZI CEPM03000D PIAZZA UMBERTO 1 …