Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the...

Data and Knowledge Engineering Laboratory

Clustered Segment Indexing for Pattern Searching on the Secondary Structure

of Protein Sequences

Minkoo Seo Sanghyun Park JungIm Won

{mkseo, sanghyun, jiwon}@cs.yonsei.ac.kr

Department of Computer Science, Yonsei University

http://www.yonsei.ac.kr/


Outline• Introduction• Related Works• Basic Concept

– Segment– Cluster and Look Ahead

• Selectivity Analysis• Algorithm Description

– Exact Match– Range Match

• Experiments• Conclusion and Future Work



Our target

Introduction• Similarities in Protein Structure

– Functional properties of proteins usually depend on structures of the proteins.

– There are many proteins which are structurally similar, but their sequences are not similar at the level of amino acids.

Amino Acids (AA)

Loop RegionsSecondary Structure Element (SSE):

Helices and Sheets

Polymer Chain

Protein

Polymer Chain Polymer Chain…

<Protein Structure>



Introduction (cont’)

• Secondary Structure of Protein Sequences– Linear sequence of amino acids folds into three dimensional stru

ctures.

– There are three basic types of folds:• h : Alpha Helices

• e : Beta Sheets

• l : Turns of Loops

– Normally, these types occur in groups.• For example, ‘lhhheeeelll’ is more likely to occur than ‘helhehhll.’

Primary :GQISDSIEEKRGFFSTKR..Secondary:HLLLLLLLLLLHHHEEEE..Probability:855577763445449476..



Related Works• BLAST and FASTA

– These exhaustive search techniques are often adapted to run on specialized high-end hardware or parallelized.

– They do not solve the fundamental problem of needing to compare a query to each sequence in the collection.



Related Works (cont’)

• Segment Based Indexing

L. Hammel and J. M. Patel, "Searching on the Secondary Structure of Protein Sequence," In Proc. VLDB Conference, 2002

– Indexing scheme is based on the idea of the segmentation of secondary protein sequences.

– Index selectivity is not uniform.

– Gap can be defined only as segments.



Segment

ID Type LengthStart Loc.

1 H 1 0

1 L 10 1

1 H 3 11

1 E 4 14

Segmentation Process

H LLLLLLLLLL HHH EEEE

H,1 L,10 H,3 E,4

0 1234567890 123 4567

Type, Len:

Loc :

Protein ID :1Primary :GQISDSIEEKRGFFSTKR..Secondary :HLLLLLLLLLLHHHEEEE..Probability:855577763445449476..

Sample Protein Sequence

<Segments>



Segment (cont’)

• Selectivity of Type+Length– Index selectivity is not uniform.

0

50

100

150

200

1 11 21 31 41 51 61

Segment Length

Nu

mb

er

of

se

gm

en

ts i

nT

ho

us

an

ds

l

h

e



Cluster and Look Ahead

IDType

StrLength

StartLoc.

LookAhead

1 h 1 0 lhe

1 l 10 1 he

1 h 3 11 e

1 e 4 14

ID Type LengthStart Loc.

1 h 1 0

1 l 10 1

1 h 3 11

1 e 4 14

<Cluster table><Segments>

IDType

StrLength

StartLoc.

LookAhead

1 hl 11 0 he

1 lh 13 1 e

1 he 7 11

IDType

StrLength

StartLoc.

LookAhead

1 hlhe 18 0

k=0

k=1

k=2

• 2k segments are combined and put into Cluster table. By doing this, we index several characters instead of exactly one character.

• ‘Look Ahead’ fields contain ‘Type’ fields of next n segments.

• Maximum value of k can be limited according to query length.



Index Selectivity

Cluster table is searched byTypeStr+TypeLen+LookAhead.

Segment table is searched byType+TypeLen.

Cluster indices are 30~3,000 times more selective than the segment index, though the exact length information is lost.

02468

1012141618

(ID

, LO

C)

in T

ho

us

an

ds

0.000

0.020

0.040

0.060

0.080

0.100

0.120

0.140Number of (ID, LOC) Found

Selectivity

Number of Proteins = 320,000



LookAhead Selectivity

0

2

4

6

8

10

12

14

16

K0 K1 K2 K3 K4

(ID

, LO

C)

in T

ho

usa

nd

s

Num of (ID, LOC) Found w/o LookAhead

Num of (ID, LOC) Found w/ LookAhead

LookAhead improves selectivity 2~34 times.

Number of Proteins = 320,000



Exact Match

<h 1 1><l 10 10><h 3 3><e 4 4><l 3 3>

Query

Compute k

2)4,5logmin()4,||logmin( 22 Qk

<h 1 1><l 10 10><h 3 3><e 4 4><l 3 3>

Clustering

① k=2 ② k=2

Sort Merge• ID must be equal.• Loc difference must be 1(=<h 1 1>).

Post Processing

Validate result using actual sequences.

Searching

① TypeStr=hlhe, TypeLen=1+10+3+4, LookAhead=l② TypeStr=lhel, TypeLen=10+3+4+3



Range Match

<Range Match>

<e 50 60><l 3 3><h 5 6><e 6 8>

Clustering

TypeStr=elhe, TypeLen=64(50+3+5+6)~ 77(60+3+6+8)

Searching

Range of TypeLen is increased too much.

<Selective Clustering Range Match (SCRM)>

<e 50 60><l 3 3><h 5 6><e 6 8>

Clustering

Comparing Histograms

<e 50 60><l 3 3><h 5 6><e 6 8>①

② ③ ④

Search the cluster which seems to returnthe smallest number of rows.

① k=2, TypeLen=64~77 ② k=1, TypeLen=53~63 ③ k=1, TypeLen=8~9 ④ k=1, TypeLen=11~14



Environment• Oracle Database 8.1.7 on Linux

– To store cluster table and cluster indices.

• JDK 1.4.2_04 on Windows XP– To run searching program.

• Protein Data Used– 80,000 amino acid sequences of PDB(Protein Data Bank) were

downloaded.– Secondary structure of protein sequences were predicted using

PREDATOR.– 320,000 sequences were populated by duplicating 80,000 seque

nces 4 times.



Exact Match

0

10

20

30

40

3 5 10 15 20

|Q|

Tim

e i

n S

ec

.

Ours

ISS

SSS

M ISS(2)

0

5

3 5 10 15 20

|Q|

Tim

e in

Sec

.

Ours

MISS(2)

Exact Match Exact Match

Our method is 3~20 times faster than MISS(2) and 8~49 times faster than SSS and ISS.



Range Match

0

5

10

15

3 5 10 15 20

|Q|

Tim

e i

n S

ec

. SCRM

M ISS(2)

0

5

10

15

20

3 5 10 15 20

|Q|

Tim

e i

n S

ec

.

SCRM

M ISS(2)

0

5

10

15

20

3 5 10 15 20

|Q|

Tim

e in

Sec

.

SCRM

MISS(2)

Range Match, First Range Match, Middle

Range Match, End

SCRM is 1.2~6 times faster than MISS(2).



Scalability

0

5

10

15

80,000 160,000 320,000Number of proteins

Tim

e in

Sec

.

Ours

MISS(2)

05

101520253035404550

80,000 160,000 320,000Number of proteins

Tim

e in

Sec

.

Ours(SCRM)

MISS(2)

Exact Match, |Q|=5 Range Match (Middle), |Q|=5

Our methods are 2~4 times faster than MISS(2).



Index Size

Index size increases linearly.

Maximum K value can be set according to expected length of queries.In most cases, maximum K value of 2 will suffice.

0

500

1,000

1,500

2,000

2,500

3,000

0 1 2 3 4

Maximum K value

Siz

e in

MB

Cluster Index

Segment Index

Number of proteins = 320,000



Conclusion and Future Work

• We have proposed the concept of clustered indexing, which indexes fixed number of characters, and look ahead, which enhances index selectivity.

• Our experiments show that proposed techniques are more efficient than the previous algorithms.

• In the future, we would like to study and propose approximate searching algorithms for 3-D structures of proteins.


Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the...

Documents

Transcript of Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the...