Chapter 2 Data Manipulation Yonsei University 1 st Semester, 2015 Sanghyun Park.
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the...
-
Upload
tracy-norman -
Category
Documents
-
view
217 -
download
0
Transcript of Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the...
Data and Knowledge Engineering Laboratory
Clustered Segment Indexing for Pattern Searching on the Secondary Structure
of Protein Sequences
Minkoo Seo Sanghyun Park JungIm Won
{mkseo, sanghyun, jiwon}@cs.yonsei.ac.kr
Department of Computer Science, Yonsei University
Data and Knowledge Engineering Laboratory
Outline• Introduction• Related Works• Basic Concept
– Segment– Cluster and Look Ahead
• Selectivity Analysis• Algorithm Description
– Exact Match– Range Match
• Experiments• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Our target
Introduction• Similarities in Protein Structure
– Functional properties of proteins usually depend on structures of the proteins.
– There are many proteins which are structurally similar, but their sequences are not similar at the level of amino acids.
Amino Acids (AA)
Loop RegionsSecondary Structure Element (SSE):
Helices and Sheets
Polymer Chain
Protein
Polymer Chain Polymer Chain…
<Protein Structure>
Data and Knowledge Engineering Laboratory
Introduction (cont’)
• Secondary Structure of Protein Sequences– Linear sequence of amino acids folds into three dimensional stru
ctures.
– There are three basic types of folds:• h : Alpha Helices
• e : Beta Sheets
• l : Turns of Loops
– Normally, these types occur in groups.• For example, ‘lhhheeeelll’ is more likely to occur than ‘helhehhll.’
Primary :GQISDSIEEKRGFFSTKR..Secondary:HLLLLLLLLLLHHHEEEE..Probability:855577763445449476..
Data and Knowledge Engineering Laboratory
Outline• Introduction• Related Works• Basic Concept
– Segment– Cluster and Look Ahead
• Selectivity Analysis• Algorithm Description
– Exact Match– Range Match
• Experiments• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Related Works• BLAST and FASTA
– These exhaustive search techniques are often adapted to run on specialized high-end hardware or parallelized.
– They do not solve the fundamental problem of needing to compare a query to each sequence in the collection.
Data and Knowledge Engineering Laboratory
Related Works (cont’)
• Segment Based Indexing
L. Hammel and J. M. Patel, "Searching on the Secondary Structure of Protein Sequence," In Proc. VLDB Conference, 2002
– Indexing scheme is based on the idea of the segmentation of secondary protein sequences.
– Index selectivity is not uniform.
– Gap can be defined only as segments.
Data and Knowledge Engineering Laboratory
Outline• Introduction• Related Works• Basic Concept
– Segment– Cluster and Look Ahead
• Selectivity Analysis• Algorithm Description
– Exact Match– Range Match
• Experiments• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Segment
ID Type LengthStart Loc.
1 H 1 0
1 L 10 1
1 H 3 11
1 E 4 14
Segmentation Process
H LLLLLLLLLL HHH EEEE
H,1 L,10 H,3 E,4
0 1234567890 123 4567
Type, Len:
Loc :
Protein ID :1Primary :GQISDSIEEKRGFFSTKR..Secondary :HLLLLLLLLLLHHHEEEE..Probability:855577763445449476..
Sample Protein Sequence
<Segments>
Data and Knowledge Engineering Laboratory
Segment (cont’)
• Selectivity of Type+Length– Index selectivity is not uniform.
0
50
100
150
200
1 11 21 31 41 51 61
Segment Length
Nu
mb
er
of
se
gm
en
ts i
nT
ho
us
an
ds
l
h
e
Data and Knowledge Engineering Laboratory
Cluster and Look Ahead
IDType
StrLength
StartLoc.
LookAhead
1 h 1 0 lhe
1 l 10 1 he
1 h 3 11 e
1 e 4 14
ID Type LengthStart Loc.
1 h 1 0
1 l 10 1
1 h 3 11
1 e 4 14
<Cluster table><Segments>
IDType
StrLength
StartLoc.
LookAhead
1 hl 11 0 he
1 lh 13 1 e
1 he 7 11
IDType
StrLength
StartLoc.
LookAhead
1 hlhe 18 0
k=0
k=1
k=2
• 2k segments are combined and put into Cluster table. By doing this, we index several characters instead of exactly one character.
• ‘Look Ahead’ fields contain ‘Type’ fields of next n segments.
• Maximum value of k can be limited according to query length.
Data and Knowledge Engineering Laboratory
Outline• Introduction• Related Works• Basic Concept
– Segment– Cluster and Look Ahead
• Selectivity Analysis• Algorithm Description
– Exact Match– Range Match
• Experiments• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Index Selectivity
Cluster table is searched byTypeStr+TypeLen+LookAhead.
Segment table is searched byType+TypeLen.
Cluster indices are 30~3,000 times more selective than the segment index, though the exact length information is lost.
02468
1012141618
(ID
, LO
C)
in T
ho
us
an
ds
0.000
0.020
0.040
0.060
0.080
0.100
0.120
0.140Number of (ID, LOC) Found
Selectivity
Number of Proteins = 320,000
Data and Knowledge Engineering Laboratory
LookAhead Selectivity
0
2
4
6
8
10
12
14
16
K0 K1 K2 K3 K4
(ID
, LO
C)
in T
ho
usa
nd
s
Num of (ID, LOC) Found w/o LookAhead
Num of (ID, LOC) Found w/ LookAhead
LookAhead improves selectivity 2~34 times.
Number of Proteins = 320,000
Data and Knowledge Engineering Laboratory
Outline• Introduction• Related Works• Basic Concept
– Segment– Cluster and Look Ahead
• Selectivity Analysis• Algorithm Description
– Exact Match– Range Match
• Experiments• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Exact Match
<h 1 1><l 10 10><h 3 3><e 4 4><l 3 3>
Query
Compute k
2)4,5logmin()4,||logmin( 22 Qk
<h 1 1><l 10 10><h 3 3><e 4 4><l 3 3>
Clustering
① k=2 ② k=2
Sort Merge• ID must be equal.• Loc difference must be 1(=<h 1 1>).
Post Processing
Validate result using actual sequences.
Searching
① TypeStr=hlhe, TypeLen=1+10+3+4, LookAhead=l② TypeStr=lhel, TypeLen=10+3+4+3
Data and Knowledge Engineering Laboratory
Range Match
<Range Match>
<e 50 60><l 3 3><h 5 6><e 6 8>
Clustering
TypeStr=elhe, TypeLen=64(50+3+5+6)~ 77(60+3+6+8)
Searching
Range of TypeLen is increased too much.
<Selective Clustering Range Match (SCRM)>
<e 50 60><l 3 3><h 5 6><e 6 8>
Clustering
Comparing Histograms
<e 50 60><l 3 3><h 5 6><e 6 8>①
② ③ ④
Search the cluster which seems to returnthe smallest number of rows.
① k=2, TypeLen=64~77 ② k=1, TypeLen=53~63 ③ k=1, TypeLen=8~9 ④ k=1, TypeLen=11~14
Data and Knowledge Engineering Laboratory
Outline• Introduction• Related Works• Basic Concept
– Segment– Cluster and Look Ahead
• Selectivity Analysis• Algorithm Description
– Exact Match– Range Match
• Experiments• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Environment• Oracle Database 8.1.7 on Linux
– To store cluster table and cluster indices.
• JDK 1.4.2_04 on Windows XP– To run searching program.
• Protein Data Used– 80,000 amino acid sequences of PDB(Protein Data Bank) were
downloaded.– Secondary structure of protein sequences were predicted using
PREDATOR.– 320,000 sequences were populated by duplicating 80,000 seque
nces 4 times.
Data and Knowledge Engineering Laboratory
Exact Match
0
10
20
30
40
3 5 10 15 20
|Q|
Tim
e i
n S
ec
.
Ours
ISS
SSS
M ISS(2)
0
5
3 5 10 15 20
|Q|
Tim
e in
Sec
.
Ours
MISS(2)
Exact Match Exact Match
Our method is 3~20 times faster than MISS(2) and 8~49 times faster than SSS and ISS.
Data and Knowledge Engineering Laboratory
Range Match
0
5
10
15
3 5 10 15 20
|Q|
Tim
e i
n S
ec
. SCRM
M ISS(2)
0
5
10
15
20
3 5 10 15 20
|Q|
Tim
e i
n S
ec
.
SCRM
M ISS(2)
0
5
10
15
20
3 5 10 15 20
|Q|
Tim
e in
Sec
.
SCRM
MISS(2)
Range Match, First Range Match, Middle
Range Match, End
SCRM is 1.2~6 times faster than MISS(2).
Data and Knowledge Engineering Laboratory
Scalability
0
5
10
15
80,000 160,000 320,000Number of proteins
Tim
e in
Sec
.
Ours
MISS(2)
05
101520253035404550
80,000 160,000 320,000Number of proteins
Tim
e in
Sec
.
Ours(SCRM)
MISS(2)
Exact Match, |Q|=5 Range Match (Middle), |Q|=5
Our methods are 2~4 times faster than MISS(2).
Data and Knowledge Engineering Laboratory
Index Size
Index size increases linearly.
Maximum K value can be set according to expected length of queries.In most cases, maximum K value of 2 will suffice.
0
500
1,000
1,500
2,000
2,500
3,000
0 1 2 3 4
Maximum K value
Siz
e in
MB
Cluster Index
Segment Index
Number of proteins = 320,000
Data and Knowledge Engineering Laboratory
Outline• Introduction• Related Works• Basic Concept
– Segment– Cluster and Look Ahead
• Selectivity Analysis• Algorithm Description
– Exact Match– Range Match
• Experiments• Conclusion and Future Work
Data and Knowledge Engineering Laboratory
Conclusion and Future Work
• We have proposed the concept of clustered indexing, which indexes fixed number of characters, and look ahead, which enhances index selectivity.
• Our experiments show that proposed techniques are more efficient than the previous algorithms.
• In the future, we would like to study and propose approximate searching algorithms for 3-D structures of proteins.
Q & A