LEGO Bricks - Monash University · LEGO Bricks? Can a canonical set (dictionary) of fragments be de...
Transcript of LEGO Bricks - Monash University · LEGO Bricks? Can a canonical set (dictionary) of fragments be de...
-
IntroductionInference
MMLLanguage of communication
SearchResults
Statistical Inference of protein “LEGO R© Bricks”
Arun Konagurthu, Monash University
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Molecular Biology as an information processing system
General flow of information in a cell (a.k.a. Crick’s Central Dogma)
DNA
⇓
RNA
⇓
Protein
Protein Sequence
⇓
Protein 3D Structure
⇓
Biological Function
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
X-Ray Crystallography to elucidate protein 3D structure
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Atomic resolution protein coordinates
ATOM 1070 N VAL A 132 -2.482 8.914 -0.561 1.00 11.79 N
ATOM 1071 CA VAL A 132 -1.569 9.061 -1.694 1.00 13.21 C
ATOM 1072 C VAL A 132 -0.723 7.794 -1.791 1.00 13.23 C
ATOM 1073 O VAL A 132 -0.102 7.399 -0.812 1.00 13.37 O
ATOM 1074 CB VAL A 132 -0.640 10.276 -1.458 1.00 12.98 C
ATOM 1075 CG1 VAL A 132 0.503 10.351 -2.539 1.00 14.72 C
ATOM 1076 CG2 VAL A 132 -1.459 11.555 -1.436 1.00 14.45 C
ATOM 1077 N ILE A 133 -0.710 7.168 -2.970 1.00 13.79 N
ATOM 1078 CA ILE A 133 0.030 5.915 -3.215 1.00 14.07 C
ATOM 1079 C ILE A 133 0.913 6.209 -4.436 1.00 14.71 C
ATOM 1080 O ILE A 133 0.394 6.656 -5.456 1.00 14.40 O
ATOM 1081 CB ILE A 133 -0.947 4.735 -3.513 1.00 14.18 C
ATOM 1082 CG1 ILE A 133 -1.850 4.473 -2.279 1.00 14.77 C
ATOM 1083 CG2 ILE A 133 -0.196 3.421 -3.848 1.00 14.82 C
ATOM 1084 CD1 ILE A 133 -3.032 3.568 -2.590 1.00 14.21 C
ATOM 1085 N GLY A 134 2.214 5.949 -4.294 1.00 14.24 N
ATOM 1086 CA GLY A 134 3.235 6.269 -5.311 1.00 15.35 C
ATOM 1087 C GLY A 134 3.228 5.299 -6.492 1.00 14.28 C
ATOM 1088 O GLY A 134 2.253 4.603 -6.749 1.00 13.67 O
ATOM 1089 N HIS A 135 4.333 5.317 -7.235 1.00 15.70 N
ATOM 1090 CA HIS A 135 4.524 4.515 -8.431 1.00 15.51 C
ATOM 1091 C HIS A 135 5.078 3.129 -8.074 1.00 14.80 C
ATOM 1092 O HIS A 135 5.852 3.020 -7.132 1.00 16.15 O
ATOM 1093 CB HIS A 135 5.515 5.236 -9.354 1.00 15.95 C
ATOM 1094 CG HIS A 135 5.041 6.585 -9.787 1.00 16.20 C
ATOM 1095 ND1 HIS A 135 3.928 6.759 -10.577 1.00 19.61 N
ATOM 1096 CD2 HIS A 135 5.520 7.821 -9.533 1.00 22.36 C
ATOM 1097 CE1 HIS A 135 3.739 8.049 -10.795 1.00 20.33 C
ATOM 1098 NE2 HIS A 135 4.689 8.714 -10.167 1.00 21.46 NBuilding blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Structures of proteins are..
... irregular and complex atthe atomic level...
...but their folding patterns aresimple and elegant.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Structures of proteins are..
... irregular and complex atthe atomic level...
...but their folding patterns aresimple and elegant.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
What defines a protein folding pattern?
Secondary structures = helices and strands ofsheet
Assembly = contact + relative geometry
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Many proteins have similar folding patterns
They differ indetail but havegeneraltopologicalproperties incommon.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Recurrent themes in protein 3D structures
Recurrent structural themes emerge within the great variety ofprotein structures.
LEGO Bricks?
Can a canonical set (dictionary) of fragments be defined ofwhich ALL known protein structures are made?
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Consider this problem
Problem
A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?
(a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Consider this problem
Problem
A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?
(a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Consider this problem
Problem
A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?
(a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Consider this problem
Problem
A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?
(a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Consider this problem
Problem
A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?
(a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Answer
Bayes’s theorem
Pr(R1& R2) = Pr(R1) × Pr(R2|R1)13 =
12 × Pr(R2|R1)
Ans:
Pr(R2|R1) = 23
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Answer
Bayes’s theorem
Pr(R1& R2) = Pr(R1) × Pr(R2|R1)13 =
12 × Pr(R2|R1)
Ans:
Pr(R2|R1) = 23
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Answer
Bayes’s theorem
Pr(R1& R2) = Pr(R1) × Pr(R2|R1)13 =
12 × Pr(R2|R1)
Ans:
Pr(R2|R1) = 23
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
LEGO Bricks?
Can a canonical set (dictionary) of fragments be defined ofwhich all known protein structures are made?
This is an inference problem!
Hypothesis = Dictionary of fragmentsData = 3D coordinates of protein structures
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Information theory and Bayesian inference
According to Bayes
P(H&D) = P(H)× P(D|H) = P(D)× P(H|D)
According to Shannon
I (E ) = − log(P(E ))
Applying Shannon’s insight to Bayes
I (H&D) = I (H) + I (D|H) = I (D) + I (H|D)
For two competing hypotheses H and H ′:I (H|D)− I (H ′|D) = I (H) + I (D|H)− I (H ′)− I (D|H ′)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Information theory and Bayesian inference
According to Bayes
P(H&D) = P(H)× P(D|H) = P(D)× P(H|D)
According to Shannon
I (E ) = − log(P(E ))
Applying Shannon’s insight to Bayes
I (H&D) = I (H) + I (D|H) = I (D) + I (H|D)
For two competing hypotheses H and H ′:I (H|D)− I (H ′|D) = I (H) + I (D|H)− I (H ′)− I (D|H ′)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Information theory and Bayesian inference
According to Bayes
P(H&D) = P(H)× P(D|H) = P(D)× P(H|D)
According to Shannon
I (E ) = − log(P(E ))
Applying Shannon’s insight to Bayes
I (H&D) = I (H) + I (D|H) = I (D) + I (H|D)
For two competing hypotheses H and H ′:I (H|D)− I (H ′|D) = I (H) + I (D|H)− I (H ′)− I (D|H ′)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Information theory and Bayesian inference
According to Bayes
P(H&D) = P(H)× P(D|H) = P(D)× P(H|D)
According to Shannon
I (E ) = − log(P(E ))
Applying Shannon’s insight to Bayes
I (H&D) = I (H) + I (D|H) = I (D) + I (H|D)
For two competing hypotheses H and H ′:I (H|D)− I (H ′|D) = I (H) + I (D|H)− I (H ′)− I (D|H ′)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Minimum Message Length Criterion
First practicaldemonstration ofBayesian inference usingInformation theory.
Introduced by lateProfessor Chris Wallace in1968. (Computer Journal,11(2) 185-194)
Links lossless compressionwith statistical inferenceand model selection.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
MML best understood as a communication process
Two part message
1 Assertion: transmit hypothesis Htaking I (H) bits.
2 Detail: transmit the observed data Dgiven H taking I (D|H) bits.
Objective
Find the hypothesis H on the given data Dwhich minimizes the two-part messagelength.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
MML best understood as a communication process
Two part message
1 Assertion: transmit hypothesis Htaking I (H) bits.
2 Detail: transmit the observed data Dgiven H taking I (D|H) bits.
Objective
Find the hypothesis H on the given data Dwhich minimizes the two-part messagelength.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Overview (Alice’s point of view)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Overview (Alice’s point of view)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Overview (Alice’s point of view)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Overview (Alice’s point of view)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Overview (Bob’s point of view)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Overview (Bob’s point of view)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Overview (Bob’s point of view)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Overview (Bob’s point of view)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
How does Alice compress any protein structure using adictionary of fragments?
Alice dissects the proteinstructure into non-overlappingregions.
Each region is assigned to adictionary fragment.
The coordinates in each regionof the protein structures istrasmitted wrt the assigneddictionary fragment.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
What Alice needs to send Bob ...
the number of regions in the dissection
the indexes of start and end points of each region
the index of the fragment in the dictionary assigned to encodeeach region in the dissection.
the details of the coordinates in the region USING its assigneddictionary fragment.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Explain a protein region using some dictionary element
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Implicit null model (not a part of the dictionary)
Coil model
Explain the coordinatesPi , · · · ,Pj as is! (Thatis, using the null modelmessage)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
The search for the best dissection of a particular proteinstructure
Given protein coordinates P1, · · · ,Pn, a general dissection of theform:
P1, · · · · · · ,PiPi , · · · · · · · · · ,Pj
Pj , · · · ,Pk...
Pm, · · · · · · ,Pn
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
The search for the best dissection of a particular proteinstructure
Given the coordinates of a particular protein structure, find thedissection and corresponding dictionary fragment assignment thatminimizes the total message length required to transmit itscoordinates.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Defining the best dictionary
1 Given a dictionary of fragments, an optimal encoding of aparticular protein is the combination of assignments of regionsto dictionary fragments, and statements of deviations relativeto such a segmentation, that is of minimal length.
2 An optimal encoding of a collection of protein structures, allusing the same given dictionary, involves the one-off cost ofstating the dictionary, followed by the encodings of theindividual proteins.
3 An optimal dictionary for any collection of protein structuresis one for which the sum of the optimal encodings of all theproteins in the set has the shortest length.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Search of the optimal dictionary
Simulated Annealing
Add: Append to the current dictionary a new elementbeing a randomly chosen fragment, of randomlychosen length, sampled from the collection.
Provided the current dictionary is non-empty:
Remove: Remove a randomly chosen element.
Swap: Replace a randomly chosen element in the currentdictionary with another fragment. This is equivalentto a ‘Delete’ followed by an ‘Add’.
Perturb length: Grow or shrink a randomly chosen element fromthe current dictionary by one residue at either end.
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Results – Dictionary
For a fragment to be included in the dictionary, our MMLformulation provides a natural trade-off between
length
fidelity
frequency
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Results – Dictionary
The optimal dictionary contains 1711 fragments.
They vary in length between 4 and 31 residues.
Rounds up the usual suspects: helices, strands, hairpins.
Many more beyond those (many of them still yet to beanalyzed)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Some fragments from the dictionary
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Example segmentation of Flavodoxin
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
How accurate are the reported protein coordinate data?
10−3
10−2
10−1
100
10
15
20
25
30
35
40
Accuracy of measurement of coordinates
Num
ber
of bits p
er
Calp
ha c
oord
inate
AOM vs BITS
Null model
Building Blocks Dictionary
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
This has opened up new directions of research
1 What is the grammar of these dictionary building blocks?
2 One-dimensional characterization of 3D structures?
3 Ab initio structure prediction from amino acid sequence?
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
Outstanding Collaborators that made the work possible!
Lloyd Allison Arthur Lesk David Abramson Peter Stuckey(Monash) (Penn State, USA) (Univ. Queensland) (Univ. Melbourne)
Building blocks
-
IntroductionInference
MMLLanguage of communication
SearchResults
QUESTIONS?
Building blocks
IntroductionInferenceMMLLanguage of communicationSearchResults