LEGO Bricks - Monash University · LEGO Bricks? Can a canonical set (dictionary) of fragments be de...

61
Introduction Inference MML Language of communication Search Results Statistical Inference of protein “LEGO R Bricks” Arun Konagurthu, Monash University Building blocks

Transcript of LEGO Bricks - Monash University · LEGO Bricks? Can a canonical set (dictionary) of fragments be de...

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Statistical Inference of protein “LEGO R© Bricks”

    Arun Konagurthu, Monash University

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Molecular Biology as an information processing system

    General flow of information in a cell (a.k.a. Crick’s Central Dogma)

    DNA

    RNA

    Protein

    Protein Sequence

    Protein 3D Structure

    Biological Function

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    X-Ray Crystallography to elucidate protein 3D structure

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Atomic resolution protein coordinates

    ATOM 1070 N VAL A 132 -2.482 8.914 -0.561 1.00 11.79 N

    ATOM 1071 CA VAL A 132 -1.569 9.061 -1.694 1.00 13.21 C

    ATOM 1072 C VAL A 132 -0.723 7.794 -1.791 1.00 13.23 C

    ATOM 1073 O VAL A 132 -0.102 7.399 -0.812 1.00 13.37 O

    ATOM 1074 CB VAL A 132 -0.640 10.276 -1.458 1.00 12.98 C

    ATOM 1075 CG1 VAL A 132 0.503 10.351 -2.539 1.00 14.72 C

    ATOM 1076 CG2 VAL A 132 -1.459 11.555 -1.436 1.00 14.45 C

    ATOM 1077 N ILE A 133 -0.710 7.168 -2.970 1.00 13.79 N

    ATOM 1078 CA ILE A 133 0.030 5.915 -3.215 1.00 14.07 C

    ATOM 1079 C ILE A 133 0.913 6.209 -4.436 1.00 14.71 C

    ATOM 1080 O ILE A 133 0.394 6.656 -5.456 1.00 14.40 O

    ATOM 1081 CB ILE A 133 -0.947 4.735 -3.513 1.00 14.18 C

    ATOM 1082 CG1 ILE A 133 -1.850 4.473 -2.279 1.00 14.77 C

    ATOM 1083 CG2 ILE A 133 -0.196 3.421 -3.848 1.00 14.82 C

    ATOM 1084 CD1 ILE A 133 -3.032 3.568 -2.590 1.00 14.21 C

    ATOM 1085 N GLY A 134 2.214 5.949 -4.294 1.00 14.24 N

    ATOM 1086 CA GLY A 134 3.235 6.269 -5.311 1.00 15.35 C

    ATOM 1087 C GLY A 134 3.228 5.299 -6.492 1.00 14.28 C

    ATOM 1088 O GLY A 134 2.253 4.603 -6.749 1.00 13.67 O

    ATOM 1089 N HIS A 135 4.333 5.317 -7.235 1.00 15.70 N

    ATOM 1090 CA HIS A 135 4.524 4.515 -8.431 1.00 15.51 C

    ATOM 1091 C HIS A 135 5.078 3.129 -8.074 1.00 14.80 C

    ATOM 1092 O HIS A 135 5.852 3.020 -7.132 1.00 16.15 O

    ATOM 1093 CB HIS A 135 5.515 5.236 -9.354 1.00 15.95 C

    ATOM 1094 CG HIS A 135 5.041 6.585 -9.787 1.00 16.20 C

    ATOM 1095 ND1 HIS A 135 3.928 6.759 -10.577 1.00 19.61 N

    ATOM 1096 CD2 HIS A 135 5.520 7.821 -9.533 1.00 22.36 C

    ATOM 1097 CE1 HIS A 135 3.739 8.049 -10.795 1.00 20.33 C

    ATOM 1098 NE2 HIS A 135 4.689 8.714 -10.167 1.00 21.46 NBuilding blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Structures of proteins are..

    ... irregular and complex atthe atomic level...

    ...but their folding patterns aresimple and elegant.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Structures of proteins are..

    ... irregular and complex atthe atomic level...

    ...but their folding patterns aresimple and elegant.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    What defines a protein folding pattern?

    Secondary structures = helices and strands ofsheet

    Assembly = contact + relative geometry

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Many proteins have similar folding patterns

    They differ indetail but havegeneraltopologicalproperties incommon.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Recurrent themes in protein 3D structures

    Recurrent structural themes emerge within the great variety ofprotein structures.

    LEGO Bricks?

    Can a canonical set (dictionary) of fragments be defined ofwhich ALL known protein structures are made?

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Consider this problem

    Problem

    A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?

    (a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Consider this problem

    Problem

    A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?

    (a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Consider this problem

    Problem

    A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?

    (a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Consider this problem

    Problem

    A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?

    (a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Consider this problem

    Problem

    A card is picked at random and plonked on a table. Let’s say yousee a card which is RED facing up. What is the probability thatthe you have RED on the other side of the card?

    (a) 2/3(b) 1/2(c) too early in the morning(d) something else or don’t know

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Answer

    Bayes’s theorem

    Pr(R1& R2) = Pr(R1) × Pr(R2|R1)13 =

    12 × Pr(R2|R1)

    Ans:

    Pr(R2|R1) = 23

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Answer

    Bayes’s theorem

    Pr(R1& R2) = Pr(R1) × Pr(R2|R1)13 =

    12 × Pr(R2|R1)

    Ans:

    Pr(R2|R1) = 23

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Answer

    Bayes’s theorem

    Pr(R1& R2) = Pr(R1) × Pr(R2|R1)13 =

    12 × Pr(R2|R1)

    Ans:

    Pr(R2|R1) = 23

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    LEGO Bricks?

    Can a canonical set (dictionary) of fragments be defined ofwhich all known protein structures are made?

    This is an inference problem!

    Hypothesis = Dictionary of fragmentsData = 3D coordinates of protein structures

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Information theory and Bayesian inference

    According to Bayes

    P(H&D) = P(H)× P(D|H) = P(D)× P(H|D)

    According to Shannon

    I (E ) = − log(P(E ))

    Applying Shannon’s insight to Bayes

    I (H&D) = I (H) + I (D|H) = I (D) + I (H|D)

    For two competing hypotheses H and H ′:I (H|D)− I (H ′|D) = I (H) + I (D|H)− I (H ′)− I (D|H ′)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Information theory and Bayesian inference

    According to Bayes

    P(H&D) = P(H)× P(D|H) = P(D)× P(H|D)

    According to Shannon

    I (E ) = − log(P(E ))

    Applying Shannon’s insight to Bayes

    I (H&D) = I (H) + I (D|H) = I (D) + I (H|D)

    For two competing hypotheses H and H ′:I (H|D)− I (H ′|D) = I (H) + I (D|H)− I (H ′)− I (D|H ′)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Information theory and Bayesian inference

    According to Bayes

    P(H&D) = P(H)× P(D|H) = P(D)× P(H|D)

    According to Shannon

    I (E ) = − log(P(E ))

    Applying Shannon’s insight to Bayes

    I (H&D) = I (H) + I (D|H) = I (D) + I (H|D)

    For two competing hypotheses H and H ′:I (H|D)− I (H ′|D) = I (H) + I (D|H)− I (H ′)− I (D|H ′)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Information theory and Bayesian inference

    According to Bayes

    P(H&D) = P(H)× P(D|H) = P(D)× P(H|D)

    According to Shannon

    I (E ) = − log(P(E ))

    Applying Shannon’s insight to Bayes

    I (H&D) = I (H) + I (D|H) = I (D) + I (H|D)

    For two competing hypotheses H and H ′:I (H|D)− I (H ′|D) = I (H) + I (D|H)− I (H ′)− I (D|H ′)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Minimum Message Length Criterion

    First practicaldemonstration ofBayesian inference usingInformation theory.

    Introduced by lateProfessor Chris Wallace in1968. (Computer Journal,11(2) 185-194)

    Links lossless compressionwith statistical inferenceand model selection.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    MML best understood as a communication process

    Two part message

    1 Assertion: transmit hypothesis Htaking I (H) bits.

    2 Detail: transmit the observed data Dgiven H taking I (D|H) bits.

    Objective

    Find the hypothesis H on the given data Dwhich minimizes the two-part messagelength.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    MML best understood as a communication process

    Two part message

    1 Assertion: transmit hypothesis Htaking I (H) bits.

    2 Detail: transmit the observed data Dgiven H taking I (D|H) bits.

    Objective

    Find the hypothesis H on the given data Dwhich minimizes the two-part messagelength.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Overview (Alice’s point of view)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Overview (Alice’s point of view)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Overview (Alice’s point of view)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Overview (Alice’s point of view)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Overview (Bob’s point of view)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Overview (Bob’s point of view)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Overview (Bob’s point of view)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Overview (Bob’s point of view)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    How does Alice compress any protein structure using adictionary of fragments?

    Alice dissects the proteinstructure into non-overlappingregions.

    Each region is assigned to adictionary fragment.

    The coordinates in each regionof the protein structures istrasmitted wrt the assigneddictionary fragment.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    What Alice needs to send Bob ...

    the number of regions in the dissection

    the indexes of start and end points of each region

    the index of the fragment in the dictionary assigned to encodeeach region in the dissection.

    the details of the coordinates in the region USING its assigneddictionary fragment.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Explain a protein region using some dictionary element

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Implicit null model (not a part of the dictionary)

    Coil model

    Explain the coordinatesPi , · · · ,Pj as is! (Thatis, using the null modelmessage)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    The search for the best dissection of a particular proteinstructure

    Given protein coordinates P1, · · · ,Pn, a general dissection of theform:

    P1, · · · · · · ,PiPi , · · · · · · · · · ,Pj

    Pj , · · · ,Pk...

    Pm, · · · · · · ,Pn

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    The search for the best dissection of a particular proteinstructure

    Given the coordinates of a particular protein structure, find thedissection and corresponding dictionary fragment assignment thatminimizes the total message length required to transmit itscoordinates.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Defining the best dictionary

    1 Given a dictionary of fragments, an optimal encoding of aparticular protein is the combination of assignments of regionsto dictionary fragments, and statements of deviations relativeto such a segmentation, that is of minimal length.

    2 An optimal encoding of a collection of protein structures, allusing the same given dictionary, involves the one-off cost ofstating the dictionary, followed by the encodings of theindividual proteins.

    3 An optimal dictionary for any collection of protein structuresis one for which the sum of the optimal encodings of all theproteins in the set has the shortest length.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Search of the optimal dictionary

    Simulated Annealing

    Add: Append to the current dictionary a new elementbeing a randomly chosen fragment, of randomlychosen length, sampled from the collection.

    Provided the current dictionary is non-empty:

    Remove: Remove a randomly chosen element.

    Swap: Replace a randomly chosen element in the currentdictionary with another fragment. This is equivalentto a ‘Delete’ followed by an ‘Add’.

    Perturb length: Grow or shrink a randomly chosen element fromthe current dictionary by one residue at either end.

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Results – Dictionary

    For a fragment to be included in the dictionary, our MMLformulation provides a natural trade-off between

    length

    fidelity

    frequency

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Results – Dictionary

    The optimal dictionary contains 1711 fragments.

    They vary in length between 4 and 31 residues.

    Rounds up the usual suspects: helices, strands, hairpins.

    Many more beyond those (many of them still yet to beanalyzed)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Some fragments from the dictionary

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Example segmentation of Flavodoxin

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    How accurate are the reported protein coordinate data?

    10−3

    10−2

    10−1

    100

    10

    15

    20

    25

    30

    35

    40

    Accuracy of measurement of coordinates

    Num

    ber

    of bits p

    er

    Calp

    ha c

    oord

    inate

    AOM vs BITS

    Null model

    Building Blocks Dictionary

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    This has opened up new directions of research

    1 What is the grammar of these dictionary building blocks?

    2 One-dimensional characterization of 3D structures?

    3 Ab initio structure prediction from amino acid sequence?

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    Outstanding Collaborators that made the work possible!

    Lloyd Allison Arthur Lesk David Abramson Peter Stuckey(Monash) (Penn State, USA) (Univ. Queensland) (Univ. Melbourne)

    Building blocks

  • IntroductionInference

    MMLLanguage of communication

    SearchResults

    QUESTIONS?

    Building blocks

    IntroductionInferenceMMLLanguage of communicationSearchResults