Protein structure prediction using metagenome sequence dataProtein structure prediction using...

44
Protein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018

Transcript of Protein structure prediction using metagenome sequence dataProtein structure prediction using...

Page 1: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Protein structure prediction using metagenome sequence data

Alex ChuCS 371 - Prof. Ron Dror

January 23, 2018

Page 2: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

4,752 contain at least one experimentally solved 3D structure

4,886 can be modeled to some extent using comparative homology modeling

5,211 with no known structure and no structurally characterized homologs...

14,849 Pfam protein families

Page 3: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Ab initio/de novo structure prediction is not very good… so use evolutionary couplings

Balakrishnan et al, Proteins 2011

Page 4: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

ApproachUsed to calculate a simple covariance matrix, but too many false positives. (Positions that covary, but are not structurally linked.)

GREMLIN is one technique that mitigates this error (learns a probabilistic graphical model from a multiple sequence alignment)

Page 5: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

ApproachCombine GREMLIN with existing de novo prediction software from Rosetta

… but still limited by the amount of sequence data available.

Page 6: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

~2 billion partial and full-length proteins from ~5000 metagenomes from the Integrated Microbial Genomes database

Use metagenomics data!

Wooley et al, PLOS Comput Biol 2010

Page 7: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Nf: a metric that describes how amenable a protein family is to this method, by relating the length of the protein, the number of sequences in the family, and the diversity of the sequences

How useful was this extra data?

Page 8: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

How do we assess the quality of a predicted structure?

It turns out structure prediction convergence is a good measure for the quality of the prediction

Page 9: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

FindingsMetagenomic data allowed prediction 33% of unmodeled Pfams (compared with 16% without metagenome data)

612 new Pfams predicted

137 of these are novel folds

Page 10: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

StrengthsLeveraged advances over the last ~10 years in high-throughput sequencing (especially in metagenomics), and in evolutionary coupling analysis.

This methods has generated one of the largest advances ever in structural genomics (quality predictions generated for >600 new families and >100 novel folds discovered), with promises of more as more sequence data becomes available.

Page 11: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

LimitationsThe algorithm doesn’t explicitly model the presence of membrane bilayers or endogenous ligands.

How generalizable is this method? Even for families with Nf > 64, prediction calculations converge just over half of the time.

Bacterial genomes and proteins are highly overrepresented in metagenomics studies.

Page 12: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Limitations

We still can’t predict structure for over half of the families with unknown structure. (But is this even a fair criticism?)

Page 13: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Potential Next StepsCan this be used at all to improve current homology modeling methods?

Can we improve the method to predict more structures with less sequences (i.e. at lower Nf values)?

Can we predict or incorporate functional sites into the predictions?

Page 14: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Questions?

Page 15: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Accurate De Novo Predic0on of Protein Contact Map by Ultra-

Deep Learning Model ShengWang,SiqiSun,ZhenLi,RenyuZhang,JinboXu

AnkitBaghel

CS37101/23/18

Page 16: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Crea0ng Protein Contact Maps

Page 17: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Protein Contact Maps

hBp://www.bioinformaJcs.org/cmview/screenshots.html

Page 18: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Exis0ng Contact Predic0on Methods

•  EvoluJonaryCouplingAnalysis(ECA)•  PSICOV•  plmDCA•  Gremlin•  CCMpred

•  SupervisedMachineLearning•  MetaPSICOV•  SVMSEQ•  CMAPpro•  PconsC2•  PhyCMAP•  CoinDCA-NN•  CMAPpro

**NotanexhausJvelist.

Page 19: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

ECA Driven Contact Predic0on Uses Correla0ons Between Residues

Page 20: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Supervised Machine Learning Incorporates More Context

PconsC2ContactPredic5on

Page 21: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Crea0ng Protein Contact Maps

Deep Residual Learning

Page 22: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain
Page 23: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Deep Residual Learning for RaptorX Contact Predic0on

Page 24: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Protein Features for the First Residual Network

PosiJon-SpecificScoringMatrix(PSSM)

(Lx20Matrix)

PredictedSecondaryStructure(3x3Matrix)

PredictedSolventAccessibility(3x3Matrix)+ +

ConcatenatedtocreateaLx26matrixforuseinthefirstresidualnetwork.

1DResidualNetwork

MatrixConcatenaJon

LearnedSequenJalFeatures(LxnMatrix)

Page 25: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Protein Features for Second Residual Network

LearnedSequenJalFeatures(LxnMatrix)

Conversionto2D

PairwiseFeatures(LxLx3n) MutualinformaJonEvoluJonaryCouplinginformaJonprovided

byCCMpred

Pair-wisecontactpotenJal

+ ++

Merge

2DResidualNetwork

LogisJcRegression

PredictedContactMap(LxLMatrix)

Page 26: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Deep Residual Learning

Accuracy of Predicted Contact Maps

Page 27: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Training Set

•  SubsetofPDB25• Allproteinshavelessthan25%sequenceidenJtywithanyotherprotein•  6767proteins• Containsonly~100membraneproteins

Page 28: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Test Set

•  150Pfamfamilies•  105CASP11testproteins•  76hardCAMEOtestproteinsfrom2015•  398membraneproteins

•  400residuesatmost•  Atmost40%sequenceidenJty

Page 29: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Contact Predic0on via Deep Residual Learning Has High Accuracy

Shortsequencedistancebetweentworesiduesisintherange[6,11]Mediumsequencedistancebetweentworesiduesisintherange[12,23]Longsequencedistancebetweentworesiduesisintherange≥24

AccuracyisdefinedasthepercentofthetopL/kpredictedcontactsthatcorrespondtonaJvecontactswhereListhelengthoftheprotein.RaptorXContactPredic5on(bo:omrow)hashigheraccuracythanthecomparedmethodsforallsequencedistanceranges.

Page 30: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Contact Predic0on via Deep Residual Learning Has High Accuracy

Shortsequencedistancebetweentworesiduesisintherange[6,11]Mediumsequencedistancebetweentworesiduesisintherange[12,23]Longsequencedistancebetweentworesiduesisintherange≥24

AccuracyisdefinedasthepercentofthetopL/kpredictedcontactsthatcorrespondtonaJvecontactswhereListhelengthoftheprotein.RaptorXContactPredic5on(bo:omrow)hashigheraccuracythanthecomparedmethodsforallsequencedistanceranges.

Page 31: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Contact Predic0on via Deep Residual Learning Has High Accuracy

Shortsequencedistancebetweentworesiduesisintherange[6,11]Mediumsequencedistancebetweentworesiduesisintherange[12,23]Longsequencedistancebetweentworesiduesisintherange≥24

AccuracyisdefinedasthepercentofthetopL/kpredictedcontactsthatcorrespondtonaJvecontactswhereListhelengthoftheprotein.RaptorXContactPredic5on(bo:omrow)hashigheraccuracythanthecomparedmethodsforallsequencedistanceranges.

Page 32: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Contact Predic0on via Deep Residual Learning Has High Accuracy

AccuracyisdefinedasthepercentofthetopL/kpredictedcontactsthatcorrespondtonaJvecontactswhereListhelengthoftheprotein.RaptorXContactPredic5on(bo:omrow)hashigheraccuracythanthecomparedmethodsforallsequencedistanceranges.

Shortsequencedistancebetweentworesiduesisintherange[6,11]Mediumsequencedistancebetweentworesiduesisintherange[12,23]Longsequencedistancebetweentworesiduesisintherange≥24

Page 33: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Increased Performance Due to Deep Residual Learning is Independent of Available Homologous Informa0on

Page 34: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Accuracy of Predicted Contact Maps

Contact-Assisted Protein Folding Results

Page 35: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Contact-Assisted Protein Folding Benefits from Deep Residual Learning

Page 36: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Learned Features Are Not Template-Based

Page 37: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Contact-Assisted Protein Folding Results

CAMEO Blind Tests

Page 38: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

CAMEO Blind Tests of Contact Predic0on

Page 39: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

CAMEO Blind Tests of Contact Predic0on

Page 40: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

CAMEO Blind Tests

Strengths/Weaknesses

Page 41: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Strengths • VerythoroughincomparingitspredicJonsagainstdifferenttypesofproteinsandpredicJonapproaches.

• Usesanon-redundanttrainingset.

• Considersallresiduepairsforcontactsimultaneously.

• BlindtesJngthroughCAMEO.

• Performssurprisinglywellonmembraneproteins.

Page 42: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Weaknesses •  Useofextensivehiddenlayersmakeslearnedfeaturesdifficulttodescribe.

•  DoesnotquanJfyitsfalse-posiJverate.

•  Isnotasuniqueanapproachasimplied(seePConsC2).

•  DoesnotcompareitsmethodofcontactmappredicJontothatofPConsC2.

•  Testedmembraneproteinswereconstrained

Page 43: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Supervised Machine Learning Incorporates More Context

PconsC2ContactPredic5on

Page 44: Protein structure prediction using metagenome sequence dataProtein structure prediction using metagenome sequence data Alex Chu CS 371 - Prof. Ron Dror January 23, 2018. 4,752 contain

Test Set

•  150Pfamfamilies•  105CASP11testproteins•  76hardCAMEOtestproteinsfrom2015•  398membraneproteins

•  400residuesatmost•  Atmost40%sequenceidenJty