Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San...

1
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca Segall, and Peter Salamon, Department of Biology, San Diego State University Methods: Bacteriophages are the single most abundant biological entity on earth, and influence every environment in which bacteria exist. There are no current algorithms which reliably analyze phage structural protein sequences and predict their function. •The research conducted allows the classification of phage structural proteins using Artificial Neural Networks, a computational method of analysis inspired by biological neurons. •Features of phage protein sequences with known classifications were used to train neural networks, which then predict the specified function an unknown sequence. •Analysis of the predictions will allow biologists to decide, with some accuracy, which proteins are the most appropriate candidates for their research needs. Background: Training of ANNs Annotation Data Filtering Conversion of Sequences to Quantitative Features Phage Major Capsid Protein Sequence Collection Phage MCPs were obtained from the NCBI database using the keywords Phage, Major, Proteins, and: Testing of ANNs using Separate Test Sets Sequence Manipulation And Analysis Sequences with non-MCP annotations were removed from the positive data-set, Sequences with MCP annotations were removed from the negative data-set. Sequences were dereplicated above 97% similarity. Due to a scarcity in positive MCP sequences less than 300 Amino Acids in length: Only sequences of greater than 300 Amino Acids were used. Five features of protein sequences were translated into quantitative representations: Amino Acid Percent Compositions, Masses, Isoelectric Points, Hydrophobicity Ratings, and Volumes. ANNs were trained according to the five features. Architecture included: One input layer containing neurons equal to the number of individual features A hidden layer with an equal number of neurons A decision layer with one neuron. Classification was executed on a test set containing a randomly selected 20% portion of Positive and Negative 84.02 90.48 84.48 95.24 83.97 96.83 56.9 1.59 84.92 88.89 0 10 20 30 40 50 60 70 80 90 100 P ercent ofTest S et C orrectly C lassified pct m ass iso-electric value hydrophobicity rating Volum e P rotein S equence Features S ingle Feature-T rained A N N C lassification ofP hage P roteins P rim ary TestS et C urated TestS et Head Shell Coat Capsid Procapsid Prohead Non-MCP sequences were also downloaded as Negative examples. Input Layer Hidde n Layer Phage Major Capsid Proteins are distinguishable from other Phage Proteins by trained Artificial Neural Networks. Classification of the test sets reveal the ANN's ability to distinguish phage proteins most accurately when multiple significant features are combined during training. Artificial Neural Networks trained by Amino Acid characteristics may produce similar classification abilities and may produce commonalities in Conclusions: This research was funded in part by the NSF 0827278 UBM Interdisciplinary Training in Biology and Mathematics grant to AMS and PS. A set of Phage MCPs, obtained from an independent database, were curated using: Annotation Analysis Clustal alignment Phylogenetic Evaluation (Rizkallah 2010). These proteins were used as a protein sequence s, not included in training . 86.03 96.16 87.2 95.24 83.41 93.65 84.66 95.24 0 10 20 30 40 50 60 70 80 90 100 P ercent ofTest S et C orrectly C lassified pct, m ass, iso, hyd, vol pct, m ass, iso, vol pct, m ass, iso, hyd pct, m ass, iso P rotein S equence Features M ulti-Feature-T rained A N N C lassification ofP hage P roteins P rim ary TestS et C urated TestS et
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San...

Page 1: Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.

Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University

Mentors: Victor Seguritan, Anca Segall, and Peter Salamon, Department of Biology, San Diego State University

Methods:

Bacteriophages are the single most abundant biological entity on earth, and influence every environment in which bacteria exist. There are no current algorithms which reliably analyze phage structural protein sequences and predict their function. •The research conducted allows the classification of phage structural proteins using Artificial Neural Networks, a computational method of analysis inspired by biological neurons.•Features of phage protein sequences with known classifications were used to train neural networks, which then predict the specified function an unknown sequence.•Analysis of the predictions will allow biologists to decide, with some accuracy, which proteins are the most appropriate candidates for their research needs.

Background:

Training of ANNs

Annotation Data Filtering

Conversion of Sequences to Quantitative Features

Phage Major Capsid Protein Sequence

Collection

Phage MCPs were obtained from the NCBI database using the keywords Phage, Major, Proteins, and:

Testing of ANNs using Separate Test Sets

Sequence Manipulation And Analysis

Sequences with non-MCP annotations were removed from the positive data-set,

Sequences with MCP annotations were removed from the negative data-set.

Sequences were dereplicated above 97% similarity. Due to a scarcity in positive MCP sequences less than 300 Amino Acids in length:

•Only sequences of greater than 300 Amino Acids were used.

Five features of protein sequences were translated into quantitative representations: Amino Acid Percent Compositions, Masses, Isoelectric Points, Hydrophobicity Ratings, and Volumes.

ANNs were trained according to the five features. Architecture included: •One input layer containing neurons equal to the number of individual features

•A hidden layer with an equal number of neurons

•A decision layer with one neuron.

Classification was executed on a test set containing a randomly selected 20% portion of Positive and Negative

84.02

90.48

84.48

95.24

83.97

96.83

56.9

1.59

84.92

88.89

0

10

20

30

40

50

60

70

80

90

100

Percent of Test

Set Correctly Classified

pct mass iso-electricvalue

hydrophobicityrating

Volume

Protein Sequence Features

Single Feature-Trained ANN Classification of Phage Proteins

PrimaryTest Set

CuratedTest Set

•Head•Shell•Coat

•Capsid•Procapsid•Prohead

Non-MCP sequences were also downloaded as Negative examples.

Input Layer

Hidden Layer

•Phage Major Capsid Proteins are distinguishable from other Phage Proteins by trained Artificial Neural Networks.

•Classification of the test sets reveal the ANN's ability to distinguish phage proteins most accurately when multiple significant features are combined during training.

•Artificial Neural Networks trained by Amino Acid characteristics may produce similar classification abilities and may produce commonalities in classification styles.

Conclusions:

This research was funded in part by the NSF 0827278 UBM Interdisciplinary Training in Biology and Mathematics grant to AMS and PS.

A set of Phage MCPs, obtained from an independent database, were curated using: •Annotation Analysis •Clustal alignment •Phylogenetic Evaluation (Rizkallah 2010). These proteins were used as a secondary test set.

protein sequences, not included in training.

86.03

96.16

87.2

95.24

83.41

93.65

84.66

95.24

0

10

20

30

40

50

60

70

80

90

100

Percent of Test

Set Correctly Classified

pct, mass, iso, hyd,vol

pct, mass, iso, vol pct, mass, iso, hyd pct, mass, iso

Protein Sequence Features

Multi-Feature-Trained ANN Classification of Phage Proteins

PrimaryTest Set

CuratedTest Set