Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
Transcript of Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
1/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
Final Project Report
CS 487 Introduction to Cluster Computing
Old Dominion University
Algorithm for Analysis of Amino Acid Bond Lengths of
Proteins
Tim Dugan, Computer Engineering Department
Gordon Bland, Computer Science Department
Tyler Wood, Computer Science Department
Adrian Ostolski, Computer Science Department
Research Advisor: Professor Jay Morris
ABSTRACT
Bioinformatics is a growing field of study which supplies us with endless computing problems
to solve. Although the definition of the term itself is somewhat arguable, the generally accepted idea
is that bioinformatics is using computers to solve biological issues, or answer questions. One such
problem is to develop an algorithm for comparing lengths of proteins in order to search for protein
keys. A protein key is a protein which sends signals to other cells by means of a chemical reaction
where the binding occurs. Our team has chosen to develop an algorithm for analysis of amino acid
bond lengths of proteins because this analysis will assist in identifying protein keys.
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
2/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
1
Table of Contents
1 Problem Description .................................................................................................. 3
1.1 Introduction ......................................................................................................... 31.2 Background ........................................................................................................ 41.3 Need for Solution ................................................................................................ 7
2 Solution Design ......................................................................................................... 82.1 Module Design .................................................................................................... 82.2 Classes ............................................................................................................... 82.3 Functions ............................................................................................................ 9
3 Results .................................................................................................................... 123.1 Generated Protein Files .................................................................................... 133.2 Output ............................................................................................................... 15
4 Conclusions and Recommended Further Research................................................ 185 Acknowledgements ................................................................................................. 19 6 Appendices ............................................................................................................. 20
A References ............................................................................................................ 20
B Source Code and Documentation ......................................................................... 20C Runtime Instructions ............................................................................................. 32
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
3/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
2
List of Figures
Figure 1. Amino Acid Structure ................................................................................... 6
Figure 2. Peptide Bond ............................................................................................... 7
Figure 3. Function Call in Main ................................................................................... 9Figure 4. Modify Function ......................................................................................... 10Figure 5. Load Function............................................................................................ 11Figure 6. Algorithm Function .................................................................................... 12Figure 7. Generate Function in Main ........................................................................ 13Figure 8. Generate Function ..................................................................................... 13
Figure 9. Protein1_clean .......................................................................................... 14
Figure 10. Protein2_clean ........................................................................................ 15
Figure 11. Protein1_raw ........................................................................................... 16
Figure 12. Protein2_raw ........................................................................................... 17Figure 13. Protein_matchs ....................................................................................... 18
List of Tables
Table 1. Amino Acids by Name .................................................................................. 5
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
4/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
3
1 PROBLEM DESCRIPTION
Bioinformatics is a growing field of study which supplies us with endless
computing problems to solve. Although the definition of the term itself is somewhat
arguable, the generally accepted idea is that bioinformatics is using computers to solve
biological issues, or answer questions. One such problem is to develop an algorithm for
comparing lengths of proteins in order to search for protein keys. A protein key is a
protein which sends signals to other cells by means of a chemical reaction where the
binding occurs.
For example, Dutch scientists found a protein produced by glia cells in the central
nervous system that transmit messages between brain cells which control the release of
chemicals that affect memory, attention, and addiction. Acetylcholine affects memory,
and dopamine affects addiction, to name a couple. Scientists anticipate using this
protein key to develop drugs which will influence certain neuronal functions as opposed
to certain others. There are many more protein keys that need to be identified,
however.
1.1 INTRODUCTION
The purpose of this Beowulf class is to demonstrate how cluster computing can be
used to solve large problems which could not otherwise be solved. Our team has
chosen to develop an algorithm for analysis of amino acid bond lengths of proteins
because this analysis will assist in identifying protein keys.
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
5/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
4
1.2 BACKGROUND
A protein is made up of amino acids, and therefore amino acids lie at the heart of
bioinformatics. There are approximately 20 amino acids found in the human body.
Each amino acid has unique properties and can be represented by a full name or a
three-letter or one-letter code, as shown in Table 1.
Alanine Ala A Hydrophobic Neutral
Cysteine Cys C Hydrophobic Neutral
Aspartic acid Asp D Hydrophilic Negative
Glutamic acid Glu E Hydrophilic Negative
Phenylalanine Phe F Hydrophobic Neutral
Glycine Gly G Hydrophobic Neutral
Histidine His H Hydrophilic Neutral/Positive/Negative
Isoleucine Ile I Hydrophobic Neutral
Lysine Lys K Hydrophilic Positive
Leucine Leu L Hydrophobic Neutral
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
6/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
5
Methionine Met M Hydrophobic Neutral
Asparagine Asn N Hydrophilic Neutral
Proline Pro P Hydrophobic Neutral
Glutamine Gln Q Hydrophilic Neutral
Arginine Arg R Hydrophilic Positive
Serine Ser S Hydrophilic Neutral
Threonine Thr T Hydrophilic Neutral
Valine Val V Hydrophobic Neutral
Tryptophan Trp W Hydrophobic Neutral
Tyrosine Tyr Y Hydrophobic Neutral
Table 1. Amino Acids by Name.
All amino acids are composed of a few atoms of the same type which form its
basic structure, with a central carbon atom or C-alpha, at its center. This carbon atom
has a hydrogen atom, and amino group, and a carboxylic acid group, and a fourth group
known as the variable sidechain connected to it. Sidechains are what differentiate one
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
7/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
6
amino acid form another. Amino acids are connected by a peptide bond between the
carboxyl group of the first amino acid and the amino group of the second amino acid.
Figure 2 shows the general structure of an amino acid, and Figure 3 shows a peptide
bond.
Figure 1. Amino Acid Structure.
(This space intentionally left blank.)
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
8/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
7
Figure 2. Peptide Bond.
1.3 NEED FOR SOLUTION
This solution has the potential to drastically increase the ability of the human race to
overcome diseases and illnesses or other conditions. With each discovery of a protein
key, scientists are able to make progress toward curing and/or treating endless causes
of mankinds suffering. Using the example mentioned in the introduction, that particular
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
9/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
8
protein key can be used to prevent Alzhiemers disease, schizophrenia, or help people
quit smoking or stop other drug additions similarly. This solution could save millions of
lives!
2 SOLUTION DESIGN
Our team has designed a solution at Old Dominion Universitys Beowulf Laboratory
where our supercomputer is housed. This solution accepts two ways of inputting data.
Either by giving the program an actual protein file, or by the user first generating protein
files. If the later is chosen, the user is prompted to input the number of protein nodes
and the maximum number of possible connections between nodes. The user also
inputs the sigma value, that is, the maximum deviation between distance comparison.
This code will then generate random proteins based on this criteria and output the
comparison to a text file.
2.1 MODULE
DESIGN
It is important to note initially that each module is independent of one another.
The intent is that each function can be run as a stand alone function. The purpose of
this is so that the program designer can implement full user input and feed them into the
functions.
2.2 CLASSES
There are two classes used in this solution; the point class and the edge class.
The point class contains the node id and the x and y coordinate values for the node
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
10/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
9
object. The edge class thus contains two points and the distance between them.
These two classes can be found in Appendix B.
2.3 FUNCTIONS
There are four major functions in our code; the generate, modify, load, and
algorithm functions. The generate function will be discussed in another section. Figure
3 shows each of these functions being called in Main.cpp and what parameters each
take.
Figure 3. Function Call in Main.cpp.
The modify function, shown in Figure 4, requires a file input name(s) and a file
output name(d). It then reads in data from file s and ouputs it to file d in a format which
will be acceptable for use in the load function.
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
11/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
10
Figure 4. Modify Function.
The load function requires a file input name(s) and a list type edge(d). It can be
seen in Figure 5 and its purpose is to load data in from file s and put the data into the
edge list d, which is passed to the function by reference.
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
12/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
11
Figure 5. Load Function.
The algorithm function, shown in Figure 6, is at the heart of our solution. This
function must have a file output name(s), a list type edge(d), a list type edge(f), and a
delta(x). It compares list lengths in list d to list f and if those lengths difference is
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
13/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
12
within the delta x value then the program will output the matching edges from d and f
to the file s.
Figure 6. Algorithm Function.
3 RESULTS
Our team has successfully generated two separate protein files (protein1, and
protein2) which mimic actual protein files. We have then used these protein files to
compare protein lengths. Our current solution can perform this on any protein file
supplied to it, with no alterations.
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
14/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
13
3.1 GENERATED PROTEIN FILES
The generated protein files are created by the generate function. It requires a file
output name(s), amino acid length (x), number of bonds (y) and will generate a random
protein with the amino acid length x. Each amino acid is connected to anywhere from 1
to y nodes and the function outputs the data in a specified format to a file named s.
Figure 7. Generate Function in Main.
Figure 8. Generate Function.
When the protein files are generated by the generate function, the files list the
node, its coordinates, the connecting node number, and the connecting nodes
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
15/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
14
coordinates. For example, the beginnings of our two generated protein files, Protein1
and Protein2, are shown in Figures 9 and 10.
Figure 9. Protein1_clean.
(This space intentionally left blank.)
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
16/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
15
Figure 10. Protein2_raw.
3.2 OUTPUT
The output for this project is the comparison of lengths performed on our generated
protein test data. The formatting for the file output in the raw and clean output files
shows data for each node in pairs. In the raw file, the first node is the real node data
and the second node data is the node it is connected to.
A(1) 5, 4A(4) 10, 11A(1) 5, 4A(5) 13, 15
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
17/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
16
This means that node 1 is connected to node 4 and node 5.
Figure 11. Protein1_raw.
(This space intentionally left blank.)
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
18/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
17
Figure 12. Protein2_clean.
(This space intentionally left blank.)
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
19/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
18
The matching file lists all edges, and is shown in Figure 13.
Figure 13. Protein_matchs.
4 CONCLUSIONS AND RECOMMENDED FURTHER RESEARCH
Our current solution finds all matching protein lengths in a given protein file, but it
does not yet actively search for protein keys. Therefore, further development would
implement an algorithm which would enable this module to identify protein keys. This
would, however, require copious amounts of additional research in order to generate a
precise method which could positively make such identifications possible.
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
20/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
19
5 ACKNOWLEDGEMENTS
Our team would like to thank Professor Jay Morris for the concept of this project
and his invaluable assistance throughout the development of this module. We would
also like to thank the Computer Science Department for providing the supercomputer for
us work with during this course. Special thank you also to Tihomir Hristov for the
training and initial setup which he provided.
(This space intentionally left blank.)
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
21/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
20
6 APPENDICES
A REFERENCES
[1] Carter, J.S. (2004, November 02). Amino acids and proteins. Retrieved from
http://biology.clc.uc.edu/courses/bio104/protein.htm[2] CMBI. (2010, February 12). Amino aci d. Retrieved from
http://wiki.cmbi.ru.nl/index.php/Amino_acid[3] Vriend, G., & Gelder, C.V. (n.d.). Intro bioinformatics. Retrieved from
http://swift.cmbi.ru.nl/teach/B1M/[4] Yahoo Stories, . (2001, May 16).
rotein key to new smoking, alzheimer'sdrugs. Retrieved from http://cmbi.bjmu.edu.cn/news/0105/97.htm
B SOURCE CODE AND DOCUMENTATION
Main.cpp
#include
#include
#include
#include
#include
using namespace std;
#include "point.h"
#include "Edge.h"
#include "function.h"
int main()
{
list protein1;
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
22/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
21
list protein2;
list::iterator pitr;
int a,b,c,d;
double sigma;
char str;
srand(time(NULL));
cout>a>>b;
cout>c>>d;
coutsigma;
cout
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
23/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
22
LoadProteinFile("protein2_clean.txt",protein2);
algorithm("protein_matchs.txt",protein1,protein2,sigma);
cout> str;
str = toupper(str);
if(str=='Y'){
cout
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
24/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
23
Function.h
struct node{
int label,nlabel;
int x,nx;
int y,ny;
};
//////////////////////////////
//functions
//////////////////////////////
void GenerateProteinFile(char str[256],int array_size,int node_connection){
int a=0;
int b=0;
int c=0;
int d=0;
int e=0;
int f=0;
node A[array_size];
//USER INPUT
fstream fout(str,ios::out);
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
25/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
24
array_size++;
//generate array of nodes
for(int z=0; z
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
26/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
25
fout
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
27/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
26
while (fin.good())
{
c = &b;
fin.get(*c);
a=*c;
if(fin.good())
{
if(a!=65 && a!=40 && a!=41 && a!=44)//ascii values for A ( ) , #
{
if(a==32)//ascii value for space
{
fout
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
28/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
27
fin.close();
fout.close();
}
/////////////////////////////////////////////////////////////////////
void LoadProteinFile(char infileName[256], list &protein){
//variable declaration
ifstream fin;
ofstream fout;
char * c;
//USER INPUT
fin.open(infileName);
Edge *amino;
Point *aptr;
Point *bptr;
int i,index=0;
double x1,y1,x2,y2,distance;
//temp code
fin>>i;
while(fin.good()){
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
29/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
28
fin>>x1;
fin>>y1;
aptr = new Point(i,x1,y1);
fin>>i;
fin>>x2;
fin>>y2;
bptr = new Point(i,x2,y2);
//calculate distance
distance = sqrt(pow((x2-x1),2)+pow((y2-y1),2));
//
amino = new Edge(index,distance,aptr,bptr);
index++;
//amino->display();
protein.push_back(*amino);
fin>>i;
}
}
//////////////////////////////////////////////////////
void DisplayProtein(list &protein){
list::iterator pitr;
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
30/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
29
int i;
double x,y;
for(pitr=protein.begin(); pitr!=protein.end();pitr++)
{
pitr->display();
}
}
////////////////////////////////////////////////////////
void algorithm(char outfileName[256],list &protein1,list &protein2, double sigma){
double delta;
fstream fout(outfileName,ios::out);
list::iterator protein1_itr;
list::iterator protein2_itr;
for(protein1_itr=protein1.begin(); protein1_itr!=protein1.end();protein1_itr++) {
for(protein2_itr=protein2.begin(); protein2_itr!=protein2.end();protein2_itr++) {
delta=fabs(protein1_itr->getDistance()-protein2_itr->getDistance());
if(sigma>=delta){
fout
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
31/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
30
fout
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
32/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
31
//Operator=
private:
int name;
double x;
double y;
};
Edge.h
class Edge {
public:
Edge();
Edge(int i,double dist, Point *x, Point *y){
index = i;
distance = dist;
a.setX(x->getX());
a.setY(x->getY());
a.setName(x->getName());
b.setX(y->getX());
b.setY(y->getY());
b.setName(y->getName());
}
int getAname(){return a.getName();}
double getAx(){return a.getX();}
double getAy(){return a.getY();}
int getBname(){return b.getName();}
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
33/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
32
double getBx(){return b.getX();}
double getBy(){return b.getY();}
double getDistance(){return distance;}
void display(){cout
-
8/9/2019 Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
34/34
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
33
asked if they would like the output printed to the screen. If no is selected, the output
may be found in the text file.