ADLD: A Novel Graphical Representation of Protein ...
Transcript of ADLD: A Novel Graphical Representation of Protein ...
Research ArticleADLD A Novel Graphical Representation of Protein Sequencesand Its Application
Lei Wang12 Hui Peng12 and Jinhua Zheng12
1 Key Laboratory of Intelligent Computing amp Information Processing Ministry of Education Xiangtan UniversityXiangtan 411105 China
2 College of Information Engineering Xiangtan University Xiangtan 411105 China
Correspondence should be addressed to Lei Wang phdleiwanggmailcom
Received 14 August 2014 Accepted 25 September 2014 Published 30 October 2014
Academic Editor Qi Dai
Copyright copy 2014 Lei Wang et alThis is an open access article distributed under theCreative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
To facilitate the intuitional analysis of protein sequences a novel graphical representation of protein sequences called ADLD(Alignment Diagonal Line Diagram) is introduced in this paper first and then a new ADLD based method is proposed and utilizedto analyze the similaritydissimilarity of protein sequences Comparing with existing methods our ADLD based method is provedto be effective in the similaritydissimilarity analysis of protein sequences and have the merits of good intuition visuality andsimplicity The examinations of the similaritiesdissimilarities for both the 16 different ND5 proteins and the 29 different spikeproteins illustrate the utility of our ADLD based approach
1 Introduction
Homology analysis is one of the hot topics in the areaof protein sequences analysis Up to now lots of methodshave been proposed for the homology analysis of proteinsequences [1ndash3] and among themauseful one is the graphicalrepresentation of protein sequences which is proved to be apowerful tool for visual comparison of protein sequences
At first graphical representation methods were intro-duced for representation of DNA sequences on the basis ofmultiple dimension space [4ndash7] After obtaining the sequenceinvariants from the graphics one can compare the sequencesbased on comparison of sequence invariants Graphicalrepresentation methods were proposed as an alternativeapproach of direct comparison of DNA sequences which arecomputational intensive (even those of a restricted length)[8] Protein sequences are to some degree similar to DNAsequences which are composed of different units Thusthe graphical representation methods can be extended todescribe protein sequences obviously
Currently many researchers have proposed differentmethods for the graphical representation of proteinsequences [9ndash24] For example Feng and Zhang [25]suggested Zp-curve based on the hydrophobicity and
charged properties of amino acid residues along theprimary sequence Randic et al [26] introduced a graphicalrepresentation of protein sequences based on a graphicalrepresentation of triplets of DNA in which the interiorof a square or a tetrahedron is utilized to accommodate64 sites for the 64 codons Bai and Wang [27] derived a2D graphical representation of protein sequences basedon nucleotide triplet codons Yao et al [28] outlined a2D graphical representation of protein sequences basedon two classifications of amino acids Abo el Maaty et al[29] proposed a novel unique 3D graphical representation ofprotein sequences based on three physicochemical propertiesof amino acid side chains Abo-Elkhier introduced a 3Dgraphical representation of protein sequence based on a rightcone of a unit base and unit height on protein sequencesinterfaces [30] El-Lakkani and El-Sherif [31] proposeda graphical representation of protein sequence to helpsimilarity analysis of protein sequences based on 2D and 3Damino acid adjacency matrices Ma et al [32] introduced afamily of Iterated Function Systems (IFS) to outline a 2Dgraphical representation of protein sequences
In most of these existing methods the main draw-backs are that the higher the dimension of the proteinsequence graphs the heavier the computation complexity of
Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2014 Article ID 959753 15 pageshttpdxdoiorg1011552014959753
2 Computational and Mathematical Methods in Medicine
the methods or the lower the recognition degree of the pro-tein sequence graphs For example in the methods proposedin [26 28] the main drawback is that the lines will cross eachother which will decrease the visibility of the graphics In themethods proposed in [29ndash31] the main drawbacks are thatthe 3D graphics seem to be more complex and have lowervisibility than the 2D graphics and in addition to obtain thesequence invariants from the graphics complex matrixes arerequired to be constructed which need much computationand storage
Sequence alignment is a way of arranging the sequencesof DNA RNA or protein to identify regions of similaritythat may be a consequence of functional structural orevolutionary relationships between the sequences [33] Upto now there are many kinds of algorithms having beenimplemented for sequence alignment [34ndash37] These meth-ods are usually efficient but complex and time consumingComparing with the alignment methods existing graphicalrepresentation methods can also display the inner structureof the protein sequences and can be utilized to find the sim-ilaritydissimilarity more visible according to their graphicsIn this paper we proposed a novel method for analyzing thesimilaritydissimilarity by combining the idea of the sequencealignment and the graphical representation methods to somedegree avoid the weakness of both of these two methods
Principal components analysis (PCA) is a standard toolin multivariate data analysis to reduce the number of dimen-sions which has been proved to be effective in the processof protein sequence analysis [38ndash40] Therefore in order toovercome the main drawbacks of existing methods in thispaper a novel graphical representation of protein sequencescalled ADLD (Alignment Diagonal Line Diagram) is intro-duced based on PCA and then a newADLD basedmethod isproposed and utilized to analyze the similaritydissimilarityof protein sequences And in addition to validate theeffectiveness of our ADLD based method we adopt it toanalyze the similaritydissimilarity of both the 16 differentND5 proteins and the 29 different spike proteins respectivelywhich are widely used as the test data [16ndash26] The analysisresults show that our method is not only visual intuitionaland effective in the similaritydissimilarity analysis of proteinsequences but also quite simple since there are no highdimensional matrixes required to be constructed
2 Materials and Methods
21 Procedure of OurMethod for Analysis of Protein SequencesIn this section we will illustrate the overall procedures of ourmethod for analyzing protein sequences as follows at first
(1) Select the same 9 different properties for each aminoacid and construct a 20 times 9 matrix as the input dataof the PCA algorithm on the basis of total 20 differentamino acids
(2) According to the PCA algorithm we can obtain aunique feature for each amino acid
(3) For each protein sequence in the test data we willreplace each amino acid in the protein sequence
with its corresponding unique feature and then wecan transform the protein sequence into a numericalsequence
(4) For any two numerical sequences we can drawa graph named ADLD and then abstract somenumerical characteristics of it which can be utilizedto analyze the similaritydissimilarity of these twosequences
Next in Sections 22ndash26 we will introduce the detailsof constructing the ADLDs and obtaining some of thenumerical characteristics of them In Section 31 we will givethe method for constructing the similaritydissimilarity ofour test sequence groups
22 AminoAcids andTheir Properties Proteins are composedof 20 different amino acids and these amino acids have manydifferent physicochemical and biological properties such asthe molecular weight (mW) hydropathy index (hI) the pKavalue for terminal amino acid groups COOH (pK1) the pKavalue for terminal amino acid groupsNH
3
+ (pK2) isoelectricpoint (pI) solubility (119878) the number of triplet codons (cN)frequency of human proteins (119865) and van der Waals radiusof side chains (vR) The names and symbols of the 20 aminoacids and the value of their 9 major properties are illustratedin Table 1
23 Principal Components Analysis Principal componentsanalysis (PCA) is a common technique for dimensionalityreduction and pattern recognition in datasets of high dimen-sion [41] The main purposes of PCA are the analysis ofdata to identify patterns and finding patterns to reduce thedimensions of the dataset with minimal loss of informationThe general steps of conducting PCA are as follows
Step 1 For119898 samples 1198831 1198832 119883
119898 suppose that each119883
119894
has 119899 components 1199091198941 1199091198941 119909
119894119899 let119883
119894= (1199091198941 1199091198941 119909
119894119899)
for 119894 isin 1 2 119898 and then construct an 119898 times 119899 matrix Xaccording to the following formula first
X =
[[[[
[
1198831
1198832
119883119898
]]]]
]
=
[[[[
[
11990911
11990912
sdot sdot sdot 1199091119899
11990921
11990922
sdot sdot sdot 1199092119899
1199091198981
1199091198982
sdot sdot sdot 119909119898119899
]]]]
]
(1)
Next based on thematrixX construct the corresponding119898 times 119899 standardized matrix Xlowast according to the followingformula
Xlowast =
[[[[[[
[
119883lowast
1
119883lowast
2
119883lowast
119898
]]]]]]
]
=
[[[[[[
[
119909lowast
11119909lowast
12sdot sdot sdot 119909
lowast
1119899
119909lowast
21119909lowast
22sdot sdot sdot 119909
lowast
2119899
119909lowast
1198981119909lowast
1198982sdot sdot sdot 119909lowast
119898119899
]]]]]]
]
(2)
where 119883lowast
119894= (119909lowast
1198941 119909lowast
1198942 119909
lowast
119894119899) 119909lowast119894119895
= (119909119894119895minus 119909119895)radicvar(119909
119895) 119909119895=
(1119899)sum119899
119894=1119909119894119895 and var(119909
119895) = (1(119899 minus 1))sum
119899
119894=1(119909119894119895
minus 119909119895)2 for
119894 isin 1 2 119898 and 119895 isin 1 2 119899
Computational and Mathematical Methods in Medicine 3
Table 1 The full list of 20 amino acids and the value of their 9 different properties
Amino acid Symbol mW hI pK1 pK2 pI 119878 cN 119865 () vRAlanine A 89079 18 234 969 601 1672 4 78 67Cysteine C 121145 25 196 1028 507 0 2 19 86Aspartic acid D 133089 minus35 188 96 277 5 2 53 91Glutamic acid E 147116 minus35 219 967 322 85 2 63 109Phenylalanine F 165177 28 183 913 548 276 2 39 135Glycine G 75052 minus04 234 96 597 2499 4 72 48Histidine H 155141 minus32 182 917 759 0 2 23 118Isoleucine I 13116 45 236 968 602 345 3 53 124Lysine K 14617 minus39 218 895 974 739 2 59 135Leucine L 13116 38 236 96 598 217 6 91 124Methionine M 149199 19 228 921 574 562 1 23 124Asparagine N 132104 minus35 202 88 541 285 2 43 96Proline P 115117 16 199 1096 648 1620 4 52 90Glutamine Q 146131 minus35 217 913 565 72 2 42 114Arginine R 174188 minus45 217 904 1076 8556 6 51 148Serine S 105078 minus08 221 915 568 422 6 68 73Tyrosine T 119105 minus07 211 962 587 132 4 59 93Valine V 117133 42 232 962 597 581 4 66 105Tryptophan W 204213 minus09 238 939 589 136 1 14 163Threonine Y 181176 minus13 22 911 566 04 2 32 141
Step 2 Based on thematrixXlowast construct the 119899times119899 correlationmatrix R according to the following formula
R =
[[[[
[
1198771
1198772
119877119899
]]]]
]
=
[[[[
[
11990311
11990312
sdot sdot sdot 1199031119899
11990321
11990322
sdot sdot sdot 1199032119899
1199031198991
1199031198992
sdot sdot sdot 119903119899119899
]]]]
]
(3)
where we can find that 119903119894119895
= sum119898
119896=1(119909lowast
119896119894minus 119909lowast
119894)(119909lowast
119896119895minus 119909lowast
119895)
radicsum119898
119896=1(119909lowast
119896119894minus 119909lowast
119894)2sum119898
119896=1(119909lowast
119896119895minus 119909lowast
119895)2 for 119894 isin 1 2 119899 and
119895 isin 1 2 119899
Step 3 From the correlationmatrixR obtain its 119899 eigenvalues1205821ge 1205822ge sdot sdot sdot ge 120582
119899gt 0 and the corresponding 119899 eigenvectors
1198861=
[[[[
[
11988611
11988621
1198861198991
]]]]
]
1198862=
[[[[
[
11988612
11988622
1198861198992
]]]]
]
119886119899=
[[[[
[
1198861119899
1198862119899
119886119899119899
]]]]
]
(4)
respectively And from now on we can obtain 119899 principalcomponents 119865
119894for 119894 isin 1 2 119899 as follows
119865119894= 1198861119894119883lowast
1+ 1198862119894119883lowast
2+ sdot sdot sdot + 119886
119899119894119883lowast
119899 (5)
Step 4 For each principal component 119865119894for 119894 isin 1 2 119899
obtain its contribution rate CR119894and accumulated contribution
rate ACR119894according to the following formulas respectively
CR119894=
120582119894
sum119899
119896=1120582119896
(6)
ACR119894=
119894
sum
119896=1
CR119896 (7)
Generally in order to lower the computation complexitywe can keep only the first 119905 (119905 le 119899) principal components1198651 1198652 119865
119905 where the accumulated contribution rate of
the 119905th principal component 119865119905shall satisfy the fact that
ACR119905ge 85
Step 5 For 119895 isin 1 2 119905 let
119865119895=
[[[[[[[
[
119865119895
1
119865119895
2
119865119895
119898
]]]]]]]
]
(8)
Then for each 119894 isin 1 2 119898 we can obtain the total scoreof the 119894th sample as follows
TotalScore (119894) =
119905
sum
119896=1
119865119896
119894times CR119896 (9)
4 Computational and Mathematical Methods in Medicine
Table 2 The 9 eigenvalues (120582) of R and the contribution rates (CR)and the accumulative contribution rates (ACR) of the 9 principalcomponents obtained by conducting PCA of the 20 amino acids
Number 120582 CR ACR1 32237 03582 035822 19132 02126 057083 14048 01561 072694 11876 01320 085885 04959 00551 091396 04467 00496 096357 01992 00221 098578 01218 00135 099929 00071 00008 10000
Table 3The 4 eigenvectors 1198861 1198862 1198863 1198864 corresponding to the first
4 eigenvalues in Table 2
1198861
1198862
1198863
1198864
05036 01436 00571 02158minus02454 minus01875 02304 06547minus01634 01820 06298 02288minus03101 minus01883 minus03964 0507100702 06464 minus00786 00532minus01665 04465 minus05280 01877minus03872 03931 01003 minus00532minus04377 01844 02544 minus0227304349 02643 01738 03495
24 PCA of the Amino Acids Observing Table 1 if weconsider the 20 amino acids as 20 different samples and the9 properties of each amino acid as its 9 components thenaccording to the general steps of conducting PCA illustratedin Section 23 we can obtain a 20 times 9 matrix X and itsstandardized matrix Xlowast a 9 times 9 correlation matrix R and9 principal components 119865
1 1198652 119865
9 And therefore as
illustrated in Table 2 we can obtain the 9 eigenvalues of Rand the contribution rates and the accumulative contributionrates of the 9 principal components 119865
1 1198652 119865
9 respec-
tivelyFrom Table 2 we can see that the accumulative con-
tribution rate of the first 4 principal components amountsto 08588 (=8588) which is already bigger than 85Therefore we can keep the first 4 principal components onlyLet 120582
1 1205822 1205823 1205824 be the 4 eigenvalues corresponding to the
first 4 principal components respectively then as illustratedin Table 3 we can obtain the 4 eigenvectors 119886
1 1198862 1198863 1198864
corresponding to the 4 eigenvalues 1205821 1205822 1205823 1205824 separately
Based on Table 3 we can obtain the first 4 principalcomponents 119865
1 1198652 1198653 1198654 as follows
1198651= 05036119883
lowast
1minus 02454119883
lowast
2minus 01634119883
lowast
3
minus 03101119883lowast
4+ 00702119883
lowast
5minus 01665119883
lowast
6
minus 03872119883lowast
7minus 04377119883
lowast
8+ 04349119883
lowast
9
Table 4 The total scores of the 20 amino acids
Symbols of amino acids Total scoresA minus09324C minus05985D minus06709E minus02296F 04298G minus11780H 04476I 01435K 07868L minus01205M 05735N minus00242P minus09822Q 02848R 11169S minus07077T minus04525V minus02643W 14729Y 09050
1198652= 01436119883
lowast
1minus 01875119883
lowast
2+ 01820119883
lowast
3
minus 01883119883lowast
4+ 06464119883
lowast
5+ 04465119883
lowast
6
+ 03931119883lowast
7+ 01844119883
lowast
8+ 02643119883
lowast
9
1198653= 00571119883
lowast
1+ 02304119883
lowast
2+ 06298119883
lowast
3
minus 03964119883lowast
4minus 00786119883
lowast
5minus 05280119883
lowast
6
+ 01003119883lowast
7+ 02544119883
lowast
8+ 01738119883
lowast
9
1198654= 02158119883
lowast
1+ 06547119883
lowast
2+ 02288119883
lowast
3
+ 05071119883lowast
4+ 00532119883
lowast
5+ 01877119883
lowast
6
minus 00532119883lowast
7minus 02273119883
lowast
8+ 03495119883
lowast
9
(10)
Observing the above 4 formulas it is easy to find thatthere are three big coefficients in the first formula whichare 05036 (corresponding to mW) 04377 (corresponding to119865) and 04349 (corresponding to vR) respectivelyThereforeit means that the three properties such as mW 119865 andvR will have a major role in the first principal component1198651 Similarly we can also know that the three properties
such as pI 119878 and cN will have a major role in the secondprincipal component 119865
2 the third principal component 119865
3is
mainly determined by pK1 and 119878 and the fourth principalcomponent 119865
4is closely linked with hI and pK2 and so forth
Hence we can obtain the total scores of the 20 amino acids asillustrated in Table 4 according to formula (9)
Computational and Mathematical Methods in Medicine 5
25 Numerical Sequences of Protein Sequences LetΩ = AC
DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901
111990121199013 119901119873represents a protein sequence with
119873 amino acids where 119901119894
isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878
Ψ= (1199051 1199052 119905
119873)
corresponding to the protein sequence Ψ through replacingeach amino acid 119901
119894in Ψ with its corresponding value of
TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein
sequences
Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM
According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525
05735 minus04525 minus04525 minus01205
119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525
05735 minus04525 minus04525 minus01205
119878Opo = 05735 07868 minus02643 01435 minus00242 01435
minus07077 minus00242 minus04525 05735
(11)
26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904
1 1199042) suppose that the protein
sequence 1199041includes 119873
1amino acids 119904
2includes 119873
2amino
acids and 1198731
⩾ 1198732 then in order to measure the
similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps
Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904
1 1199042) into two numerical
sequences with the same length as follows
1198781= 1199051 1199052 119905
1198731
1198782= 1199051 1199052 119905
1198732
0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
1198731minus1198732
(12)
Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904
1 1199042) that is let 119904
1= 1199011 1199012 1199013 119901
1198731
1199042=
1199021 1199022 1199023 119902
1198732
then for any amino acid 119901119894in the protein
sequence 1199041 we will compare it with these 2119908 + 1 amino
acids 119902119894minus119908
119902119894minus1
119902119894 119902119894+1
119902119894+119908
in the protein sequence1199042 and then 119908 can be simply defined as follows
119908 =
120585 if 1198731minus 1198732le 120585
1198731minus 1198732 else
(13)
where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904
1 1199042) will not be too small to
expose the association of the inner structures of the proteinsequence pair (119904
1 1199042) In actual applications we suggest that
120585 shall be no less than 10
Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904
1 1199042) can
be briefly defined as follows
119860120576
119894119895= Θ (
10038161003816100381610038161003816119905119894minus 119905119895
10038161003816100381610038161003816minus 120576) (14)
where 119894 isin 1 2 1198731 119895 isin 1 2 119873
1 and Θ is a
Heaviside function which can be defined as follows
Θ (119909) =
1 if 119909 le 0
0 else(15)
Thereafter we can obtain an 1198731times 1198731alignment matrix
(AM) as follows
AM = (119860120576
119894119895)1198731times1198731
(16)
Step 4 For the1198731times1198731elements in the alignmentmatrix AM
we can plot points on 119894-119895 plane for these elements in the AMwith 119860
120576
119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call
the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904
1 1199042)
For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0
From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)
For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575
consecutive APs on the AT or BTs as a CAPS where 120575 ge 1
is a given thresholdFor a given CAPS caps
1 if there is no CAPS caps
2
satisfying caps1
sub caps2 then we call the caps
1a maximum
CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)
6 Computational and Mathematical Methods in Medicine
Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be
For convenience of analysis in an ADLD suppose thatthere are 119870
1different SFs and 119870
2different FPs on its AT
119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following
(1) For these 1198701different SFs and 119870
2different FPs
on the AT of the ADLD we will number these1198701SFs and 119870
2FPs from left to right and utilize
ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870
1SFs and 119870
2FPs separately And
in addition we would also call these SFs on the AT ofthe ADLD the ASFs
(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT
119870 to represent these BTs separately
and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT
minus1BTminus2
BTminus119870
to represent these BTsseparately
(3) For each BT119897 where 119897 isin 1 2 119870 suppose
that there are 1198703different SFs on the BT
119897 then we
will number these 1198703SFs from left to right and
utilize BSF1119897BSF2119897 BSF1198703
119897 to represent these SFs
separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs
According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs
From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)
to the point (32 32) and the other is BSF1minus4 that is the
line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively
Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs
Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs
From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs
3 Results and Discussion
31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences
Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ
119873 then while applying the ADLDs to analyze
the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps
Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878
119873
Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin
1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively
Step 3 Suppose that there are totally 1198711different ASFs
such as ASF1ASF2 ASF1198711 1198712different BSFs such
as BSF11198971
BSF21198972
BSF11987121198971198712
and 1198713different FPs such as
FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895
119897119895
let their length be length119894 and length119895
respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871
2
then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as
follows
SD (Ψ119886 Ψ119887) =
1198711
sum
119894=1
length119894 +1198712
sum
119895=1
length119895+ 1198713 (17)
Computational and Mathematical Methods in Medicine 7
150
100
50
0150100500
(a)
150
100
50
0150100500
(b)
Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16
150
100
50
0150100500
(a)
150
100
50
0
150100500
(b)
Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)
And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ
119873 we can obtain an 119873 times 119873 matching matrix
(MM) as follows
MM = [119889119894119895]119873times119873
(18)
where
119889119894119895
=
SD (Ψ119894 Ψ119895) if 119894 ge 119895
0 else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(19)
Step 4 Based on the matching matrixMM and all of its com-ponents 119889
119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then
we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ
1 Ψ2 Ψ
119873 as follows
SM = [119904119894119895]119873times119873
(20)
where
119904119894119895
=
1 minus Λ(
119889119894119895
119889119894119894
) if 119894 ge 119895
0 else
Λ(
119889119894119895
119889119894119894
) =
1 if119889119894119895
119889119894119894
ge 1
119889119894119895
119889119894119894
else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(21)
According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6
Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human
8 Computational and Mathematical Methods in Medicine
Table 5 The basic information of 16 ND5 protein sequences
Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603
Sheep
Goat
Cattle
Fin-whale
Blue-whale
Lemur
Hare
Rabbit
Gorilla
Human
Pi-chim
C-chim
Rat
Mouse
Opossum
Gallus
00383
00468
00255
00255
00162
00162
00942
00942
02169
00356
00356
01084
00540
00419
02311
00419
00855
00128
00184
00085
00665
01080
00477
00770
00243
00176
00288
00163
00458
00142
000005010015020
Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method
pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining
mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals
Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and
Computational and Mathematical Methods in Medicine 9
Table6Th
esim
ilaritydissim
ilaritymatrix
forthe
16ND5proteins
basedon
theA
DLD
sbased
metho
d
Hum
anGorilla
Pi-chim
C-chim
Fin-whale
Blue-w
hale
rat
mou
seop
ossum
sheep
goat
lemur
cattle
hare
gallu
srabbit
Hum
an000
00Gorilla
01111
00000
Pi-chim
007
20008
33000
00C-
chim
008
14008
65005
10000
00Fin-whale
03396
03285
03222
03301
00000
Blue-w
hale
03474
03333
03285
03301
00324
000
00Ra
t03693
03622
03636
03716
03333
03381
000
00Mou
se03740
03686
03716
03748
03317
03333
01883
000
00Opo
ssum
044
7604551
04290
044
1804515
04519
04513
044
79000
00Sheep
03020
02933
02871
02951
02023
02067
03149
03219
04121
00000
Goat
03036
02901
02871
02919
01958
02147
03166
03468
04202
007
12000
00Lemur
02989
02708
02839
02967
02557
02724
03166
03670
040
5502087
02317
000
00Ca
ttle
03114
03045
03046
03062
01958
02051
03149
03173
04235
009
0601254
02184
000
00Hare
03146
03157
03062
03046
02832
02788
03166
03421
04023
02217
02508
02053
02532
000
00Gallus
04726
04920
04737
05008
044
50044
2304903
04743
04691
04239
04524
046
8004183
046
6000000
Rabb
it03255
03189
03222
03142
02896
02756
03084
03390
04332
02184
02603
02282
02612
008
37044
34000
00
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
2 Computational and Mathematical Methods in Medicine
the methods or the lower the recognition degree of the pro-tein sequence graphs For example in the methods proposedin [26 28] the main drawback is that the lines will cross eachother which will decrease the visibility of the graphics In themethods proposed in [29ndash31] the main drawbacks are thatthe 3D graphics seem to be more complex and have lowervisibility than the 2D graphics and in addition to obtain thesequence invariants from the graphics complex matrixes arerequired to be constructed which need much computationand storage
Sequence alignment is a way of arranging the sequencesof DNA RNA or protein to identify regions of similaritythat may be a consequence of functional structural orevolutionary relationships between the sequences [33] Upto now there are many kinds of algorithms having beenimplemented for sequence alignment [34ndash37] These meth-ods are usually efficient but complex and time consumingComparing with the alignment methods existing graphicalrepresentation methods can also display the inner structureof the protein sequences and can be utilized to find the sim-ilaritydissimilarity more visible according to their graphicsIn this paper we proposed a novel method for analyzing thesimilaritydissimilarity by combining the idea of the sequencealignment and the graphical representation methods to somedegree avoid the weakness of both of these two methods
Principal components analysis (PCA) is a standard toolin multivariate data analysis to reduce the number of dimen-sions which has been proved to be effective in the processof protein sequence analysis [38ndash40] Therefore in order toovercome the main drawbacks of existing methods in thispaper a novel graphical representation of protein sequencescalled ADLD (Alignment Diagonal Line Diagram) is intro-duced based on PCA and then a newADLD basedmethod isproposed and utilized to analyze the similaritydissimilarityof protein sequences And in addition to validate theeffectiveness of our ADLD based method we adopt it toanalyze the similaritydissimilarity of both the 16 differentND5 proteins and the 29 different spike proteins respectivelywhich are widely used as the test data [16ndash26] The analysisresults show that our method is not only visual intuitionaland effective in the similaritydissimilarity analysis of proteinsequences but also quite simple since there are no highdimensional matrixes required to be constructed
2 Materials and Methods
21 Procedure of OurMethod for Analysis of Protein SequencesIn this section we will illustrate the overall procedures of ourmethod for analyzing protein sequences as follows at first
(1) Select the same 9 different properties for each aminoacid and construct a 20 times 9 matrix as the input dataof the PCA algorithm on the basis of total 20 differentamino acids
(2) According to the PCA algorithm we can obtain aunique feature for each amino acid
(3) For each protein sequence in the test data we willreplace each amino acid in the protein sequence
with its corresponding unique feature and then wecan transform the protein sequence into a numericalsequence
(4) For any two numerical sequences we can drawa graph named ADLD and then abstract somenumerical characteristics of it which can be utilizedto analyze the similaritydissimilarity of these twosequences
Next in Sections 22ndash26 we will introduce the detailsof constructing the ADLDs and obtaining some of thenumerical characteristics of them In Section 31 we will givethe method for constructing the similaritydissimilarity ofour test sequence groups
22 AminoAcids andTheir Properties Proteins are composedof 20 different amino acids and these amino acids have manydifferent physicochemical and biological properties such asthe molecular weight (mW) hydropathy index (hI) the pKavalue for terminal amino acid groups COOH (pK1) the pKavalue for terminal amino acid groupsNH
3
+ (pK2) isoelectricpoint (pI) solubility (119878) the number of triplet codons (cN)frequency of human proteins (119865) and van der Waals radiusof side chains (vR) The names and symbols of the 20 aminoacids and the value of their 9 major properties are illustratedin Table 1
23 Principal Components Analysis Principal componentsanalysis (PCA) is a common technique for dimensionalityreduction and pattern recognition in datasets of high dimen-sion [41] The main purposes of PCA are the analysis ofdata to identify patterns and finding patterns to reduce thedimensions of the dataset with minimal loss of informationThe general steps of conducting PCA are as follows
Step 1 For119898 samples 1198831 1198832 119883
119898 suppose that each119883
119894
has 119899 components 1199091198941 1199091198941 119909
119894119899 let119883
119894= (1199091198941 1199091198941 119909
119894119899)
for 119894 isin 1 2 119898 and then construct an 119898 times 119899 matrix Xaccording to the following formula first
X =
[[[[
[
1198831
1198832
119883119898
]]]]
]
=
[[[[
[
11990911
11990912
sdot sdot sdot 1199091119899
11990921
11990922
sdot sdot sdot 1199092119899
1199091198981
1199091198982
sdot sdot sdot 119909119898119899
]]]]
]
(1)
Next based on thematrixX construct the corresponding119898 times 119899 standardized matrix Xlowast according to the followingformula
Xlowast =
[[[[[[
[
119883lowast
1
119883lowast
2
119883lowast
119898
]]]]]]
]
=
[[[[[[
[
119909lowast
11119909lowast
12sdot sdot sdot 119909
lowast
1119899
119909lowast
21119909lowast
22sdot sdot sdot 119909
lowast
2119899
119909lowast
1198981119909lowast
1198982sdot sdot sdot 119909lowast
119898119899
]]]]]]
]
(2)
where 119883lowast
119894= (119909lowast
1198941 119909lowast
1198942 119909
lowast
119894119899) 119909lowast119894119895
= (119909119894119895minus 119909119895)radicvar(119909
119895) 119909119895=
(1119899)sum119899
119894=1119909119894119895 and var(119909
119895) = (1(119899 minus 1))sum
119899
119894=1(119909119894119895
minus 119909119895)2 for
119894 isin 1 2 119898 and 119895 isin 1 2 119899
Computational and Mathematical Methods in Medicine 3
Table 1 The full list of 20 amino acids and the value of their 9 different properties
Amino acid Symbol mW hI pK1 pK2 pI 119878 cN 119865 () vRAlanine A 89079 18 234 969 601 1672 4 78 67Cysteine C 121145 25 196 1028 507 0 2 19 86Aspartic acid D 133089 minus35 188 96 277 5 2 53 91Glutamic acid E 147116 minus35 219 967 322 85 2 63 109Phenylalanine F 165177 28 183 913 548 276 2 39 135Glycine G 75052 minus04 234 96 597 2499 4 72 48Histidine H 155141 minus32 182 917 759 0 2 23 118Isoleucine I 13116 45 236 968 602 345 3 53 124Lysine K 14617 minus39 218 895 974 739 2 59 135Leucine L 13116 38 236 96 598 217 6 91 124Methionine M 149199 19 228 921 574 562 1 23 124Asparagine N 132104 minus35 202 88 541 285 2 43 96Proline P 115117 16 199 1096 648 1620 4 52 90Glutamine Q 146131 minus35 217 913 565 72 2 42 114Arginine R 174188 minus45 217 904 1076 8556 6 51 148Serine S 105078 minus08 221 915 568 422 6 68 73Tyrosine T 119105 minus07 211 962 587 132 4 59 93Valine V 117133 42 232 962 597 581 4 66 105Tryptophan W 204213 minus09 238 939 589 136 1 14 163Threonine Y 181176 minus13 22 911 566 04 2 32 141
Step 2 Based on thematrixXlowast construct the 119899times119899 correlationmatrix R according to the following formula
R =
[[[[
[
1198771
1198772
119877119899
]]]]
]
=
[[[[
[
11990311
11990312
sdot sdot sdot 1199031119899
11990321
11990322
sdot sdot sdot 1199032119899
1199031198991
1199031198992
sdot sdot sdot 119903119899119899
]]]]
]
(3)
where we can find that 119903119894119895
= sum119898
119896=1(119909lowast
119896119894minus 119909lowast
119894)(119909lowast
119896119895minus 119909lowast
119895)
radicsum119898
119896=1(119909lowast
119896119894minus 119909lowast
119894)2sum119898
119896=1(119909lowast
119896119895minus 119909lowast
119895)2 for 119894 isin 1 2 119899 and
119895 isin 1 2 119899
Step 3 From the correlationmatrixR obtain its 119899 eigenvalues1205821ge 1205822ge sdot sdot sdot ge 120582
119899gt 0 and the corresponding 119899 eigenvectors
1198861=
[[[[
[
11988611
11988621
1198861198991
]]]]
]
1198862=
[[[[
[
11988612
11988622
1198861198992
]]]]
]
119886119899=
[[[[
[
1198861119899
1198862119899
119886119899119899
]]]]
]
(4)
respectively And from now on we can obtain 119899 principalcomponents 119865
119894for 119894 isin 1 2 119899 as follows
119865119894= 1198861119894119883lowast
1+ 1198862119894119883lowast
2+ sdot sdot sdot + 119886
119899119894119883lowast
119899 (5)
Step 4 For each principal component 119865119894for 119894 isin 1 2 119899
obtain its contribution rate CR119894and accumulated contribution
rate ACR119894according to the following formulas respectively
CR119894=
120582119894
sum119899
119896=1120582119896
(6)
ACR119894=
119894
sum
119896=1
CR119896 (7)
Generally in order to lower the computation complexitywe can keep only the first 119905 (119905 le 119899) principal components1198651 1198652 119865
119905 where the accumulated contribution rate of
the 119905th principal component 119865119905shall satisfy the fact that
ACR119905ge 85
Step 5 For 119895 isin 1 2 119905 let
119865119895=
[[[[[[[
[
119865119895
1
119865119895
2
119865119895
119898
]]]]]]]
]
(8)
Then for each 119894 isin 1 2 119898 we can obtain the total scoreof the 119894th sample as follows
TotalScore (119894) =
119905
sum
119896=1
119865119896
119894times CR119896 (9)
4 Computational and Mathematical Methods in Medicine
Table 2 The 9 eigenvalues (120582) of R and the contribution rates (CR)and the accumulative contribution rates (ACR) of the 9 principalcomponents obtained by conducting PCA of the 20 amino acids
Number 120582 CR ACR1 32237 03582 035822 19132 02126 057083 14048 01561 072694 11876 01320 085885 04959 00551 091396 04467 00496 096357 01992 00221 098578 01218 00135 099929 00071 00008 10000
Table 3The 4 eigenvectors 1198861 1198862 1198863 1198864 corresponding to the first
4 eigenvalues in Table 2
1198861
1198862
1198863
1198864
05036 01436 00571 02158minus02454 minus01875 02304 06547minus01634 01820 06298 02288minus03101 minus01883 minus03964 0507100702 06464 minus00786 00532minus01665 04465 minus05280 01877minus03872 03931 01003 minus00532minus04377 01844 02544 minus0227304349 02643 01738 03495
24 PCA of the Amino Acids Observing Table 1 if weconsider the 20 amino acids as 20 different samples and the9 properties of each amino acid as its 9 components thenaccording to the general steps of conducting PCA illustratedin Section 23 we can obtain a 20 times 9 matrix X and itsstandardized matrix Xlowast a 9 times 9 correlation matrix R and9 principal components 119865
1 1198652 119865
9 And therefore as
illustrated in Table 2 we can obtain the 9 eigenvalues of Rand the contribution rates and the accumulative contributionrates of the 9 principal components 119865
1 1198652 119865
9 respec-
tivelyFrom Table 2 we can see that the accumulative con-
tribution rate of the first 4 principal components amountsto 08588 (=8588) which is already bigger than 85Therefore we can keep the first 4 principal components onlyLet 120582
1 1205822 1205823 1205824 be the 4 eigenvalues corresponding to the
first 4 principal components respectively then as illustratedin Table 3 we can obtain the 4 eigenvectors 119886
1 1198862 1198863 1198864
corresponding to the 4 eigenvalues 1205821 1205822 1205823 1205824 separately
Based on Table 3 we can obtain the first 4 principalcomponents 119865
1 1198652 1198653 1198654 as follows
1198651= 05036119883
lowast
1minus 02454119883
lowast
2minus 01634119883
lowast
3
minus 03101119883lowast
4+ 00702119883
lowast
5minus 01665119883
lowast
6
minus 03872119883lowast
7minus 04377119883
lowast
8+ 04349119883
lowast
9
Table 4 The total scores of the 20 amino acids
Symbols of amino acids Total scoresA minus09324C minus05985D minus06709E minus02296F 04298G minus11780H 04476I 01435K 07868L minus01205M 05735N minus00242P minus09822Q 02848R 11169S minus07077T minus04525V minus02643W 14729Y 09050
1198652= 01436119883
lowast
1minus 01875119883
lowast
2+ 01820119883
lowast
3
minus 01883119883lowast
4+ 06464119883
lowast
5+ 04465119883
lowast
6
+ 03931119883lowast
7+ 01844119883
lowast
8+ 02643119883
lowast
9
1198653= 00571119883
lowast
1+ 02304119883
lowast
2+ 06298119883
lowast
3
minus 03964119883lowast
4minus 00786119883
lowast
5minus 05280119883
lowast
6
+ 01003119883lowast
7+ 02544119883
lowast
8+ 01738119883
lowast
9
1198654= 02158119883
lowast
1+ 06547119883
lowast
2+ 02288119883
lowast
3
+ 05071119883lowast
4+ 00532119883
lowast
5+ 01877119883
lowast
6
minus 00532119883lowast
7minus 02273119883
lowast
8+ 03495119883
lowast
9
(10)
Observing the above 4 formulas it is easy to find thatthere are three big coefficients in the first formula whichare 05036 (corresponding to mW) 04377 (corresponding to119865) and 04349 (corresponding to vR) respectivelyThereforeit means that the three properties such as mW 119865 andvR will have a major role in the first principal component1198651 Similarly we can also know that the three properties
such as pI 119878 and cN will have a major role in the secondprincipal component 119865
2 the third principal component 119865
3is
mainly determined by pK1 and 119878 and the fourth principalcomponent 119865
4is closely linked with hI and pK2 and so forth
Hence we can obtain the total scores of the 20 amino acids asillustrated in Table 4 according to formula (9)
Computational and Mathematical Methods in Medicine 5
25 Numerical Sequences of Protein Sequences LetΩ = AC
DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901
111990121199013 119901119873represents a protein sequence with
119873 amino acids where 119901119894
isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878
Ψ= (1199051 1199052 119905
119873)
corresponding to the protein sequence Ψ through replacingeach amino acid 119901
119894in Ψ with its corresponding value of
TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein
sequences
Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM
According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525
05735 minus04525 minus04525 minus01205
119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525
05735 minus04525 minus04525 minus01205
119878Opo = 05735 07868 minus02643 01435 minus00242 01435
minus07077 minus00242 minus04525 05735
(11)
26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904
1 1199042) suppose that the protein
sequence 1199041includes 119873
1amino acids 119904
2includes 119873
2amino
acids and 1198731
⩾ 1198732 then in order to measure the
similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps
Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904
1 1199042) into two numerical
sequences with the same length as follows
1198781= 1199051 1199052 119905
1198731
1198782= 1199051 1199052 119905
1198732
0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
1198731minus1198732
(12)
Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904
1 1199042) that is let 119904
1= 1199011 1199012 1199013 119901
1198731
1199042=
1199021 1199022 1199023 119902
1198732
then for any amino acid 119901119894in the protein
sequence 1199041 we will compare it with these 2119908 + 1 amino
acids 119902119894minus119908
119902119894minus1
119902119894 119902119894+1
119902119894+119908
in the protein sequence1199042 and then 119908 can be simply defined as follows
119908 =
120585 if 1198731minus 1198732le 120585
1198731minus 1198732 else
(13)
where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904
1 1199042) will not be too small to
expose the association of the inner structures of the proteinsequence pair (119904
1 1199042) In actual applications we suggest that
120585 shall be no less than 10
Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904
1 1199042) can
be briefly defined as follows
119860120576
119894119895= Θ (
10038161003816100381610038161003816119905119894minus 119905119895
10038161003816100381610038161003816minus 120576) (14)
where 119894 isin 1 2 1198731 119895 isin 1 2 119873
1 and Θ is a
Heaviside function which can be defined as follows
Θ (119909) =
1 if 119909 le 0
0 else(15)
Thereafter we can obtain an 1198731times 1198731alignment matrix
(AM) as follows
AM = (119860120576
119894119895)1198731times1198731
(16)
Step 4 For the1198731times1198731elements in the alignmentmatrix AM
we can plot points on 119894-119895 plane for these elements in the AMwith 119860
120576
119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call
the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904
1 1199042)
For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0
From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)
For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575
consecutive APs on the AT or BTs as a CAPS where 120575 ge 1
is a given thresholdFor a given CAPS caps
1 if there is no CAPS caps
2
satisfying caps1
sub caps2 then we call the caps
1a maximum
CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)
6 Computational and Mathematical Methods in Medicine
Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be
For convenience of analysis in an ADLD suppose thatthere are 119870
1different SFs and 119870
2different FPs on its AT
119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following
(1) For these 1198701different SFs and 119870
2different FPs
on the AT of the ADLD we will number these1198701SFs and 119870
2FPs from left to right and utilize
ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870
1SFs and 119870
2FPs separately And
in addition we would also call these SFs on the AT ofthe ADLD the ASFs
(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT
119870 to represent these BTs separately
and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT
minus1BTminus2
BTminus119870
to represent these BTsseparately
(3) For each BT119897 where 119897 isin 1 2 119870 suppose
that there are 1198703different SFs on the BT
119897 then we
will number these 1198703SFs from left to right and
utilize BSF1119897BSF2119897 BSF1198703
119897 to represent these SFs
separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs
According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs
From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)
to the point (32 32) and the other is BSF1minus4 that is the
line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively
Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs
Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs
From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs
3 Results and Discussion
31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences
Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ
119873 then while applying the ADLDs to analyze
the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps
Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878
119873
Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin
1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively
Step 3 Suppose that there are totally 1198711different ASFs
such as ASF1ASF2 ASF1198711 1198712different BSFs such
as BSF11198971
BSF21198972
BSF11987121198971198712
and 1198713different FPs such as
FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895
119897119895
let their length be length119894 and length119895
respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871
2
then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as
follows
SD (Ψ119886 Ψ119887) =
1198711
sum
119894=1
length119894 +1198712
sum
119895=1
length119895+ 1198713 (17)
Computational and Mathematical Methods in Medicine 7
150
100
50
0150100500
(a)
150
100
50
0150100500
(b)
Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16
150
100
50
0150100500
(a)
150
100
50
0
150100500
(b)
Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)
And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ
119873 we can obtain an 119873 times 119873 matching matrix
(MM) as follows
MM = [119889119894119895]119873times119873
(18)
where
119889119894119895
=
SD (Ψ119894 Ψ119895) if 119894 ge 119895
0 else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(19)
Step 4 Based on the matching matrixMM and all of its com-ponents 119889
119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then
we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ
1 Ψ2 Ψ
119873 as follows
SM = [119904119894119895]119873times119873
(20)
where
119904119894119895
=
1 minus Λ(
119889119894119895
119889119894119894
) if 119894 ge 119895
0 else
Λ(
119889119894119895
119889119894119894
) =
1 if119889119894119895
119889119894119894
ge 1
119889119894119895
119889119894119894
else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(21)
According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6
Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human
8 Computational and Mathematical Methods in Medicine
Table 5 The basic information of 16 ND5 protein sequences
Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603
Sheep
Goat
Cattle
Fin-whale
Blue-whale
Lemur
Hare
Rabbit
Gorilla
Human
Pi-chim
C-chim
Rat
Mouse
Opossum
Gallus
00383
00468
00255
00255
00162
00162
00942
00942
02169
00356
00356
01084
00540
00419
02311
00419
00855
00128
00184
00085
00665
01080
00477
00770
00243
00176
00288
00163
00458
00142
000005010015020
Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method
pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining
mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals
Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and
Computational and Mathematical Methods in Medicine 9
Table6Th
esim
ilaritydissim
ilaritymatrix
forthe
16ND5proteins
basedon
theA
DLD
sbased
metho
d
Hum
anGorilla
Pi-chim
C-chim
Fin-whale
Blue-w
hale
rat
mou
seop
ossum
sheep
goat
lemur
cattle
hare
gallu
srabbit
Hum
an000
00Gorilla
01111
00000
Pi-chim
007
20008
33000
00C-
chim
008
14008
65005
10000
00Fin-whale
03396
03285
03222
03301
00000
Blue-w
hale
03474
03333
03285
03301
00324
000
00Ra
t03693
03622
03636
03716
03333
03381
000
00Mou
se03740
03686
03716
03748
03317
03333
01883
000
00Opo
ssum
044
7604551
04290
044
1804515
04519
04513
044
79000
00Sheep
03020
02933
02871
02951
02023
02067
03149
03219
04121
00000
Goat
03036
02901
02871
02919
01958
02147
03166
03468
04202
007
12000
00Lemur
02989
02708
02839
02967
02557
02724
03166
03670
040
5502087
02317
000
00Ca
ttle
03114
03045
03046
03062
01958
02051
03149
03173
04235
009
0601254
02184
000
00Hare
03146
03157
03062
03046
02832
02788
03166
03421
04023
02217
02508
02053
02532
000
00Gallus
04726
04920
04737
05008
044
50044
2304903
04743
04691
04239
04524
046
8004183
046
6000000
Rabb
it03255
03189
03222
03142
02896
02756
03084
03390
04332
02184
02603
02282
02612
008
37044
34000
00
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
Computational and Mathematical Methods in Medicine 3
Table 1 The full list of 20 amino acids and the value of their 9 different properties
Amino acid Symbol mW hI pK1 pK2 pI 119878 cN 119865 () vRAlanine A 89079 18 234 969 601 1672 4 78 67Cysteine C 121145 25 196 1028 507 0 2 19 86Aspartic acid D 133089 minus35 188 96 277 5 2 53 91Glutamic acid E 147116 minus35 219 967 322 85 2 63 109Phenylalanine F 165177 28 183 913 548 276 2 39 135Glycine G 75052 minus04 234 96 597 2499 4 72 48Histidine H 155141 minus32 182 917 759 0 2 23 118Isoleucine I 13116 45 236 968 602 345 3 53 124Lysine K 14617 minus39 218 895 974 739 2 59 135Leucine L 13116 38 236 96 598 217 6 91 124Methionine M 149199 19 228 921 574 562 1 23 124Asparagine N 132104 minus35 202 88 541 285 2 43 96Proline P 115117 16 199 1096 648 1620 4 52 90Glutamine Q 146131 minus35 217 913 565 72 2 42 114Arginine R 174188 minus45 217 904 1076 8556 6 51 148Serine S 105078 minus08 221 915 568 422 6 68 73Tyrosine T 119105 minus07 211 962 587 132 4 59 93Valine V 117133 42 232 962 597 581 4 66 105Tryptophan W 204213 minus09 238 939 589 136 1 14 163Threonine Y 181176 minus13 22 911 566 04 2 32 141
Step 2 Based on thematrixXlowast construct the 119899times119899 correlationmatrix R according to the following formula
R =
[[[[
[
1198771
1198772
119877119899
]]]]
]
=
[[[[
[
11990311
11990312
sdot sdot sdot 1199031119899
11990321
11990322
sdot sdot sdot 1199032119899
1199031198991
1199031198992
sdot sdot sdot 119903119899119899
]]]]
]
(3)
where we can find that 119903119894119895
= sum119898
119896=1(119909lowast
119896119894minus 119909lowast
119894)(119909lowast
119896119895minus 119909lowast
119895)
radicsum119898
119896=1(119909lowast
119896119894minus 119909lowast
119894)2sum119898
119896=1(119909lowast
119896119895minus 119909lowast
119895)2 for 119894 isin 1 2 119899 and
119895 isin 1 2 119899
Step 3 From the correlationmatrixR obtain its 119899 eigenvalues1205821ge 1205822ge sdot sdot sdot ge 120582
119899gt 0 and the corresponding 119899 eigenvectors
1198861=
[[[[
[
11988611
11988621
1198861198991
]]]]
]
1198862=
[[[[
[
11988612
11988622
1198861198992
]]]]
]
119886119899=
[[[[
[
1198861119899
1198862119899
119886119899119899
]]]]
]
(4)
respectively And from now on we can obtain 119899 principalcomponents 119865
119894for 119894 isin 1 2 119899 as follows
119865119894= 1198861119894119883lowast
1+ 1198862119894119883lowast
2+ sdot sdot sdot + 119886
119899119894119883lowast
119899 (5)
Step 4 For each principal component 119865119894for 119894 isin 1 2 119899
obtain its contribution rate CR119894and accumulated contribution
rate ACR119894according to the following formulas respectively
CR119894=
120582119894
sum119899
119896=1120582119896
(6)
ACR119894=
119894
sum
119896=1
CR119896 (7)
Generally in order to lower the computation complexitywe can keep only the first 119905 (119905 le 119899) principal components1198651 1198652 119865
119905 where the accumulated contribution rate of
the 119905th principal component 119865119905shall satisfy the fact that
ACR119905ge 85
Step 5 For 119895 isin 1 2 119905 let
119865119895=
[[[[[[[
[
119865119895
1
119865119895
2
119865119895
119898
]]]]]]]
]
(8)
Then for each 119894 isin 1 2 119898 we can obtain the total scoreof the 119894th sample as follows
TotalScore (119894) =
119905
sum
119896=1
119865119896
119894times CR119896 (9)
4 Computational and Mathematical Methods in Medicine
Table 2 The 9 eigenvalues (120582) of R and the contribution rates (CR)and the accumulative contribution rates (ACR) of the 9 principalcomponents obtained by conducting PCA of the 20 amino acids
Number 120582 CR ACR1 32237 03582 035822 19132 02126 057083 14048 01561 072694 11876 01320 085885 04959 00551 091396 04467 00496 096357 01992 00221 098578 01218 00135 099929 00071 00008 10000
Table 3The 4 eigenvectors 1198861 1198862 1198863 1198864 corresponding to the first
4 eigenvalues in Table 2
1198861
1198862
1198863
1198864
05036 01436 00571 02158minus02454 minus01875 02304 06547minus01634 01820 06298 02288minus03101 minus01883 minus03964 0507100702 06464 minus00786 00532minus01665 04465 minus05280 01877minus03872 03931 01003 minus00532minus04377 01844 02544 minus0227304349 02643 01738 03495
24 PCA of the Amino Acids Observing Table 1 if weconsider the 20 amino acids as 20 different samples and the9 properties of each amino acid as its 9 components thenaccording to the general steps of conducting PCA illustratedin Section 23 we can obtain a 20 times 9 matrix X and itsstandardized matrix Xlowast a 9 times 9 correlation matrix R and9 principal components 119865
1 1198652 119865
9 And therefore as
illustrated in Table 2 we can obtain the 9 eigenvalues of Rand the contribution rates and the accumulative contributionrates of the 9 principal components 119865
1 1198652 119865
9 respec-
tivelyFrom Table 2 we can see that the accumulative con-
tribution rate of the first 4 principal components amountsto 08588 (=8588) which is already bigger than 85Therefore we can keep the first 4 principal components onlyLet 120582
1 1205822 1205823 1205824 be the 4 eigenvalues corresponding to the
first 4 principal components respectively then as illustratedin Table 3 we can obtain the 4 eigenvectors 119886
1 1198862 1198863 1198864
corresponding to the 4 eigenvalues 1205821 1205822 1205823 1205824 separately
Based on Table 3 we can obtain the first 4 principalcomponents 119865
1 1198652 1198653 1198654 as follows
1198651= 05036119883
lowast
1minus 02454119883
lowast
2minus 01634119883
lowast
3
minus 03101119883lowast
4+ 00702119883
lowast
5minus 01665119883
lowast
6
minus 03872119883lowast
7minus 04377119883
lowast
8+ 04349119883
lowast
9
Table 4 The total scores of the 20 amino acids
Symbols of amino acids Total scoresA minus09324C minus05985D minus06709E minus02296F 04298G minus11780H 04476I 01435K 07868L minus01205M 05735N minus00242P minus09822Q 02848R 11169S minus07077T minus04525V minus02643W 14729Y 09050
1198652= 01436119883
lowast
1minus 01875119883
lowast
2+ 01820119883
lowast
3
minus 01883119883lowast
4+ 06464119883
lowast
5+ 04465119883
lowast
6
+ 03931119883lowast
7+ 01844119883
lowast
8+ 02643119883
lowast
9
1198653= 00571119883
lowast
1+ 02304119883
lowast
2+ 06298119883
lowast
3
minus 03964119883lowast
4minus 00786119883
lowast
5minus 05280119883
lowast
6
+ 01003119883lowast
7+ 02544119883
lowast
8+ 01738119883
lowast
9
1198654= 02158119883
lowast
1+ 06547119883
lowast
2+ 02288119883
lowast
3
+ 05071119883lowast
4+ 00532119883
lowast
5+ 01877119883
lowast
6
minus 00532119883lowast
7minus 02273119883
lowast
8+ 03495119883
lowast
9
(10)
Observing the above 4 formulas it is easy to find thatthere are three big coefficients in the first formula whichare 05036 (corresponding to mW) 04377 (corresponding to119865) and 04349 (corresponding to vR) respectivelyThereforeit means that the three properties such as mW 119865 andvR will have a major role in the first principal component1198651 Similarly we can also know that the three properties
such as pI 119878 and cN will have a major role in the secondprincipal component 119865
2 the third principal component 119865
3is
mainly determined by pK1 and 119878 and the fourth principalcomponent 119865
4is closely linked with hI and pK2 and so forth
Hence we can obtain the total scores of the 20 amino acids asillustrated in Table 4 according to formula (9)
Computational and Mathematical Methods in Medicine 5
25 Numerical Sequences of Protein Sequences LetΩ = AC
DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901
111990121199013 119901119873represents a protein sequence with
119873 amino acids where 119901119894
isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878
Ψ= (1199051 1199052 119905
119873)
corresponding to the protein sequence Ψ through replacingeach amino acid 119901
119894in Ψ with its corresponding value of
TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein
sequences
Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM
According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525
05735 minus04525 minus04525 minus01205
119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525
05735 minus04525 minus04525 minus01205
119878Opo = 05735 07868 minus02643 01435 minus00242 01435
minus07077 minus00242 minus04525 05735
(11)
26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904
1 1199042) suppose that the protein
sequence 1199041includes 119873
1amino acids 119904
2includes 119873
2amino
acids and 1198731
⩾ 1198732 then in order to measure the
similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps
Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904
1 1199042) into two numerical
sequences with the same length as follows
1198781= 1199051 1199052 119905
1198731
1198782= 1199051 1199052 119905
1198732
0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
1198731minus1198732
(12)
Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904
1 1199042) that is let 119904
1= 1199011 1199012 1199013 119901
1198731
1199042=
1199021 1199022 1199023 119902
1198732
then for any amino acid 119901119894in the protein
sequence 1199041 we will compare it with these 2119908 + 1 amino
acids 119902119894minus119908
119902119894minus1
119902119894 119902119894+1
119902119894+119908
in the protein sequence1199042 and then 119908 can be simply defined as follows
119908 =
120585 if 1198731minus 1198732le 120585
1198731minus 1198732 else
(13)
where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904
1 1199042) will not be too small to
expose the association of the inner structures of the proteinsequence pair (119904
1 1199042) In actual applications we suggest that
120585 shall be no less than 10
Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904
1 1199042) can
be briefly defined as follows
119860120576
119894119895= Θ (
10038161003816100381610038161003816119905119894minus 119905119895
10038161003816100381610038161003816minus 120576) (14)
where 119894 isin 1 2 1198731 119895 isin 1 2 119873
1 and Θ is a
Heaviside function which can be defined as follows
Θ (119909) =
1 if 119909 le 0
0 else(15)
Thereafter we can obtain an 1198731times 1198731alignment matrix
(AM) as follows
AM = (119860120576
119894119895)1198731times1198731
(16)
Step 4 For the1198731times1198731elements in the alignmentmatrix AM
we can plot points on 119894-119895 plane for these elements in the AMwith 119860
120576
119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call
the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904
1 1199042)
For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0
From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)
For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575
consecutive APs on the AT or BTs as a CAPS where 120575 ge 1
is a given thresholdFor a given CAPS caps
1 if there is no CAPS caps
2
satisfying caps1
sub caps2 then we call the caps
1a maximum
CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)
6 Computational and Mathematical Methods in Medicine
Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be
For convenience of analysis in an ADLD suppose thatthere are 119870
1different SFs and 119870
2different FPs on its AT
119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following
(1) For these 1198701different SFs and 119870
2different FPs
on the AT of the ADLD we will number these1198701SFs and 119870
2FPs from left to right and utilize
ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870
1SFs and 119870
2FPs separately And
in addition we would also call these SFs on the AT ofthe ADLD the ASFs
(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT
119870 to represent these BTs separately
and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT
minus1BTminus2
BTminus119870
to represent these BTsseparately
(3) For each BT119897 where 119897 isin 1 2 119870 suppose
that there are 1198703different SFs on the BT
119897 then we
will number these 1198703SFs from left to right and
utilize BSF1119897BSF2119897 BSF1198703
119897 to represent these SFs
separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs
According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs
From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)
to the point (32 32) and the other is BSF1minus4 that is the
line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively
Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs
Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs
From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs
3 Results and Discussion
31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences
Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ
119873 then while applying the ADLDs to analyze
the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps
Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878
119873
Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin
1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively
Step 3 Suppose that there are totally 1198711different ASFs
such as ASF1ASF2 ASF1198711 1198712different BSFs such
as BSF11198971
BSF21198972
BSF11987121198971198712
and 1198713different FPs such as
FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895
119897119895
let their length be length119894 and length119895
respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871
2
then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as
follows
SD (Ψ119886 Ψ119887) =
1198711
sum
119894=1
length119894 +1198712
sum
119895=1
length119895+ 1198713 (17)
Computational and Mathematical Methods in Medicine 7
150
100
50
0150100500
(a)
150
100
50
0150100500
(b)
Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16
150
100
50
0150100500
(a)
150
100
50
0
150100500
(b)
Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)
And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ
119873 we can obtain an 119873 times 119873 matching matrix
(MM) as follows
MM = [119889119894119895]119873times119873
(18)
where
119889119894119895
=
SD (Ψ119894 Ψ119895) if 119894 ge 119895
0 else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(19)
Step 4 Based on the matching matrixMM and all of its com-ponents 119889
119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then
we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ
1 Ψ2 Ψ
119873 as follows
SM = [119904119894119895]119873times119873
(20)
where
119904119894119895
=
1 minus Λ(
119889119894119895
119889119894119894
) if 119894 ge 119895
0 else
Λ(
119889119894119895
119889119894119894
) =
1 if119889119894119895
119889119894119894
ge 1
119889119894119895
119889119894119894
else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(21)
According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6
Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human
8 Computational and Mathematical Methods in Medicine
Table 5 The basic information of 16 ND5 protein sequences
Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603
Sheep
Goat
Cattle
Fin-whale
Blue-whale
Lemur
Hare
Rabbit
Gorilla
Human
Pi-chim
C-chim
Rat
Mouse
Opossum
Gallus
00383
00468
00255
00255
00162
00162
00942
00942
02169
00356
00356
01084
00540
00419
02311
00419
00855
00128
00184
00085
00665
01080
00477
00770
00243
00176
00288
00163
00458
00142
000005010015020
Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method
pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining
mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals
Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and
Computational and Mathematical Methods in Medicine 9
Table6Th
esim
ilaritydissim
ilaritymatrix
forthe
16ND5proteins
basedon
theA
DLD
sbased
metho
d
Hum
anGorilla
Pi-chim
C-chim
Fin-whale
Blue-w
hale
rat
mou
seop
ossum
sheep
goat
lemur
cattle
hare
gallu
srabbit
Hum
an000
00Gorilla
01111
00000
Pi-chim
007
20008
33000
00C-
chim
008
14008
65005
10000
00Fin-whale
03396
03285
03222
03301
00000
Blue-w
hale
03474
03333
03285
03301
00324
000
00Ra
t03693
03622
03636
03716
03333
03381
000
00Mou
se03740
03686
03716
03748
03317
03333
01883
000
00Opo
ssum
044
7604551
04290
044
1804515
04519
04513
044
79000
00Sheep
03020
02933
02871
02951
02023
02067
03149
03219
04121
00000
Goat
03036
02901
02871
02919
01958
02147
03166
03468
04202
007
12000
00Lemur
02989
02708
02839
02967
02557
02724
03166
03670
040
5502087
02317
000
00Ca
ttle
03114
03045
03046
03062
01958
02051
03149
03173
04235
009
0601254
02184
000
00Hare
03146
03157
03062
03046
02832
02788
03166
03421
04023
02217
02508
02053
02532
000
00Gallus
04726
04920
04737
05008
044
50044
2304903
04743
04691
04239
04524
046
8004183
046
6000000
Rabb
it03255
03189
03222
03142
02896
02756
03084
03390
04332
02184
02603
02282
02612
008
37044
34000
00
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
4 Computational and Mathematical Methods in Medicine
Table 2 The 9 eigenvalues (120582) of R and the contribution rates (CR)and the accumulative contribution rates (ACR) of the 9 principalcomponents obtained by conducting PCA of the 20 amino acids
Number 120582 CR ACR1 32237 03582 035822 19132 02126 057083 14048 01561 072694 11876 01320 085885 04959 00551 091396 04467 00496 096357 01992 00221 098578 01218 00135 099929 00071 00008 10000
Table 3The 4 eigenvectors 1198861 1198862 1198863 1198864 corresponding to the first
4 eigenvalues in Table 2
1198861
1198862
1198863
1198864
05036 01436 00571 02158minus02454 minus01875 02304 06547minus01634 01820 06298 02288minus03101 minus01883 minus03964 0507100702 06464 minus00786 00532minus01665 04465 minus05280 01877minus03872 03931 01003 minus00532minus04377 01844 02544 minus0227304349 02643 01738 03495
24 PCA of the Amino Acids Observing Table 1 if weconsider the 20 amino acids as 20 different samples and the9 properties of each amino acid as its 9 components thenaccording to the general steps of conducting PCA illustratedin Section 23 we can obtain a 20 times 9 matrix X and itsstandardized matrix Xlowast a 9 times 9 correlation matrix R and9 principal components 119865
1 1198652 119865
9 And therefore as
illustrated in Table 2 we can obtain the 9 eigenvalues of Rand the contribution rates and the accumulative contributionrates of the 9 principal components 119865
1 1198652 119865
9 respec-
tivelyFrom Table 2 we can see that the accumulative con-
tribution rate of the first 4 principal components amountsto 08588 (=8588) which is already bigger than 85Therefore we can keep the first 4 principal components onlyLet 120582
1 1205822 1205823 1205824 be the 4 eigenvalues corresponding to the
first 4 principal components respectively then as illustratedin Table 3 we can obtain the 4 eigenvectors 119886
1 1198862 1198863 1198864
corresponding to the 4 eigenvalues 1205821 1205822 1205823 1205824 separately
Based on Table 3 we can obtain the first 4 principalcomponents 119865
1 1198652 1198653 1198654 as follows
1198651= 05036119883
lowast
1minus 02454119883
lowast
2minus 01634119883
lowast
3
minus 03101119883lowast
4+ 00702119883
lowast
5minus 01665119883
lowast
6
minus 03872119883lowast
7minus 04377119883
lowast
8+ 04349119883
lowast
9
Table 4 The total scores of the 20 amino acids
Symbols of amino acids Total scoresA minus09324C minus05985D minus06709E minus02296F 04298G minus11780H 04476I 01435K 07868L minus01205M 05735N minus00242P minus09822Q 02848R 11169S minus07077T minus04525V minus02643W 14729Y 09050
1198652= 01436119883
lowast
1minus 01875119883
lowast
2+ 01820119883
lowast
3
minus 01883119883lowast
4+ 06464119883
lowast
5+ 04465119883
lowast
6
+ 03931119883lowast
7+ 01844119883
lowast
8+ 02643119883
lowast
9
1198653= 00571119883
lowast
1+ 02304119883
lowast
2+ 06298119883
lowast
3
minus 03964119883lowast
4minus 00786119883
lowast
5minus 05280119883
lowast
6
+ 01003119883lowast
7+ 02544119883
lowast
8+ 01738119883
lowast
9
1198654= 02158119883
lowast
1+ 06547119883
lowast
2+ 02288119883
lowast
3
+ 05071119883lowast
4+ 00532119883
lowast
5+ 01877119883
lowast
6
minus 00532119883lowast
7minus 02273119883
lowast
8+ 03495119883
lowast
9
(10)
Observing the above 4 formulas it is easy to find thatthere are three big coefficients in the first formula whichare 05036 (corresponding to mW) 04377 (corresponding to119865) and 04349 (corresponding to vR) respectivelyThereforeit means that the three properties such as mW 119865 andvR will have a major role in the first principal component1198651 Similarly we can also know that the three properties
such as pI 119878 and cN will have a major role in the secondprincipal component 119865
2 the third principal component 119865
3is
mainly determined by pK1 and 119878 and the fourth principalcomponent 119865
4is closely linked with hI and pK2 and so forth
Hence we can obtain the total scores of the 20 amino acids asillustrated in Table 4 according to formula (9)
Computational and Mathematical Methods in Medicine 5
25 Numerical Sequences of Protein Sequences LetΩ = AC
DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901
111990121199013 119901119873represents a protein sequence with
119873 amino acids where 119901119894
isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878
Ψ= (1199051 1199052 119905
119873)
corresponding to the protein sequence Ψ through replacingeach amino acid 119901
119894in Ψ with its corresponding value of
TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein
sequences
Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM
According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525
05735 minus04525 minus04525 minus01205
119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525
05735 minus04525 minus04525 minus01205
119878Opo = 05735 07868 minus02643 01435 minus00242 01435
minus07077 minus00242 minus04525 05735
(11)
26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904
1 1199042) suppose that the protein
sequence 1199041includes 119873
1amino acids 119904
2includes 119873
2amino
acids and 1198731
⩾ 1198732 then in order to measure the
similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps
Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904
1 1199042) into two numerical
sequences with the same length as follows
1198781= 1199051 1199052 119905
1198731
1198782= 1199051 1199052 119905
1198732
0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
1198731minus1198732
(12)
Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904
1 1199042) that is let 119904
1= 1199011 1199012 1199013 119901
1198731
1199042=
1199021 1199022 1199023 119902
1198732
then for any amino acid 119901119894in the protein
sequence 1199041 we will compare it with these 2119908 + 1 amino
acids 119902119894minus119908
119902119894minus1
119902119894 119902119894+1
119902119894+119908
in the protein sequence1199042 and then 119908 can be simply defined as follows
119908 =
120585 if 1198731minus 1198732le 120585
1198731minus 1198732 else
(13)
where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904
1 1199042) will not be too small to
expose the association of the inner structures of the proteinsequence pair (119904
1 1199042) In actual applications we suggest that
120585 shall be no less than 10
Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904
1 1199042) can
be briefly defined as follows
119860120576
119894119895= Θ (
10038161003816100381610038161003816119905119894minus 119905119895
10038161003816100381610038161003816minus 120576) (14)
where 119894 isin 1 2 1198731 119895 isin 1 2 119873
1 and Θ is a
Heaviside function which can be defined as follows
Θ (119909) =
1 if 119909 le 0
0 else(15)
Thereafter we can obtain an 1198731times 1198731alignment matrix
(AM) as follows
AM = (119860120576
119894119895)1198731times1198731
(16)
Step 4 For the1198731times1198731elements in the alignmentmatrix AM
we can plot points on 119894-119895 plane for these elements in the AMwith 119860
120576
119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call
the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904
1 1199042)
For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0
From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)
For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575
consecutive APs on the AT or BTs as a CAPS where 120575 ge 1
is a given thresholdFor a given CAPS caps
1 if there is no CAPS caps
2
satisfying caps1
sub caps2 then we call the caps
1a maximum
CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)
6 Computational and Mathematical Methods in Medicine
Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be
For convenience of analysis in an ADLD suppose thatthere are 119870
1different SFs and 119870
2different FPs on its AT
119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following
(1) For these 1198701different SFs and 119870
2different FPs
on the AT of the ADLD we will number these1198701SFs and 119870
2FPs from left to right and utilize
ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870
1SFs and 119870
2FPs separately And
in addition we would also call these SFs on the AT ofthe ADLD the ASFs
(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT
119870 to represent these BTs separately
and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT
minus1BTminus2
BTminus119870
to represent these BTsseparately
(3) For each BT119897 where 119897 isin 1 2 119870 suppose
that there are 1198703different SFs on the BT
119897 then we
will number these 1198703SFs from left to right and
utilize BSF1119897BSF2119897 BSF1198703
119897 to represent these SFs
separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs
According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs
From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)
to the point (32 32) and the other is BSF1minus4 that is the
line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively
Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs
Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs
From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs
3 Results and Discussion
31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences
Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ
119873 then while applying the ADLDs to analyze
the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps
Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878
119873
Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin
1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively
Step 3 Suppose that there are totally 1198711different ASFs
such as ASF1ASF2 ASF1198711 1198712different BSFs such
as BSF11198971
BSF21198972
BSF11987121198971198712
and 1198713different FPs such as
FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895
119897119895
let their length be length119894 and length119895
respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871
2
then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as
follows
SD (Ψ119886 Ψ119887) =
1198711
sum
119894=1
length119894 +1198712
sum
119895=1
length119895+ 1198713 (17)
Computational and Mathematical Methods in Medicine 7
150
100
50
0150100500
(a)
150
100
50
0150100500
(b)
Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16
150
100
50
0150100500
(a)
150
100
50
0
150100500
(b)
Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)
And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ
119873 we can obtain an 119873 times 119873 matching matrix
(MM) as follows
MM = [119889119894119895]119873times119873
(18)
where
119889119894119895
=
SD (Ψ119894 Ψ119895) if 119894 ge 119895
0 else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(19)
Step 4 Based on the matching matrixMM and all of its com-ponents 119889
119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then
we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ
1 Ψ2 Ψ
119873 as follows
SM = [119904119894119895]119873times119873
(20)
where
119904119894119895
=
1 minus Λ(
119889119894119895
119889119894119894
) if 119894 ge 119895
0 else
Λ(
119889119894119895
119889119894119894
) =
1 if119889119894119895
119889119894119894
ge 1
119889119894119895
119889119894119894
else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(21)
According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6
Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human
8 Computational and Mathematical Methods in Medicine
Table 5 The basic information of 16 ND5 protein sequences
Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603
Sheep
Goat
Cattle
Fin-whale
Blue-whale
Lemur
Hare
Rabbit
Gorilla
Human
Pi-chim
C-chim
Rat
Mouse
Opossum
Gallus
00383
00468
00255
00255
00162
00162
00942
00942
02169
00356
00356
01084
00540
00419
02311
00419
00855
00128
00184
00085
00665
01080
00477
00770
00243
00176
00288
00163
00458
00142
000005010015020
Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method
pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining
mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals
Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and
Computational and Mathematical Methods in Medicine 9
Table6Th
esim
ilaritydissim
ilaritymatrix
forthe
16ND5proteins
basedon
theA
DLD
sbased
metho
d
Hum
anGorilla
Pi-chim
C-chim
Fin-whale
Blue-w
hale
rat
mou
seop
ossum
sheep
goat
lemur
cattle
hare
gallu
srabbit
Hum
an000
00Gorilla
01111
00000
Pi-chim
007
20008
33000
00C-
chim
008
14008
65005
10000
00Fin-whale
03396
03285
03222
03301
00000
Blue-w
hale
03474
03333
03285
03301
00324
000
00Ra
t03693
03622
03636
03716
03333
03381
000
00Mou
se03740
03686
03716
03748
03317
03333
01883
000
00Opo
ssum
044
7604551
04290
044
1804515
04519
04513
044
79000
00Sheep
03020
02933
02871
02951
02023
02067
03149
03219
04121
00000
Goat
03036
02901
02871
02919
01958
02147
03166
03468
04202
007
12000
00Lemur
02989
02708
02839
02967
02557
02724
03166
03670
040
5502087
02317
000
00Ca
ttle
03114
03045
03046
03062
01958
02051
03149
03173
04235
009
0601254
02184
000
00Hare
03146
03157
03062
03046
02832
02788
03166
03421
04023
02217
02508
02053
02532
000
00Gallus
04726
04920
04737
05008
044
50044
2304903
04743
04691
04239
04524
046
8004183
046
6000000
Rabb
it03255
03189
03222
03142
02896
02756
03084
03390
04332
02184
02603
02282
02612
008
37044
34000
00
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
Computational and Mathematical Methods in Medicine 5
25 Numerical Sequences of Protein Sequences LetΩ = AC
DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901
111990121199013 119901119873represents a protein sequence with
119873 amino acids where 119901119894
isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878
Ψ= (1199051 1199052 119905
119873)
corresponding to the protein sequence Ψ through replacingeach amino acid 119901
119894in Ψ with its corresponding value of
TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein
sequences
Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM
According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525
05735 minus04525 minus04525 minus01205
119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525
05735 minus04525 minus04525 minus01205
119878Opo = 05735 07868 minus02643 01435 minus00242 01435
minus07077 minus00242 minus04525 05735
(11)
26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904
1 1199042) suppose that the protein
sequence 1199041includes 119873
1amino acids 119904
2includes 119873
2amino
acids and 1198731
⩾ 1198732 then in order to measure the
similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps
Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904
1 1199042) into two numerical
sequences with the same length as follows
1198781= 1199051 1199052 119905
1198731
1198782= 1199051 1199052 119905
1198732
0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
1198731minus1198732
(12)
Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904
1 1199042) that is let 119904
1= 1199011 1199012 1199013 119901
1198731
1199042=
1199021 1199022 1199023 119902
1198732
then for any amino acid 119901119894in the protein
sequence 1199041 we will compare it with these 2119908 + 1 amino
acids 119902119894minus119908
119902119894minus1
119902119894 119902119894+1
119902119894+119908
in the protein sequence1199042 and then 119908 can be simply defined as follows
119908 =
120585 if 1198731minus 1198732le 120585
1198731minus 1198732 else
(13)
where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904
1 1199042) will not be too small to
expose the association of the inner structures of the proteinsequence pair (119904
1 1199042) In actual applications we suggest that
120585 shall be no less than 10
Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904
1 1199042) can
be briefly defined as follows
119860120576
119894119895= Θ (
10038161003816100381610038161003816119905119894minus 119905119895
10038161003816100381610038161003816minus 120576) (14)
where 119894 isin 1 2 1198731 119895 isin 1 2 119873
1 and Θ is a
Heaviside function which can be defined as follows
Θ (119909) =
1 if 119909 le 0
0 else(15)
Thereafter we can obtain an 1198731times 1198731alignment matrix
(AM) as follows
AM = (119860120576
119894119895)1198731times1198731
(16)
Step 4 For the1198731times1198731elements in the alignmentmatrix AM
we can plot points on 119894-119895 plane for these elements in the AMwith 119860
120576
119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call
the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904
1 1199042)
For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0
From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)
For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575
consecutive APs on the AT or BTs as a CAPS where 120575 ge 1
is a given thresholdFor a given CAPS caps
1 if there is no CAPS caps
2
satisfying caps1
sub caps2 then we call the caps
1a maximum
CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)
6 Computational and Mathematical Methods in Medicine
Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be
For convenience of analysis in an ADLD suppose thatthere are 119870
1different SFs and 119870
2different FPs on its AT
119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following
(1) For these 1198701different SFs and 119870
2different FPs
on the AT of the ADLD we will number these1198701SFs and 119870
2FPs from left to right and utilize
ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870
1SFs and 119870
2FPs separately And
in addition we would also call these SFs on the AT ofthe ADLD the ASFs
(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT
119870 to represent these BTs separately
and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT
minus1BTminus2
BTminus119870
to represent these BTsseparately
(3) For each BT119897 where 119897 isin 1 2 119870 suppose
that there are 1198703different SFs on the BT
119897 then we
will number these 1198703SFs from left to right and
utilize BSF1119897BSF2119897 BSF1198703
119897 to represent these SFs
separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs
According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs
From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)
to the point (32 32) and the other is BSF1minus4 that is the
line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively
Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs
Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs
From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs
3 Results and Discussion
31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences
Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ
119873 then while applying the ADLDs to analyze
the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps
Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878
119873
Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin
1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively
Step 3 Suppose that there are totally 1198711different ASFs
such as ASF1ASF2 ASF1198711 1198712different BSFs such
as BSF11198971
BSF21198972
BSF11987121198971198712
and 1198713different FPs such as
FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895
119897119895
let their length be length119894 and length119895
respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871
2
then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as
follows
SD (Ψ119886 Ψ119887) =
1198711
sum
119894=1
length119894 +1198712
sum
119895=1
length119895+ 1198713 (17)
Computational and Mathematical Methods in Medicine 7
150
100
50
0150100500
(a)
150
100
50
0150100500
(b)
Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16
150
100
50
0150100500
(a)
150
100
50
0
150100500
(b)
Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)
And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ
119873 we can obtain an 119873 times 119873 matching matrix
(MM) as follows
MM = [119889119894119895]119873times119873
(18)
where
119889119894119895
=
SD (Ψ119894 Ψ119895) if 119894 ge 119895
0 else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(19)
Step 4 Based on the matching matrixMM and all of its com-ponents 119889
119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then
we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ
1 Ψ2 Ψ
119873 as follows
SM = [119904119894119895]119873times119873
(20)
where
119904119894119895
=
1 minus Λ(
119889119894119895
119889119894119894
) if 119894 ge 119895
0 else
Λ(
119889119894119895
119889119894119894
) =
1 if119889119894119895
119889119894119894
ge 1
119889119894119895
119889119894119894
else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(21)
According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6
Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human
8 Computational and Mathematical Methods in Medicine
Table 5 The basic information of 16 ND5 protein sequences
Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603
Sheep
Goat
Cattle
Fin-whale
Blue-whale
Lemur
Hare
Rabbit
Gorilla
Human
Pi-chim
C-chim
Rat
Mouse
Opossum
Gallus
00383
00468
00255
00255
00162
00162
00942
00942
02169
00356
00356
01084
00540
00419
02311
00419
00855
00128
00184
00085
00665
01080
00477
00770
00243
00176
00288
00163
00458
00142
000005010015020
Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method
pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining
mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals
Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and
Computational and Mathematical Methods in Medicine 9
Table6Th
esim
ilaritydissim
ilaritymatrix
forthe
16ND5proteins
basedon
theA
DLD
sbased
metho
d
Hum
anGorilla
Pi-chim
C-chim
Fin-whale
Blue-w
hale
rat
mou
seop
ossum
sheep
goat
lemur
cattle
hare
gallu
srabbit
Hum
an000
00Gorilla
01111
00000
Pi-chim
007
20008
33000
00C-
chim
008
14008
65005
10000
00Fin-whale
03396
03285
03222
03301
00000
Blue-w
hale
03474
03333
03285
03301
00324
000
00Ra
t03693
03622
03636
03716
03333
03381
000
00Mou
se03740
03686
03716
03748
03317
03333
01883
000
00Opo
ssum
044
7604551
04290
044
1804515
04519
04513
044
79000
00Sheep
03020
02933
02871
02951
02023
02067
03149
03219
04121
00000
Goat
03036
02901
02871
02919
01958
02147
03166
03468
04202
007
12000
00Lemur
02989
02708
02839
02967
02557
02724
03166
03670
040
5502087
02317
000
00Ca
ttle
03114
03045
03046
03062
01958
02051
03149
03173
04235
009
0601254
02184
000
00Hare
03146
03157
03062
03046
02832
02788
03166
03421
04023
02217
02508
02053
02532
000
00Gallus
04726
04920
04737
05008
044
50044
2304903
04743
04691
04239
04524
046
8004183
046
6000000
Rabb
it03255
03189
03222
03142
02896
02756
03084
03390
04332
02184
02603
02282
02612
008
37044
34000
00
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
6 Computational and Mathematical Methods in Medicine
Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be
For convenience of analysis in an ADLD suppose thatthere are 119870
1different SFs and 119870
2different FPs on its AT
119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following
(1) For these 1198701different SFs and 119870
2different FPs
on the AT of the ADLD we will number these1198701SFs and 119870
2FPs from left to right and utilize
ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870
1SFs and 119870
2FPs separately And
in addition we would also call these SFs on the AT ofthe ADLD the ASFs
(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT
119870 to represent these BTs separately
and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT
minus1BTminus2
BTminus119870
to represent these BTsseparately
(3) For each BT119897 where 119897 isin 1 2 119870 suppose
that there are 1198703different SFs on the BT
119897 then we
will number these 1198703SFs from left to right and
utilize BSF1119897BSF2119897 BSF1198703
119897 to represent these SFs
separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs
According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs
From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)
to the point (32 32) and the other is BSF1minus4 that is the
line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively
Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs
Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs
From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs
3 Results and Discussion
31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences
Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ
119873 then while applying the ADLDs to analyze
the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps
Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878
119873
Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin
1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively
Step 3 Suppose that there are totally 1198711different ASFs
such as ASF1ASF2 ASF1198711 1198712different BSFs such
as BSF11198971
BSF21198972
BSF11987121198971198712
and 1198713different FPs such as
FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895
119897119895
let their length be length119894 and length119895
respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871
2
then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as
follows
SD (Ψ119886 Ψ119887) =
1198711
sum
119894=1
length119894 +1198712
sum
119895=1
length119895+ 1198713 (17)
Computational and Mathematical Methods in Medicine 7
150
100
50
0150100500
(a)
150
100
50
0150100500
(b)
Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16
150
100
50
0150100500
(a)
150
100
50
0
150100500
(b)
Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)
And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ
119873 we can obtain an 119873 times 119873 matching matrix
(MM) as follows
MM = [119889119894119895]119873times119873
(18)
where
119889119894119895
=
SD (Ψ119894 Ψ119895) if 119894 ge 119895
0 else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(19)
Step 4 Based on the matching matrixMM and all of its com-ponents 119889
119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then
we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ
1 Ψ2 Ψ
119873 as follows
SM = [119904119894119895]119873times119873
(20)
where
119904119894119895
=
1 minus Λ(
119889119894119895
119889119894119894
) if 119894 ge 119895
0 else
Λ(
119889119894119895
119889119894119894
) =
1 if119889119894119895
119889119894119894
ge 1
119889119894119895
119889119894119894
else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(21)
According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6
Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human
8 Computational and Mathematical Methods in Medicine
Table 5 The basic information of 16 ND5 protein sequences
Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603
Sheep
Goat
Cattle
Fin-whale
Blue-whale
Lemur
Hare
Rabbit
Gorilla
Human
Pi-chim
C-chim
Rat
Mouse
Opossum
Gallus
00383
00468
00255
00255
00162
00162
00942
00942
02169
00356
00356
01084
00540
00419
02311
00419
00855
00128
00184
00085
00665
01080
00477
00770
00243
00176
00288
00163
00458
00142
000005010015020
Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method
pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining
mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals
Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and
Computational and Mathematical Methods in Medicine 9
Table6Th
esim
ilaritydissim
ilaritymatrix
forthe
16ND5proteins
basedon
theA
DLD
sbased
metho
d
Hum
anGorilla
Pi-chim
C-chim
Fin-whale
Blue-w
hale
rat
mou
seop
ossum
sheep
goat
lemur
cattle
hare
gallu
srabbit
Hum
an000
00Gorilla
01111
00000
Pi-chim
007
20008
33000
00C-
chim
008
14008
65005
10000
00Fin-whale
03396
03285
03222
03301
00000
Blue-w
hale
03474
03333
03285
03301
00324
000
00Ra
t03693
03622
03636
03716
03333
03381
000
00Mou
se03740
03686
03716
03748
03317
03333
01883
000
00Opo
ssum
044
7604551
04290
044
1804515
04519
04513
044
79000
00Sheep
03020
02933
02871
02951
02023
02067
03149
03219
04121
00000
Goat
03036
02901
02871
02919
01958
02147
03166
03468
04202
007
12000
00Lemur
02989
02708
02839
02967
02557
02724
03166
03670
040
5502087
02317
000
00Ca
ttle
03114
03045
03046
03062
01958
02051
03149
03173
04235
009
0601254
02184
000
00Hare
03146
03157
03062
03046
02832
02788
03166
03421
04023
02217
02508
02053
02532
000
00Gallus
04726
04920
04737
05008
044
50044
2304903
04743
04691
04239
04524
046
8004183
046
6000000
Rabb
it03255
03189
03222
03142
02896
02756
03084
03390
04332
02184
02603
02282
02612
008
37044
34000
00
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
Computational and Mathematical Methods in Medicine 7
150
100
50
0150100500
(a)
150
100
50
0150100500
(b)
Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16
150
100
50
0150100500
(a)
150
100
50
0
150100500
(b)
Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)
And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ
119873 we can obtain an 119873 times 119873 matching matrix
(MM) as follows
MM = [119889119894119895]119873times119873
(18)
where
119889119894119895
=
SD (Ψ119894 Ψ119895) if 119894 ge 119895
0 else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(19)
Step 4 Based on the matching matrixMM and all of its com-ponents 119889
119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then
we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ
1 Ψ2 Ψ
119873 as follows
SM = [119904119894119895]119873times119873
(20)
where
119904119894119895
=
1 minus Λ(
119889119894119895
119889119894119894
) if 119894 ge 119895
0 else
Λ(
119889119894119895
119889119894119894
) =
1 if119889119894119895
119889119894119894
ge 1
119889119894119895
119889119894119894
else
for 119894 isin 1 2 119873 119895 isin 1 2 119873
(21)
According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6
Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human
8 Computational and Mathematical Methods in Medicine
Table 5 The basic information of 16 ND5 protein sequences
Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603
Sheep
Goat
Cattle
Fin-whale
Blue-whale
Lemur
Hare
Rabbit
Gorilla
Human
Pi-chim
C-chim
Rat
Mouse
Opossum
Gallus
00383
00468
00255
00255
00162
00162
00942
00942
02169
00356
00356
01084
00540
00419
02311
00419
00855
00128
00184
00085
00665
01080
00477
00770
00243
00176
00288
00163
00458
00142
000005010015020
Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method
pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining
mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals
Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and
Computational and Mathematical Methods in Medicine 9
Table6Th
esim
ilaritydissim
ilaritymatrix
forthe
16ND5proteins
basedon
theA
DLD
sbased
metho
d
Hum
anGorilla
Pi-chim
C-chim
Fin-whale
Blue-w
hale
rat
mou
seop
ossum
sheep
goat
lemur
cattle
hare
gallu
srabbit
Hum
an000
00Gorilla
01111
00000
Pi-chim
007
20008
33000
00C-
chim
008
14008
65005
10000
00Fin-whale
03396
03285
03222
03301
00000
Blue-w
hale
03474
03333
03285
03301
00324
000
00Ra
t03693
03622
03636
03716
03333
03381
000
00Mou
se03740
03686
03716
03748
03317
03333
01883
000
00Opo
ssum
044
7604551
04290
044
1804515
04519
04513
044
79000
00Sheep
03020
02933
02871
02951
02023
02067
03149
03219
04121
00000
Goat
03036
02901
02871
02919
01958
02147
03166
03468
04202
007
12000
00Lemur
02989
02708
02839
02967
02557
02724
03166
03670
040
5502087
02317
000
00Ca
ttle
03114
03045
03046
03062
01958
02051
03149
03173
04235
009
0601254
02184
000
00Hare
03146
03157
03062
03046
02832
02788
03166
03421
04023
02217
02508
02053
02532
000
00Gallus
04726
04920
04737
05008
044
50044
2304903
04743
04691
04239
04524
046
8004183
046
6000000
Rabb
it03255
03189
03222
03142
02896
02756
03084
03390
04332
02184
02603
02282
02612
008
37044
34000
00
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
8 Computational and Mathematical Methods in Medicine
Table 5 The basic information of 16 ND5 protein sequences
Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603
Sheep
Goat
Cattle
Fin-whale
Blue-whale
Lemur
Hare
Rabbit
Gorilla
Human
Pi-chim
C-chim
Rat
Mouse
Opossum
Gallus
00383
00468
00255
00255
00162
00162
00942
00942
02169
00356
00356
01084
00540
00419
02311
00419
00855
00128
00184
00085
00665
01080
00477
00770
00243
00176
00288
00163
00458
00142
000005010015020
Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method
pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining
mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals
Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and
Computational and Mathematical Methods in Medicine 9
Table6Th
esim
ilaritydissim
ilaritymatrix
forthe
16ND5proteins
basedon
theA
DLD
sbased
metho
d
Hum
anGorilla
Pi-chim
C-chim
Fin-whale
Blue-w
hale
rat
mou
seop
ossum
sheep
goat
lemur
cattle
hare
gallu
srabbit
Hum
an000
00Gorilla
01111
00000
Pi-chim
007
20008
33000
00C-
chim
008
14008
65005
10000
00Fin-whale
03396
03285
03222
03301
00000
Blue-w
hale
03474
03333
03285
03301
00324
000
00Ra
t03693
03622
03636
03716
03333
03381
000
00Mou
se03740
03686
03716
03748
03317
03333
01883
000
00Opo
ssum
044
7604551
04290
044
1804515
04519
04513
044
79000
00Sheep
03020
02933
02871
02951
02023
02067
03149
03219
04121
00000
Goat
03036
02901
02871
02919
01958
02147
03166
03468
04202
007
12000
00Lemur
02989
02708
02839
02967
02557
02724
03166
03670
040
5502087
02317
000
00Ca
ttle
03114
03045
03046
03062
01958
02051
03149
03173
04235
009
0601254
02184
000
00Hare
03146
03157
03062
03046
02832
02788
03166
03421
04023
02217
02508
02053
02532
000
00Gallus
04726
04920
04737
05008
044
50044
2304903
04743
04691
04239
04524
046
8004183
046
6000000
Rabb
it03255
03189
03222
03142
02896
02756
03084
03390
04332
02184
02603
02282
02612
008
37044
34000
00
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
Computational and Mathematical Methods in Medicine 9
Table6Th
esim
ilaritydissim
ilaritymatrix
forthe
16ND5proteins
basedon
theA
DLD
sbased
metho
d
Hum
anGorilla
Pi-chim
C-chim
Fin-whale
Blue-w
hale
rat
mou
seop
ossum
sheep
goat
lemur
cattle
hare
gallu
srabbit
Hum
an000
00Gorilla
01111
00000
Pi-chim
007
20008
33000
00C-
chim
008
14008
65005
10000
00Fin-whale
03396
03285
03222
03301
00000
Blue-w
hale
03474
03333
03285
03301
00324
000
00Ra
t03693
03622
03636
03716
03333
03381
000
00Mou
se03740
03686
03716
03748
03317
03333
01883
000
00Opo
ssum
044
7604551
04290
044
1804515
04519
04513
044
79000
00Sheep
03020
02933
02871
02951
02023
02067
03149
03219
04121
00000
Goat
03036
02901
02871
02919
01958
02147
03166
03468
04202
007
12000
00Lemur
02989
02708
02839
02967
02557
02724
03166
03670
040
5502087
02317
000
00Ca
ttle
03114
03045
03046
03062
01958
02051
03149
03173
04235
009
0601254
02184
000
00Hare
03146
03157
03062
03046
02832
02788
03166
03421
04023
02217
02508
02053
02532
000
00Gallus
04726
04920
04737
05008
044
50044
2304903
04743
04691
04239
04524
046
8004183
046
6000000
Rabb
it03255
03189
03222
03142
02896
02756
03084
03390
04332
02184
02603
02282
02612
008
37044
34000
00
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
10 Computational and Mathematical Methods in Medicine
the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal
Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences
32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species
Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3
From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree
To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7
For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points
Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)
Table 7 The basic information of 29 spike proteins
Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255
From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method
Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively
Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang
33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
Computational and Mathematical Methods in Medicine 11
BCoVE
BCoVL
BCoVM
BCoVQ
HCoVOC43
MHVJHM
MHVP
MHVA
MHVM
TGEVG
TGEV
PEDVC
PEDV
TOR2
TaiwanTC1
FRA
BJ01
GD01
GZ02
civet007
A022
civet010
PC4137
GD03T0013
PC4127
PC4205
IBV
IBVBJ
IBVC
00000
00000
00000
00000
00369
00026
00026
00022
00022
00019
00559
00221
00019
00950
00950
01221
00010
00004
00015
00004
00004
00006
00004
00012
00012
00011
00006
00004
00004
04350
04350
00002
00002
00006
00005
00020
00005
00019
00018
00011
00202
00116
00112
00044
00040
04560
00232
00337
03498
03309
0027203461
00713
00230
00049
00052
Group II
Group I
Group III
SARS CoVs
Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5
will further discuss the intuition and visuality of the ADLDsin detail
From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3
Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)
Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
12 Computational and Mathematical Methods in Medicine
PEDVC
PEDV
GD03T0013
BJ01
GD01
MHVA
PC4205
GZ02
TOR2
MHVM
PC4137
FRA
PC4127
TaiwanTC1
MHVP
A022
civet007
civet010
TGEVG
TGEV
BCoVM
BCoVQ
HCoVOC43
IBVC
MHVJHM
BCoVE
BCoVL
IBVBJ
IBV
00000
00000
00000
00000
00013
00006
00006
00001
00001
00000
00012
00002
00001
00010
00013
00010
00000
00000
00000
00000
00000
00000
00001
00001
00000
00000
00000
00001
0000000000
00000
00000
00006
00000
00000
00001
00000
00000
00001
00002
00001
00000
00002
00004
00002
00003
00003
00008
00006
00008
00067
00022
00013
00013
00008
00042
Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang
length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short
And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively
Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
Computational and Mathematical Methods in Medicine 13
Fin-whale
Rabbit
Blue-whale
Sheep
Goat
Hare
Rat
Lemur
Mouse
Cattle
Opossum
Gorilla
Human
Pi-chim
C-chim
Gallus
00006
00037
00004
00004
00001
00002
00000
00001
00074
00002
00015
00000
00001
00011
00183
00001
00006
00006
00004
00005
00002
00005
00031
00008
00024
00020
00039
00060
00024
00086
Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang
700
600
500
400
300
200
100
07006005004003002001000
(a)
700
600
500
400
300
200
100
07006005004003002001000
(b)
700
600
500
400
300
200
100
07006005004003002001000
(c)
Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
14 Computational and Mathematical Methods in Medicine
in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)
Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields
4 Conclusions
In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)
References
[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999
[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994
[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004
[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004
[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007
[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007
[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008
[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001
[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009
[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009
[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006
[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007
[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011
[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009
[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007
[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008
[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009
[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010
[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010
[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008
[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013
Computational and Mathematical Methods in Medicine 15
[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012
[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012
[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008
[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002
[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004
[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005
[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013
[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010
[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012
[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013
[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014
[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004
[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988
[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004
[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002
[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000
[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996
[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical
Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000
[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003
[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013