On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic...

download On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA) as a Novel Pseudorandom Number Generator

of 13

Transcript of On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic...

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    1/13

    World Applied Programming, Vol (1), No (3), August 2011. 215-227

    215

    On Pseudorandom Number Generation from

    Programmable and Computable Biomolecules:

    Deoxyribonucleic (DNA) as a Novel PseudorandomNumber Generator

    Okunoye Babatunde O.

    Department of Pure and Applied Biology,

    Ladoke Akintola Akintola University Of Technology,

    P.M.B. 4000, Ogbomoso, Nigeria

    [email protected]

    Abstract: Deoxyribonucleic acid (DNA) computing has extended the frontiers of computer science.The field has made possible the creation of programmable DNA molecules that perform complex

    calculations and autonomous DNA machines. Random numbers are a sequence of numbers that lack

    any pattern. A random number Generator (RNG) is a computational device designed to generate

    random numbers. The many applications of random numbers include cryptography, statistical sampling,

    numerical simulation of physical and biological systems, and lotteries. Random numbers are usually

    generated by sampling entropy in physical phenomena and processing it through a computer. Examples

    of such phenomena include a radioactive source, atmospheric noise, and quantum mechanics. Random

    numbers are hard to characterize mathematically, such that even though there exists several statistical

    tests to verify the absence of certain patterns in a stream of numbers, no finite set of tests exists for

    characterizing randomness in numbers, as there may be patterns not considered by such tests. This

    paper discusses the generation of pseudorandom numbers from DNA Watson-Crick units. 1000

    numbers from an experimental DNA segment passed two statistical tests for randomness. This work

    also reports what might be a breakthrough in DNA structural analysis: the Poisson distribution of DNA

    bases and amino acids.

    Key word: Pseudorandom number generator DNA computing Poisson distribution

    I. INTRODUCTION

    Alan Turing [1] was first to give the description of a computer model in computability theory, known as theTuring machine. Deoxyribonucleic acid (DNA) has been used in the implementation of the Turing Machine [2][3] and as such DNA is now accepted as a standard computer model. This has resulted in the development ofDNA computing which has been employed in the resolution of numerous computations, especially certain NP-hard problems [4] [5] [6] [7] [8]. Of special importance also is the creation of programmable and autonomousDNA computers [9] [10] [11] [12] [13] that have shown the promise and power of molecular computing. Here,DNA molecules are used to demonstrate the generation of a sequence of pseudorandom numbers.

    Random numbers are numbers are a sequence of numbers that lack any pattern and have applications incryptography, gambling and numerical simulation of physical biological systems [14]. A random numbergenerator is a computational device designed to generate random numbers. Random numbers are hard tocharacterize mathematically [15]. Statistical tests exist for verifying the absence of certain patterns in a sequenceof numbers [16] [17] but there are no complete set of tests for randomness as there may be patterns not covered bysuch tests. Consequently, it is practically useful to distinguish true random number generators from pseudorandomnumber generators. A true random number generator employs entropy in physical phenomena, especially quantummechanics [14] to generate numbers that are totally unpredictable while a pseudorandom number generatorgenerates numbers that are computed using a mathematical algorithm or from a previously calculated list. Randomnumber generators are often imperfect [18] [19] [20] [21] [22]; for example certain pseudorandom numbergenerators are deterministic in nature, yet produce results that satisfy all the randomness tests [23]. This paperdiscusses DNA as a pseudorandom number generator, where 1,000 numbers generated from DNA strands passed

    ISSN: 2222-2510

    2011 WAP journal. www.waprogramming.com

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    2/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    216

    two statistical tests for randomness. The paper also reports the Poisson distribution of DNA nucleotide bases andamino acids, which might be an important secondary finding.

    II. MATERIALS AND METHODS

    A.

    Number Generation

    Bacteriophage T4 DNA [24] sequence was obtained from GenBank, the international institutional

    microbial genome depository, with accession number AF158101. The base sequence employed in the

    experiment is Complement 168,900 167,900 (in the 5 to 3) direction. A total of 1,000 DNA segments

    comprising of ten bases each of T4 Phage DNA was used to generate a total of 1,000 numbers. For example, the

    first ten bases of the related complement gave the following digits:

    Number of Adenine Bases: 4 Number of Guanine Bases: 1

    Number of Thymine Bases: 3 Number of Cytosine Bases: 2

    The next ten bases in the complement gave similarly:

    Number of Adenine Bases: 0 Number of Guanine Bases: 2Number of Thymine Bases: 5 Number of Cytosine Bases: 2

    Therefore the first numbers in our sequence of 1,000 numbers are 41 and 52. This process is repeated for therest of the complement. The numbers of the purine and pyrimidine bases in the related complement 168,900 167,900 were used alternatively per base segment (10 bases) in order to preserve independence of the numbers, aquality of random numbers. The purine and pyrimidine numbers were not generated from the same segment (10bases) of the sequence, as this would compromise the independence of the number sequence: it would then bepossible to predict something about the next number in the sequence from the preceding number. This is becausethe numeric digits within the segment would have the sum of ten, the number of bases in the experimental DNAsegments used.

    B. Statistical Tests

    The properties of random numbers include:

    Uniformity: This means the numbers should be distributed uniformly.

    Independence: It should be impossible to predict something about the next value in thesequence from the previous value(s).

    Summation: This means that the sum of consecutive numbers in the sequence is equally likelyto be even or odd. The probability ought to be 0.5.

    Duplication: Some numbers will be duplicated in the sequence, while others will be omitted.

    However, the properties of summation and duplication are more binding than those of uniformity and

    independence. A sequence of numbers that have uniformity and independence are not random unless they

    display the properties of summation and duplication

    There exist statistical tests to evaluate random numbers and Random Number Generators [16] [17]. Three tests

    were chosen to evaluate the numbers generated from T4 Phage DNA: the Chi-squared tests, the Reverse

    arrangements test and the Test of runs above and below the median.

    The Chi-squared test:The chi-squared test is a test of distributional accuracy. It is a parametrictest, which makes assumptions about the distribution of the number set. It measures how

    closely a set of numbers supplied follows the uniform distribution. The chi-squared test is acommon statistical test used in the evaluation of random numbers.

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    3/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    217

    The test of runs above and below the median:The test of runs is a non-parametric test, as itdoes not make assumptions about the distribution of the number set. The test is especially

    useful in detecting fluctuating trends in data, the existence of which will indicate non-

    randomness. The test examines the order in which the numbers are generated, checking for

    trends and cyclical patterns which may indicate a bias in the software producing the sequenceof numbers, hence non-randomness.

    Observations in the number sequence greater than the median of the sequence are assigned the letter a

    while observations less than the median are assigned the letter b. Observations equal to the median are

    omitted. Accordingly, u, the total number of runs, n 1, the number of asand n 2, the number of b

    s are calculated.

    If the sequence shows mostly as at first and then mostly bs this suggests a downward trend in the data.Similarly, the sequence showing mostly bs at first and then mostly as suggests an upward trend in the data.

    When the test of runs above and below the median is implemented on a set of numbers, the mean u, standard

    deviation uand z-score is calculated.

    The reverse arrangements test:The reverse arrangements test is also useful in detecting bias in the

    software generating the random number sequence. This particular test detects monotonic (gradual andcontinuous) trends in the sequence, which also indicates non-randomness. Restrictions in the ReverseArrangements distribution tables necessitated the use of a sample size of 100 for five experiments inorder to determine the average.

    C. Summary Statistics

    The objective of summary statistics is to provide an overview of the main parameters of the data: the count(quantity of numbers), the mean, the median, the maximum and minimum value in the data set. It ensuresfamiliarity with the data set.

    III. SUMMARYSTATISTICS

    A.

    Summary Statistics For Numbers Generated From T4 Phage DNA

    Count

    CountMean

    MeanMedian

    MedianMax

    MaxMin

    MinP(Even)

    P(Even)Number of

    Duplications

    Number of non-

    occurring values

    1000 33.8 36 82 0 0.496 40 40

    The P (even) shows that the summation property of the numbers generated holds. The Probability of the sumof two numbers being even is 0.496 (approximately 0.5). The number of duplications and non-occurring values,considering the minimum and maximum values of 0 and 82 respectively, suggests randomness.

    B. The Chi-Squared Test

    Observed and Expected Frequencies for the Chi-Squared TestCategory Observed Frequency (Oi) Expected Frequency (Ei)

    0.100 31 111.1

    0.200 107 111.1

    0.300 240 111.1

    0.400 241 111.1

    0.500 205 111.1

    0.600 113 111.1

    0.700 47 111.1

    0.800 13 111.1

    0.900 3 111.1

    Total 1000 999.9

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    4/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    218

    Ho: The numbers follow a uniform distribution.

    HA: The numbers follow another distribution.

    K

    Test Statistic: 2 = (Oi - Ei)Ei

    i=1

    2 = 74.17

    Level of significance: = 0.05

    Critical value: 2 0.05, 8= 15.51

    The test statistic is more than the critical value so the null hypothesis rejected is at the 5% level of significance.

    The numbers do not follow a uniform distribution, one of the properties of random numbers. The numbers

    approximate to a Poisson distribution.

    C. The Reverse Arrangements Test

    Ho: The numbers generated do not exhibit monotonic trends

    HA: The numbers generated exhibit monotonic trends

    N-1 N 1 if Xi >Xj

    Test Statistic:

    [hi j

    ]wherehi j=

    {

    j=1 j= i+1 0 else

    99 100

    A = [ hi j]J=1 j= i+1

    A = 2308

    Level of significance: = 0.05

    Critical Value: AN: (1- / 2) < 2308 AN; (/ 2)

    A100; 0.975 < 2308 A100; 0.025

    2145 < 2308 2804

    The value of A in this and the other four experiments lies between 2145 and 2804. The null hypothesis isaccepted at the 5% level of significance. There is no evidence that the data has monotonic trends.

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    5/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    219

    Test Values For the Five Experiments in the Reverse Arrangements Test

    A_1 A_2 A_3 A_4 A_5 Avg A

    2308 2262 2336 2267 2167 2268

    D. Test of Runs Above and Below the Median

    Summary Statistics for the Runs Test

    U n1 n2 u u z

    473 415 570 481.30 15.30 - 0.51

    Ho: The numbers are generated in a random order

    HA: The numbers are not generated in a random order

    The mean of the distribution of u:

    u=2 n1n2 + 1

    n1 + n2

    The standard deviation of u,

    u = 2 n1 n2 (2 n1 n2 n1 n2 )

    (n1 + n2)2 (n1 + n2 1)

    The Test Statistic : z = ( u 0.5 ) - u

    u

    z = - 0.51

    Level of Significance : = 0.05

    Critical Value : / z / < z /2

    - z - /2 < - 0.51 < z /2

    - z 0.025 < - 0.51 < z 0.025

    - 1.96 < - 0.51 < +1.96

    The value of z lies between 1.96. The null hypothesis is accepted at the 5% level of significance. There is no

    evidence to suggest a bias in the number generation and sequence.

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    6/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    220

    IV. DISCUSSION

    The future of DNA computation is immense. Programmable DNA computers have found particular

    application in medical therapeutics [25]. The potential of DNA computing seems to be realized, not in

    competing with silicon based electronic computers, but in complementary fields as cited above. The numbers

    generated from T4 phage DNA passed the non-parametric, distribution free tests. Given that the properties ofsummation and duplication are more binding than those of uniformity and independence, the evidence suggests

    that while T4 phage certainly does not generate truly random numbers, it however produces pseudo-random

    numbers and hence could be seen as a possible pseudo-random number generator. The Chi-squared reveals that

    the numbers do not follow a Uniform distribution, rather, a Poisson distribution. The finding that the

    distribution of DNA nucleotides and amino acids follow a Poisson distribution is a novel experimental result

    which alongside Chargaffs rule and the triplet coding mechanism of DNA nucleotides, contributes to our

    fundamental knowledge of DNA structure. Some approximation induced errors were recorded in the Poisson

    distribution tables but they do not detract from the fundamental findings.

    The investigation reveals that DNA of T4 Bacteriophage is a pseudo-random number generator, with the

    distribution of nucleotides and amino acids following a Poisson distribution.

    This work is open to future development, especially as regards to a larger sample size and more tests for

    randomness. Nevertheless, it is doubtful if this could alter the fundamental findings contained in the work.

    V. TABLES SHOWING POISSON DISTRIBUTION OF PURINE BASES, PYRIMIDINE BASES

    AND AMINO ACIDS IN 3,500 BASES (COMPLEMENT 168,900 165,400) IN THE 5 TO 3

    DIRECTION IN BACTERIOPHAGE T4 DNA.

    A. Purine and Pyrimidine Bases

    Adenine

    =3.238

    X P(x)

    Theoreticalfrequency

    (f e) Actual frequency

    0 0.0392 13.3 11

    1 0.1269 43.2 35

    2 0.2055 69.9 59

    3 0.2218 75.4 92

    4 0.1795 61.0 75

    5 0.1162 39.5 41

    6 0.0628 21.3 227 0.0290 9.7 5

    0.8592 333.3 340

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    7/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    221

    Thymine

    =3.1706X P(x) Theoretical Actual frequency

    Frequency (f e)

    0 0.0419 14.2 91 0.1328 45.2 36

    2 0.2106 71.6 81

    3 0.2226 75.7 76

    4 0.1764 59.9 68

    5 0.1119 38.0 46

    6 0.0591 20.1 18

    7 0.0268 9.1 6

    0.8768 333.8 340

    Guanine

    =1.5612X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.2099 70.3 62

    1 0.3277 109.8 110

    2 0.2558 85.7 98

    3 0.1331 44.6 46

    4 0.0520 17.4 16

    5 0.0162 5.4 3

    0.9947 333.2 335

    Cytosine

    =2.0264X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.1318 44.9 39

    1 0.2671 91.1 78

    2 0.2706 92.3 111

    3 0.1828 62.3 74

    4 0.0926 31.6 295 0.0375 12.8 8

    6 0.0127 4.3 1

    7 0.0037 1.3 1

    0.9988 340.6 341

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    8/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    222

    B.Amino Acids

    Glycine

    =1.1020

    X P(x) Theoretical Actual frequencyfrequency(f e)

    0 0.3322 16.3 19

    1 0.3661 17.9 13

    2 0.2017 9.9 11

    3 0.0741 3.6 5

    4 0.0204 1.0 1

    0.9945 48.7 49

    Alanine

    =1.1224X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.3255 15.9 17

    1 0.3653 17.9 15

    2 0.2050 10.0 13

    3 0.0767 3.8 2

    4 0.0215 1.1 2

    0.9945 48.7 49

    Valine

    =1.8367X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.1593 7.8 7

    1 0.2925 14.3 16

    2 0.2687 13.2 12

    3 0.1645 8.1 9

    4 0.0755 3.7 3

    5 0.0277 1.4 1

    6 0.0085 0.4 1

    0.9967 48.9 49

    Leucine

    =1.4694X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.2301 11.2 9

    1 0.3381 16.6 14

    2 0.2484 12.2 14

    3 0.1217 5.9 8

    4 0.0447 2.2 4

    0.9830 48.1 49

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    9/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    223

    Isoleucine

    =1.4490

    X P(x) Theoretical Actual frequencyfrequency(f e)

    0 0.2348 11.5 11

    1 0.3402 16.7 19

    2 0.2465 12.1 11

    3 0.1191 5.8 5

    4 0.0431 2.1 1

    5 0.0125 0.6 1

    6 0.0030 0.1 1

    0.9992 48.9 49

    Methionine

    =0.6531X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.5204 25.5 27

    1 0.3399 16.7 14

    2 0.1110 5.4 6

    3 0.0242 1.2 2

    0.9955 48.8 49

    Phenylalanine= 1.1429

    X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.3189 15.6 18

    1 0.3645 17.9 15

    2 0.2083 10.2 8

    3 0.0793 3.9 7

    4 0.0227 1.1 1

    0.9937 48.7 49

    Tyrosine

    =0.9184X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.3992 19.6 17

    1 0.3666 17.9 23

    2 0.1684 8.2 5

    3 0.0515 2.5 4

    0.9857 48.3 49

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    10/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    224

    Tryptophan

    =0.3673X P(x) Theoretical Actual frequency

    frequency(f e)0 0.6926 33.9 34

    1 0.2544 12.5 12

    2 0.0467 2.3 3

    0.9937 48.7 49

    Serine

    =1.4082X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.2446 11.9 12

    1 0.3444 16.9 17

    2 0.2425 11.9 11

    3 0.1138 5.6 8

    4 0.0401 1.9 0

    5 0.0113 0.6 0

    6 0.0026 0.1 1

    0.9993 48.9 49

    Proline

    =0.6122

    X P(x) Theoretical Actual frequencyfrequency(f e)

    0 0.5422 26.6 29

    1 0.3319 16.3 10

    2 0.1016 4.9 10

    0.9757 47.8 49

    Threonine

    =1.0408X P(x) Theoretical Actual frequency

    frequency(f e)

    0 0.3532 17.3 22

    1 0.3676 18.0 12

    2 0.1913 9.4 8

    3 0.0664 3.3 5

    4 0.0173 0.8 2

    0.9926 48.8 49

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    11/13

    Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

    225

    Cysteine

    =0.3469X P(x) Theoretical Actual frequency

    frequency(f e)0 0.7069 34.6 32

    1 0.2452 12.0 17

    0.9521 46.6 49

    Asparagine

    =1.1633

    X P(x) Theoretical frequency(f e) Actual frequency

    0 0.3125 15.3 12

    1 0.3635 17.8 24

    2 0.2114 10.4 8

    3 0.0820 4.0 3

    4 0.0238 1.2 2

    0.9932 48.7 49

    Glutamate

    =0.7755

    X P(x) Theoretical frequency(f e) Actual frequency

    0 0.4605 22.6 22

    1 0.3571 17.5 182 0.1385 6.8 7

    3 0.0360 1.8 2

    0.9921 48.7 49

    Lysine

    =1.3061X P(x) Theoreticalfrequency(f e) Actual frequency

    0 0.2709 13.3 12

    1 0.3538 17.3 20

    2 0.2311 11.3 10

    3 0.1006 4.9 4

    4 0.0328 1.6 3

    0.9892 48.4 49

    Histidine

    =0.3673X P(x) Theoreticalfrequency(f e) Actual frequency

    0 0.6926 33.9 33

    1 0.2544 12.5 14

    2 0.0469 2.3 2

    1.0404 48.7 49

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    12/13

  • 8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

    13/13