On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic...

8/10/2019 On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic (DNA

1/13

World Applied Programming, Vol (1), No (3), August 2011. 215-227

215

On Pseudorandom Number Generation from

Programmable and Computable Biomolecules:

Deoxyribonucleic (DNA) as a Novel PseudorandomNumber Generator

Okunoye Babatunde O.

Department of Pure and Applied Biology,

Ladoke Akintola Akintola University Of Technology,

P.M.B. 4000, Ogbomoso, Nigeria

[email protected]

Abstract: Deoxyribonucleic acid (DNA) computing has extended the frontiers of computer science.The field has made possible the creation of programmable DNA molecules that perform complex

calculations and autonomous DNA machines. Random numbers are a sequence of numbers that lack

any pattern. A random number Generator (RNG) is a computational device designed to generate

random numbers. The many applications of random numbers include cryptography, statistical sampling,

numerical simulation of physical and biological systems, and lotteries. Random numbers are usually

generated by sampling entropy in physical phenomena and processing it through a computer. Examples

of such phenomena include a radioactive source, atmospheric noise, and quantum mechanics. Random

numbers are hard to characterize mathematically, such that even though there exists several statistical

tests to verify the absence of certain patterns in a stream of numbers, no finite set of tests exists for

characterizing randomness in numbers, as there may be patterns not considered by such tests. This

paper discusses the generation of pseudorandom numbers from DNA Watson-Crick units. 1000

numbers from an experimental DNA segment passed two statistical tests for randomness. This work

also reports what might be a breakthrough in DNA structural analysis: the Poisson distribution of DNA

bases and amino acids.

Key word: Pseudorandom number generator DNA computing Poisson distribution

I. INTRODUCTION

Alan Turing [1] was first to give the description of a computer model in computability theory, known as theTuring machine. Deoxyribonucleic acid (DNA) has been used in the implementation of the Turing Machine [2][3] and as such DNA is now accepted as a standard computer model. This has resulted in the development ofDNA computing which has been employed in the resolution of numerous computations, especially certain NP-hard problems [4] [5] [6] [7] [8]. Of special importance also is the creation of programmable and autonomousDNA computers [9] [10] [11] [12] [13] that have shown the promise and power of molecular computing. Here,DNA molecules are used to demonstrate the generation of a sequence of pseudorandom numbers.

Random numbers are numbers are a sequence of numbers that lack any pattern and have applications incryptography, gambling and numerical simulation of physical biological systems [14]. A random numbergenerator is a computational device designed to generate random numbers. Random numbers are hard tocharacterize mathematically [15]. Statistical tests exist for verifying the absence of certain patterns in a sequenceof numbers [16] [17] but there are no complete set of tests for randomness as there may be patterns not covered bysuch tests. Consequently, it is practically useful to distinguish true random number generators from pseudorandomnumber generators. A true random number generator employs entropy in physical phenomena, especially quantummechanics [14] to generate numbers that are totally unpredictable while a pseudorandom number generatorgenerates numbers that are computed using a mathematical algorithm or from a previously calculated list. Randomnumber generators are often imperfect [18] [19] [20] [21] [22]; for example certain pseudorandom numbergenerators are deterministic in nature, yet produce results that satisfy all the randomness tests [23]. This paperdiscusses DNA as a pseudorandom number generator, where 1,000 numbers generated from DNA strands passed

ISSN: 2222-2510

2011 WAP journal. www.waprogramming.com


2/13

Okunoye Babatunde O., World Applied Programming, Vol (1), No (3), August 2011.

216

two statistical tests for randomness. The paper also reports the Poisson distribution of DNA nucleotide bases andamino acids, which might be an important secondary finding.

II. MATERIALS AND METHODS

A.

Number Generation

Bacteriophage T4 DNA [24] sequence was obtained from GenBank, the international institutional

microbial genome depository, with accession number AF158101. The base sequence employed in the

experiment is Complement 168,900 167,900 (in the 5 to 3) direction. A total of 1,000 DNA segments

comprising of ten bases each of T4 Phage DNA was used to generate a total of 1,000 numbers. For example, the

first ten bases of the related complement gave the following digits:

Number of Adenine Bases: 4 Number of Guanine Bases: 1

Number of Thymine Bases: 3 Number of Cytosine Bases: 2

The next ten bases in the complement gave similarly:

Number of Adenine Bases: 0 Number of Guanine Bases: 2Number of Thymine Bases: 5 Number of Cytosine Bases: 2

Therefore the first numbers in our sequence of 1,000 numbers are 41 and 52. This process is repeated for therest of the complement. The numbers of the purine and pyrimidine bases in the related complement 168,900 167,900 were used alternatively per base segment (10 bases) in order to preserve independence of the numbers, aquality of random numbers. The purine and pyrimidine numbers were not generated from the same segment (10bases) of the sequence, as this would compromise the independence of the number sequence: it would then bepossible to predict something about the next number in the sequence from the preceding number. This is becausethe numeric digits within the segment would have the sum of ten, the number of bases in the experimental DNAsegments used.

B. Statistical Tests

The properties of random numbers include:

Uniformity: This means the numbers should be distributed uniformly.

Independence: It should be impossible to predict something about the next value in thesequence from the previous value(s).

Summation: This means that the sum of consecutive numbers in the sequence is equally likelyto be even or odd. The probability ought to be 0.5.

Duplication: Some numbers will be duplicated in the sequence, while others will be omitted.

However, the properties of summation and duplication are more binding than those of uniformity and

independence. A sequence of numbers that have uniformity and independence are not random unless they

display the properties of summation and duplication

There exist statistical tests to evaluate random numbers and Random Number Generators [16] [17]. Three tests

were chosen to evaluate the numbers generated from T4 Phage DNA: the Chi-squared tests, the Reverse

arrangements test and the Test of runs above and below the median.

The Chi-squared test:The chi-squared test is a test of distributional accuracy. It is a parametrictest, which makes assumptions about the distribution of the number set. It measures how

closely a set of numbers supplied follows the uniform distribution. The chi-squared test is acommon statistical test used in the evaluation of random numbers.


3/13


217

The test of runs above and below the median:The test of runs is a non-parametric test, as itdoes not make assumptions about the distribution of the number set. The test is especially

useful in detecting fluctuating trends in data, the existence of which will indicate non-

randomness. The test examines the order in which the numbers are generated, checking for

trends and cyclical patterns which may indicate a bias in the software producing the sequenceof numbers, hence non-randomness.

Observations in the number sequence greater than the median of the sequence are assigned the letter a

while observations less than the median are assigned the letter b. Observations equal to the median are

omitted. Accordingly, u, the total number of runs, n 1, the number of asand n 2, the number of b

s are calculated.

If the sequence shows mostly as at first and then mostly bs this suggests a downward trend in the data.Similarly, the sequence showing mostly bs at first and then mostly as suggests an upward trend in the data.

When the test of runs above and below the median is implemented on a set of numbers, the mean u, standard

deviation uand z-score is calculated.

The reverse arrangements test:The reverse arrangements test is also useful in detecting bias in the

software generating the random number sequence. This particular test detects monotonic (gradual andcontinuous) trends in the sequence, which also indicates non-randomness. Restrictions in the ReverseArrangements distribution tables necessitated the use of a sample size of 100 for five experiments inorder to determine the average.

C. Summary Statistics

The objective of summary statistics is to provide an overview of the main parameters of the data: the count(quantity of numbers), the mean, the median, the maximum and minimum value in the data set. It ensuresfamiliarity with the data set.

III. SUMMARYSTATISTICS

A.

Summary Statistics For Numbers Generated From T4 Phage DNA

Count

CountMean

MeanMedian

MedianMax

MaxMin

MinP(Even)

P(Even)Number of

Duplications

Number of non-

occurring values

1000 33.8 36 82 0 0.496 40 40

The P (even) shows that the summation property of the numbers generated holds. The Probability of the sumof two numbers being even is 0.496 (approximately 0.5). The number of duplications and non-occurring values,considering the minimum and maximum values of 0 and 82 respectively, suggests randomness.

B. The Chi-Squared Test

Observed and Expected Frequencies for the Chi-Squared TestCategory Observed Frequency (Oi) Expected Frequency (Ei)

0.100 31 111.1

0.200 107 111.1

0.300 240 111.1

0.400 241 111.1

0.500 205 111.1

0.600 113 111.1

0.700 47 111.1

0.800 13 111.1

0.900 3 111.1

Total 1000 999.9


4/13


218

Ho: The numbers follow a uniform distribution.

HA: The numbers follow another distribution.

K

Test Statistic: 2 = (Oi - Ei)Ei

i=1

2 = 74.17

Level of significance: = 0.05

Critical value: 2 0.05, 8= 15.51

The test statistic is more than the critical value so the null hypothesis rejected is at the 5% level of significance.

The numbers do not follow a uniform distribution, one of the properties of random numbers. The numbers

approximate to a Poisson distribution.

C. The Reverse Arrangements Test

Ho: The numbers generated do not exhibit monotonic trends

HA: The numbers generated exhibit monotonic trends

N-1 N 1 if Xi >Xj

Test Statistic:

[hi j

]wherehi j=

{

j=1 j= i+1 0 else

99 100

A = [ hi j]J=1 j= i+1

A = 2308

Level of significance: = 0.05

Critical Value: AN: (1- / 2) < 2308 AN; (/ 2)

A100; 0.975 < 2308 A100; 0.025

2145 < 2308 2804

The value of A in this and the other four experiments lies between 2145 and 2804. The null hypothesis isaccepted at the 5% level of significance. There is no evidence that the data has monotonic trends.


5/13


219

Test Values For the Five Experiments in the Reverse Arrangements Test

A_1 A_2 A_3 A_4 A_5 Avg A

2308 2262 2336 2267 2167 2268

D. Test of Runs Above and Below the Median

Summary Statistics for the Runs Test

U n1 n2 u u z

473 415 570 481.30 15.30 - 0.51

Ho: The numbers are generated in a random order

HA: The numbers are not generated in a random order

The mean of the distribution of u:

u=2 n1n2 + 1

n1 + n2

The standard deviation of u,

u = 2 n1 n2 (2 n1 n2 n1 n2 )

(n1 + n2)2 (n1 + n2 1)

The Test Statistic : z = ( u 0.5 ) - u

u

z = - 0.51

Level of Significance : = 0.05

Critical Value : / z / < z /2

- z - /2 < - 0.51 < z /2

- z 0.025 < - 0.51 < z 0.025

- 1.96 < - 0.51 < +1.96

The value of z lies between 1.96. The null hypothesis is accepted at the 5% level of significance. There is no

evidence to suggest a bias in the number generation and sequence.


6/13


220

IV. DISCUSSION

The future of DNA computation is immense. Programmable DNA computers have found particular

application in medical therapeutics [25]. The potential of DNA computing seems to be realized, not in

competing with silicon based electronic computers, but in complementary fields as cited above. The numbers

generated from T4 phage DNA passed the non-parametric, distribution free tests. Given that the properties ofsummation and duplication are more binding than those of uniformity and independence, the evidence suggests

that while T4 phage certainly does not generate truly random numbers, it however produces pseudo-random

numbers and hence could be seen as a possible pseudo-random number generator. The Chi-squared reveals that

the numbers do not follow a Uniform distribution, rather, a Poisson distribution. The finding that the

distribution of DNA nucleotides and amino acids follow a Poisson distribution is a novel experimental result

which alongside Chargaffs rule and the triplet coding mechanism of DNA nucleotides, contributes to our

fundamental knowledge of DNA structure. Some approximation induced errors were recorded in the Poisson

distribution tables but they do not detract from the fundamental findings.

The investigation reveals that DNA of T4 Bacteriophage is a pseudo-random number generator, with the

distribution of nucleotides and amino acids following a Poisson distribution.

This work is open to future development, especially as regards to a larger sample size and more tests for

randomness. Nevertheless, it is doubtful if this could alter the fundamental findings contained in the work.

V. TABLES SHOWING POISSON DISTRIBUTION OF PURINE BASES, PYRIMIDINE BASES

AND AMINO ACIDS IN 3,500 BASES (COMPLEMENT 168,900 165,400) IN THE 5 TO 3

DIRECTION IN BACTERIOPHAGE T4 DNA.

A. Purine and Pyrimidine Bases

Adenine

=3.238

X P(x)

Theoreticalfrequency

(f e) Actual frequency

0 0.0392 13.3 11

1 0.1269 43.2 35

2 0.2055 69.9 59

3 0.2218 75.4 92

4 0.1795 61.0 75

5 0.1162 39.5 41

6 0.0628 21.3 227 0.0290 9.7 5

0.8592 333.3 340


7/13


221

Thymine

=3.1706X P(x) Theoretical Actual frequency

Frequency (f e)

0 0.0419 14.2 91 0.1328 45.2 36

2 0.2106 71.6 81

3 0.2226 75.7 76

4 0.1764 59.9 68

5 0.1119 38.0 46

6 0.0591 20.1 18

7 0.0268 9.1 6

0.8768 333.8 340

Guanine


frequency(f e)

0 0.2099 70.3 62

1 0.3277 109.8 110

2 0.2558 85.7 98

3 0.1331 44.6 46

4 0.0520 17.4 16

5 0.0162 5.4 3

0.9947 333.2 335

Cytosine


frequency(f e)

0 0.1318 44.9 39

1 0.2671 91.1 78

2 0.2706 92.3 111

3 0.1828 62.3 74

4 0.0926 31.6 295 0.0375 12.8 8

6 0.0127 4.3 1

7 0.0037 1.3 1

0.9988 340.6 341


8/13


222

B.Amino Acids

Glycine

=1.1020

X P(x) Theoretical Actual frequencyfrequency(f e)

0 0.3322 16.3 19

1 0.3661 17.9 13

2 0.2017 9.9 11

3 0.0741 3.6 5

4 0.0204 1.0 1

0.9945 48.7 49

Alanine


frequency(f e)

0 0.3255 15.9 17

1 0.3653 17.9 15

2 0.2050 10.0 13

3 0.0767 3.8 2

4 0.0215 1.1 2

0.9945 48.7 49

Valine


frequency(f e)

0 0.1593 7.8 7

1 0.2925 14.3 16

2 0.2687 13.2 12

3 0.1645 8.1 9

4 0.0755 3.7 3

5 0.0277 1.4 1

6 0.0085 0.4 1

0.9967 48.9 49

Leucine


frequency(f e)

0 0.2301 11.2 9

1 0.3381 16.6 14

2 0.2484 12.2 14

3 0.1217 5.9 8

4 0.0447 2.2 4

0.9830 48.1 49


9/13


223

Isoleucine

=1.4490


0 0.2348 11.5 11

1 0.3402 16.7 19

2 0.2465 12.1 11

3 0.1191 5.8 5

4 0.0431 2.1 1

5 0.0125 0.6 1

6 0.0030 0.1 1

0.9992 48.9 49

Methionine


frequency(f e)

0 0.5204 25.5 27

1 0.3399 16.7 14

2 0.1110 5.4 6

3 0.0242 1.2 2

0.9955 48.8 49

Phenylalanine= 1.1429

X P(x) Theoretical Actual frequency

frequency(f e)

0 0.3189 15.6 18

1 0.3645 17.9 15

2 0.2083 10.2 8

3 0.0793 3.9 7

4 0.0227 1.1 1

0.9937 48.7 49

Tyrosine


frequency(f e)

0 0.3992 19.6 17

1 0.3666 17.9 23

2 0.1684 8.2 5

3 0.0515 2.5 4

0.9857 48.3 49


10/13


224

Tryptophan


frequency(f e)0 0.6926 33.9 34

1 0.2544 12.5 12

2 0.0467 2.3 3

0.9937 48.7 49

Serine


frequency(f e)

0 0.2446 11.9 12

1 0.3444 16.9 17

2 0.2425 11.9 11

3 0.1138 5.6 8

4 0.0401 1.9 0

5 0.0113 0.6 0

6 0.0026 0.1 1

0.9993 48.9 49

Proline

=0.6122


0 0.5422 26.6 29

1 0.3319 16.3 10

2 0.1016 4.9 10

0.9757 47.8 49

Threonine


frequency(f e)

0 0.3532 17.3 22

1 0.3676 18.0 12

2 0.1913 9.4 8

3 0.0664 3.3 5

4 0.0173 0.8 2

0.9926 48.8 49


11/13


225

Cysteine


frequency(f e)0 0.7069 34.6 32

1 0.2452 12.0 17

0.9521 46.6 49

Asparagine

=1.1633

X P(x) Theoretical frequency(f e) Actual frequency

0 0.3125 15.3 12

1 0.3635 17.8 24

2 0.2114 10.4 8

3 0.0820 4.0 3

4 0.0238 1.2 2

0.9932 48.7 49

Glutamate

=0.7755

X P(x) Theoretical frequency(f e) Actual frequency

0 0.4605 22.6 22

1 0.3571 17.5 182 0.1385 6.8 7

3 0.0360 1.8 2

0.9921 48.7 49

Lysine

=1.3061X P(x) Theoreticalfrequency(f e) Actual frequency

0 0.2709 13.3 12

1 0.3538 17.3 20

2 0.2311 11.3 10

3 0.1006 4.9 4

4 0.0328 1.6 3

0.9892 48.4 49

Histidine

=0.3673X P(x) Theoreticalfrequency(f e) Actual frequency

0 0.6926 33.9 33

1 0.2544 12.5 14

2 0.0469 2.3 2

1.0404 48.7 49


12/13


13/13

On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic...

Documents

Transcript of On Pseudorandom Number Generation from Programmable and Computable Biomolecules: Deoxyribonucleic...