Research Article DNASynth: A Computer Program...

Research ArticleDNASynth A Computer Program for Assembly of Artificial GeneParts in Decreasing Temperature

Robert M Nowak1 Anna Wojtowicz-Krawiec2 and Andrzej Plucienniczak2

1 Institute of Electronic Systems Warsaw University of Technology Nowowiejska 1519 00-665 Warsaw Poland2Institute of Biotechnology and Antibiotics Staroscinska 5 02-512 Warsaw Poland

Correspondence should be addressed to Robert M Nowak rbmnowakgmailcom

Received 30 May 2014 Revised 8 October 2014 Accepted 11 October 2014

Academic Editor Jiangke Yang

Copyright copy 2015 Robert M Nowak et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Artificial gene synthesis requires consideration of nucleotide sequence development as well as long DNA molecule assemblyprotocols The nucleotide sequence of the molecule must meet many conditions including particular preferences of the hostorganism for certain codons avoidance of specific regulatory subsequences and a lack of secondary structures that inhibitexpression The chemical synthesis of DNAmolecule has limitations in terms of strand length thus the creation of artificial genesrequires the assembly of long DNAmolecules from shorter fragments In the approach presented the algorithm and the computerprogram address both tasks developing the optimal nucleotide sequence to encode a given peptide for a given host organism anddetermining the longDNAassembly protocolThese tasks are closely connected a change in codon usagemay lead to changes in theoptimal assembly protocol and the lack of a simple assembly protocol may be addressed by changing the nucleotide sequenceThecomputer program presented in this study was tested with real data from an experiment in a wet biological laboratory to synthesizea peptide The benefit of the presented algorithm and its application is the shorter time compared to polymerase cycling assemblyneeded to produce a ready synthetic gene

1 Introduction

An artificial gene is the DNA strand of a given sequencesynthesized in vitro De novo synthesized DNA moleculeswithout needing an initial DNA template have higher costefficiency and higher flexibility than redesigning moleculesproduced by molecular cloning [1]

The chemical synthesis method for DNA strand usesthe repeated addition of nucleotides [2] It has practicallimitations of strand length to about 80 nucleotides (nt) [3]The creation of artificial genes usually with length greaterthan 300 nt requires the assembly of long DNA moleculesfrom shorter chemically synthesized fragmentsThe assemblymethods typically mix the fragments together in one tubeThus successful synthesis requires removing regions havingsecondary structures with high energy from the fragmentssuch as repetitive structures inverted repeats and regionsof extraordinary high or low GC content If there is nopossibility of meeting these conditions the fragments of

a particular gene can only be synthesized by splitting theprocedure into several consecutive steps and a final assemblyof shorter subsequences which in turn leads to a significantincrease in time and labor needed for its production

Polymerase cycling assembly (PCA) is the most widelyused technique [4 5] to produce long (eg 1000 nt) DNAstrands from shorter fragments This method uses the samereactions and reagents as the polymerase chain reaction(PCR) Chemically synthesized shorter fragments are mixedtogether in one tube and then polymerase builds the com-plementary strand using the fragments as primers PCR-specific starters are added and finally the long DNAmoleculeis amplified PCA requires the melting temperatures of theoverlapping regions to be similar for all fragments Thenecessary primer optimization should be performed usingspecialized oligonucleotide design programs several solu-tions for automated primer design for gene synthesis usingPCA have been reported [6] The correct syntheses of an800 nt strand have also been reported [7]

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 413262 8 pageshttpdxdoiorg1011552015413262

2 BioMed Research International

In the present procedure we do not use PCA becausethe long molecule synthesis using PCA requires severalsteps where each step extends the output molecule byhybridization polymerization and denaturation We extendthe assembly method using ligase chain reaction (LCR) [89] This method uses exactly one sequence of denaturationhybridization polymerization and ligation The chemicallysynthesized DNA strands are mixed together then the T4ligase enzyme connects the fragments Next polymerasecreates missing pieces and finally PCR is used to amplify theoutput molecule The proposed modification of the assemblymethod by ligation is carrying out the reaction at decreasedtemperatures which helps in providing deterministic assem-bly of fragments

The artificial gene synthesis process uses reverse trans-lation methods for developing the gene DNA sequenceThe 20 amino acids are encoded by 43 = 64 nucleotidetriplets (codons) some amino acids are encoded by severalcodons Genetic code redundancy allowsmodifying theDNAsequence without changing the protein sequence encodedThe reverse translation method considers different choicesfrom a suite of synonymous codons taking into accountrequirements like the absence of regulatory repetitive andextraordinarily high GC content subsequences and pref-erences for particular codons by the host organism Thebrute-force approach to reverse translation screening allpossible alternative sequences rapidly becomes impracticalbecause the number of such sequences grows exponentiallywith length of the sequence Thus some empirical rules(heuristics) are used in all 37 reported computer programs[10]

In the present paper we describe a computer programthat designs the whole artificial gene synthesis processdeveloping the optimal nucleotide sequence encoding a givenpeptide for a given host organism and determining thebest long DNA assembly protocol These tasks are closelyconnected synonymous codon substitutions may lead tochanges in the optimal assembly protocol and lack of asimple assembly protocol may be addressed by changing thenucleotide sequence We also describe an assembly methodwhere the fragments are designed for carrying out thereaction at a decreased temperature

Existing programs for gene synthesis that automaticallydesign oligonucleotides are developed mainly for the PCAmethod [11ndash15] An exception is TmPrime [16] which sup-ports the LCR method but requires synthesizing oligonu-cleotides that fully cover both the coding and the noncodingstrands The approach proposed in this paper allows LCRassembly with gaps in the noncoding strand which leads tosignificant reduction in the cost of gene synthesis

The calculation is a single multicriteria optimization taskThe algorithm output is a set of DNA fragments that areshort enough to be chemically synthesized and a protocol fortheir assembly The melting temperatures of the overlappingregions are slightly different for each pair If the reactioncannot be performed in one mixture the possibility ofperforming the reactions in separate tubes and then joiningthe results is considered

A

A

A

A B

B

B

B

C

C

C

C

D

D

D

D

X998400

X998400

X998400

Y998400

Y998400

Y998400

Z998400

Z998400

Z998400

3998400

3998400

5998400

5998400

DNA ligation

PCR

Figure 1 Long DNA base synthesis protocol Shorter fragmentsare synthesized chemically then during hybridization with a slowlydecreasing temperature they form the proper molecule Nextligation joins the parts into a longer molecule Finally PCR withspecific primers amplifies the correct DNA strands If a correctmolecule cannot be created because the fragments fold in anabnormal way the reaction is performed in separate tubes (complexprotocol)

2 Materials and Methods

The method of producing long DNA molecule encodinga given peptide for a given host organism from shorterfragments synthesized chemically is called the synthesisprotocol hereThe protocol includes the sequences of shorterfragments and a method for assembling them A singlemixture synthesis is called the base protocol a multitubesynthesis is called a complex protocol If a given DNAmolecule cannot be created correctly by a base protocol (one-tube reaction) for example because the fragments fold in anincorrect way a complex protocol is considered (reaction inseparate tubes)

21 DNA Assembly Method The base synthesis protocol isshown in Figure 1 it is essentially a modified assembly byligation [8] Shorter fragments are synthesized chemicallyand then by hybridization and ligation the longer moleculeis formed Finally PCRwith specific starters (primers) is usedto amplify the correct DNA strands The mixture consists offragments of the desired DNA strand calledmain chains (themolecules labeled A B C and D in Figure 1 are the mainchains) and the helper chainsmdashcomplementary to the twoadjacent fragmentsmdashare used for joining (molecules X Yand Z in Figure 1)

In the assembly protocol presented the temperature ischanged slowly from a temperature high enough to unfoldall the fragments to a temperature where all fragments willhybridize The temperature management thus differs fromother assembly protocols where the folding temperature

BioMed Research International 3

119875 larr 1198940 1198941 119894

119899 ⊳ initialization

repeat1198751015840larr 119875

for all 119894 isin 119875 do1198941015840larrmutate(119894) ⊳ no crossover only mutation1198751015840larr 1198751015840cup 1198941015840 ⊳ succession

end for119875 larr selection(1198751015840) ⊳ proportional ranking or tournament

until stop ⊳ number of generations

Algorithm 1 Optimization algorithm used

is similar for all fragments The change helps to form thecorrectmolecule because fragments fold in the desired orderThe different-temperatures-for-different-fragments featurerequires more advanced fragment sequence design Thenumber of set of fragments considered in synthesizing a givenDNA molecule is much bigger leading to an increase inalgorithm time complexity

Of course there is no guarantee that every set of frag-ments will assemble in the desired order We examine everypossible pair of strands from the solutionmdashmain and helperchainsmdashincluding pairs consisting of two identical molecules(fragment with itself) These examinations use DNA sec-ondary structure prediction algorithms which predict howthe strandswill fold andwhat temperature is needed to unfoldthem The results ordered by descending temperaturesdefine the sequence of strand foldings This order dependsstrongly on the sequences of the fragments

It is desirable that each time the fragments in solutionare either correctly folded or completely unfolded A cor-rectly folded fragment is created by a pair that folds intoa coherent double-stranded structure with dangling single-stranded ends If the pair contains a strand that is alreadyfolded then the double-stranded structure is extended eitherby main chain or by the helping chain Ultimately the wholelongDNA strand is composed and can be subjected to furthertreatment such as completing bonds between the chains byDNA ligase as shown in Figure 1

Computer simulations for the base protocol with randomsequences of length 60 to 80 nt have shown that about 1 ofpossible mixtures of three strands (two main chains and onehelper chain) lead to correct longmolecule creation In 10000simulations the main chains were calculated by drawingwith uniform distribution the position on the random targetsequence which divided it into two subsequences of length20 to 40 nt The helping chain was complementary to aregion near this position Most mixtures were unsuitablefor synthesis by the base protocol due to incorrect foldingFortunately the number of good solutions can be increasedby extending the base protocol to the complex one

Long strand assembly can be achieved in steps as shownin Figure 2 In each step conflicting strands (strands formingimproper pairs) are separated into different tube Then thebase protocol can be used in each tube Chains formed fromthe base protocol become the input strands for the next base

Base protocol

Base protocolBase protocol

Tube Tube

Tube

Figure 2 Long DNA complex protocol multiple tubes are used

protocol and the procedure is repeated until the full-lengthstrand is obtained

Obviously the complex synthesis protocol is more expen-sive than the base one The execution of each step in thelaboratory takes about 24 h multiple probes are needed andthe number of required nucleotides is noticeably greater

22 Computer Program Implementation The applicationsearches the artificial gene synthesis process in the space ofpossible DNA representations of a given protein and possibleDNA assembly protocols described above (base or complex)Each synthesis process is defined by set of DNA sequences(fragment sequences) and a protocol for their assembly(which fragments in which tube) Each synthesis process hasa quality measure assigned this quality is used to optimizein the space of DNA representations and assembly protocolsusing a combination of an evolutionary algorithm for globaloptimization and a hill-climbing algorithm to perform finetuning The algorithm is depicted in Algorithm 1

The synthesis process representation is a solution candi-date (individual) for the evolutionary algorithm It is repre-sented by the target DNA strand (in a form of sequence ofnucleotides) the collection of positions in this strand describ-ing the places of main chains separation and another collec-tion of positions describing the helper chains Each fragment(main or helper chain) meets the constraints on minimumandmaximumstrand length (chemical synthesis limitations)Each individual differs in three aspects nucleotide sequence


encoding a given peptide (different codons) main chainfragmentation and helper chain selection The number ofindividuals in a population is an algorithmparameter and canbe modified by the user

The initial set of individuals (initial population) is createdas depicted below The target DNA sequence is created froma given peptide sequence by the chosen codons the codonencoding a given amino acid is chosen randomly with equalprobabilityThe places of fragment separation are also chosenrandomly with uniform distribution having regard to theminimum and maximum length of the fragment The lengthof a helper chain is random with a uniform distributionbetween theminimal andmaximal length of the fragment Itssequence is complementary to the sequence around locationsof main chain separation

All individuals were chosen for reproduction Everyparent produces a single mutated child by choosing withequal probability one of the following transformations

(i) replacement of the random codon (nucleotide triplet)from the sequence

(ii) increment or decrement of a randomposition describ-ing the target DNA separation into main chainsmdashonefragment is shortened and the other is extended

(iii) one randomly chosen helper chain that is shortenedor extended or moved left or right

The fitness of the individual depends on codon fre-quencies used the total number of nucleotides used for theartificial gene synthesis and the number of steps required forassembly (number of required base protocols)

The individualrsquos fitness can be represented by formula (1)where 119865

119862 119865119873 and 119865

119878represent codon frequency component

the sum of the lengths of molecules used for assembly andthe number of base assembly protocols respectivelyThe119865

119862is

the root mean square of the differences between required andobtained codon frequencies scaled byweight119908

119862 For codon 119895

119862119903119895denotes optimal (required) codon frequency for example

available at [17] 119862119894119895is an individualrsquos codon frequency 119899

119895is a

number of a codon 119895 and 119899 = sum119895119899119895is the number of codons

The calculation of 119865119873

exploits 119873 number of nucleotidesused for assembly (sum of all fragmentsrsquo lengths) 119873max =2119873main is the maximum number of nucleotides which can beused the helping chains have maximum length 119873main is thelength of the target DNA sequence 119873min = 119873main + (119896 minus1)119871min is the minimum number of nucleotides which canbe used all helping chains have minimal length 119871min (thealgorithm parameter the property of chemical synthesis)119896 is the number of main chains and 119908

119873is weight of this

component The calculation of 119865119878uses weight 119908

119878and 119878mdash

number of base protocols involved in the assembly Consider

119865 =1

(1 + 119865119862+ 119865119873+ 119865119878)

where

119865119862= 119908119862sdot sum

119895

radic119899119895(119862119903119895minus 119862119894119895)2

119899

119865119873= 119908119873sdot119873 minus 119873min119873max minus 119873min

119865119878= 119908119878sdot119878 minus 1

2

(1)

The individual fitness calculations include the determin-ing of optimal assembly protocol Each pair of fragmentsincluding a pair created from a fragment with itself isconsidered in finding the pairrsquos folding order This orderdepends on the pairrsquos melting temperature Amodified Zukeralgorithm [18] with energy parameters for DNA [19] is usedit is a dynamic programming algorithm to calculate thesecondary structure free energy using thermodynamic dataOur modification allows using the algorithm for pair ofstrands Because the folding temperature depends directly onsecondary structure free energy returned by the algorithm itis used to examine the fragment joining order If the joiningorder is improper the conflicting pair are placed in separatetubes and a base assembly protocol is changed to a complexone

In every generation parents and offspring are selectedfor succession There are three methods of selection propor-tional ranking or tournament (119896 = 2)Themethod is chosenby the user (settings file) as well as the number of generations(stop condition) and some other parameters

23 Software Architecture The application was implementedin client-server three-layered software architecture wherethe presentation layer was deployed on client machine andthe data processing and data storage layer were deployedon server We developed the software based on a bioweb[20] framework for example we use C++ and Python andApache Flex and Java Script programming languages TheC++ language was chosen for algorithm implementationbecause it provides the good trade-off between the availabilityof optimization techniques portability and extensibility ThePython was used to connect all server modules the serveris set up on a Django framework The user interface usesApache Flex therefore only web browser is required on clientmachine

The tests show that a single individualrsquos fitness calculationtime for real-life input needs seconds or even minutes Forexample on a typical 3 GHz CPU core it takes an averageof 90 s to compute a 1000 nt gene synthesis simulation Thiswas too long for finding the best assembly protocol Thus wedecided to include additional heuristics to speed up calcula-tions to use distributed calculations (many computers) andto take advantage of GPU power

For distributed calculations we used Common ObjectRequest Broker Architecture (CORBA) An individualrsquos fit-ness is considered independently so the calculations can bedelegated to different cores processors and computers

The most time-consuming task (99 of the wholeapplication execution time) was the calculation of freeenergy using the modified Zuker algorithm The task time-complexity was Θ(1198994) and with the use of heuristics thiscould be simplified to Θ(1198993 + 3021198992) We noticed that a


Table 1 Fragments used to synthesize the Ubp41015840 gene Ubp410158401ndashUbp4101584011 are used in the assembly protocol and Ubp41015840For and Ubp41015840Rev forPCR

Name SequenceUbp410158401 ATGTGCTACATGAATTGTATAATACAGTGTATCCTAGGAA

Ubp410158402 AATCTGGGTCAGCTCATGCGTTCCTAGGATACACTGTATT

Ubp410158403 CGCATGAGCTGACCCAGATTTTTCTCGATGACTCCTACGCTA

Ubp410158404 TAGAGTTAATATTGATGTGCTTAGCGTAGGAGTCATCGAGA

Ubp410158405 AGCACATCAATATTAACTCTAAACTAGGCTCTAAGGGCATCTTGG

Ubp410158406 GCACGAGTCGAGCGAAATACTTTGCCAAGATGCCCTTAGAGCCT

Ubp410158407 CAAAGTATTTCGCTCGACTCGTGCACATGATGTATAAAGAGCAAGTA

Ubp410158408 GGGGAAATGGATATTTTCTTAGAACCGTCTACTTGCTCTTTATACATCAT

Ubp410158409 GACGGTTCTAAGAAAATATCCATTTCCCCAATAAAATTTAAATTAGCATGTGGATCCGT

Ubp4101584010 GCGGTTTTGAACAACGAGTTTACGGATCCACATGCTAATTTAA

Ubp4101584011 AAACTCGTTGTTCAAAACCGCCTCACAACAGGATTGTCAGTAA

Ubp41015840For GGGGGAATTCGATATGTGCTACATGAATTGTATAATAC1015840

Ubp41015840Rev GGGGGGATCCTTACTGACAATCCTGTTGTGAG

mutation can affect changes only on 1ndash3 strands in a probethus best efforts were made to minimize the number ofsuch operation calls by storing the results and making theminheritable by mutated individuals As a result when the1000 nt gene synthesis protocol was evaluated its derivativeswere calculated 10 times faster

To increase the speed of DNA secondary structure pre-diction calculations a module was implemented in ComputeUnified Device Architecture (CUDA) to take advantage ofGPU power Every subchain is considered independentlythus it is possible to make these calculations in parallel andthen the results are combined The calculation speedup wasgreater for longer strands For example a 384 nt sequencewas calculated 12 times faster on a Geforce GTX 460 thanon a 3GHz CPU core The whole application speed increasewas also significant the 1000 nt gene synthesis simulationwith GPU support took 40 s on average and was 25 timesfaster than with the CPU These results allowed designinga synthesis of a typical protein in a few hours however weconsider using other algorithms to speed up calculations forexample using known structures stored in a database [21] andusing the information of homologous sequences [22]

3 Results

DNASynth application was used for the synthesis of ananalogue of the yeast Saccharomyces cerevisiae Ubp4 proteasegene Protease Ubp4p is an enzyme that cleaves ubiquitinfromproteins fused to its C-terminusThis gene sequencewasdescribed in 1995 (GenBank CAA89098) The whole geneis 2778 bp however shorter analogues 276 bp (Ubp41015840) and1083 bp (Ubp410158401015840) were designed

The Ubp41015840 synthesis involves the 11 fragments designedby the present computer program depicted in Table 1 Thesefragments could be mixed in one tube without risk ofincorrect connections a one-step synthesis could be used Anevolutionary algorithm with 200 individuals in the popula-tion 466 generations and tournament selection found this

assembly protocol in 100 h on a PC with z 3GHz CPU and anVidia Geforce GTX 460 GPU

The Ubp410158401 Ubp410158403 Ubp410158405 Ubp410158407 Ubp410158409 andUbp4101584011 fragments had to be phosphorylated We recom-mended using commercially phosphorylated primers Then20 pmol of each phosphorylated (Ubp410158401 Ubp410158403 Ubp410158405Ubp410158407 Ubp410158409 andUbp4101584011) and unphosphorylated primer(Ubp410158402 Ubp410158404 Ubp410158406 Ubp410158408 and Ubp4101584010) was placedin a tube with ligation buffer in a volume of 50 120583L and heatedto a temperature of 94∘C Then the reaction mixture cooledslowly to room temperature over about 1 h Next we added1 120583L of ligase and left it for 10min at 22∘C

The PCR was performed in a 50 120583L reaction volume witha buffer containing 50mM KCl 2mM MgCl2 002mM ofeach dNTP 75mM Tris-HCl (pH 89) 20mM (NH4)2SO425 pM of each primer Ubp41015840Rev Ubp41015840For (with additionalsequence for restriction enzymes NdeI EcoRI and BamHI)the enzyme Biotools DNA polymerase (Biotools BampM LabsSA or Taq DNA polymerase with standard Taq buffer NewEngland Biolabs Inc) and 1120583L as a template for 23 and29 cycles using Eppendorf 5330 thermocycler Each cycleconsisted of 40 s at 94∘C 40 s at 47∘C and 1min at 72∘C for4 cycles and next each cycle consisted of 40 s at 94∘C 40 s at52∘C and 1min at 72∘C for 19 or 25 cycles

The temperature of annealing was higher for followingcycles because of better specificity of the PCR reaction Theamplified 302 bp long DNA fragment was isolated by 1agarose gel electrophoresis (Figure 3) It was cloned into thepBlue vector and next into expression pT7RSNHU vectorat the NdeI and BamHI sites The nucleotide sequences ofinserts were verified by automatic sequencing 20 plasmidswere analyzed in thismannerTheDNA sequence was assem-bled correctly in 80 of the sequenced plasmids Resultsof the sequencing indicate high efficacy and usefulness ofDNASynth for obtaining synthetic genes

The longer Ubp410158401015840 gene was prepared from 34 fragmentsThe DNASynth program suggests complex protocol with twotubesUbp410158401015840a for part ofDNA 1ndash540 indexed andUbp410158401015840b for


Table 2 Fragments used to synthesize the Ubp410158401015840a gene Ubp410158401015840a1ndashUbp410158401015840a15 are used in the assembly protocol and Ubp410158401015840aFor andUbp410158401015840aRev for PCR

Ubp410158401015840a1 TTCGCGGTGGGCCTCGAGAATCTAGGAAATTCCTGCTATATGAACTGCATTATCC

Ubp410158401015840a2 CAGCTCATGCGTCCCTAAAATACACTGGATAATGCAGTTCATATAGCAGGAATTTCCTAG

Ubp410158401015840a3 AGTGTATTTTAGGGACGCATGAGCTGACGCAGATCTTCCTTGACGATTCGTATGCGAA

Ubp410158401015840a4 CCCTAGTTTAGAATTGATATTGATGTGCTTCGCATACGAATCGTCAAGGAAGA

Ubp410158401015840a5 GCACATCAATATCAATTCTAAACTAGGGTCGAAAGGTATTTTAGCTAAATATTTCGCTCGTCTTGTAC

Ubp410158401015840a6 TTGGAGCCATCTACCTGTTCCTTATACATCATATGTACAAGACGAGCGAAATATTTAGCT

Ubp410158401015840a7 ATATGATGTATAAGGAACAGGTAGATGGCTCCAAAAAAATTTCGATTAGCCCAATCAAATTTAAGCT

Ubp410158401015840a8 AACAGTGAGTTGACGGAACCACAGGCGAGCTTAAATTTGATTGGGCTAATCG

Ubp410158401015840a9 CGCCTGTGGTTCCGTCAACTCACTGTTTAAAACAGCTTCACAGCAAGATTGTCAGGAATTTTGCC

Ubp410158401015840a10 GATCTTCATGCAAACCGTCAAGCAAGAATTGGCAAAATTCCTGACAATCTTGCTGTGA

Ubp410158401015840a11 AATTCTTGCTTGACGGTTTGCATGAAGATCTGAACCAGTGCGGTTCTAACCCC

Ubp410158401015840a12 CCTCTTGGCTCAGTTCTTTCAACGGGGGGTTAGAACCGCACTGGTTCA

Ubp410158401015840a13 CCGTTGAAAGAACTGAGCCAAGAGGCAGAAGCGAGGCGGGAAAAGCTGAGCCTC

Ubp410158401015840a14 GTTCCCATTCAATACTACTGGCTATGCGGAGGCTCAGCTTTTCCCGCCTCGC

Ubp410158401015840a15 CGCATAGCCAGTAGTATTGAATGGGAACGCTTTCTGACCACGGATTTTAGCGTCATTGTG

Ubp410158401015840a16 AGCGAGAGGCATATTGCCCTTGAAAGAGGTCCACAATGACGCTAAAATCCGTGG

Ubp410158401015840a17 GACCTCTTTCAAGGGCAATATGCCTCTCGCTTAAAATGCAAAGTATGCAGCCACACTTCA

Ubp410158401015840aFor GGGGGAATTCGATATGTTCGCGGTGGGCCTCGAGAATC

Ubp410158401015840bRev GGGGGGATCCTTATGAAGTGTGGCTGCATACTTTGCATTTTAAG

1000500400300200

100

PCRproduct

M 1 2 3 4 5 6 7 8 M

Figure 3 PCR products of UBP41015840 M-marker (bp) GeneRulerDNA Ladder Mix (Fermentas Life Sciences) lanes 1ndash8 UBP41015840 gene(276 bp UBP41015840 length and 26 bp for restriction enzymes sequence =302 bp) PCR product lanes 1 2 PCR reaction with Biotools DNApolymerase and 23 cycles lanes 3 4 PCR reaction with BiotoolsDNA polymerase and 29 cycles lanes 5 6 PCR reaction withBiotoolsDNApolymerase and 29 cycles and lanes 7 8 PCR reactionwith Taq DNA polymerase and 29 cycles

541ndash1083 indexed The fragments used for Ubp410158401015840a synthesisare depicted in Table 2 and the fragments for Ubp410158401015840b inTable 3 respectively The PCR products were obtained asdescribed previouslyWe obtained PCR products with 566 bpand 569 bp long DNA (Figures 4 and 5)

TheUbp410158401015840 is product of ligation ofUbp410158401015840a andUbp410158401015840bIt was cloned into the pBlue 3 vector and next into expressionpT7RSNHU vector The nucleotide sequences of insertswere verified by automatic sequencing For 20 plasmidsanalyzed in this manner the DNA sequence was assembledcorrectly for at least 80 of the sequenced plasmids Theresults of sequencing indicate high efficacy and usefulness ofDNASynth for obtaining synthetic genes with length exceed-ing 1000 bp

PCRproduct

3000

1000

500

200

M 1

Figure 4 1 agarose gel electrophoresis of PCR products ofUBP410158401015840a M-marker (bp) GeneRuler DNA Ladder Mix (FermentasLife Sciences) lane 1 UBP410158401015840a gene (540 bp UBP410158401015840a length and26 bp for restriction enzymes sequence = 566 bp) PCR reactionwithBiotools DNA polymerase and 23 cycles

4 Discussion and Conclusion

The DNA molecules of gene size are routinely created usingPCA by research laboratories and specialized companies(Blue Heron Technology DNA 20 GENEART and others)These molecules assemble the synthetic chromosome or thesynthetic genome using isothermal method [23 24] or othertechniques [25 26] The presented protocol could speed upthese tasks


Table 3 Fragments used to synthesize the Ubp410158401015840b gene Ubp410158401015840b1ndashUbp410158401015840b17 are used in the assembly protocol and Ubp410158401015840bFor andUbp410158401015840bRev for PCR

Ubp410158401015840b1 ACCACATATCAGCCCTTTACGGTACTGTCTATCCCTATACCCAAAAAGAATA

Ubp410158401015840b2 AGCAATCCTCGATAGTAATGTTATTGCGACTATTCTTTTTGGGTATAGGGATAGACAGT

Ubp410158401015840b3 GTCGCAATAACATTACTATCGAGGATTGCTTCCGCGAATTCACCAAATGCGAAAATCTCGAGGTAG

Ubp410158401015840b4 TTCACAGTGCGGACAGAGCCATTGCTCATCTACCTCGAGATTTTCGCATTTGGT

Ubp410158401015840b5 ATGAGCAATGGCTCTGTCCGCACTGTGAAAAACGTCAACCGTCAACTAAACAGCTGACCATTACC

Ubp410158401015840b6 TTTTAGATGGACAATGAGGTTCCGCGGCAAGCGGGTAATGGTCAGCTGTTTAGTTGACGGT

Ubp410158401015840b7 CGCTTGCCGCGGAACCTCATTGTCCATCTAAAACGTTTTGATAACCTGCTTAACAAAAACA

Ubp410158401015840b8 AGCAGAAATGGATAAATCACGAAGTCATTGTTTTTGTTAAGCAGGTTATCA

Ubp410158401015840b9 ATGACTTCGTGATTTATCCATTTCTGCTGGATTTAACGCCTTTTTGGGCAAACGACTT

Ubp410158401015840b10 TTCACCCCCGGCGGGAACACGCCATCAAAGTCGTTTGCCCAAAAAGGCGTTA

Ubp410158401015840b11 TGATGGCGTGTTCCCGCCGGGGGTGAACGACGATGAATTACCAATCCGAGGTCAGATCCCGCC

Ubp410158401015840b12 CACGCGACACCATACAATTCATATTTAAATGGCGGGATCTGACCTCGGATTGGTAATT

Ubp410158401015840b13 ATTTAAATATGAATTGTATGGTGTCGCGTGCCATTTCGGGACACTGTATGGCGGCCAC

Ubp410158401015840b14 TTAAACCTTTTTTCACATAAGCCGTATAGTGGCCGCCATACAGTGTCCC

Ubp410158401015840b15 TATACGGCTTATGTGAAAAAAGGTTTAAAAAAGGGATGGCTTTATTTTGATGA

Ubp410158401015840b16 GCGTCGGCTTTGTTTTTAACCGGCTTATACTTGGTATCATCAAAATAAAGCCATCC

Ubp410158401015840b17 TACCAAGTATAAGCCGGTTAAAAACAAAGCCGACGCAATTAATAGCAATGCATATGTTCTGTTCTAT

Ubp410158401015840bFor GGGGGGATCCACCACATATCAGCCCTTTACGGTAC

Ubp410158401015840bRev GGGGTCTAGATTAATGGTGATGGTGATGGTGATAGAACAGAACATATGCATTGCTATTAATTGCG

PCRproduct

3000

1000

500400

300

200

100

M 1

Figure 5 Analysis of the products was performed by 8 acrylamidegel electrophoresis PCR products of UBP410158401015840b M-marker (bp)GeneRuler DNA Ladder Mix (Fermentas Life Sciences) lane 1UBP410158401015840b gene (543 bp UBP410158401015840b length and 26 bp for restrictionenzymes sequence = 569 bp) PCR reaction with Biotools DNApolymerase and 23 cycles

The proposed software designs a synthetic gene sequencefor the purpose of gene expression The software would helpin obtaining the relevant planned sequence of a specific DNAwithout the DNA template It could also be useful at the timewhen the sequence is received based on for example an assayor modification of this gene by in silico analysis The softwarewould be an alternative for chemical gene synthesis

In relation to the expression of a synthetic gene anadditional modification namely substitution of rare codonswith frequent codons in a relevant host has been planned inorder to increase the usefulness of the presented algorithmand at the same time to provide an opportunity for obtainingthe best possible results in an experimental laboratory

The computer programpresented considers amuch largernumber of assembly variants for the proposed assemblymethod to synthesize a given peptide than previous algo-rithms It connects the optimal assembly procedure withartificial gene DNA sequence design The assembly methoduses a decreasing mixture temperature to increase the lengthand reliability of synthesis

The software and user manual is freely available athttpdnasynthsourceforgenet under GNU LGPL license

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thisworkwas supported by the statutory research of Instituteof Electronic Systems of Warsaw University of Technologyand by European Union Project Innovative Economy POIG010102-00-00708-00 The authors would like to thankMaciej Michalak for participating in development of anearly version of the presented software Ryszard Romaniukfor support and editors and anonymous reviewer for theirconstructive comments


References

[1] E Andrianantoandro S Basu D Karig and R Weiss ldquoSyn-thetic biology new engineering rules for an emerging disci-plinerdquoMolecular Systems Biology vol 2 no 1 2006

[2] S L Beaucage and R P Iyer ldquoAdvances in the synthesis ofoligonucleotides by the phosphoramidite approachrdquo Tetrahe-dron vol 48 no 12 pp 2223ndash2311 1992

[3] M Stryer Biochemistry W H Freeman New York NY USA1995

[4] J D Dillon and C A Rosen ldquoA rapid method for the construc-tion of synthetic genes using the polymerase chain reactionrdquoBioTechniques vol 9 no 3 pp 298ndash300 1990

[5] W P C Stemmer A Crameri K D Ha T M Brennan and HL Heyneker ldquoSingle-step assembly of a gene and entire plasmidfrom large numbers of oligodeoxyribonucleotidesrdquo Gene vol164 no 1 pp 49ndash53 1995

[6] F Burpo ldquoA critical review of PCR primer design algorithmsand crosshybridization case studyrdquo Biochemistry vol 218 pp 1ndash10 2001

[7] S J Kodumal K G Patel R Reid H G Menzella M Welchand D V Santi ldquoTotal synthesis of long DNA sequences syn-thesis of a contiguous 32-kb polyketide synthase gene clusterrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 101 no 44 pp 15573ndash15578 2004

[8] D Goeddel D Kleid F Bolivar et al ldquoExpression in Escherichiacoli of chemically synthesized genes for human insulinrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 76 no 1 pp 106ndash110 1979

[9] L-C Au F-Y Yang W-J Yang S-H Lo and C-F Kao ldquoGenesynthesis by a LCR-based approach high-level production ofleptin-L54 using synthetic gene in Escherichia colirdquo Biochemicaland Biophysical Research Communications vol 248 no 1 pp200ndash203 1998

[10] G Wu L Dress and S J Freeland ldquoOptimal encoding rulesfor synthetic genes the need for a community effortrdquoMolecularSystems Biology vol 3 no 1 article 134 2007

[11] A Villalobos J E Ness C Gustafsson J Minshull and SGovindarajan ldquoGene Designer a synthetic biology tool forconstructuring artificial DNA segmentsrdquo BMC Bioinformaticsvol 7 article 285 2006

[12] J-M Rouillard W Lee G Truan X Gao X Zhou and EGulari ldquoGene2Oligo oligonucleotide design for in vitro genesynthesisrdquo Nucleic Acids Research vol 32 pp W176ndashW1802004

[13] S Jayaraj R Reid andDV Santi ldquoGeMS an advanced softwarepackage for designing synthetic genesrdquo Nucleic Acids Researchvol 33 no 9 pp 3011ndash3016 2005

[14] R Rydzanicz X S Zhao and P E Johnson ldquoAssemblyPCR oligo maker a tool for designing oligodeoxynucleotidesfor constructing long DNA molecules for RNA productionrdquoNucleic Acids Research vol 33 no 2 pp W521ndashW525 2005

[15] D M Hoover and J Lubkowski ldquoDNAWorks an automatedmethod for designing oligonucleotides for PCR-based genesynthesisrdquo Nucleic Acids Research vol 30 no 10 article e432002

[16] M Bode S Khor H Ye M-H Li and J Y Ying ldquoTmPrimefast flexible oligonucleotide design software for gene synthesisrdquoNucleic Acids Research vol 37 no 2 pp W214ndashW221 2009

[17] Y Nakamura T Gojobori and T Ikemura ldquoCodon usagetabulated from international DNA sequence databases statusfor the year 2000rdquo Nucleic Acids Research vol 28 no 1 p 2922000

[18] M Zuker and P Stiegler ldquoOptimal computer folding of largeRNA sequences using thermodynamics and auxiliary informa-tionrdquo Nucleic Acids Research vol 9 no 1 pp 133ndash148 1981

[19] J Santalucia Jr ldquoA unified view of polymer dumbbell andoligonucleotide DNA nearest-neighbor thermodynamicsrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 95 no 4 pp 1460ndash1465 1998

[20] R M Nowak ldquoPolyglot programming the applications toanalyze genetic datardquo BioMed Research International vol 2014Article ID 253013 7 pages 2014

[21] S Bellaousov and D H Mathews ldquoProbKnot fast prediction ofRNA secondary structure including pseudoknotsrdquoRNA vol 16no 10 pp 1870ndash1880 2010

[22] M Hamada K Sato H Kiryu T Mituyama and K AsaildquoPredictions of RNA secondary structure by combining homol-ogous sequence informationrdquo Bioinformatics vol 25 no 12 ppi330ndashi338 2009

[23] D G Gibson G A Benders C Andrews-Pfannkoch et alldquoComplete chemical synthesis assembly and cloning of aMycoplasma genitalium genomerdquo Science vol 319 no 5867 pp1215ndash1220 2008

[24] D G Gibson L Young R-Y Chuang J C Venter C AHutchison and H O Smith ldquoEnzymatic assembly of DNAmolecules up to several hundred kilobasesrdquo Nature Methodsvol 6 no 5 pp 343ndash345 2009

[25] R P Shetty D Endy and T F Knight Jr ldquoEngineering biobrickvectors from biobrick partsrdquo Journal of Biological Engineeringvol 2 no 1 pp 1ndash12 2008

[26] A Casini J T Macdonald J de Jonghe et al ldquoOne-potDNA construction for synthetic biology the Modular Overlap-Directed Assembly with Linkers (MODAL) strategyrdquo NucleicAcids Research vol 42 no 1 article e7 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of


Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology


Molecular Biology International

GenomicsInternational Journal of


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


BioinformaticsAdvances in

Marine BiologyJournal of



Signal TransductionJournal of


BioMed Research International

Evolutionary BiologyInternational Journal of



Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Genetics Research International


Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational



Enzyme Research



Microbiology


In the present procedure we do not use PCA becausethe long molecule synthesis using PCA requires severalsteps where each step extends the output molecule byhybridization polymerization and denaturation We extendthe assembly method using ligase chain reaction (LCR) [89] This method uses exactly one sequence of denaturationhybridization polymerization and ligation The chemicallysynthesized DNA strands are mixed together then the T4ligase enzyme connects the fragments Next polymerasecreates missing pieces and finally PCR is used to amplify theoutput molecule The proposed modification of the assemblymethod by ligation is carrying out the reaction at decreasedtemperatures which helps in providing deterministic assem-bly of fragments

The artificial gene synthesis process uses reverse trans-lation methods for developing the gene DNA sequenceThe 20 amino acids are encoded by 43 = 64 nucleotidetriplets (codons) some amino acids are encoded by severalcodons Genetic code redundancy allowsmodifying theDNAsequence without changing the protein sequence encodedThe reverse translation method considers different choicesfrom a suite of synonymous codons taking into accountrequirements like the absence of regulatory repetitive andextraordinarily high GC content subsequences and pref-erences for particular codons by the host organism Thebrute-force approach to reverse translation screening allpossible alternative sequences rapidly becomes impracticalbecause the number of such sequences grows exponentiallywith length of the sequence Thus some empirical rules(heuristics) are used in all 37 reported computer programs[10]

In the present paper we describe a computer programthat designs the whole artificial gene synthesis processdeveloping the optimal nucleotide sequence encoding a givenpeptide for a given host organism and determining thebest long DNA assembly protocol These tasks are closelyconnected synonymous codon substitutions may lead tochanges in the optimal assembly protocol and lack of asimple assembly protocol may be addressed by changing thenucleotide sequence We also describe an assembly methodwhere the fragments are designed for carrying out thereaction at a decreased temperature

Existing programs for gene synthesis that automaticallydesign oligonucleotides are developed mainly for the PCAmethod [11ndash15] An exception is TmPrime [16] which sup-ports the LCR method but requires synthesizing oligonu-cleotides that fully cover both the coding and the noncodingstrands The approach proposed in this paper allows LCRassembly with gaps in the noncoding strand which leads tosignificant reduction in the cost of gene synthesis

The calculation is a single multicriteria optimization taskThe algorithm output is a set of DNA fragments that areshort enough to be chemically synthesized and a protocol fortheir assembly The melting temperatures of the overlappingregions are slightly different for each pair If the reactioncannot be performed in one mixture the possibility ofperforming the reactions in separate tubes and then joiningthe results is considered

A

A

A

A B

B

B

B

C

C

C

C

D

D

D

D

X998400

X998400

X998400

Y998400

Y998400

Y998400

Z998400

Z998400

Z998400

3998400

3998400

5998400

5998400

DNA ligation

PCR

Figure 1 Long DNA base synthesis protocol Shorter fragmentsare synthesized chemically then during hybridization with a slowlydecreasing temperature they form the proper molecule Nextligation joins the parts into a longer molecule Finally PCR withspecific primers amplifies the correct DNA strands If a correctmolecule cannot be created because the fragments fold in anabnormal way the reaction is performed in separate tubes (complexprotocol)

2 Materials and Methods

The method of producing long DNA molecule encodinga given peptide for a given host organism from shorterfragments synthesized chemically is called the synthesisprotocol hereThe protocol includes the sequences of shorterfragments and a method for assembling them A singlemixture synthesis is called the base protocol a multitubesynthesis is called a complex protocol If a given DNAmolecule cannot be created correctly by a base protocol (one-tube reaction) for example because the fragments fold in anincorrect way a complex protocol is considered (reaction inseparate tubes)

21 DNA Assembly Method The base synthesis protocol isshown in Figure 1 it is essentially a modified assembly byligation [8] Shorter fragments are synthesized chemicallyand then by hybridization and ligation the longer moleculeis formed Finally PCRwith specific starters (primers) is usedto amplify the correct DNA strands The mixture consists offragments of the desired DNA strand calledmain chains (themolecules labeled A B C and D in Figure 1 are the mainchains) and the helper chainsmdashcomplementary to the twoadjacent fragmentsmdashare used for joining (molecules X Yand Z in Figure 1)

In the assembly protocol presented the temperature ischanged slowly from a temperature high enough to unfoldall the fragments to a temperature where all fragments willhybridize The temperature management thus differs fromother assembly protocols where the folding temperature


119875 larr 1198940 1198941 119894


repeat1198751015840larr 119875










Base protocol


Tube Tube

Tube















119862 119865119873 and 119865



119862is


119862 For codon 119895



119895is a






119878and 119878mdash


119865 =1

(1 + 119865119862+ 119865119873+ 119865119878)

where

119865119862= 119908119862sdot sum

119895

radic119899119895(119862119903119895minus 119862119894119895)2

119899


119865119878= 119908119878sdot119878 minus 1

2

(1)
























3 Results





























1000500400300200

100

PCRproduct

M 1 2 3 4 5 6 7 8 M




PCRproduct

3000

1000

500

200

M 1

























PCRproduct

3000

1000

500400

300

200

100

M 1








Acknowledgments



References


































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


119875 larr 1198940 1198941 119894


repeat1198751015840larr 119875










Base protocol


Tube Tube

Tube















119862 119865119873 and 119865



119862is


119862 For codon 119895



119895is a






119878and 119878mdash


119865 =1

(1 + 119865119862+ 119865119873+ 119865119878)

where

119865119862= 119908119862sdot sum

119895

radic119899119895(119862119903119895minus 119862119894119895)2

119899


119865119878= 119908119878sdot119878 minus 1

2

(1)
























3 Results





























1000500400300200

100

PCRproduct

M 1 2 3 4 5 6 7 8 M




PCRproduct

3000

1000

500

200

M 1

























PCRproduct

3000

1000

500400

300

200

100

M 1








Acknowledgments



References


































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology










119862 119865119873 and 119865



119862is


119862 For codon 119895



119895is a






119878and 119878mdash


119865 =1

(1 + 119865119862+ 119865119873+ 119865119878)

where

119865119862= 119908119862sdot sum

119895

radic119899119895(119862119903119895minus 119862119894119895)2

119899


119865119878= 119908119878sdot119878 minus 1

2

(1)
























3 Results





























1000500400300200

100

PCRproduct

M 1 2 3 4 5 6 7 8 M




PCRproduct

3000

1000

500

200

M 1

























PCRproduct

3000

1000

500400

300

200

100

M 1








Acknowledgments



References


































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


















3 Results





























1000500400300200

100

PCRproduct

M 1 2 3 4 5 6 7 8 M




PCRproduct

3000

1000

500

200

M 1

























PCRproduct

3000

1000

500400

300

200

100

M 1








Acknowledgments



References


































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology






















1000500400300200

100

PCRproduct

M 1 2 3 4 5 6 7 8 M




PCRproduct

3000

1000

500

200

M 1

























PCRproduct

3000

1000

500400

300

200

100

M 1








Acknowledgments



References


































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology






















PCRproduct

3000

1000

500400

300

200

100

M 1








Acknowledgments



References


































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


References


































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology








Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology

Research Article DNASynth: A Computer Program...

Documents

Transcript of Research Article DNASynth: A Computer Program...