DNA sequence analysis
description
Transcript of DNA sequence analysis
DNA sequence analysis
School B&I TCD Bioinformatics
May 2010
A, T/U, C, G
• Simple code, lots of sequence
• Sequence analysis– Computer intensive
• BLAST homology searching• Gene/exon prediction• Multiple sequence alignment• Alignments in general
– “Trivial”
Trivial
• Could be done by hand– Computers
• Quicker• More reliable
• Examples– Translate DNA– Restriction sites– Synonymous codon usage
Sequence formats• Fasta Format
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
• Phylip Format4 131 IXI_234 TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT IXI_235 TSPASIRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT IXI_236 TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT IXI_237 TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT
• CLUSTAL W(1.4) multiple sequence alignment
IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQATIXI_235 TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT IXI_236 TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT IXI_237 TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT
• Interconvert: http://thr.cit.nih.gov/molbio/readseq/
DNA sequence analysis
• Google EMBOSS– A suite of programs with the same look&feel– Does pretty much everything you need– Can be installed locally
Translation• DNA anti-parallel.
– One strand 5’ -3’ matches the complementary strand 3’ – 5’
– Translation, transcription always 5’ – 3’
• Six possible translations, 3 each strand• ATGCCCGCATTTGAATAA• ATGCCCGCATTTGAATAA• ATGCCCGCATTTGAATAA• Stop codons underlined
Frameshift errorsFrameshift mutations
Genetic codeThe “Universal” Genetic Code.
Phe UUU Ser UCU Tyr UAU Cys UGU UUC UCC UAC UGC Leu UUA UCA ter UAA ter UGA UUG UCG ter UAG Trp UGG
Leu CUU Pro CCU His CAU Arg CGU CUC CCC CAC CGC CUA CCA Gln CAA CGA CUG CCG CAG CGG
Ile AUU Thr ACU Asn AAU Ser AGU AUC ACC AAC AGC AUA ACA Lys AAA Arg AGA Met AUG ACG AAG AGG
Val GUU Ala GCU Asp GAU Gly GGU GUC GCC GAC GGC GUA GCA Glu GAA GGA GUG GCG GAG GGG
Exceptions to the code• #1: Yeast Mitochondrial Code: CUN=T AUA=M UGA=W• #2: Mitochondrial Code of Vertebrates: AGR=* AUA=M UGA=W• #3: Mitochondrial Code of Filamentous fungi: UGA=W• #4: Mitochondrial Code of Insects and platyhelminths: AUA=M
UGA=W AGR=S• #5: Nuclear Code of Candida cylindracea: CUG=S (*)• #6: Nuclear Code of Ciliata: UAR = Q• #7: Nuclear Code of Euplotes: UGA=C• #8: Mitochondrial Code of Echinoderms: UGA=W AGR=S AAA=N• #9: Mitochondrial Code of Ascidaceae: UGA=W AGR=G AUA=M• #10: Mitochondrial Code of Platyhelminthes: UGA=W AGR=S
UAA=Y AAA=N• #11: Nuclear Code of Blepharisma: UAG=Q
(*) (see Nature 341:164):
Start codons
• ATG the “universal” start codon … but
• 10% E.coli genes start with GTG
• 1% start with TTG.
• Bioinformaticians only make predictions
• Molecular biologists verify
Restriction sites
• Essential for the construction of plasmids
• A key tool for molecular biology
• Hundreds available commercially– Need to decide which to order– Costs from $3.80/1000units - $500/1000
• http://tools.neb.com/NEBcutter2/index.php
• Usually need an enzyme that cuts once
Alu15'AG’CT 3'TC’GA
EcoR15'G’AATTC 3'CTTAA’G
BamH15'G’GATCC 3'CCTAG’G
BluntEnd
Promoter Prediction
• To find start of transcript (97% Human genome not coding)
• False positive rate too high– Predicted 1 / kb gene-density 1 / 100kb
• RNA polII transcribes DNA – RNA– Needs general transcription factors (GTFs)
• Also specific (species, tissue, devt stage) TF• TF binding sites short and “fuzzy”• 7% of vertebrate genes are TFs
Promoters 2
NF-AT4 matrix (3 known sites)and consensus:
Consensus YYAAAKKM = [CT](2)AAA[GT](2)[AC]Predicts five sites in 3Kb upstream of human IL-11:Bp 007 TTAAAGGCBp 248 ACAAATTCBp1959 GAGTTTGABp2154 TCAAAGGABp2181 GACTTTTAAsk if TF site relevant to your cell type is present.
A00333001C12000002G00000110T21000220 TCAAATTC
Primer design
• You will be asked to design primers for sequencing, PCR etc.
• Manual pages cover this
• Computationally trivial, so lots of choice for available websites
Not-trivial
• NA secondary structure– EMBOSS einverted for short palindromes– mFOLD
• Huge database of 16sRNA structures
• miRNA sites
Secondary Structure
• DNA (and RNA) can form base-pairs.
• Not all of these are with complementary strands.Bioinformatic view= a cartoon
Closer to reality
16s RNA
Gram -veGram +ve
Evolutionary consequences? Coordinated/dependent mutational change
RDP
• Ribosomal Database Project-II Release 9 Notes
• RDP Release 9.42 (Release 9, update 42) consists of 262,030 aligned and annotated 16S rRNA sequences, along with five online analysis tools.