Bioinformatics: Practical Application of Simulation and Data Mining Markov Modeling III
Practical Bioinformatics - HistoBase
Transcript of Practical Bioinformatics - HistoBase
![Page 1: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/1.jpg)
Practical Bioinformatics
Mark Voorhies
5/19/2015
Mark Voorhies Practical Bioinformatics
![Page 2: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/2.jpg)
Review – Documentation
A program should communicate its intent to both the computerand human programmers.
Comments
Docstrings
Code and inputs defining a complete protocol
Positive and negative controls
Mark Voorhies Practical Bioinformatics
![Page 3: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/3.jpg)
Review – Documentation
A program should communicate its intent to both the computerand human programmers.
Comments
Docstrings
Code and inputs defining a complete protocol
Positive and negative controls
Mark Voorhies Practical Bioinformatics
![Page 4: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/4.jpg)
Review – Documentation
A program should communicate its intent to both the computerand human programmers.
Comments
Docstrings
Code and inputs defining a complete protocol
Positive and negative controls
Mark Voorhies Practical Bioinformatics
![Page 5: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/5.jpg)
Review – “Top Down” design
Experiment in the shell
Factor working code into functions and modules
Refine from problem-specific to generally applicable functions
“As simple as possible, but no simpler”
Mark Voorhies Practical Bioinformatics
![Page 6: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/6.jpg)
Review – “Top Down” design
Experiment in the shell
Factor working code into functions and modules
Refine from problem-specific to generally applicable functions“As simple as possible, but no simpler”
Mark Voorhies Practical Bioinformatics
![Page 7: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/7.jpg)
Dictionaries
d i c t i o n a r y = {”A” : ”T” , ”T” : ”A” , ”G” : ”C” , ”C” : ”G”}d i c t i o n a r y [ ”G” ]d i c t i o n a r y [ ”N” ] = ”N”d i c t i o n a r y . h a s k e y ( ”C” )
Mark Voorhies Practical Bioinformatics
![Page 8: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/8.jpg)
Dictionaries
gene t i cCode = {”TTT” : ”F” , ”TTC” : ”F” , ”TTA” : ”L” , ”TTG” : ”L” ,”CTT” : ”L” , ”CTC” : ”L” , ”CTA” : ”L” , ”CTG” : ”L” ,”ATT” : ” I ” , ”ATC” : ” I ” , ”ATA” : ” I ” , ”ATG” : ”M” ,”GTT” : ”V” , ”GTC” : ”V” , ”GTA” : ”V” , ”GTG” : ”V” ,
”TCT” : ”S” , ”TCC” : ”S” , ”TCA” : ”S” , ”TCG” : ”S” ,”CCT” : ”P” , ”CCC” : ”P” , ”CCA” : ”P” , ”CCG” : ”P” ,”ACT” : ”T” , ”ACC” : ”T” , ”ACA” : ”T” , ”ACG” : ”T” ,”GCT” : ”A” , ”GCC” : ”A” , ”GCA” : ”A” , ”GCG” : ”A” ,
”TAT” : ”Y” , ”TAC” : ”Y” , ”TAA” : ”∗” , ”TAG” : ”∗” ,”CAT” : ”H” , ”CAC” : ”H” , ”CAA” : ”Q” , ”CAG” : ”Q” ,”AAT” : ”N” , ”AAC” : ”N” , ”AAA” : ”K” , ”AAG” : ”K” ,”GAT” : ”D” , ”GAC” : ”D” , ”GAA” : ”E” , ”GAG” : ”E” ,
”TGT” : ”C” , ”TGC” : ”C” , ”TGA” : ”∗” , ”TGG” : ”W” ,”CGT” : ”R” , ”CGC” : ”R” , ”CGA” : ”R” , ”CGG” : ”R” ,”AGT” : ”S” , ”AGC” : ”S” , ”AGA” : ”R” , ”AGG” : ”R” ,”GGT” : ”G” , ”GGC” : ”G” , ”GGA” : ”G” , ”GGG” : ”G”}
Mark Voorhies Practical Bioinformatics
![Page 9: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/9.jpg)
Whiteboard Image
Mark Voorhies Practical Bioinformatics
![Page 10: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/10.jpg)
Exercise: Transforming sequences
1 Write a function to return the antisense strand of a DNAsequence in 3’→5’ orientation.
2 Write a function to return the complement of a DNAsequence in 5’→3’ orientation.
3 Write a function to translate a DNA sequence
Mark Voorhies Practical Bioinformatics
![Page 11: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/11.jpg)
Why compare sequences?
To find genes with a common ancestor
To infer conserved molecular mechanism and biologicalfunction
To find short functional motifs
To find repetitive elements within a sequence
To predict cross-hybridizing sequences (e.g., in microarraydesign)
To find genomic origin of imperfectly sequenced fragments(e.g., in deep sequencing experiments)
To predict nucleotide secondary structure
Mark Voorhies Practical Bioinformatics
![Page 12: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/12.jpg)
Why compare sequences?
To find genes with a common ancestor
To infer conserved molecular mechanism and biologicalfunction
To find short functional motifs
To find repetitive elements within a sequence
To predict cross-hybridizing sequences (e.g., in microarraydesign)
To find genomic origin of imperfectly sequenced fragments(e.g., in deep sequencing experiments)
To predict nucleotide secondary structure
Mark Voorhies Practical Bioinformatics
![Page 13: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/13.jpg)
Whiteboard Image
Mark Voorhies Practical Bioinformatics
![Page 14: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/14.jpg)
Nomenclature
Homologs heritable elements with a common evolutionaryorigin.
Orthologs homologs arising from speciation.
Paralogs homologs arising from duplication and divergencewithin a single genome.
Xenologs homologs arising from horizontal transfer.
Onologs homologs arising from whole genome duplication.
Mark Voorhies Practical Bioinformatics
![Page 15: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/15.jpg)
Nomenclature
Homologs heritable elements with a common evolutionaryorigin.
Orthologs homologs arising from speciation.
Paralogs homologs arising from duplication and divergencewithin a single genome.
Xenologs homologs arising from horizontal transfer.
Onologs homologs arising from whole genome duplication.
Mark Voorhies Practical Bioinformatics
![Page 16: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/16.jpg)
Nomenclature
Homologs heritable elements with a common evolutionaryorigin.
Orthologs homologs arising from speciation.
Paralogs homologs arising from duplication and divergencewithin a single genome.
Xenologs homologs arising from horizontal transfer.
Onologs homologs arising from whole genome duplication.
Mark Voorhies Practical Bioinformatics
![Page 17: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/17.jpg)
Dotplots
1 Unbiased view of all ungappedalignments of two sequences
2 Noise can be filtered by applying asmoothing window to the diagonals.
Mark Voorhies Practical Bioinformatics
![Page 18: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/18.jpg)
Dotplots
1 Unbiased view of all ungappedalignments of two sequences
2 Noise can be filtered by applying asmoothing window to the diagonals.
Mark Voorhies Practical Bioinformatics
![Page 19: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/19.jpg)
Types of alignments
Global Alignment Each letter of each sequence is aligned to aletter or a gap (e.g., Needleman-Wunsch)
Local Alignment An optimal pair of subsequences is taken fromthe two sequences and globally aligned (e.g.,Smith-Waterman)
Mark Voorhies Practical Bioinformatics
![Page 20: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/20.jpg)
Exercise: Scoring an ungapped alignment
s ={”A” :{ ”A” : 1 . 0 , ”T” :−1.0 , ”G” :−1.0 , ”C” :−1.0} ,”T” :{ ”A” :−1.0 , ”T” : 1 . 0 , ”G” :−1.0 , ”C” :−1.0} ,”G” :{ ”A” :−1.0 , ”T” :−1.0 , ”G” : 1 . 0 , ”C” :−1.0} ,”C” :{ ”A” :−1.0 , ”T” :−1.0 , ”G” :−1.0 , ”C” : 1 . 0}}
S(x , y) =N∑i
s(xi , yi )
1 Given two equal length sequences and a scoring matrix, returnthe alignment score for a full length, ungapped alignment.
2 Given two sequences and a scoring matrix, find the offset thatyields the best scoring ungapped alignment.
Mark Voorhies Practical Bioinformatics
![Page 21: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/21.jpg)
Exercise: Scoring an ungapped alignment
s ={”A” :{ ”A” : 1 . 0 , ”T” :−1.0 , ”G” :−1.0 , ”C” :−1.0} ,”T” :{ ”A” :−1.0 , ”T” : 1 . 0 , ”G” :−1.0 , ”C” :−1.0} ,”G” :{ ”A” :−1.0 , ”T” :−1.0 , ”G” : 1 . 0 , ”C” :−1.0} ,”C” :{ ”A” :−1.0 , ”T” :−1.0 , ”G” :−1.0 , ”C” : 1 . 0}}
S(x , y) =N∑i
s(xi , yi )
1 Given two equal length sequences and a scoring matrix, returnthe alignment score for a full length, ungapped alignment.
2 Given two sequences and a scoring matrix, find the offset thatyields the best scoring ungapped alignment.
Mark Voorhies Practical Bioinformatics
![Page 22: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/22.jpg)
Exercise: Scoring an ungapped alignment
s ={”A” :{ ”A” : 1 . 0 , ”T” :−1.0 , ”G” :−1.0 , ”C” :−1.0} ,”T” :{ ”A” :−1.0 , ”T” : 1 . 0 , ”G” :−1.0 , ”C” :−1.0} ,”G” :{ ”A” :−1.0 , ”T” :−1.0 , ”G” : 1 . 0 , ”C” :−1.0} ,”C” :{ ”A” :−1.0 , ”T” :−1.0 , ”G” :−1.0 , ”C” : 1 . 0}}
S(x , y) =N∑i
s(xi , yi )
1 Given two equal length sequences and a scoring matrix, returnthe alignment score for a full length, ungapped alignment.
2 Given two sequences and a scoring matrix, find the offset thatyields the best scoring ungapped alignment.
Mark Voorhies Practical Bioinformatics
![Page 23: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/23.jpg)
Exercise: Scoring an ungapped alignment
s ={”A” :{ ”A” : 1 . 0 , ”T” :−1.0 , ”G” :−1.0 , ”C” :−1.0} ,”T” :{ ”A” :−1.0 , ”T” : 1 . 0 , ”G” :−1.0 , ”C” :−1.0} ,”G” :{ ”A” :−1.0 , ”T” :−1.0 , ”G” : 1 . 0 , ”C” :−1.0} ,”C” :{ ”A” :−1.0 , ”T” :−1.0 , ”G” :−1.0 , ”C” : 1 . 0}}
S(x , y) =N∑i
s(xi , yi )
1 Given two equal length sequences and a scoring matrix, returnthe alignment score for a full length, ungapped alignment.
2 Given two sequences and a scoring matrix, find the offset thatyields the best scoring ungapped alignment.
Mark Voorhies Practical Bioinformatics
![Page 24: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/24.jpg)
Whiteboard Image
Mark Voorhies Practical Bioinformatics
![Page 25: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/25.jpg)
Exercise: Scoring a gapped alignment
1 Given two equal length gapped sequences (where “-”represents a gap) and a scoring matrix, calculate an alignmentscore with a -1 penalty for each base aligned to a gap.
2 Write a new scoring function with separate penalties foropening a zero length gap (e.g., G = -11) and extending anopen gap by one base (e.g., E = -1).
Sgapped(x , y) = S(x , y) +
gaps∑i
(G + E ∗ len(i))
Mark Voorhies Practical Bioinformatics
![Page 26: Practical Bioinformatics - HistoBase](https://reader034.fdocuments.net/reader034/viewer/2022042405/625e473f6be89b159679cfb4/html5/thumbnails/26.jpg)
Exercise: Scoring a gapped alignment
1 Given two equal length gapped sequences (where “-”represents a gap) and a scoring matrix, calculate an alignmentscore with a -1 penalty for each base aligned to a gap.
2 Write a new scoring function with separate penalties foropening a zero length gap (e.g., G = -11) and extending anopen gap by one base (e.g., E = -1).
Sgapped(x , y) = S(x , y) +
gaps∑i
(G + E ∗ len(i))
Mark Voorhies Practical Bioinformatics