Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp....

49
Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA GATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACG AAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATG GAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAG Learn How to: ● Assemble a genome and predict its: - ORFs - Promoters Annotate genome: - Predict protein functions - Model them if possible - Re-design them if possible Predict functions by inference from a large amount of unrelated data Predict ncRNAs High-throughput methods and data interpretation Prepare the data for presentations & publications
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp....

Page 1: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Advanced Bioinformatics (MB480/580)>Sulfolobus virus 1 complete genome 15465 bp.TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAGTACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAAGATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACGAAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATGGAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAG

Learn How to:● Assemble a genome and predict its:

- ORFs- Promoters

● Annotate genome:- Predict protein functions- Model them if possible- Re-design them if possible

● Predict functions by inference from a large amount of unrelated data● Predict ncRNAs● High-throughput methods and data interpretation● Prepare the data for presentations & publications

Page 2: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

What is Bioinformatics?

• Choices:– The analysis of biological molecules

using computers and statistical techniques•TRUE

– The science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research• also TRUE, but suits Computational Biology

better

Page 3: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

More definitions

• The collection, organization and analysis of large amounts of biological data, using networks of computers and databases.

• The process of developing tools and processes to quantify and collect data to study biological systems logically.

• The science of informatics as applied to biological research.

Page 4: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Yet more definitions

• Mark Gerstein’s definition:– Bioinformatics is conceptualizing biology in terms of

macromolecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

– The manuscript breaking down each part of the above statement will be e-mailed.

– http://wiki.bioinformatics.org/Bioinformatics_FAQ

Page 5: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

The important stuff

• Bioinformatics brings together biological data from genome research with the theory and tools of mathematics, computer science and artificial intelligence.

• Bioinformatics includes any application of computer technology and information science to:– Gather, organize, store and handle data.– Analyze, interpret and spread data.– Predict biological structure and function.

Page 6: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

What is the information in Molecular Biology?

• Central Dogmaof Molecular Biology

DNA -> RNA -> Protein -> Phenotype

• Molecules– Sequence, Structure, Function

• Processes– Mechanism, Specificity,

Regulation

• Central Paradigmfor Bioinformatics

Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype

• Large Amounts of Information– Standardized– Statistical

•Genetic material •Information transfer (mRNA)•Protein synthesis (tRNA/mRNA)•Some catalytic activity

•Most cellular functions are performed or facilitated by proteins.

•Primary biocatalyst

•Cofactor transport/storage

•Mechanical motion/support

•Immune protection

•Control of growth/differentiation

This slide is courtesy of Mark Gerstein

Page 7: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Language of biology is not easy to understand

• Just like in spoken language, some words look very different but have the same meaning (car and automobile are synonyms; sequences of distantly related proteins are synonyms)

• Some words look or sound very similar yet have different meaning (complement and compliment; eminent and imminent; allude and elude; decent and descent are homophones; GAG and TAG codons are homophones)

• In spoken language, we came up with the rules which is why most of the time we can trace back their origins

• How do we trace the origins of Nature’s language?

Page 8: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Why is Bioinformatics important?

• Supports experimental work– In some cases, it provides complementary

data• More importantly, guides experimental

work– Predictions based on data– Extension of experiments in new directions

• To be believable, Bioinformatics predictions have to be verifiable– Statistical significance, or some other kind

of significance score

Page 9: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

When did Bioinformatics begin?

• 10-15 years ago?– This is a common assumption

• Bioinformatics existed even back in 70s– It was called differently– It was underused because the amount of

biological sequence data was small

Page 10: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Bioinformatics and Genome Biology

• The revolution driving enormous development in Bioinformatics and experimental sciences came from whole genome sequencing

Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512.(Picture adapted from TIGR website, http://www.tigr.org)

• Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes done1997, yeast: 13 Mb & ~6000 genes for yeast1998, worm: ~100Mb with 19 K genes1999: >30 completed genomes!2003, human: 3 Gb & 50 K genes...

Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.

-- G A Petsko, Nature 401: 115-116 (1999)

Page 11: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

What can we infer from sequence using Bioinformatics?

Expressed?

• cellular function • physiological function • substrate binding sites • protein-protein interfaces

• activity • specificity • docking • localisation

DNA

ORF

Protein

Active proteinDomains =smallest functional /structural subunits

3D structure

Function

Page 12: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Make sense of subtle differences

[Waterston et al. Nature 2002]

- About 90% of the mouse and human genomes are in syntenic blocks.

Page 13: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

What’s in the genome?

• If we are so much alike in terms of genome, why are we so much different?– Large variation in human population– Similar genes and similar genome organization

between human and chimp (or even human and mouse), yet large phenotypic difference

• The importance of non-coding parts of our genome became more obvious– Non-coding, regulatory RNAs– Binding sites for regulatory proteins– Other possibilities that are not obvious right now

Page 14: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Complexity of biological information

1. Finding regulatorymotifs in DNA

2. Increasing the speedand reliability of functionalannotation from sequence

Page 15: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

The more we know, the better?

Page 16: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

So we have a genome sequence …>Sulfolobus virus 1 complete genome 15465 bp.TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAGCGGAATACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAAGGAGGGATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACGTAAACAAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATGAAGAAGAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAGACTAACGGCTTTGAAAGTGCGATTATTTTCGGGAAACAAGGTACGGGAAAGACTACTTACGCCCTTAAGGTGGCAAAAGAAGTTTACCAGAGATTAGGACATGAACCGGACAAGGCATGGGAACTGGCCCTTGACTCTTTATTCTTTGAGCTTAAAGATGCATTGAGGATAATGAAAATATTCAGGCAAAATGATAGGACAATACCAATAATAATTTTCGACGATGCTGGGATATGGCTTCAAAAATATTTATGGTATAAGGAAGAGATGATAAAGTTTTACCGTATATATAACATTATTAGGAATATAGTAAGCGGGGTGATCTTCACTACCCCTTCCCCTAACGATATAGCGTTTTATGTGAGGGAAAAGGGGTGGAAGCTGATAATGATAACGAGAAACGGAAGACAACCTGACGGTACGCCAAAGGCAGTAGCTAAAATAGCGGTGAATAAGATAACGATTATAAAAGGAAAAATAACAAATAAGATGAAATGGAGGACAGTAGACGATTATACGGTCAAGCTTCCGGATTGGGTATATAAAGAATATGTGGAAAGAAGAAAGGTTTATGAGGAAAAATTGTTGGAGGAGTTGGATGAGGTTTTAGATAGTGATAACAAAACGGAAAACCCGTCAAACCCATCACTACTAACGAAAATTGACGACGTAACAAGATAGTGATACGGGTAATGTCAGACCCCTTTTAGCCATTCCGCATACTTTTTATATTGCTCTTTCGCTATGCCGAAGAGCGATACGTAATGTTGCGTTAAAACGCGTGTCGGTTTACGCCCTTGAATAAAATCGATAATATCTAACGGTACGCTTAGCTCAGCCATCTTAGACGCTACGAATTTGCGGAAGTACTTTATCGCTATAGCGTCCTTATGACGTCGTTCAAAGTCCGCTATTGCCCACTTCGTCACCTCTACTCTCTTCAGAGGCGTTATGTGGAATACATAGAAGACGCCCTTATATCCCCTAGTCCAACTAAGCGGATAATAACAGACGTCGTTACCGCAAATGTCCCTTTCGGGTTCCTTCAGCACTTTCAGTATTTCGCTCAGCCTAACGCCCGACTCGAGAGCGATACGGTAGATGAAGTAGACGTTTTCGCTATAGTCTTTTGCTAATTGTAACGTCCTTTTTATCTCTTCCAACGTTGGAATGTAGATATCAGCGTTCGCCTTCTTCACCTTTACCGCTTTCAATATTTTATCCGCAAATTCATCATGTATGATATTGCGTGACGCTAAGAAACGTGCAAAGAGTCGGTAAGCCTTCTGTGCGTCTCTCGTCTCTTTATACGGCTTTGATATAGCATTGATGTAGTCCTTTGCAGTTTTTTCGCTTATCCCCCTTTCGTTCATGAGATAGTCGTAGAACGCCTTTATGTTGCCGTCCGTCGCGTATTGGCGCAAATTGGCAACCAACGCTATTTTACGTCGTTCAGTTCCCTCTTTTCCGCCTCCGGAGCCGGAGGTCCCGGGTTCAAATCCCGGCGGGTCCGCTTGTAGGGGAGTATCCCCTACGACCCCTAATTTCATTTTTAGATATGATTCAACGACGTCAGCTAAAGGACCCACGTAACGCTCTTTTACCTCACCGTTTTCATACTCTAGCTTGTAAACATAATACCGCCCTTTCCTCTCGCGTAAAATATAATCCCCGTATTTATAACGCGTCTTATCTTTCGTCATTTCGCCTCACAGTATTATGGTTGCCAAAACGGGCTTATAAGCATTGGCAACCCGTTAATTTTTGCCGTTAAAACACGTTGAATTGAAAGAAGACGGCAAAGAATCCACACAGGTAATACTAAAAAAGTAGTATTACTTACATTAGAAGGACTCATTTGTCCACCTTGTATTCTAGCCATGCTATCTCTGCCTTCAGCTCATCTAGCTTCCCCTTTATGTCTGTCAGGTCAAGGGGAACTCCTCTCATTAACCTGAGTTCGTTTTCGATTTTTTCAAGCTCCTTTTCCAACTCCTCTAGTTTCTCTAATTCCTTTAGTCGTTCTTCCAATTTCTTTTCCAATTTCCCCTTTGCGTCATTTATAATTATGCTTACTACCCAAACAATTCCTAAATCAGAAATAATTATTAACTCCTCTGAGTTGAATATCATTTTCCGCCCCTCGCTAAATACTCCTTAAAGCTCTGATAGAACCCCTTCAGACTAACCCGTAAGTCTGTTAGGTTCTTCCAGTATTGTAATGGGATTAAGTAATAGTAGCTTACTGCATCTCTCTCAAATTTGTCCTTCTTAATCTTTCCTTGCTTTTCTAAGTTGAGTATTTGCAGTGCTGAGATACATTTTAACTTGTCCTCAGCATCTGAATAGTGTATAAACCAAACCCTCCCCATAACCTCATTCTGCTTTGCAACTTCTACTTTAGTGCTTAATATTGCGTAAACGCTTTCGCCGTATCTTTCTTTGCTCTGTTCTTCAGTCCATGAACTTCCCGTAATATCTATCCAAATTAAAGGATAATATTCTGTCTTAGCCTTAACGTATAAAGTCAAATCGTATTTATCTTGCAGACCGCTATAGTATTGCTCATTTATTACATTAGTTAAAGTCCCCACGCCAGTTGGGCGGATATAAACATCAAAGTCTAACAAACCCTTAGCCCGCCACTTTGATAAAGAGATTAAGAGCTTTCCAAAAACTAGGTATTCTCGCCCTAAATAAGTTGAAGGGAGGATATAATCCTCAGCTTGATTACCCCAATACTTTAGCTTAAAATTAGTTTCAGCCATCTCACTCACCATATTGAAACGTGGGCTAGTATGTGAATCAGTACTGATGCTATTGCAAATAACACACTTGCAGTAGCAATTCCTATTACAATCCATTTACCATAATCCACCTTAGTTTGTTGGTCAATATACTCGTTGATGATCTTTAGTATTTCTGGCTTTAGTTCTGATAATGAAAGGAAGACAGAGGCATAAAGTACTAAGGAGGATGTGAACAGATTATCCGCCTTTTCTGAAAGTTTATAAAGCTCATATCTTGCTCTCTCATAATCTTCATAATTAATAATTTCATCAAACTTTTCTACTTGCTCTTCATATTCTTTCTTCAGAGAGTAAGGAGTTGTCTTTTCAATTACTCCTAATTTTATTAACTTCTTAACAGCTTCCTTAAATCCTTGTTTATTGCTAGCATACGCTAAAGGGTCTTTTCCTTCTTGAGAAGCTCTATAGATAACTATAGCACCATAAACAATATTTACAATATCGTATGGTAAGGAATACGCACCGATTTGGGCAATATCTTCAACTCTTCTTTGATCCATCTAGTTCACCTCTTTTTGATTTGTTTGTAGGTTTCTATCGCAGTTTTCAGCGATATCGCAAATAGCTTCCCCTTTTCCGTTAGGTATAGCCTCTTTTCGCCTCTTTCTTGACGCTCTTTCACGAAGCCCTCTTGTATTAGGAACTTTTTTGCATCATAAAAGGTGGCAGTGGACATGGGAAATTCTGCGTTTACTTTCTTGTATAGGTCATATGTTGCTATTCCTTCATTATCATATAGATAAGCCAATACTATGGCTTCGGGGTAGAAGAATGGTGTACTTTTCATATCCTCCTCACTCCTCAGCCTCTAATAGCTTAACTGCCTCCTCTATCAACTGTCCCATTGTCTTTCCAGTCTTTGCCTTAAGCCTCTGCAGAGTCTCATATGTTTCCTCACTTATTGAAATGTTAAGCCTTTTGACTATCCTATCTTTCCTCTTCTCTATCATTTAGGTCACCTTGTTTATTGTTATTTGAAATACGTATCCGTCTTCGTCACATCGAAGTATAATTTTGTATCCATTATTAGCATATTCTACGTCAAAGTTCCCACAACAATAATTCGGGTCTTCGGACTCGTTATAGACTTTGCTCCAACCATCTTTTTGTAGTGCCTCTTCTAAGTAGTCTACTCTGATGAAGCCTTCATCATATTCGTTCAGTACCCTAAAGCTTATACTATCAATGCCTAATACGTCTAATAGCTTCAACAGATCGAATATAGGAACTTGCACCATCATTTCAGCTCACCTTAATGAGCTGATATAATTCCGCTTCTATCTTTTGAACTTGGAAGTATGCCTTGCCTAGCTTTTGCTTATCCATATTGCCCGTTATTCTATCAATCTTAATCTCGTGGATTAATGATAATAGCTCTCTGACATCCTCATCAAGCATTTCAAATAATTCTTTCTCTAAGACTTCTTTACTCATTGTTTTTCACCTTAGCAAACTCATCTAACGTTGTTTGTCTCAGTTCTCTTTTCTTTATCAAATAAAATTCCGAATGTCCCTTCTTATTGTTATTACTGTACTTCATGTCAGTTCACTGCTTTGCCTTTATAAATCCTTGATCCGTTTGCTCAAAATTTGCGGGCTGGGCAT

Page 17: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Gene finding through learningatgccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgtaa

gaggatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagatg

Gene

Non-gene

gcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag

Gene?

Page 18: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttg

caggccggtgag

Page 19: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Map looks better. Is this all?

Page 20: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

OK, so we’ll predict protein functions …

Page 21: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

… maybe do few experiments …

Page 22: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

… and then enjoy glory (maybe money, too).

Trevor Douglas and Mark Young (2006) Science 312, 873 - 875.

Page 23: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

What can we do with Molecular Biology information?

• Different levels of Molecular Biology information

• DNA– Coding or non-coding– Meaningful or junk DNA?

• RNA– Information transfer (mRNA, tRNA, rRNA)– Regulatory roles

• Protein– Structure and function– Modifications

Page 24: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Molecular Biology Information in DNA and RNA

• Raw DNA Sequence– 4 bases: AGCT– Coding or Not?– How do we parse

the sequence into genes?

– Because of introns, ~1 K in a gene could mean ~2 M in genome

• Raw RNA Sequence– 4 bases: AGCU– mRNA, tRNA, rRNA– Regulatory RNAs– Secondary

structure

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .

. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

Page 25: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Molecular Biology information in protein sequences

1. Finding regulatorymotifs in DNA

• 20 letter alphabet, more combinatorial variability than DNA (20AA-number)

– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ

• Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain

• More than 2 million unique protein sequences (more than 5.6 M of total sequences in the database)

• We must be able to “transfer” the function from characterized proteins to uncharacterized ones based on some measure of similarity

Page 26: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Molecular Biology information in macromolecular structures

• DNA/RNA/Protein– The majority of all structures

are of proteins– Proteins easier to crystallize

and were thought to be more important

Page 27: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Organizing information: Redundancy and multiplicity help

…• Fairly different sequences may have the same

structure and function– Bad news: If they are very different, how do we find this?– Good news: Once they are found, we learn something more

about structure and function

• An organism has many similar genes and non-coding RNAs– The redundancy present for essential genes and/or RNAs

(rRNA)

• Single gene may have multiple functions– Combining domains in eukaryotes produces large proteins

• Genes are grouped into pathways; this is good

Page 28: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

… though sometimes the path is difficult

• Evolutionary distances do not help establish initial relationship– Large differences (large evolutionary distances) between

proteins are hard to identify and defend on statistical grounds without experiment

• Evolutionary distances do help once the relationship is established– If the relationship between distant proteins is

established, their conserved parts provide information about what is vital for function

– Less conserved parts of proteins are less important for function - scaffold

• Given all these difficulties, how do we find hidden similarities?

Page 29: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Some things we can do using just sequence

• Sequence (text string) comparisons– Sequence (text string) search– Sequence alignment– Finding short sequences in biological sequences– Significance statistics

• Databases– Building, Querying

• Learning patterns– Artificial Intelligence and Machine Learning– Mining for patterns and clustering them

• Secondary structure prediction– Where are helices, strands and loops in proteins?– Finding trans-membrane helices

• Tertiary structure prediction– Fold recognition and structure prediction– Active site identification

Page 30: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

How are optimal alignments found?(Should we all pick the one we like?)

Page 31: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Aligning text strings …Which alignment is the best?

Raw Data ???T C A T G C A T T G

2 matches, 0 gaps

T C A T G | |C A T T G

3 matches (2 end gaps)

T C A T G . | | | . C A T T G

4 matches, 1 insertion

T C A - T G | | | | . C A T T G

4 matches, 1 insertion

T C A T - G | | | | . C A T T G

Page 32: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Dynamic Programming to the rescue1. Finding regulatory

motifs in DNA•What to do for Bigger String?SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGG

REGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEPKPNEPRGDILLPTVGHALAFIERLERPELYGVNP

EVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRT

EDFDGVWAS

•Needleman-Wunsch (1970) provided first automatic method– Dynamic Programming to Find Global Alignment– Local Alignment is sometimes better than Global

•Needleman-Wunsch Test Data–ABCNYRQCLCRPMAYCYNRCKCRBP

Page 33: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Make a dot plot (Similarity matrix)

Put 1's where characters are identical.

A B C N Y R Q C L C R P M

A 1

Y 1

C 1 1 1

Y 1

N 1

R 1 1

C 1 1 1

K

C 1 1 1

R 1 1

B 1

P 1

Page 34: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Scoring the alignment

• The idea is to go through the matrix and find a shortest path to the bottom (it is actually done from the bottom backwards)

• Caveat 1: This path also needs to have the highest score

• Caveat 2: We have to score the gaps (insertions and deletions) since they do not exist in proteins

Page 35: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Global alignment by dynamic programming

Sequence X: MONTANASequence Y: MONTANAScoring system: 5 for match; -2 for mismatch; -6 for gap

Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 A -30 -19 -8 3 14 25 19 13 N -36 -25 -14 -3 8 19 30 24 A -42 -31 -20 -9 2 13 24 35

Optimum alignment score: 35X: MONTANAY: MONTANA

Page 36: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

What about gaps?

Sequence X: MONTTANASequence Y: MONTANAScoring system: 5 for match; -2 for mismatch; -6 for gap

Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 T -30 -19 -8 3 14 18 12 6 A -36 -25 -14 -3 8 19 16 17 N -42 -31 -20 -9 2 13 24 18 A -48 -37 -26 -15 -4 7 18 29

Optimum alignment score: 29X: MONTTANAY: MON-TANA

Page 37: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Scoring “real-life” alignments

Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match; -2 for mismatch; -6 for gap

Dynamic programming matrix: M O N T A N A G R I Z Z L I E S 0 -6 -12 -18 -24 -30 -36 -42 -48 -54 -60 -66 -72 -78 -84 -90 -96 M -6 5 -1 -7 -13 -19 -25 -31 -37 -43 -49 -55 -61 -67 -73 -79 -85 O -12 -1 10 4 -2 -8 -14 -20 -26 -32 -38 -44 -50 -56 -62 -68 -74 N -18 -7 4 15 9 3 -3 -9 -15 -21 -27 -33 -39 -45 -51 -57 -63 T -24 -13 -2 9 20 14 8 2 -4 -10 -16 -22 -28 -34 -40 -46 -52 A -30 -19 -8 3 14 25 19 13 7 1 -5 -11 -17 -23 -29 -35 -41 N -36 -25 -14 -3 8 19 30 24 18 12 6 0 -6 -12 -18 -24 -30 A -42 -31 -20 -9 2 13 24 35 29 23 17 11 5 -1 -7 -13 -19 B -48 -37 -26 -15 -4 7 18 29 33 27 21 15 9 3 -3 -9 -15 O -54 -43 -32 -21 -10 1 12 23 27 31 25 19 13 7 1 -5 -11 B -60 -49 -38 -27 -16 -5 6 17 21 25 29 23 17 11 5 -1 -7 C -66 -55 -44 -33 -22 -11 0 11 15 19 23 27 21 15 9 3 -3 A -72 -61 -50 -39 -28 -17 -6 5 9 13 17 21 25 19 13 7 1 T -78 -67 -56 -45 -34 -23 -12 -1 3 7 11 15 19 23 17 11 5 S -84 -73 -62 -51 -40 -29 -18 -7 -3 1 5 9 13 17 21 15 16

Optimum alignment score: 16X: MONTANA--BOBCATSY: MONTANAGRIZZLIES

Page 38: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

The scoring depends on our choice of parameters

Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match (B=G); -2 for mismatch; -6 for gap

Optimum alignment score: 23X: MONTANAB--OBCATSY: MONTANAGRIZZLIES

Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match; -2 for mismatch; -1 for gap

Optimum alignment score: 26X: MONTANA--------BOBCATSY: MONTANAGRIZZLIE------S

Page 39: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

How do we choose good scoring parameters?

• A simple scoring scheme considers only sequence identity

• More realistic scoring schemes consider sequence similarity, which is taken from substitution matrices

• We measure the frequency of residue substitutions and normalize it by residue frequency in the database (LOG2 an/ad)

• Zero in substitution matrix means that the substitution occurs by chance

• Score less than zero means that the substitution is unlikely to occur by chance

• There is no universally good matrix

Page 40: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

BLOSUM62 substitution matrix

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Page 41: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

What is the limit of substitution matrices?

Page 42: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Why did substitution matrices fail?

• The proteins in question are very distantly related and their substitution patterns are not properly rewarded by general matrices

• Substitution matrices do not capture all families equally well because they are meant to be general

• Can we build protein family-specific substitution matrices?

• Yes, these are known as protein family profiles

Page 43: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

How do we build protein family-specific matrix?

Search protein database using BLOSUM62 matrix

Build protein-family specific matrix (profile) and search protein database again

???

Page 44: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Detecting distant relationships using profiles (PSI-BLAST)

Page 45: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Position-specific scoring matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V T -1 0 1 2 -2 0 1 -1 0 -2 -2 0 -1 -2 -1 0 3 -1 -1 -1 M -1 -2 -3 -3 -1 -2 -3 -3 -2 1 1 -2 7 0 -3 -3 -2 -1 -1 0 D -1 2 0 2 -3 3 1 -1 0 -3 -3 1 -2 -3 -1 0 0 -2 -2 -2 V -1 -4 -4 -5 -1 -4 -4 -5 -3 4 1 -4 0 -1 -4 -4 -1 -2 -2 5 I -1 -4 -5 -6 -1 -4 -5 -6 -4 5 1 -5 0 -1 -5 -5 -2 -2 -2 4 S 0 -1 0 -1 -1 -1 -1 -1 -1 -3 -3 -1 -2 -3 -2 4 3 -3 -2 -2 F -2 -4 -4 -5 -2 -4 -5 -5 -2 1 2 -4 1 6 -4 -4 -3 2 2 0 K -1 3 -1 -1 -4 1 0 -2 0 -3 -3 5 -2 -4 -2 -1 -1 -3 -3 -3 L -2 -4 -5 -6 -1 -4 -5 -6 -3 3 4 -4 2 1 -4 -5 -3 -1 -1 1 P -1 -3 -3 -2 -4 -3 -2 -3 -3 -4 -4 -2 -4 -4 7 -2 -2 -4 -4 -3 P -1 -1 -1 2 -4 -1 0 -1 -1 -4 -4 -1 -4 -4 6 -1 -1 -3 -3 -3 E 0 0 0 2 -4 1 4 -2 -1 -3 -3 1 -2 -4 -1 -1 -1 -3 -3 -2 L -2 -3 -4 -5 -2 -3 -4 -5 -2 2 4 -4 6 1 -4 -4 -2 -1 -1 1 N -1 0 3 0 -2 0 2 -1 0 -2 0 1 -1 -2 -1 0 0 -1 -1 -2 A 2 1 -1 -1 -1 0 0 -2 0 0 0 0 0 -1 -1 -1 0 0 0 0 K -1 1 -1 -1 -3 1 0 -2 0 -2 0 4 -1 -2 -2 -1 -1 -2 -1 -1 L -2 -4 -5 -5 -2 -4 -5 -5 -3 1 5 -4 1 1 -4 -5 -3 -1 -1 1 E -1 0 2 4 -4 0 3 -1 0 -5 -4 0 -4 -4 -1 0 -1 -3 -3 -4 S 1 1 0 0 -3 3 1 -1 0 -2 -2 1 -2 -2 -1 1 0 -2 -1 -2 V -1 -1 -2 -3 -1 -2 -3 -3 0 2 1 -2 1 2 -2 -2 -1 1 3 3 A 5 -3 -3 -2 0 -2 -2 0 -2 -2 -2 -2 -2 -3 -2 0 -1 -3 -3 -1 L -1 0 -1 -2 -1 -1 -1 -2 0 2 1 0 1 0 -2 0 0 1 2 0 K -1 2 0 0 -3 1 2 -1 0 -3 -3 3 -2 -4 -1 1 0 -3 -2 -2 E -1 1 0 0 -2 2 2 -2 3 -2 0 1 -1 -2 -1 -1 0 -1 -1 -1 K -1 1 1 0 -3 0 0 2 0 -4 -3 3 -3 -3 -1 0 -1 -2 -2 -3 K -1 0 -1 -1 -2 0 0 -3 0 0 -1 3 3 -1 -2 -1 0 -1 0 1 S -1 -1 2 0 -2 0 0 -1 0 -3 -3 0 -2 -3 -1 3 2 -2 -2 -3

Page 46: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Combining sequence similarity with SS information

Page 47: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Can we quickly scan for common protein families?

• YES, many databases available• Instead of comparing our query to other

sequences, we compare it to the database of profiles (also called Hidden Markov Models or HMMs)

• Profiles (and HMMs) capture the average preference for residues at all positions; They are probabilistic representations of protein families

• Try these databases:– PFAM (http://pfam.wustl.edu)– SMART (http://smart.embl-heidelberg.de)

Page 48: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Profiles HMMs have other uses

• Profile HMMs represent a phylogenetic footprint of a given protein family– Also used for secondary structure

prediction– Predictions of trans-membrane proteins– Prediction of protein disorder

• Most of these predictors are based on machine-learning algorithms that are trained on known data and can extract subtle patterns

Page 49: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA.

Questions?

Mensur DlakicDepartment of Microbiology

111 Lewis HallTel: 994-6576

[email protected] office: 109 Lewis Hall