Billions and Billions of Bases How does a biologist maintain a grip on reality?
-
Upload
kristian-bates -
Category
Documents
-
view
218 -
download
0
Transcript of Billions and Billions of Bases How does a biologist maintain a grip on reality?
Billions and Billions of Bases
How does a biologist maintain a grip on reality?
46 chromosomes~3 billion nucleotides
The Human Genome Project
One millionth of total
The Human Genome ProjectTGAGACACATATTTTTGATATTCCAGTTGTTGCAATCGAATGTAAAACATATTTAGATCTTTAAATGTATGGTACATTCAAGATCCAACCTTCATTCTAGTGTTTAAAGAGAACTGATTTGTTTGCAGGGGCAGGAGGCTTTGGTTTAGGTTTTGAAATGGCAGGCTTCTCTGTACCTTTATCTGTTGAAATTGATACCTGGGCTTGTGATACACTACGCTACAACCGCCCTGATTCAACAGTTATTCAAAATGATATCGGTAACTTTAGTACAGAAAATGACGTTAAGAATATCTGCAACTTTAAACCTGATATTATTATTGGCGGGCCTCCATGCCAGGGATTTAGTATTGCTGGGCCAGCCCAAAAAGATCCTAAAGATCCTAGAAATGGTTTATTCATCAACTTTGCACAATGGATAAAATTTCTTGAACCTAAAGCGTTTGTCATGGAAAACGTAAAAGGATTGCTATCAAGGAAAAATGCAGAAGGTTTTAAAGTTATAGATATTATTAAGAAAACATTTGAAGAACTTGGTTATTTTGTCGAAGTATGGGTTTTAAATGCTGCGGAATATGGCATTCCGCAAATTAGAGAACGTATTTTTATTGTTGGCAATAAAAAAGGTAAAGTACTAGGTATTCCTAAAAAAACACATTCTCTGCAATTTTTAAATTTAAATAGGTCTCAATTATCGATCTTCGATGATATGAGTATTATACCTGCACTAACTTTGTGGGACGCAATATCAGACTTACCAGAACTTAATGCGCGTGAAGGAAGTGAAGAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTTGGGCTAGAAATGGTAGTGCTACGCTTTACAATCATGTTGCAATGG
AACATTCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAATCCAGTTCGGATGTATCTAAAGAACATGGAGCTAGACGACGTAGTGGTAATGGTGAATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCTCATAAACCGTCTCACACTATTGCTGCGTCATTCTATGCTAATTTTGTCC
ATCCTTTTCAACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTTTTCCAGATAACTATAGATTTTTTGGAAAAAAAACTGTCGTATCTCATAAACTATTGCATCGA GAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAAAGTAATTGCACATCATCTTCTAGAGAAATTAGAGTTATGCCAACAACTGATAGAAATCCTCTAGTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAATCAGAACTGAATATGACAAATGGCATAAAGCAAATATGAACCTGGTTGGACCAAAATCAGAAATTACTGACCAAGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACCTTCATTCTAGTGTTTTAGAGACCATTTATAAAGTAAATCTTTAGACGACTAGACGACGTAGCATAATACGAGTCATAACGGCATATATG GCAGCCTCACTCATTTCTGGGAGACGCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCATCAGCCAAACAGAGAGCGCAAATTTATCACCGTCATAGCCGGAATCAACCCAGATGACTTCAACTTTTTCCAGTAATTCTGGACGCTCTTCTAACAGTTCCATCAAAGTATAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGACAACCTGTTTT
CAGATGGTAGTAGATAGCGTTGCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACTGACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT
The Human Genome ProjectAATAAAGCTTTACAAACCAAACTCTGGCTTCAATTGTGTAACCCAAGCTTTGATTCTTTCCTCTGTTAAATCGGATTGATTATCTTCATCAAGGGCAAGACCTACAAATTTACCATCACGAACAGCTTTAGACTCACTGAATTCATAACCTTCTGTAGGCCAATAGCCAACTGTTTCACCACCATTTTCTGAAATTTTTTCCTCTAGAATACCGAGGGCATCTTGAAATGTATCAGGATAACCAACCTGGTCTCCAGGAGCAAAATAAGCAACTTTTTTGCCGATGAAGTCAATGTTATCTAACTCATCATAAAAATTTTCCCAATCACTTTGCAATTCTCCAACATTCCAGGTAGGACAACCAACAACGATATAATCGTAGTTATTGAAATCACTTGGTTCAGCTTGTGAAATATCATATAAAGTTACAACACTATCACCACCAAACTCCTTCTGAATTATTTCTGATTCAGTTTGGGTATTGCCTGTTTGAGTACCAAAAAATAAACCAATATTAGACATTTTTACTCCTTTTATGTATTTGCAAAATTATTTCAATTAAAATATTTAGTAATAATTAATTGTTAGCTAGCTAATAATTAAATTTTTATTACAATCATTGTAAAAGGCATTGAAAAAGTAAATAAAAATTTTTATTCTACGTTATTTCAAAAATATTTACTTACATATACTTAACCTTTATAGTGATGTAATATACTCTAATTCCTATTTTACTTATAAATACCATCTCAGCTTAATGTAACGAATTTTTCTGTTTATCTTTAAATACAAAAAATTCAACAAAACTACAGAAAATTAATCTTAATAACACAAAACAAGTATCAATCTGTAATACAACTAAGCTTAAATAAATTAATAGAAAGCTTCATCTATCTAATAGGTTGAGAATAGTTTATGTCTAATGACATAAATTCATTCGTGTTGATTTCATTTGGGTATATTCATCTGATTTAGGATTTACTCCATTAAGTTTGTACTCATCAATGCCCGCCTGTTGGTATCCACAATTCTCATACAGTGCGCGAGCAAAGTAATCAATCGTTCGTCGCCATATCTAACTTTGAGTCAAACAAACCAGTTGGATTACCAACCCTCAACTAATCGCTTCTTTAAGGCGAGCGATCGCACATTTAACTGTTGGTTGTCACAAGAGAACTAATACTACAGCAGTATATTTAACAACTAAGGGTGGTTCAACTTTCGCTGCGACTCCTCCAACGCGCTGAAATACACAGGACTGATGCGATCGCAAACTCTTTGACTAAATTCCATACATTATCATGACCATCTCCCAAACAAACAAGTGGGTTAACCAGATGCTGACTATTAACATCCCCTGAGTTCGGAGTTGTAGGTCTATTTGACTGGTTCAAAGCGATGATGGAACGGCTTTGTTGCATGAATTAAAAAAAGACACACCATCACCTACTTCTAGGATAGACACATCAAACGTCCCACCGCCTAAGTCAAATACCAAGATAATTTCGTTAGTTTTCTTGTCAAGTCCGTAAGCGAGGGCCGCCGCCGTGGGCTAGTTGATAATTCGCAGAACTTTAATCCCGGCAATTCTACTGGCATCTTTGGTAGCCTGCCGTTGAGAGTCATTGAAATAGGCAGGGGTGGTAATTACCGCTTGCCTCACTGGTTCCCCCAGATATGTGCTGGCATCATCTATCAGCTTGCGGACTACCTCATACCATTTCACGAAAAACCTGATACACATGTAAACTCTGAAACCCTTGCTGTATCAAAGTTTTGTAATTACGAATTACGAATTACGAATTGATATCAGCCGAGATTTCTTCGGGTGAAAATTCCTTGTTCAGAGCGGGACAGTGTAGCTTGACATTGCCATTACTGTCACGTACCACTTTGTAAGTAACTTGTTTTGCCTCTTGCGTAACTTCATCATACCTGCGCCCGATGAACCGCTTCACAGAATAAAAAGTGTTTTCTGGGTTCATTACACCCTGGCGCTT
The Human Genome Project
A Walk in the Forest
* Photo courtesy of www.webshots.com
Observation
* Photos courtesy of www.webshots.com and Peter Smallwood
Observation
* Photos courtesy of www.webshots.com and Peter Smallwood
Observation
* Photos courtesy of www.webshots.com and Peter Smallwood
Observation
* Photos courtesy of www.webshots.com and Peter Smallwood
Experiment
* Photos courtesy of www.webshots.com and Peter Smallwood
Filters: Information reducersSquirrel filter
Filters: Information reducersMolecule filter
Filters: Information reducersSequence filter
How organism is made
How organism works
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT
CTCCGTAAAC CTCTAAC...
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code Rules of folding
Active site
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Active site
Cell interaction
Metabolism,Architecture
Genetic code Rules of folding
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Active site
Gives us:
• Custom antibiotics
Genetic code Rules of folding
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Gives us:
• Custom antibiotics • Custom antibodies• Custom enzymes• New materials
Genetic code Rules of folding
Active site
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Rules of transcriptional and post-transcriptional control
• Begin transcription• End transcription• Splice transcript• Begin translation
ATGACTTATGATCAACGCACAGGGCTA3%
?
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Rules of transcriptional and post-transcriptional control
TCTACTTATATTCAATCCACAGGGCTACACCTAGTTCTTGAAGAGTCTGTTGAATGAACACATACATGGTTTATCTGTTTTTCTGTCTGCTCTGACCTCTGGCAGCTT
TAGCCTGCCCCACTCTTAGATAAACGAACCTTAGTGACTTCTGCTATACCAAAGTCTCCACGCCCCTCCGTAAACCTCTAACATGATGTCAGCAAATATTAAAAATGA
97%
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA
?
• Begin transcription• End transcription• Splice transcript• Begin translation
From Sequence to OrganismHow does Nature do it?
Natural filters/transformations
• Selective transcription
• Selective processing
• Translation
• Folding
DNA Functional protein
From Sequence to OrganismHow does Nature do it?
Natural filters/transformations
DNA Functional protein
Simulation of Nature Surrogate Processes
From Sequence to OrganismHow can WE do it?
From Sequence to OrganismHow can WE do it?
Simulation of Nature
Utterance of W Shakespeare
Utterance of George W Bush
“Whether ‘tis nobler in the mind to suffer the slings and arrows
of outrageous fortune...”
“We must give our military every tool and weapon it needs to prevail...”
???
From Sequence to OrganismHow can WE do it?
Surrogate Processes
Utterance of W Shakespeare
Utterance of George W Bush
“Whether ‘tis nobler in the mind to suffer the slings and arrows
of outrageous fortune...”
“We must give our military every tool and weapon it needs to prevail...”
Word frequency
From Sequence to OrganismHow can WE do it?
Surrogate Processes
Utterance of W Shakespeare
Utterance of George W Bush
“Whether ‘tis nobler in the mind to suffer the slings and arrows
of outrageous fortune...”
“We must give our military every tool and weapon it needs to prevail...”
Word frequency, words/sentence…
From Sequence to OrganismHow can WE do it?
Natural filters/transformations
• Selective transcription
• Selective processing
• Translation
• Folding/function
Surrogate filters
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC
Characteristics of coding sequences/introns
• Gene finders
Predicted coding regionsMy sequence
From Sequence to OrganismHow can WE do it?
Natural filters/transformations
• Selective transcription
• Selective processing
• Translation
• Folding/function
Surrogate filters• Gene finders
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Function?
From Sequence to OrganismHow can WE do it?
Natural filters/transformations
• Selective transcription
• Selective processing
• Translation
• Folding/function
Surrogate filters• Gene finders
• Similarity finders
My predicted geneSequence/motif
databases
globin
globin?
Similar genes
Surrogate FiltersGene finders
Start/Stop codon search
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA
CTC CAC GCC CCT CCG TAC ACC TCT AAC ATG ATC TCA GCA AAT ATT AAA AAT GAA TAA ACT TTG TGA CAT GTA CAA ATG GAA ATA TGC AAC TCC ACG CCC CTC CGT ACA CCT CTA ACA TGA TCT CAG CAA ATA TTA AAA ATG AAT AAA CTT TGT GAC ATG TAC AAA TGG AAA TAT GCA A
Look for start codons (ATG) (GTG,TTG)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG
Surrogate FiltersGene finders
Start/Stop codon search
Look for start codons (ATG) (GTG,TTG)
Look for stop codons (TAA,TAG,TGA)
Highly inaccurate
Surrogate FiltersGene finders
Hidden Markov Model (HMM)-based recognition
Step 1: Create model through extensive training set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
Step 1: Create model through extensive training set
AAAA: 33%
AAAC: 25%
AAAG: 12%
AAAT: 30%
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
Step 1: Create model through extensive training set
AACA: 30%
AACC: 20%
AACG: 15%
AACT: 35%
AAAAACAAGAATACA . . .TTGTTT
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
Step 2: Assess candidate genes
0.12
A C G TAAA 0.33 0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG 0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20 0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25 0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
AAAGCAA…
3rd order Markov model
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Step 2: Assess candidate genes
AAAGCAA…
0.12 x 0.15
3rd order Markov model
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
A C G TAAA 0.33 0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG 0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20 0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25 0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
Step 2: Assess candidate genes
AAAGCTA…
0.12 x 0.15 . . .
So far, not a good candidate!
3rd order Markov model
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
A C G TAAA 0.33 0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG 0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20 0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25 0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
Step 2: Assess candidate genes
3rd order Markov model
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Candidate genes Predicted genes
Predicted genes
Step 2: Assess candidate genes
3rd order Markov model
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Candidate genes Predicted genes
Conform to standard modelChallenge
accepted beliefs
Computers are powerful
globin
Highly filtered output • Easy to grasp• High-level insights
Unfiltered output• Confusing• Basic insights
Computers are tempting
Globin
Computers are tempting
Crisis in Bioinformatics
1. Need high-level filters
2. Need access to raw phenomena
3. Need new tools for new phenomena
4. Need intuitive representation of results
Need a new generation
5. Need ability to build new tools
View of the Future
View of the Future Integration of information
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Cell interaction
Metabolism,Architecture
Genetic code Rules of folding
Active site
Prochlorococcus MED4
Prochlorococcus MIT9313
• Gene present in Prochlorococcus MED4 MED4 is naturally adapted to grow in high light.
How do cells control response to light?
• Ortholog absent in Prochlorococcus MIT9313 MIT9313 is naturally adapted to grow in low light
• Ortholog present in Synechocystis PCC 6803 Reason will become apparent in a moment
• Synechocystis PCC 6803 ortholog responds to high light Gene turns on by factor > 2 in response to high light
What genes are related to the adaptation to high light?
Look for:
Build set Display set
Click on Build Set to begin finding orfs with
the desired specifications
HELPSet operation
All items in All open reading frames of
All amino acid sequences of
All intergenic regions of
Human-annotated orfs of
Private set
Public set
All open reading frames of
Build set Display set
Choose set type
Goal is to find all open reading frames within Prochlorococcus MED4 that
meet certain specifications, so click on All open reading frames in
CancelHELPSet operation
All items in All open reading frames of Arthrobacter platensisGloeobacter violaceusMicrocystis aeruginosa
Nostoc punctiformeNostoc PCC 7120
Prochlorococcus MED4Prochlorococcus MIT9313
Prochlorococcus S120Synechococcus PCC6301Synechococcus PCC7942
Synechococcus WHSynechocystis PCC 6803Thermosynechococcus
TrichodesmiumUnicellulularFilamentous
All
Prochlorococcus MED4
Build set Display set
Choose set type Choose database
Click on Prochlorococcus MED4
CancelHELPSet operation
All items in All open reading frames of Prochlorococcus MED4
Display set
such that:
Variable Data Operation Function Done
Choose set type Choose database
Build set
You will ask that an ortholog of each desired MED4 genes exists in Synechocystis PCC 6803. It is
convenient to define the ortholog now. Click the Variable button
CancelHELPSet operation
All items in All open reading frames of Prochlorococcus MED4
Display set
such that:
Variable Data
ItemNew variable
Variable
Choose set type Choose database
New variable
Build set
Item refers to the MED4 orf under consideration. You want to define its ortholog
in Synechocystis, so click on New variable
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of Prochlorococcus MED4
Display set
such that:
Variable Data
6803 ortholog
Type variable name
=
Choose set type Choose database
Build set
You can name the variable representing the ortholog anything you
like. For this simulation, a name is provided. Press the Enter key
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of Prochlorococcus MED4
Display set
such that:
Variable Data
6803 ortholog
Type variable name
= Closest ortholog of
Protein product of
Upstream region of
Downstream region of
Ortholog of (item
Choose set type Choose database
Choose function
Build set
One variable can be defined with respect to another in several ways.
The relationship you want is Ortholog of
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
= Ortholog of (item in Arthrobacter platensisGloeobacter violaceusMicrocystis aeruginosa
Nostoc punctiformeNostoc PCC 7120
Prochlorococcus MED4Prochlorococcus MIT9313
Prochlorococcus S120Synechococcus PCC6301Synechococcus PCC7942
Synechococcus WHSynechocystis PCC 6803Thermosynechococcus
Trichodesmium
Choose database
Synechocystis PCC6803
)Choose function
Build set
Clicking on Synechocystis PCC6803 defines the variable 6803 ortholog as
the ortholog in Synechocystis to a given orf of MED4.
6803 ortholog
Type variable name
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Synechocystis PCC 6803
Build set
)
The first limitation on the MED4 orf is that no ortholog of it exists in MIT9313. To evoke the concept of ortholog, press the
Function button
= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
Click on Ortholog of
Closest ortholog of
Protein product of
Upstream region of
Downstream region of
Ortholog of
Choose function
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
As always, Item refers to the orf of MED4 that is being defined. You want to specify that an ortholog of it in MIT9313 doesn’t
exist, so click on Item.
Item6803 ortholog
Variable
Item( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
Clicking on Prochlorococcus MIT9313 defines an ortholog of a MED4 gene in MIT9313 (if such an ortholog exists)
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Arthrobacter platensisGloeobacter violaceusMicrocystis aeruginosa
Nostoc punctiformeNostoc PCC 7120
Prochlorococcus MED4Prochlorococcus MIT9313
Prochlorococcus S120Synechococcus PCC6301Synechococcus PCC7942
Synechococcus WHSynechocystis PCC 6803Thermosynechococcus
Trichodesmium
Choose database
)
Prochlorococcus MIT9313
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
You want to keep only those MED4 genes where an ortholog in MIT9313 does NOT
exist, so click on doesn’t exist.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) =
existsdoesn’t existdoesn’t exist
Op
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
That completes one specification, but there are more. Click on the Operation button to
connect one specification to the next.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
You want both the first specification AND the second to be true, so click on AND.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
ANDOR
AND
Op
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
The second specification is that microarray data for the 6803 ortholog meets a certain criterion. To get at that
data, press the Data button
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
AND
Op
[
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
The data you want is for the 6803 ortholog. Click on 6803
ortholog.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
AND
Op
[
data for ( Item6803 ortholog
New variable
Variable
6803 orthologin
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
Choose the Hihara experiment, which measured expression changes upon shift from low light to high light. If you didn’t
know which experiment was appropriate, you could have clicked on Choose data set for a description of the choices
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
AND
Op
[
data for ( 6803 ortholog
Variable
in Microarray:Hihara1(6803)Microarray:Suzuki1(6803)
Microarray:Yoshimura1(6803)Microarray:Meeks(Npun)Microarray:Golden(7120)
Choose data set
Microarray:Hihara1(6803) )
Operation Function Done
CancelHELPSet operation
High light vs low light experiment
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
You want the ratio of experimental condition to control to exceed a
specified value. Click on >.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
AND
Op
[
data for (
Variable
in Microarray:Hihara1(6803)
Choose data set
) << or =
=> or =
>>
Op
6803 ortholog
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
You can type in the value you want. For this simulation a number is supplied. Press the Enter key.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
AND
Op
[
data for (
Variable
in Microarray:Hihara1(6803)
Choose data set
) >
Op Value
]+26803 ortholog
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
No more specifications. Press the Done button.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
AND
Op
[
data for (
Variable
in Microarray:Hihara1(6803)
Choose data set
) >
Op Value
]+26803 ortholog
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
This was a complicated search. If you wanted to do it again, you could save the
search description. In this case, just save the results by clicking on Save only results.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
AND
Op
[
data for (
Variable
in Microarray:Hihara1(6803)
Choose data set
) >
Op Value
]+26803 ortholog
Save results and scriptSave only resultsSave only results
Operation Function Done
CancelHELPSet operation
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
All MED4 genes meeting the given specifications will be collected into a set. You can name the set anything you want. For this
simulation, a name is provided. Press the Enter key.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
AND
Op
[
data for (
Variable
in Microarray:Hihara1(6803)
Choose data set
) >
Op Value
]+2
Light-specific genes
Type name of set
6803 ortholog
Operation Function Done
CancelHELPSet operation
Build set Display set
:all0687 hupL [NiFe] uptake hydrogenase large subunit, C terminus
:all0687 hupL [NiFe] uptake hydrogenase large subunit, N terminus
:all0688 hupS [NiFe] uptake hydrogenase small subunit
:alr0692 similar to nifU
:alr0874 nifH2 dinitrogenase reductase
:asr1309 similar to nifU
:alr1407 nifV1 homocitrate synthase
:asr1408 nifZ iron-sulfur cofactor synthesis
:asr1408 nifT
Set: Light-specific genes
ProcMed4:all0687 hupL [NiFe] uptake hydrogenase large subunit, C terminus
ProcMed4:all0687 hupL [NiFe] uptake hydrogenase large subunit, N terminus
ProcMed4:all0688 hupS [NiFe] uptake hydrogenase small subunit
ProcMed4:alr0692 similar to nifU
ProcMed4:alr0874 psbBX dinitrogenase reductase
ProcMed4:asr1309 similar to nifU
ProcMed4:alr1407 psbY1 homocitrate synthase
ProcMed4:asr1408 psbX iron-sulfur cofactor synthesis
ProcMed4:asr1408 nifT
• The results are displayed as a list of orfs (Of course, the search capabilities do not now exist, and the results of the described search are unknown)• Clicking on the name of any orf brings you to its page (see Scenarios 1 and 2).• Clicking on circles next to the orf names allows you to modify the set.• The genetic neighborhood of each orf is shown to the right.
DoneHELPSet operation
[WARNING: Fantasy filtration not in effect!]
Prochlorococcus MED4: pll1290
Replicon: Chromosome
Coordinates: 1533026 (stop) <- 1533931 (start-TTG) Human Length = 301 amino acids
Strand: Complementary
Gene name(s): proXM
Function: Putative type II DNA cytosine methyltransferase (CAGCTG-specific) Human Classification: Type II beta (N4) Human
Activity: Protects against: PvuII Experiment In vivo activity: exists Experiment
Cyanobacterial orthologs: none
ProcMED4
Proteus vulgaris
Salmonella paratyphi
Streptomyces spectabilis
OptionsAnnotateMain Menu History
More
A
A
A
A
A
HELP
[WARNING: Fantasy filtration not in effect!]
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
This was a complicated search. If you wanted to do it again, you could save the
search description. In this case, just save the results by clicking on Save only results.
Item
Variable
( in
Synechocystis PCC 6803 )= Ortholog of (item inChoose databaseChoose function
6803 ortholog
Type variable name
Ortholog of
Choose function
Prochlorococcus MIT9313
Choose database
) doesn’t exist
Op
AND
Op
[
data for (
Variable
in Microarray:Hihara1(6803)
Choose data set
) >
Op Value
]+26803 ortholog
Save results and scriptSave only results
Save results and script
Operation Function Done
CancelHELPSet operation
Equivalent script that bypasses interface
FOR orf IN (orfs:ProcMED4) { 6803ortholog = Ortholog(orf,orfs:Syny6803); WHEN (NOT Exists(Ortholog(orf,orfs:Proc9313)) AND Data(6803ortholog,microarray:Hihara1) > +2){ COLLECT orf INTO light_specific_genes; }}DISPLAY (light_specific_genes, “BNC”);
or
MAIL (light_specific_genes,[email protected],“BNC”);
The same search could have been conducted through the script shown above. The script interface makes possible complex
searches beyond the scope of the graphical interface.
All items in All open reading frames of
Choose set type
Prochlorococcus MED4
Choose database
Display set
such that:
Variable Data
Build set
Operation Function Done
CancelHELPSet operation HELP
???
Cyanobacterial Knowledge BaseVirtual Help Desk
How to search for
data?
How to build a
new filter?
Cyanobacterial Knowledge BaseVirtual Help Desk
How to......I don’t know!
Virtual Help Desk Staff
HELP
Cyanobacterial Knowledge BaseVirtual Help Desk
Upper echelons Staff
You
Virtual Help Desk Staff
HELP
Billions and Billions of Bases
How does a biologist maintain a grip on sanity?reality?
View of the Future Interplay of low- & high-level perception
ProcMED4
Proteus vulgaris
Salmonella paratyphi
Streptomyces spectabilis
View of the Future Interplay of low- & high-level perception
Anab7120
Proteus vulgaris
Salmonella paratyphi
Streptomyces spectabilis
TCTACTTATATTCAATCCACAGGGCTACACCTAGTTCTTGAAGAGTCTGTTGAATGAACACATACATGGTTTATCTGTTTTTCTGTCTGCTCTGACCTCTGGCAGCTT
TAGCCTGCCCCACTCTTAGATAAACGAACCTTAGTGACTTCTGCTATACCAAAGTCTCCACGCCCCTCCGTAAACCTCTAACATGATGTCAGCAAATATTAAAAATGA
97%
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA
Anabaena Chromosome (6413771 bp): 4001 to 5000
cgcccaacaataacaaatgtgtaatctagaccttctgccttgagttccttggcgcggttttcggcacgacggatgacgttggtattgtaaccgccgcacaaaccacgatcgccagaaataactagcaagcctactgatttaacttcccgttttttcagtagaggtaagtctacatcttcaaaccgtagacgagtttgcaaaccgtataatacttgtgccaaacggtcagcaaaaggacgagtagcgattacttgttcttgggcgcgacgtacacgcgccgccgctaccagccgcatggcttctgtgattttcttggtgtttttgaccgactgaatgcgatcgcgtattgatttgagattaggcataatatttgttgattgtcagttgtcagttgtcagttgtcagttgtcagtgtctattgctactgaccactgaccaatgactaatgactaattacgctgtagctttgaaggtctttttgtagtcttctaaagctgccttcaatgctttttcttcatcatcacccagtgctttcttcgattgtacgtcttggaagtaggggttaacgccggacttcaagtaatctctcaagcctttggtgaaggtggtgactttatcaacagggatatcatctaagtaaccgttgatacctgcgtacagaatggctacttgttcagctacggatagaggctgattttgggactgtttgaggagttcccgcaggcgttgacctcttgccaattggtcttgggtggctttatctaggtcggaagcaaattgcgcgaaggcttggaggtcgtcaaactgtgctagttcgagcttaatcttaccagcaacttttttcatcgctttggtttgtgccgcagaacccacacgggatacagagataccagggtttacagccggacgaataccagcgttaaataagtcagaagataagaatatctgaccgtctgtaatagaaattacgttggtaggaatgtaggcagaaacgtcacca
Typical output of current programs
Future: Sequence plus genetic context
Noncoding region
Future: Both filtered and raw data
Future: Both filtered and raw data
Filters: Information reducersBuild filter to find repeated sequences
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT
TGGTCTCCGACCGACCGTAGGTCATCGTGGTCTCCGACCGACCGTAGGTCATCG
CTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGTAGCCGAAGTAGTTCGACTGAGCGTAGTCGAAGTC
...
Repeat filter
Entire genome Repeated sequences
Filters: Information reducersBuild repeats filter
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT
TGGTCTCCGACCGACCGTAGGTCATCGTGGTCTCCGACCGACCGTAGGTCATCG
CTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGTAGCCGAAGTAGTTCGACTGAGCGTAGTCGAAGTC
...
Repeat filter
Entire genome Repeated sequences
NIS-1: repeat family
Alignment of NIS-1
(…271 more)
Filters: Information reducersBuild secondary repeats filter
A: CTTGTACTGAGCGAAGTCGAAGTAB: CTTGTACTGAGCGTAGCCGAAGTA
Distance = 2
CTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGAAGTCGAAGTA ...CTTGTACTGAGCGAAGTCGAAGTA
Copy number = 10
Subfamily A
CTTGTACTGAGCGTAGCCGAAGTACTTGTACTGAGCGTAGCCGAAGTA
Copy number = 2
Subfamily B
GTTCGACTGAGCGTAGTCGAAGTC
Copy number = 1
Subfamily C
Filters: Information reducersBuild secondary repeats filter
Distance = 2
A: CTTGTACTGAGCGAAGTCGAAGTAC: GTTCGACTGAGCGTAGTCGAAGTC
Distance = 5
CTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGAAGTCGAAGTA ...CTTGTACTGAGCGAAGTCGAAGTA
Copy number = 10
Subfamily A
CTTGTACTGAGCGTAGCCGAAGTACTTGTACTGAGCGTAGCCGAAGTA
Copy number = 2
Subfamily B
GTTCGACTGAGCGTAGTCGAAGTC
Copy number = 1
Subfamily C
Filters: Information reducersBuild secondary repeats filter
B: CTTGTACTGAGCGTAGCCGAAGTAC: GTTCGACTGAGCGTAGTCGAAGTC
Distance = 5Distance = 5Do for all pairs of subfamilies
CTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGAAGTCGAAGTACTTGTACTGAGCGAAGTCGAAGTA ...CTTGTACTGAGCGAAGTCGAAGTA
Copy number = 10
Subfamily A
CTTGTACTGAGCGTAGCCGAAGTACTTGTACTGAGCGTAGCCGAAGTA
Copy number = 2
Subfamily B
GTTCGACTGAGCGTAGTCGAAGTC
Copy number = 1
Subfamily C
Distance = 2
Diameter Copies of exact repeats
Distance Number of mismatches
Relationship between related repeats in genome(sequences within NIS-1 repeat family)
Crisis in Bioinformatics
1. Need high-level filters2. Need access to raw phenomena
Integrated knowledge base
Crisis in Bioinformatics
1. Need high-level filters2. Need access to raw phenomena
3. Need new tools for new phenomena4. Need intuitive representation of results
Integrated knowledge base
Tools that bridge levels of perception
Crisis in Bioinformatics
1. Need high-level filters2. Need access to raw phenomena
3. Need new tools for new phenomena4. Need intuitive representation of results
Long term: Need a new generation
5. Need ability to build new tools
Integrated knowledge base
Tools that bridge levels of perception
Short term: Graphical programming Human help
Billions and Billions of Bases
How does a biologist maintain a grip on reality?
Filtering reality Raw reality
Real questions with real answers
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
How do we figure out how cars are made?
Genetic approach Biochemical approach
Pre-genomic Molecular BiologyGeneticist’s Approach
Pre-genomic Molecular BiologyGeneticist’s Approach
Isolation of Defective Gene
Pre-genomic Molecular BiologyGeneticist’s Approach
Pre-genomic Molecular Biology
How do we figure out how cars are made?
Genetic approach Biochemical approach
Pre-genomic Molecular BiologyBiochemist’s Approach
Pre-genomic Molecular BiologyBiochemist’s Approach
Pre-genomic Molecular BiologyBiochemist’s Approach
Pre-genomic Molecular BiologyBiochemist’s Approach
Pre-genomic Molecular Biology
How do we figure out how cars are made?
Genetic approach Biochemical approach
• One component at a time
• Highly filtered perception
• Many local viewpoints
Pre-genomic Molecular BiologyHow we viewed the world
Post-genomic Molecular Biology
Post-genomic Molecular BiologyBioinformaticist’s Approach
(long term)
Assemble the whole
Post-genomic Molecular BiologyBioinformaticist’s Approach
(short term)
Identify critical parts
Globin
Current Biology
AATAAAGCTTTACAAACCAAACTCTGGCTTCAATTGTGTAACCCAAGCTTTGATTCTTTCCTCTGTTAAATCGGATTGATTATCTTCATCAAGGGCAAGACCTACAAATTTACCATCACGAACAGCTTTAGACTCACTGAATTCATAACCTTCTGTAGGCCAATAGCCAACTGTTTCACCACCATTTTCTGAAATTTTTTCCTCTAGAATACCGAGGGCATCTTGAAATGTATCAGGATAACCAACCTGGTCTCCAGGAGCAAAATAAGCAACTTTTTTGCCGATGAAGTCAATGTTATCTAACTCATCATAAAAATTTTCCCAATCACTTTGCAATTCTCCAACATTCCAGGTAGGACAACCAACAACGATATAATCGTAGTTATTGAAATCACTTGGTTCAGCTTGTGAAATATCATATAAAGTTACAACACTATCACCACCAAACTCCTTCTGAATTATTTCTGATTCAGTTTGGGTATTGCCTGTTTGAGTACCAAAAAATAAACCAATATTAGACATTTTTACTCCTTTTATGTATTTGCAAAATTATTTCAATTAAAATATTTAGTAATAATTAATTGTTAGCTAGCTAATAATTAAATTTTTATTACAATCATTGTAAAAGGCATTGAAAAAGTAAATAAAAATTTTTATTCTACGTTATTTCAAAAATATTTACTTACATATACTTAACCTTTATAGTGATGTAATATACTCTAATTCCTATTTTACTTATAAATACCATCTCAGCTTAATGTAACGAATTTTTCTGTTTATCTTTAAATACAAAAAATTCAACAAAACTACAGAAAATTAATCTTAATAACACAAAACAAGTATCAATCTGTAATACAACTAAGCTTAAATAAATTAATAGAAAGCTTCATCTATCTAATAGGTTGAGAATAGTTTATGTCTAATGACATAAATTCATTCGTGTTGATTTCATTTGGGTATATTCATCTGATTTAGGATTTACTCCATTAAGTTTGTACTCATCAATGCCCGCCTGTTGGTATCCACAATTCTCATACAGTGCGCGAGCAAAGTAATCAATCGTTCGTCGCCATATCTAACTTTGAGTCAAACAAACCAGTTGGATTACCAACCCTCAACTAATCGCTTCTTTAAGGCGAGCGATCGCACATTTAACTGTTGGTTGTCACAAGAGAACTAATACTACAGCAGTATATTTAACAACTAAGGGTGGTTCAACTTTCGCTGCGACTCCTCCAACGCGCTGAAATACACAGGACTGATGCGATCGCAAACTCTTTGACTAAATTCCATACATTATCATGACCATCTCCCAAACAAACAAGTGGGTTAACCAGATGCTGACTATTAACATCCCCTGAGTTCGGAGTTGTAGGTCTATTTGACTGGTTCAAAGCGATGATGGAACGGCTTTGTTGCATGAATTAAAAAAAGACACACCATCACCTACTTCTAGGATAGACACATCAAACGTCCCACCGCCTAAGTCAAATACCAAGATAATTTCGTTAGTTTTCTTGTCAAGTCCGTAAGCGAGGGCCGCCGCCGTGGGCTAGTTGATAATTCGCAGAACTTTAATCCCGGCAATTCTACTGGCATCTTTGGTAGCCTGCCGTTGAGAGTCATTGAAATAGGCAGGGGTGGTAATTACCGCTTGCCTCACTGGTTCCCCCAGATATGTGCTGGCATCATCTATCAGCTTGCGGACTACCTCATACCATTTCACGAAAAACCTGATACACATGTAAACTCTGAAACCCTTGCTGTATCAAAGTTTTGTAATTACGAATTACGAATTACGAATTGATATCAGCCGAGATTTCTTCGGGTGAAAATTCCTTGTTCAGAGCGGGACAGTGTAGCTTGACATTGCCATTACTGTCACGTACCACTTTGTAAGTAACTTGTTTTGCCTCTTGCGTAACTTCATCATACCTGCGCCCGATGAACCGCTTCACAGAATAAAAAGTGTTTTCTGGGTTCATTACACCCTGGCGCTT
Future Biology
AATAAAGCTTTACAAACCAAACTCTGGCTTCAATTGTGTAACCCAAGCTTTGATTCTTTCCTCTGTTAAATCGGATTGATTATCTTCATCAAGGGCAAGACCTACAAATTTACCATCACGAACAGCTTTAGACTCACTGAATTCATAACCTTCTGTAGGCCAATAGCCAACTGTTTCACCACCATTTTCTGAAATTTTTTCCTCTAGAATACCGAGGGCATCTTGAAATGTATCAGGATAACCAACCTGGTCTCCAGGAGCAAAATAAGCAACTTTTTTGCCGATGAAGTCAATGTTATCTAACTCATCATAAAAATTTTCCCAATCACTTTGCAATTCTCCAACATTCCAGGTAGGACAACCAACAACGATATAATCGTAGTTATTGAAATCACTTGGTTCAGCTTGTGAAATATCATATAAAGTTACAACACTATCACCACCAAACTCCTTCTGAATTATTTCTGATTCAGTTTGGGTATTGCCTGTTTGAGTACCAAAAAATAAACCAATATTAGACATTTTTACTCCTTTTATGTATTTGCAAAATTATTTCAATTAAAATATTTAGTAATAATTAATTGTTAGCTAGCTAATAATTAAATTTTTATTACAATCATTGTAAAAGGCATTGAAAAAGTAAATAAAAATTTTTATTCTACGTTATTTCAAAAATATTTACTTACATATACTTAACCTTTATAGTGATGTAATATACTCTAATTCCTATTTTACTTATAAATACCATCTCAGCTTAATGTAACGAATTTTTCTGTTTATCTTTAAATACAAAAAATTCAACAAAACTACAGAAAATTAATCTTAATAACACAAAACAAGTATCAATCTGTAATACAACTAAGCTTAAATAAATTAATAGAAAGCTTCATCTATCTAATAGGTTGAGAATAGTTTATGTCTAATGACATAAATTCATTCGTGTTGATTTCATTTGGGTATATTCATCTGATTTAGGATTTACTCCATTAAGTTTGTACTCATCAATGCCCGCCTGTTGGTATCCACAATTCTCATACAGTGCGCGAGCAAAGTAATCAATCGTTCGTCGCCATATCTAACTTTGAGTCAAACAAACCAGTTGGATTACCAACCCTCAACTAATCGCTTCTTTAAGGCGAGCGATCGCACATTTAACTGTTGGTTGTCACAAGAGAACTAATACTACAGCAGTATATTTAACAACTAAGGGTGGTTCAACTTTCGCTGCGACTCCTCCAACGCGCTGAAATACACAGGACTGATGCGATCGCAAACTCTTTGACTAAATTCCATACATTATCATGACCATCTCCCAAACAAACAAGTGGGTTAACCAGATGCTGACTATTAACATCCCCTGAGTTCGGAGTTGTAGGTCTATTTGACTGGTTCAAAGCGATGATGGAACGGCTTTGTTGCATGAATTAAAAAAAGACACACCATCACCTACTTCTAGGATAGACACATCAAACGTCCCACCGCCTAAGTCAAATACCAAGATAATTTCGTTAGTTTTCTTGTCAAGTCCGTAAGCGAGGGCCGCCGCCGTGGGCTAGTTGATAATTCGCAGAACTTTAATCCCGGCAATTCTACTGGCATCTTTGGTAGCCTGCCGTTGAGAGTCATTGAAATAGGCAGGGGTGGTAATTACCGCTTGCCTCACTGGTTCCCCCAGATATGTGCTGGCATCATCTATCAGCTTGCGGACTACCTCATACCATTTCACGAAAAACCTGATACACATGTAAACTCTGAAACCCTTGCTGTATCAAAGTTTTGTAATTACGAATTACGAATTACGAATTGATATCAGCCGAGATTTCTTCGGGTGAAAATTCCTTGTTCAGAGCGGGACAGTGTAGCTTGACATTGCCATTACTGTCACGTACCACTTTGTAAGTAACTTGTTTTGCCTCTTGCGTAACTTCATCATACCTGCGCCCGATGAACCGCTTCACAGAATAAAAAGTGTTTTCTGGGTTCATTACACCCTGGCGCTT
Future Biology
Globin
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT
Current BiologyCurrent Life
“Axis of Evil...”
Current Life
“No war for oil...”Globin
Current Life
“No war for oil...”Globin
Current Life
Contact Information
Jeff ElhaiDepartment of BiologyVirginia Commonwealth UniversityRichmond, VA
E-Mail: [email protected]: 804-828-0794Web: www.people.vcu.edu/~elhaij/