What is Bioinformatics?
-
Upload
thor-schroeder -
Category
Documents
-
view
20 -
download
2
description
Transcript of What is Bioinformatics?
What is Bioinformatics?
The Data
The Analysis
Comparison
Evolution
Long Distance: Comparative Genomics
Short Distance: Variation Analysis
Homology
Non-homology
Physical/Chemical/Statistical Mathematical Modelling
The Data & its growth.1976/79 The first viral genome –MS2/X174
1995 The first prokaryotic genome – H. influenzae
1996 The first unicellular eukaryotic genome - Yeast
1997 The first multicellular eukaryotic genome – C.elegans
2001 The human genome 3Gb
1.5.03: Known
>1000 viral genomes
96 prokaryotic genomes
16 Archeobacterial genomes
A series multicellular genomes are coming.
A general increase in data involving higher structures and dynamics of biological systems
Genomes & Tree of Life
•3.5-3.8 Gyr Origin of Life
•3+ Gyr LUCA
•~1.4 Gyr Origin of Eukaryotes
•5-600 Myr Origin of Vertebrates
•200+ Myr Origin of Mammals
•80-100 Myr Mouse Mammalian Split
•5-7 Myr Chimp-Human Split
•100 Kyr – Myr Age of Polymorphisms
From Janssen, 2003
Comparison of Evolutionary Objects.
RNA (Secondary) StructureSequences
ACTGT
ACTCCT
Protein Structure
87654321
4
Cabbage
Turnip
75 31 86 2
Gene Order/Orientation.
Gene Structure
Interaction Networks
Any Graph.
General Theme.
Formal Model of Structure
Stochastic Model of Structure Evolution.
Renin
HIV proteinase
The Phylogeny for Evolutionary Objects
observable observable
Parameters:tim
e
rates, selectionUnobservable
Evolutionary Path
observable
MRCA-Most Recent Common Ancestor
?
3 Problems:
i. Test all possible relationships.
ii. Examine unknown internal states.
iii. Explore unknown paths between states at nodes.
ATTGCGTATATAT….CAG ATTGCGTATATAT….CAG ATTGCGTATATAT….CAG
Tim
e Direction
Gene and Genome Evolution
)1(41),( 4
,t
t eGCP
Higher CellsChimp Mouse Fish E.coli
TGCGTATC TGTGTATA
Basic Events
• substitutions.
• insertion deletions.
• Chromosome Level events: inversions, duplications, transpositions,..
Average Number of Mitoses
•Per Male generation (15:35 .. 20:150)
•Per Female generation: ~24
• Single nucleotide substitutions: ~10-7
• Microsatellites (~100.000): ~10-2
• Small insertion deletions: ~10-8
Principles of String Comparison: Alignment
ACTGT
ACTCCT
ACTGCT ACTCGT
ACTGT
ACTCCT
ACT-GT
ACTCCT
ACTG-T
ACTCCT
Cost 2 Probability: e-16.47
.41 .41
Human alpha hemoglobin;Human beta hemoglobin;Human myoglobinBean leghemoglobin
Probability of data e -1560.138
Probability of data and alignment e-1593.223
Probability of alignment given data 4.279 * 10-15 = e-33.085
Ratio of insertion-deletions to substitutions: 0.0334
Maximum likelihood phylogeny and alignment
Gerton Lunter
Istvan Miklos
Alexei Drummond
Yun Song
Rooting using irreversibility (Lunter)
Lunter and Hein, ISMB2004
Reversibility:P( )=P( )* P( )P( )*
The Pulley Principle:
=
=
Contagious Dependence
CG avoidance creates irreversibility
Comparison of Evolutionary Objects.
Observable
Observable Unobservable
Unobservable
U
C G
A
C
AU
A
C
)()(
)()(
SequencePSequenceStructureP
StructurePStructureSequenceP
Goldman, Thorne & Jones, 96
Knudsen & Hein, 99
Eddy & co.
Meyer and Durbin 02 Pedersen & Hein, 03 Siepel & Haussler 03
The Rise of Comparative Genomics
Lan
der
et
al(2
001)
Fig
ure
25A
Recursive Definition of Strings
A
I
A
I
A
I
A
I
A
I
ATG
E
Exon 2Exon 1 Exon 3
GAG
E
s
ds
ss
dd d
S -> sS Ss dSd SS
S -> E I
E -> eE eI I -> iE iI
sSS
S
S
S
S ssSS
ssdSdS ssddSddS
ssddSddsS
Gene Grammar RNA Grammar
Stochastic Grammars
If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules.
S -> aSa -> abSba -> abaaba (.015) 0.3 0.5 0.1
S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb
Grammars: Finite Set of Rules for Generating Stringsi. A starting symbol:
ii. A set of substitution rules applied to variables - - in the present string:
Reg
ula
r
Co
nte
xt F
ree
Co
nte
xt S
ensi
tive
Gen
eral
(a
lso
era
sin
g)
finished – no variables
Structure Dependent Evolution: RNA
U A C A C C G U
U
C G
A C
AU
C
U A C A C C G U
U A C A C C G U
U A C A C C G U
1 2 3 4 5 6 7
23
68
457
1 2 3 4 5 6 7
23
68
457
)(
)(
,
,
UnpairedHistoryP
PairedHistoryP
ji
ji
)(
)(
,
,
UnpairedHistoryP
PairedHistoryP
ji
ji
Fro
m B
jarn
e K
nuds
en
Knudsen & Hein, 2003
From Knudsen & Hein (1999)
RNA Structure Application
Observing Evolution has 2 parts
P(x):
P(Further history of x):
U
C G
A
C
AU
A
C
xx
Inter- and Intra-species Comparisons
At shorter time scales
•For sequences sampled within a population, their relationship is determined by population structure. There is no analogue for this for interspecies sequences.
•Is within species variation a short time slice of long term variation?
•Where do the species and population perspective meet?
Short Time Evolution: Population Genetics and History
Population
N1
1 2 1 2 1 2 1 2 1 2
Tim
e
Cardon
Donnelly
Griffiths
McVean
Wiuf
Song
Schierup
Three large areas of application:
Interpretation of Variation
Human Population History
Gene Mapping
Pathogen Evolution
An
cestral Reco
mb
inatio
n G
raph
Time slices
Population
N1
1 2 1 2 1 2 1 2 1 2
Tim
e
All positions have found a common ancestors
All positions have found a common ancestors on one sequence
A randomly picked ancestor: (ancestral material comes in batteries!)
0
0 52.000
260 Mb
06890 8360
7.5 Mb
*35
0 30kb
*250
4Ne 20.000 Segments 52.000 Ancestors 6.800
Applications to Human Genome (Chr 1) (Wiuf and
Hein,97)
The Origin of Variation
Show variation
N1
A
G
CA
G
C
A
G
T
T
G
C
T
G
C
Tim
e
T
G
C
Inter.SNP Consortium (2001): A map of human genome sequence variation containing 1.42 million SNPs. Nature 409.928-33
Slice in Space
N1
Tim
e
a: (3,4)
b: (3,4)
c: (15,16)
d: (16,17)
e: (35,36)
f: (35,36)
g: (36,37)
Minimal ARGs and Haplotype Blocks (Song)
Yun Song, 2004
Genotype and Phenotype Covariation: Gene Mapping
Tim
e Reich et al. (2001)
Rafnar et al.(2004) – Morris et al(2001) +
Finding Homologies
DatabaseNew Sequence
P( ) P( ) / * P( )
R. Doolittle et al.(1983).
New Sequence: Simian Sarcoma Virus onc Gene
Similar Sequence: Platelet-Derived Growth Factor
Properties for the known sequence are transferred to the new sequence, immediately yielding biological hypotheses about the new sequence.
P28SIS 51 GGELESLARGSLGSLSVAEPAMIAECKTRTEVFEISAALIDATNANFLVWPPCVEVQACSGCCNNRN..PDGF-1 1 ----------SLGSLTIAEPAMIAECKTREEVCFCIAAL?DA????????PPCVEVKACTGCCNNRN.. ***** ************ ** *** ** ****** ** *******
“Knowledge Based..”: The Products of Evolution - An Example (D.Baker)
Sequence Structure
Make a List:
Choose global structure that doesn’t create new local structures!
What is Bioinformatics?
The Data
The Analysis
Comparison
Evolution
Long Distance: Comparative Genomics
Short Distance: Variation Analysis
Homology
Non-homology
Physical/Chemical/Statistical Mathematical Modelling
Lizhong HaoBen Holtom Stephen McCauley
Gerton Lunter Rune Lyngsoe Irmtraud MeyerYun Song Jennifer Taylor
Jotun HeinAlexei DrummondRoald Forsberg Bjarne KnudsenIstvan MiklosJakob Skou PedersenSantiago SchnellCarsten Wiuf….
Homepage:Homepage:
http://www.stats.ox.ac.uk/mathgen/bioinformatics/http://www.stats.ox.ac.uk/mathgen/bioinformatics/
Methodology•Evolutionary Models
•Alignment
•Expression Data
•Genome and Gene Evolution
•Sequence Variation Data & Recombination
•RNA Secondary Structure and Evolution
•…………
Collaborations•William Cookson (WCHG)
•John Hancock (Harwell MRC)
•Peter Simmonds (Edinburgh)
•Bioinformatics Research Centre, Dk
•………
Funding:Funding:
MRC & EPSRCMRC & EPSRC