Post on 23-Dec-2015
Quantifying contributions of mutations and homologous
recombination to E. coli genomic diversity
Sergei MaslovDepartment of Biosciences
Brookhaven National Laboratory, New York
Bacterial genome evolution happens in cooperation with
phages
+ =
Variation between E. coli strains
M Touchon et al. PLoS Genetics (2009)
Pan-genome of E. coli
Copy and Insert Copy and Replace
FW Studier, P Daegelen, RE Lenski,
S Maslov, JF Kim, JMB (2009)
Comparison of B vs K-12 strains of E. coli
Usual suspects are there but do not explain heterogeneity
Negative correlation with protein abundance: 2.5% of variation, P-value=10-5
Positive correlation with distance from origin of replication: 0.4% of variation, P-value=10-2
High SNP numbers are clustered along the
chromosome
RecombinedClo
nal
P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)
SNPs by recombination/SNPs by clonal mutations
r/μ=6±1
Clonal regions Recombined regions
Recombined regions
P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)
Strains: K-12 vsETEC-H10407 HS O157-H7-Sakai
Neutral model:Mutations and Recombinations among 70 “genes”, population of 104
C. Fraser et al.(2007) and (2009)
P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)
Phase transition Δc=1.5%
P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)
P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)
Why exponential tail?
Time to coalescence: Prob(t)= 1/Ne (1-1/Ne)t-1=exp(
exponential slope =1/2μNe or 1/θ
Population size Ne=1±0.1 x 109 consistent with earlier estimates
Why Ne << N ? Phages:
But: there are phages that cross species boundaries.
Also slope is similar for different species
Restriction modification system: Recombined segments are not continuous
[Milkman R, Bridges MM. Genetics 1990]
Recombination efficiency: Need 20-30 identical bases to start
recombination Our slope predicts 60 bases which roughly
matches30 in the neginnng and 30 in the end
Species are defined by recombination
Are our 30+ strains a representative sample?
Fully sequenced genomes: 1000s of genes (unbiased and complete) 10s of strains (biased)
MLST data: 10s of genes (biased) 1000s of strains (unbiased, I hope)
Database http://mlst.ucc.ie ∼3000 E. coli strains 7 short regions of ~500 base pairs
eachin housekeeping genes
MLST-- Genomes
Is it really phages?
Phage capacity: 20kbOther strains up to
40kb
K-12 to B comparison
1kb: gene length
Does neutral model explain everything?
At 3 standard deviations
• 19 1kb regions supervariable
• 29 1kb regionssuperconserved
Collaborators & funding• Bill Studier (BNL)• Purushottam Dixit (BNL)• Tin Yau Pang (Stony Brook)• Rich Lenski (Michigan State)• Patrick Daegelen (France)• Jinhyun Kim (Korea) DOE Systems Biology
Knoledgebase (KBase) • Adam Arkin (Berkley) • Rick Stevens (Argonne)• Bob Cottingham (Oak Ridge)• Mark Gerstein (Yale)• Doreen Ware (Cold Spring Harbor)• Mike Schatz (Cold Spring Harbor)• Dave Weston (ORNL)• 60+ other collaborators
Thank you!
24
Genes encoded in bacterial genomes
Packages installed on Linux computers
~
25
Complex systems have many components Genes (Bacteria) Software packages (Linux OS)
Components do not work alone: they need to be assembled to work
In individual systems only a subset of components is used Genome (Bacteria) – bag of genes Computer (Linux OS) – installed
packages Components have vastly
different frequencies of use
26
Justin Pollard, http://www.designboom.com
IKEA: has many components
27Justin Pollard, http://www.designboom.com
They need to be assembled to work
28
Different frequencies of use
vs
Common Rare
29
What determines the frequency of use?
Popularity: AKA preferential attachment Frequency ~ self-amplifying popularity Relevant for social systems: WWW links,
facebook friendships, scientific citations Functional role:
Frequency ~ breadth or importance of the functional role
Relevant for biological and technological systems where selection adjusts undeserved popularity
30
Empirical data on component frequencies
Bacterial genomes (eggnog.embl.de): 500 sequenced prokaryotic genomes 44,000 Orthologous Gene families
Linux packages (popcon.ubuntu.com): 200,000 Linux packages installed on 2,000,000 individual computers
Binary tables: component is either present or not in a given system
31
Frequency distributions
P(f)~ f-1.5 except the top √N “universal” components with f~1
Cloud
ShellCore
ORFans
32
How to quantify functional importance?
Components do not work alone Breadth/Importance ~ Component is
needed for proper functioning of other components
Dependency network A B means A depends on B for its function Formalized for Linux software packages For metabolic enzymes given by upstream-
downstream positions in pathways Frequency ~ dependency degree, Kdep
Kdep = the total number of components that directly or indirectly depend on the selected one
33
34Correlation coefficient ~0.4 for both Linux and genesCould be improved by using weighted dependency
degree
Frequency is positively correlated with functional importance
35
Tree-like metabolic network
Kdep=5
Kdep=15
TCA cycle
36
Dependency degree distribution
on a critical branching tree P(K)~K-1.5 for a critical branching tree
Paradox: Kmax-0.5 ~ 1/N Kmax=N2>N
Answer: parent tree size imposes a cutoff:there will be √N “core” nodes with Kmax=N present in almost all systems (ribosomal genes
or core metabolic enzymes)
Need a new model: in a tree D=1, while in real systems D~2>1
37
Dependency network evolution New components added gradually
over time New component depends on D
existing components selected randomly
Kdep(t) ~(t/N)-D
P(Kdep(t)>K)=P(t/N<K-1/D)=K-1/D
P(Kdep)=Kdep-(1+1/D) =Kdep
-1.5 for D=2 Nuniversal=N(D-1)/D=N0.5 for D=2
38
Kdep decreases layer number
Linux Model with D=2
39
Zipf plot for Kdep distributions
Metabolic enzymesvs
Model
Linuxvs
Model
40
Frequency distributions
P(f)~ f-1.5 except the top √N “universal” components with f~1
Shell
Core
ORFans
Cloud
41
Why should we care about P(f)?
42
Metagenomes and pan-genomes
The Human Microbiome Project Consortium, Nature (2012)
For P(f) ~ f -1.5: (Pan-genome size)~ ~(# of samples)0.5
Pan-genome of E. coli strains
M Touchon et al. PLoS Genetics (2009)
Genome evolution in E. coliStudier FW, Daegelen P, Lenski RE, Maslov S, Kim JF J. Mol Biol. (2009)P. Dixit, T. Y. Pang, Studier FW, Maslov S, submitted (2013)
How many transcription factorsdoes an organism need?
Regulator genes Worker genes
S. Maslov, TY Pang, K. Sneppen, S. Krishna, PNAS (2009)TY Pang, S. Maslov, PLoS Comp Bio (2011)
Figure adapted from S. Maslov, TY Pang, K. Sneppen, S. Krishna, PNAS (2009)
+
NR~ NG2 NR/NG ~ NG
Parkinson's Law
The report of the Royal Commission on the Civil Service was published on Thursday afternoon. Time has not permitted any comment in this week's issue of The Economist on the contents of the Report. But the startling discovery enunciated by a correspondent in the following article is certainly relevant to what should have been in it.
Nov 19th 1955 | From The Economist print edition
“… bureaucracy grew by 5-7% per year "irrespective of any variation in the
amount of work (if any) to be done."
Why? 1) "An official wants to multiply subordinates, not rivals" 2) "Officials make work for each other.“ so that
“Work expands so as to fill the time available for its completion”
Is this what happens in bacterial genomes? Probably not!
Cyril Northcote Parkinson
(1909 -1993)
Economies of scale in bacterial evolution
• NR=NG2/80,000 NG/NR=80,000/NG
• Economies of scale: as genome gets larger: new pathways get shorter
nutrient
Horizontal gene transfer:entire pathways could be added in one step
nutrient
Redundant enzymes are removed
Central metabolic core anabolic pathways biomass production
Adapted from “scope-expansion” algorithm by R. Heinrich et al.
Minimal metabolic pathwaysfrom reactions in KEGG database
(# of pathways or their regulators) ~(# of enzymes )2
NG
NR
What it all means for regulatory networks?
Scale-free regulatory networks with “hubs” due to power law distribution of branch sizes: P(S)~S-3
Trends in complexity of regulation vs. genome size
NR<Kout>=NG<Kin>=number of regulatory interactions E. van Nimwegen, TIG (2003)
NR/NG= <Kin>/<Kout> increases with NG Either <Kout> decreases with NG:
functions become more specialized Or <Kin> grows with NG:
regulation gets more coordinated & interconnected
Most likely both trends at once
nutrient
TF1
nutrient
TF2
Regulatory templates:one worker – one boss
<Kout>: <Kin>=1=const
nutrient
TF1
nutrient
Regulatory templates:long top-to-bottom regulation
<Kout>=const<Kin>:
TF2<Kout>:<Kin> :
nutrient
nutrient
TF1
TF2
Regulatory templates:hierarchy & middle management
TF3
Histogram of the # of SNPs in genes
FW Studier, P Daegelen, RE Lenski, S Maslov, JF Kim, JMB (2009)
Comparison of B vs K-12 strains of E. coli
• 50% of genes have very few SNPs• 1253: 0 SNPs• 445: 1 SNP• 232: 2 SNP
• The remaining 50% are in exponential tail up to 100 SNPs (10% divergence) and higher