The Emerging Global Collaboratory for Microbial Metagenomics Researchers
Building a foundation for microbial metagenomics analysis Mccluskey_WDCM... · Building a...
Transcript of Building a foundation for microbial metagenomics analysis Mccluskey_WDCM... · Building a...
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Building a foundation for microbial
metagenomics analysisKevin McCluskey1 and Scott Baker2
1The Fungal Genetics Stock Center2Department of Energy, Pacific Northwest National Laboratory
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Outline
• Background
• Role of collections
• Whole genome survey programs
• Accessing genome databases
• Evaluating genome diversity
2/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
What is metagenomics?
• The direct genetic analysis of communities of microbes
– Environmental
– Intestinal/Rumen
• Growth of Metagenomicspublications (from Pubmed)
3/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Metagenomic are proliferating
• US NCBI SRA lists over 17,000 publicmetagenome studies
• 16,466 are private
4/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Metagenome studies depend on robust reference materials
• 2005 Most taxa are unidentifiedor Id to Phylum
• Microbial“Dark Matter”
5/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Resources in collections are essential for metagenomics
77 different species representing
23 different genera
59 species with 10 or fewer isolates
> 70 strains from whole genome
sequencing programs
# strains Species
18744 Neurospora crassa
1188 Aspergillus nidulans
672 Neurospora intermedia
550 Fusarium sp.
299 Neurospora tetrasperma
274 Neurospora sitophila
253 Schizophyllum commune
241 Sordaria sp.
152 Neurospora discreta
134 Magnaporthe grisea
138 Aspergillus niger
75 Pichia pastoris
52 Neurospora sp
28 Ascobolus sp.
26 Gelasinospora sp.
53 Aspergillus fumigatus
6/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Increasing pressure on operations of culture collections in post-genomics era
Neurospora genome was published in 2004
0
500
1000
1500
2000
2500
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
Neurospora
Aspergillus
Other
Plasmids
Plates
Numbers of items distributed by the FGSC in recent years.
7/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Taxonomic studies and WGS
• 1000 Fungal Genomes seeks to provide one complete genome sequence for every fungal FAMILY
• GEBA will fill gaps in bacterial and archael genome sequences
8/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
US DOE Joint Genome Institute programs
• Most are user programs
– community sequencing program
– emerging technologies opportunity program
– technology development pilot program
9/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
US DOE Joint Genome Institute programs
• Emphasizing relationships with living microbe collection
– Genomic encyclopedia of bacteria and archae
• DSMZ
– 1000 Fungal genomes program
• USDA, USFS, FGSC collections http://1000.fungalgenomes.org
http://www.jgi.doe.gov/programs/GEBA/
10/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
US DOE JGI Data Release
11/31
• JGI supports open data
• Data published immediately
– Available immediately on JGI portal• Fort Lauderdale (2003) and Toronto (2009) agreements
– Reserves right of first publication• Not strongly enforced
• Registration required (Fall 2013)
– Archived at NCBI
• Analyses and time-of-publication data responsibility of investigator
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
1000 Fungal Genomes
12/31
• PIs: Spatafora, Stajich,McCluskey, Crous, Turgeon, Lindner, O'Donnell, Ward, Rokas, Glass, Arnold,Martin, Grigoriev
• Sampling to completelysequence ≥one speciesfrom each Family
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Current Databases are Taxonomic
• JGI
• GenomesOnLineDatabase
13/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
BLAST at JGI
• Somewhat hidden
• Identifies SOME strain names
14/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
BLAST at Pubmed
• Taxonomicallydense
• Not allhits are to extantstrains
15/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Searching collection catalogs by gene ontogeny
• Search bytraits
• Limitedat JGI
16/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Searching by GO term
• Could allow a taxonomicallyblind search
• Also does notrequire that user have DNAsequence fortrait
17/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Searching by GO term
• Only as goodas the annotation
18/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Searching by GO term
• Only as goodas the annotation
• Alternatives:KEGGKOG
19/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Searching by GO term
• Only as goodas the annotation
• Alternatives:KEGGKOG
20/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
How can you know if your gene is interesting
• Resequencing characterizes within-species genetic diversity
• Enables allele finding
• Amenable to quantitative analysis
21/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Resequencing reveals „Haplotypes‟
• 3 Backgrounds:– 7035 is in a “St. Lawrence” background
• 3831 and 2261 are intermediate
– 1363 is in a “Lindegren” background
– 821 is in an “Abbott” background
• This figure shows polymorphisms on chromosome 7 from five strains
• Other chromosomes show differentamounts of Lindegren vs St. Lawrencesequence
Shared SNP
Unique SNP
Indel
22/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Size distribution of Indels
• Indels of size +/- 4 are over-represented
• Indels that do not cause frameshift are greatly over-represented in coding sequences
0
5000
10000
15000
20000
25000
30000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Nu
mb
er
of
Ind
els
Indel Size
Size distribution of Indels
0
20
40
1 2 3 4 5 6 7 8 9101112131415161718
Indel size (number of bases)
% CDSindels
23/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Mitochondrial polymorphisms among WGS strains
Strain SNP INDELCDS
INDEL106 2 14 0305 4 65 5309 3 23 0322 7 83 5821 6 133 81211 4 69 41303 4 53 41363 6 11 02261 2 10 03114 4 51 23246 3 28 13562 6 27 13564 3 70 43566 36 322 213831 16 153 103921 3 18 27022 11 92 37035 9 29 1
Griffiths, Collins and Nargang (1995)
• No two strains have the same mitochondrial genotype
• Some strains areheteroploid
24/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
How is the variability relevant
• Resequencing reveals multiple alleles at many loci
• Evaluating generated or natural mutants requires a background of natural or neutral variability
25/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Allele variability among resequenced strains
• Depends on organism
• Neurospora has 9,730 genes
– Sequenced 18 strains (pilot study)
9,730175,000
Potential range of allele variability
All identical to reference genomeEvery allele unique
?
26/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Allele variability: Neurospora example
• All strains had SNPs and indels
• >400 ORFs had no SNPs
• All strains had nonsense mutations 0
50
100
150
13
63
32
2
35
62
70
22
35
64
31
14
12
11
38
31
70
35
Nonsense SNPs
Strain Total SNPs
309 13274
7035 18487
1211 20493
3246 21533
3831 22961
106 23579
3566 37516
3114 41085
2261 44839
3564 47981
1303 59356
7022 78991
3921 80311
305 90195
3562 106533
322 142489
1363 146641
821 188346
0
20000
40000
60000
80000
82
1
13
63
70
22
13
03
35
66
32
46
12
11
70
35
10
6
SNPs per strain
27/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Allele variability among resequenced strains
• 18 Laboratory mutant strains
– two lineages
10,000180,000
33,172 alleles among 18 strains
All identical to reference genomeEvery allele unique
28/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Lineage dictates number of unique alleles
• Strains crossedwith referencegenome strainhave fewest novel alleles
0
1000
2000
3000
4000
5000
6000
7000
13
63
82
1
32
2
30
5
35
62
39
21
70
22
22
61
13
03
35
64
10
6
35
66
32
46
31
14
38
31
70
35
30
9
12
11
No
n-r
efe
ren
ce A
llele
s
Strain Number (FGSC)
Number of novel alleles
2489 = St. Lawrence/ Oak Ridge
1363 = Lindegren
821 = Abbott
Less related More related
29/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Lineage dictates number of unique alleles
• Strains crossedwith referencegenome strainhave fewest novel alleles
0
1000
2000
3000
4000
5000
6000
7000
13
63
82
1
32
2
30
5
35
62
39
21
70
22
22
61
13
03
35
64
10
6
35
66
32
46
31
14
38
31
70
35
30
9
12
11
No
n-r
efe
ren
ce A
llele
s
Strain Number (FGSC)
Number of novel alleles
2489 = St. Lawrence/ Oak Ridge
1363 = Lindegren
821 = Abbott
Fewer Backcrosses More Backcrosses
29/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Culture Collections supportmetagenomics analyses
• Taxonomic diversity studies
– GEBA
– 1000 Fungal Genomes Project
– BGI
• Genetic diversity studies
– Pan genome
– Neurospora
– Dothidiomycetes
– Eurotiomycetes
30/31
WDCM and CODATA Joint WorkshopICCC13 Beijing, ChinaONE-GENE, ONE-ENZYME
Acknowledgements
• US National Science Foundation grant 0235887 (FGSC)– Mike Plamann, Aric Wiest
• US National Science Foundation grant 1203112 (RCN)
• US NIH award GM068087 (Jay Dunlap, PI)– Knock-out collection
• DOE JGI Fungal Genomics Program– Scott Baker, Igor Grigoriev, Wendy Schackwitz, Anna Lipzin,
Joel Martin
31/31