Microbial Agrogenomics 4/2/2015, UK-MX Workshop

92
Microbial Agrogenomics Where can it lead us? Leighton Pritchard Information and Computational Sciences The James Hutton Institute

Transcript of Microbial Agrogenomics 4/2/2015, UK-MX Workshop

Microbial AgrogenomicsWhere can it lead us?

Leighton PritchardInformation and Computational SciencesThe James Hutton Institute

Acceptable Use PolicyIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Recording of this talk, taking photos, discussing the content usingemail, Twitter, blogs, etc. is permitted (and encouraged),providing distraction to others during the presentation is minimised.

These slides will be available on SlideShare.

Table of ContentsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Introduction

Why Genomics?

2003-Now

Implications

Where Next?

Conclusions

The James Hutton InstituteIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Centres of ExpertiseIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

http://www.hutton.ac.uk• Dundee Effector Consortium (DEC, with University of Dundee) [link]

• Centre for Research on Potato and Other Solanaceous Plants (CRPS) [link]

• Centre for Human and Animal Pathogens in the Environment (HAP-E) [link]

Plant-Pathogen InteractionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Pathogens of barley (e.g. Rhynchosporium commune), and soft fruit

(e.g. Raspberry Leaf Blotch Virus (RLBV))

Plant-Pathogen InteractionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Potato pathogens, pests, and vectors.• soft-rot bacteria (Dickeya, Pectobacterium, Erwinia)

• blight (Phytophthora infestans)

• Potato Cyst Nematode (PCN) (Globodera)

• aphids (Myzus persicae)

Issue 1: Food SecurityIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Economic cost and burden of crop disease• P. infestans: e1bn Europe; $4bn global

• Societal impact (human health, commodity prices; farming)

• Emerging pathogens (JIT supply chain; climate change)

• Plant-associated human pathogens

• Food fraud

Issue 2: Environmental SustainabilityIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Pesticide minimisation and withdrawal

• Durable resistance, soil-beneficial microbes, plantgrowth/nutritional enhancement

• Traditional breeding, GM, or engineering?

• Soils: rhizosphere interactions/soil diversity

• Farming practices (water run-off, rotation, equipment-cleaning- EU sulphuric acid ban)

Table of ContentsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Introduction

Why Genomics?

2003-Now

Implications

Where Next?

Conclusions

What Have Genomes Ever Done For Us?Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Catalogue features (genes, regulatory elements, etc.) in anorganism.

Plant-microbe interactionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Gene products at the host-microbe interface

Dodds & Rathjen (2010) Nat. Rev. Genet. 11:539-548 doi:10.1038/nrg2812

Plant-Nematode InteractionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

RNA-seq identification of 27 putative nematode effectors:Small proteins, expressed in gland cells during feeding stage only.

Cotton et al. (2014) Genome Biol. 15:R43 doi:10.1186/gb-2014-15-3-r43

Plant DefenceIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Prediction of NB-LRR genes (sequence capture).

Jupe et al. (2013) Plant J. 76:530-544 doi:10.1111/tpj.12307

Jupe et al. (2012) BMC Genomics 13:75 doi:10.1186/1471-2164-13-75

What Have Genomes Ever Done For Us?Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Catalogue features (genes, regulatory elements, etc.) in anorganism.

• If we have multiple genomes. . .• What common features associate with phenotype or

environment?

Plant-microbe interactionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

GWAS/QTLs/genotyping for plant breeding

http://ics.hutton.ac.uk/flapjack/

Milne et al. (2010) Bioinformatics 26:3133-3134 doi:10.1093/bioinformatics/btq580

Plant-microbe interactionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Structural changes to genomes: repeat-driven expansion

duplication, mutation, recombination, epigenetic control of effectors . . .

Haas et al. (2009) Nature 461:393-398 doi:10.1038/nature08358

Plant-microbe interactionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Structural changes to genomes: genome reductions

Buchnera, Serratia symbiotica - aphid symbionts, ‘random’ inactivationGil et al. (2002) Proc. Natl. Acad. Sci. USA 99:4454-4458 doi:10.1073/pnas.062067299

Burke & Moran (2011) Genome Biol. Evol. 99:4454-4458 doi:10.1093/gbe/evr002

Plant-microbe interactionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Lateral gene transfer (virulence-associated genes)

Bell et al. (2004) Proc. Natl. Acad. Sci. 101:11105-11110 doi:10.1073/pnas.0402424101

Plant-microbe interactionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Closely-related bacteria, different host/environmental preference.Pectobacterium atrosepticum

Holden et al. (2009) FEMS Micro. Rev. 33:689-703 doi:10.1111/j.1574-6976.2008.00153.xToth et al. (2006) Annu. Rev. Phytopath. 44:305-336 doi:10.1146/annurev.phyto.44.070505.143444

What Have Genomes Ever Done For Us?Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Catalogue features (genes, regulatory elements, etc.) in anorganism.

• If we have multiple genomes. . .• What common features associate with phenotype or

environment?• Epidemiology: spread and transmission

Historical OriginsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Retracing 19th-century P.infestans pandemics

Yoshida et al. (2014) PLoS Pathog. 10:e1004028 doi:10.1371/journal.ppat.1004028

International EmergenceIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Distribution of Dickeya spp. in Europe• D.dianthicola; ◦ D.solani ; � Dickeya spp. on potato

Toth et al. (2011) Plant Pathol. 60:385-399 doi:10.1111/j.1365-3059.2011.02427.x

Host JumpsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Movement of Dickeya from ornamental to crop plants

Parkinson et al. (2015) Eur. J. Plant Pathol. 141:63-70 doi:10.1007/s10658-014-0523-5

Diagnostic ToolsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Quarantine and legislation require precise identification.Genomes enable rapid, robust RT-PCR diagnostics.

targets

V

IV

III

II

I

genomes

IIIIIIIVV

https://github.com/widdowquinn/find differential primersPritchard et al. (2013) Plant Pathol. 62:587-596 doi:10.1111/j.1365-3059.2012.02678.x

Pritchard et al. (2012) PLoS One 7:e34498 doi:10.1371/journal.pone.0034498

Table of ContentsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Introduction

Why Genomics?

2003-Now

Implications

Where Next?

Conclusions

2003: E. carotovora subsp. atrosepticaIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• £250k collaboration between SCRI, University of Cambridge,WT Sanger Institute

• Single isolate: E. carotovora subsp. atroseptica SCRI1043

• First sequenced enterobacterial plant pathogen (32 authors!)

• Annotation: 6 people, for 6 months ≈ three person-years

• Result: single, complete 5Mbp circular chromosome (10.2X)

Bell et al. (2004) Proc. Natl. Acad. Sci. USA 101: 30:11105-11110. doi:10.1073/pnas.0402424101

2003: E. carotovora subsp. atrosepticaIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Compared against all 142 then-available bacterial genomes

Bell et al. (2004) Proc. Natl. Acad. Sci. USA 101: 30:11105-11110. doi:10.1073/pnas.0402424101

2013: Dickeya spp.Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Sequenced and annotated 25 isolates of Dickeya over two years

• Multiple sequencing methods: 454, Illumina (SE, PE)

• Automated annotation, limited manual correction

• Results: 12-237 fragments: 4.2-5.1Mbp/genome (6-84X)Pritchard et al. (2013) Genome Ann. 1 (4) doi:10.1128/genomeA.00087-12

Pritchard et al. (2013) Genome Ann. 1 (6) doi:10.1128/genomeA.00978-13

2013: Dickeya spp.Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Whole genome-based species definitions: sp. nov. D. solani

van der Wolf et al. (2014) Int. J. Syst. Evol. Micr. 64:768-774 doi:10.1099/ijs.0.052944-0

2013: Dickeya spp.Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Differences in metabolic capacity (but ≈ 20% orphan EC activities)

2014: E. coliIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Sequenced and annotated ≈ 190 isolates of E. coliAll bacteria environmental, sampled from lysimeters

• Illumina PE sequencing, cost ≈£11k

• Automated annotation: PROKKA

(w/ Fiona Brennan, Florence Abram, NUI Galway)

2014: E. coliIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Whole genome-based subspecies classification

Bru

nei2

0070

942_

cont

igs

Mue

nste

r200

6309

1_co

ntig

s

Sen

ftenb

erg2

0070

885_

cont

igs

Lys1

42_c

ontig

s

Lys1

75_c

ontig

s

Lys1

30_c

ontig

s

Lys1

70_c

ontig

s

Lys1

26_c

ontig

s

Lys1

67_c

ontig

s

Lys1

76_c

ontig

s

Lys1

69_c

ontig

s

Lys5

0_co

ntig

s

X50

38_c

ontig

s

Lys1

31_c

ontig

s

Lys1

71_c

ontig

s

Lys1

11_c

ontig

s

Lys1

07_c

ontig

s

Lys1

14_c

ontig

s

Lys1

6_co

ntig

s

Lys2

2_co

ntig

s

Lys6

5_co

ntig

s

Lys5

6_co

ntig

s

Lys1

13_c

ontig

s

Lys1

09_c

ontig

s

Lys7

7_co

ntig

s

Lys1

02_c

ontig

s

Lys1

00_c

ontig

s

Lys9

2_co

ntig

s

Lys9

4_co

ntig

s

Lys8

0_co

ntig

s

Lys6

4_co

ntig

s

Lys8

2_co

ntig

s

AW

3_co

ntig

s

X50

08_c

ontig

s

AW

4_co

ntig

s

AW

1_co

ntig

s

Lys1

18_c

ontig

s

Lys1

38_c

ontig

s

Lys1

21_c

ontig

s

Lys1

22_c

ontig

s

Lys1

77_c

ontig

s

Lys1

55_c

ontig

s

Lys1

65_c

ontig

s

Lys1

63_c

ontig

s

Lys1

60_c

ontig

s

Lys1

61_c

ontig

s

Lys1

72_c

ontig

s

Lys1

44_c

ontig

s

Lys1

35_c

ontig

s

Lys1

46_c

ontig

s

Lys1

23_c

ontig

s

Lys1

24_c

ontig

s

Lys1

50_c

ontig

s

Lys1

40_c

ontig

s

Lys1

57_c

ontig

s

Lys1

73_c

ontig

s

Lys1

56_c

ontig

s

Lys1

58_c

ontig

s

Lys1

59_c

ontig

s

Lys1

62_c

ontig

s

Lys5

_con

tigs

X50

84_c

ontig

s

X50

42_c

ontig

s

Lys1

10_c

ontig

s

Lys1

36_c

ontig

s

Lys5

4_co

ntig

s

Lys1

_con

tigs

Lys6

_con

tigs

Lys1

12_c

ontig

s

X50

12_c

ontig

s

Lys3

0_co

ntig

s

Lys2

5_co

ntig

s

Lys4

3_co

ntig

s

Lys3

7_co

ntig

s

Lys4

0_co

ntig

s

Lys1

51_c

ontig

s

Lys3

1_co

ntig

s

Lys2

7_co

ntig

s

Lys4

2_co

ntig

s

Lys5

1_co

ntig

s

Lys3

3_co

ntig

s

Lys4

6_co

ntig

s

Lys3

8_co

ntig

s

Lys8

9_co

ntig

s

Lys2

3_co

ntig

s

Lys1

15_c

ontig

s

Lys1

08_c

ontig

s

Lys1

04_c

ontig

s

DS

M10

973_

cont

igs

Lys1

25_c

ontig

s

Lys1

05_c

ontig

s

Lys1

7_co

ntig

s

Lys1

28_c

ontig

s

Lys6

6_co

ntig

s

Lys7

3_co

ntig

s

Lys1

5_co

ntig

s

Lys9

1_co

ntig

s

DS

M86

98_c

ontig

s

DS

M86

95_c

ontig

s

Lys7

4_co

ntig

s

Lys6

1_co

ntig

s

Lys9

_con

tigs

Lys1

53_c

ontig

s

Lys8

4_co

ntig

s

Lys9

3_co

ntig

s

Lys7

2_co

ntig

s

Lys6

2_co

ntig

s

Lys2

1_co

ntig

s

Lys5

9_co

ntig

s

Lys6

3_co

ntig

s

Lys8

3_co

ntig

s

Lys1

9_co

ntig

s

Lys4

_con

tigs

AW

13_c

ontig

s

Lys4

5_co

ntig

s

Lys2

8_co

ntig

s

Lys5

3_co

ntig

s

Lys5

2_co

ntig

s

Lys3

4_co

ntig

s

Lys3

6_co

ntig

s

Lys2

4_co

ntig

s

Lys3

5_co

ntig

s

Lys6

8_co

ntig

s

Lys1

06_c

ontig

s

Lys8

8_co

ntig

s

Lys9

7_co

ntig

s

Lys7

6_co

ntig

s

Lys1

34_c

ontig

s

Lys5

8_co

ntig

s

Lys7

1_co

ntig

s

Lys8

1_co

ntig

s

Lys1

29_c

ontig

s

Lys1

20_c

ontig

s

Lys1

45_c

ontig

s

Lys1

37_c

ontig

s

Lys1

27_c

ontig

s

Lys1

52_c

ontig

s

Lys1

01_c

ontig

s

Lys9

8_co

ntig

s

Lys7

0_co

ntig

s

Lys1

33_c

ontig

s

Lys4

7_co

ntig

s

Lys7

5_co

ntig

s

Lys4

8_co

ntig

s

Lys1

48_c

ontig

s

Lys1

39_c

ontig

s

Lys1

41_c

ontig

s

Lys1

64_c

ontig

s

Lys1

49_c

ontig

s

Lys1

47_c

ontig

s

Lys6

0_co

ntig

s

Lys7

9_co

ntig

s

Lys1

68_c

ontig

s

Lys1

8_co

ntig

s

Lys8

7_co

ntig

s

Lys9

6_co

ntig

s

Lys7

_con

tigs

Lys1

54_c

ontig

s

Lys1

17_c

ontig

s

Lys1

19_c

ontig

s

Lys1

78_c

ontig

s

Lys1

16_c

ontig

s

Lys8

6_co

ntig

s

Lys9

0_co

ntig

s

Lys4

1_co

ntig

s

Lys1

3_co

ntig

s

Lys8

5_co

ntig

s

X50

02_c

ontig

s

Lys1

2_co

ntig

s

Lys3

9_co

ntig

s

Lys1

4_co

ntig

s

Lys5

5_co

ntig

s

Lys2

9_co

ntig

s

Lys9

9_co

ntig

s

X50

35_c

ontig

s

Lys8

_con

tigs

Lys3

_con

tigs

X50

34_c

ontig

s

X50

88_c

ontig

s

Lys2

0_co

ntig

s

Lys7

8_co

ntig

s

Lys1

1_co

ntig

s

Brunei20070942_contigs

Muenster20063091_contigs

Senftenberg20070885_contigs

Lys142_contigs

Lys175_contigs

Lys130_contigs

Lys170_contigs

Lys126_contigs

Lys167_contigs

Lys176_contigs

Lys169_contigs

Lys50_contigs

5038_contigs

Lys131_contigs

Lys171_contigs

Lys111_contigs

Lys107_contigs

Lys114_contigs

Lys16_contigs

Lys22_contigs

Lys65_contigs

Lys56_contigs

Lys113_contigs

Lys109_contigs

Lys77_contigs

Lys102_contigs

Lys100_contigs

Lys92_contigs

Lys94_contigs

Lys80_contigs

Lys64_contigs

Lys82_contigs

AW3_contigs

5008_contigs

AW4_contigs

AW1_contigs

Lys118_contigs

Lys138_contigs

Lys121_contigs

Lys122_contigs

Lys177_contigs

Lys155_contigs

Lys165_contigs

Lys163_contigs

Lys160_contigs

Lys161_contigs

Lys172_contigs

Lys144_contigs

Lys135_contigs

Lys146_contigs

Lys123_contigs

Lys124_contigs

Lys150_contigs

Lys140_contigs

Lys157_contigs

Lys173_contigs

Lys156_contigs

Lys158_contigs

Lys159_contigs

Lys162_contigs

Lys5_contigs

5084_contigs

5042_contigs

Lys110_contigs

Lys136_contigs

Lys54_contigs

Lys1_contigs

Lys6_contigs

Lys112_contigs

5012_contigs

Lys30_contigs

Lys25_contigs

Lys43_contigs

Lys37_contigs

Lys40_contigs

Lys151_contigs

Lys31_contigs

Lys27_contigs

Lys42_contigs

Lys51_contigs

Lys33_contigs

Lys46_contigs

Lys38_contigs

Lys89_contigs

Lys23_contigs

Lys115_contigs

Lys108_contigs

Lys104_contigs

DSM10973_contigs

Lys125_contigs

Lys105_contigs

Lys17_contigs

Lys128_contigs

Lys66_contigs

Lys73_contigs

Lys15_contigs

Lys91_contigs

DSM8698_contigs

DSM8695_contigs

Lys74_contigs

Lys61_contigs

Lys9_contigs

Lys153_contigs

Lys84_contigs

Lys93_contigs

Lys72_contigs

Lys62_contigs

Lys21_contigs

Lys59_contigs

Lys63_contigs

Lys83_contigs

Lys19_contigs

Lys4_contigs

AW13_contigs

Lys45_contigs

Lys28_contigs

Lys53_contigs

Lys52_contigs

Lys34_contigs

Lys36_contigs

Lys24_contigs

Lys35_contigs

Lys68_contigs

Lys106_contigs

Lys88_contigs

Lys97_contigs

Lys76_contigs

Lys134_contigs

Lys58_contigs

Lys71_contigs

Lys81_contigs

Lys129_contigs

Lys120_contigs

Lys145_contigs

Lys137_contigs

Lys127_contigs

Lys152_contigs

Lys101_contigs

Lys98_contigs

Lys70_contigs

Lys133_contigs

Lys47_contigs

Lys75_contigs

Lys48_contigs

Lys148_contigs

Lys139_contigs

Lys141_contigs

Lys164_contigs

Lys149_contigs

Lys147_contigs

Lys60_contigs

Lys79_contigs

Lys168_contigs

Lys18_contigs

Lys87_contigs

Lys96_contigs

Lys7_contigs

Lys154_contigs

Lys117_contigs

Lys119_contigs

Lys178_contigs

Lys116_contigs

Lys86_contigs

Lys90_contigs

Lys41_contigs

Lys13_contigs

Lys85_contigs

5002_contigs

Lys12_contigs

Lys39_contigs

Lys14_contigs

Lys55_contigs

Lys29_contigs

Lys99_contigs

5035_contigs

Lys8_contigs

Lys3_contigs

5034_contigs

5088_contigs

Lys20_contigs

Lys78_contigs

Lys11_contigs

ANIm

0.9 0.92 0.94 0.96 0.98

Value

010

0020

0030

0040

0050

0060

00

Color Keyand Histogram

Cou

nt

AB1B2CDEFUX

(w/ Fiona Brennan, Florence Abram, NUI Galway)

2014: Campylobacter spp.Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

≈1034 clinical, animal, food-associated Campylobacter isolates

• Illumina PE sequencing, cost ≈£60k

• Automated annotation: PRODIGAL

(w/ Ken Forbes, Norval Strachan, University of Aberdeen)

2014: Campylobacter spp.Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• 15554 ‘gene families’ in 1034 isolates.

• Calculation: 4e12 pairwise protein comparisons!

(w/ Ken Forbes, Norval Strachan, University of Aberdeen)

Table of ContentsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Introduction

Why Genomics?

2003-Now

Implications

Where Next?

Conclusions

So what’s changed?Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Everything.

• Cost: £250k → £60 per genome.Now cheaper to sequence than analyse a genome!Offload work from people to software.

• Location: sequencing centre, to benchtop (Nanopore!)

• Speed: sequencing run time can be less than a day

• Data: massive volume increase

Predicting the future is hard. . .Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Su et al. attempted to do it, though:

10,000 prokaryotes in 2015 was an underestimate.http://sulab.org/2013/06/sequenced-genomes-per-year/

So what’s changed?Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Everything.

• Cost: £250k → £60 per genome.

• Location: sequencing centre, to benchtop (Nanopore!)

• Speed: sequencing run time can be less than a day

• Data: massive volume increaseMore data ≈ better, but also more challenging.

• Software: more ( 6= better. . .) software for more things

• New experiments: genomes, exomes, variant calling,methylated sequences, STARR-seq, . . .

• New applications: diagnostics, epidemic tracking,metagenomics, . . .

Sequence first. . . ask questions, laterIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• “Why?” has sometimes been replaced by “What?”

http://dilbert.com/strip/2000-01-03

“The thesis is not hypothesis driven. Add a hypothesis and refer to it in subsequent

chapters.”

More isn’t always betterIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Deeper sequencing (more reads) 6= more information or betterassembly.

60-80X coverage the ‘sweet spot’ for bacterial genomes.More reps � more reads!Conway & Bromage (2011) Bioinformatics 27:479-486 doi:10.1093/bioinformatics/btq697

Are database annotations reliable?Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Automated annotation is essentialThe Critical Assessment of Function Annotation (CAFA) project.

Radivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340

Do biased database annotations matter?Introduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Experimental annotations of proteins are incomplete. Is thatimportant?Tested by simulation, and following databases for three years.• Yes. It matters.

• Current large scale annotations are meaningful and almost surprisingly reliable.

• The nature and level of data incompleteness, and type of classification modelhave an effect.

• “Low precision, high recall” (i.e. less discriminating) tools most significantlyaffected.

Molecular function prediction is usually more reliable thanbiological process predictionJiang et al. (2014) Bioinformatics 30:i609-i616 doi:10.1093/bioinformatics/btu472

Cozzetto et al. (2013) BMC Bioinf. 14:S3-S1 doi:10.1186/1471-2105-14-S3-S1

CAFA resultsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

The Critical Assessment of Function Annotation (CAFA) 2013results. (F-measure combines precision and recall)

• You can do better thanBLAST.

• Best-performing methods docomparably well.

• Best methods usedevolutionary relationships,structure, and expressiondata.

• Machine Learning methodswork best.

Radivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340

More Isn’t Always BetterIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Statistical inference on large datasets requires extra care.

Hypothesis tests may incorrectly reject null hypotheses (B-H)

More Isn’t Always BetterIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• More tests → random effect seems ’real’

• May be considering a large set of inferences simultaneously(and yet not notice!):“p-hacking”, “Researcher Degrees ofFreedom”“good scientists are skilled at looking hard enough and subsequently coming up

with good stories (plausible even to themselves, as well as to their colleagues

and peer reviewers) to back up any statistically-significant comparisons they

happen to come up with.” Gelman & Loken (2013) ”The Garden of Forking Paths”

(“Data-dredging”)

True for all large data analyses: genomics, metabolomics,proteomics, health screening, finding terrorists, etc.Xia et al. (2012) Metabolomics 9:280-299 doi:10.1007/s11306-012-0482-9Broadhurst & Kell (2006) Metabolomics 2:171-196 doi:10.1007/s11306-006-0037-z

Genome-Scale PredictionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Imagine a paper describing a predictor for protein functionalclass (e.g. pathogen effector)

• The paper reports sensitivity = 0.95, FPR = 0.01

• We run the predictor on 20,000 proteins in an organism

• It predicts 130 members of the class. How many of them arelikely to be true positives?

• We need a baseline level of that class (fX ) in the genome todetermine this.

• Estimate ≈ 200 in gene complement, so fX = 0.01• fX = 0.01 =⇒ P(class|+ve) = 0.490 ≈ 0.5: 65 TP

Pritchard & Broadhurst (2014) Meth. Mol. Biol. 9:280-299 doi:10.1007/978-1-62703-986-4 4

Genome-Scale PredictionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Imagine a paper describing a predictor for protein functionalclass (e.g. pathogen effector)

• The paper reports sensitivity = 0.95, FPR = 0.01

• We run the predictor on 20,000 proteins in an organism

• It predicts 130 members of the class. How many of them arelikely to be true positives?

• We need a baseline level of that class (fX ) in the genome todetermine this.

• Estimate ≈ 200 in gene complement, so fX = 0.01• fX = 0.01 =⇒ P(class|+ve) = 0.490 ≈ 0.5: 65 TP

Pritchard & Broadhurst (2014) Meth. Mol. Biol. 9:280-299 doi:10.1007/978-1-62703-986-4 4

A Literature ExampleIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Reported sensitivity ≈ 0.71, FPR ≈ 0.15

Arnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376

Big Data: New ProblemsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Lots of high throughput experiments, and large datasets(but even more small datasets)

• Historically ill-formed data (sequences in Word documents,BLAST results pasted into notebooks).

• How do we connect all this data in a productive way?

This section influenced heavily by C. Titus Brown and Philip Bourne

Big Data: New ProblemsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Data management. Too often:“Goodbye to the student is goodbye to the data”

• Persistence of data resources (link rot, database entropy)

http://www.phdcomics.com/comics/archive.php?comicid=382

Big Data: New ProblemsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• How reproducible are computational results?

• Software/data versions prevent exact reproduction: 280h toreproduce one paper approximately - in the same lab!Garijo et al. (2013) PLoS One doi:10.1371/journal.pone.0080278

http://www.slideshare.net/pebourne/sib0114

Big Data: New ProblemsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Maybe we can get away with all of this in a traditional model ofscience publishing. . .

http://www.slideshare.net/c.titus.brown/2015-baltiandbioinformatics

Big Data: New ProblemsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

. . .but lots of biological data doesn’t make sense except in the lightof other biological data.

http://www.slideshare.net/c.titus.brown/2015-baltiandbioinformatics

Big Data: New SolutionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Everyone could be better off with collaboration and data sharing.

What is winning: career progression, or feeding people?(still competing, but on analysis and insight, not on who holds what data. . .)

http://www.slideshare.net/c.titus.brown/2015-baltiandbioinformatics

Big Data: New SolutionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Data quality ≈ data trust:

• Sustainable: storage, archiving, maintenance

• Findable: “where is the dataset?”, “is it available?”

• Queryable: “is X in the dataset?”

• Analysable: metadata, annotation

http://www.slideshare.net/pebourne/sib0114

Big Data: New SolutionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Interoperable digital assets: datasets, software, lab books, etc.

• Uniquely identified (DOI, PMID, etc.)

• Provenance (version and access control)

• Open standards - what data to keep, how to organise it:MINSEQE (sequencing), MIAME (microarray), MIASE (simulation), MIAPE

(proteomics), MIARE (RNAi), SBML, GFF3, SAM/BAM/CRAM, etc.

• Sustainable infrastructure for biological information(ELIXIR, “The Commons” [US], RDF, Open Data)

http://www.slideshare.net/pebourne/sib0114

https://pebourne.wordpress.com/2014/10/07/the-commons/

Big Data: New SolutionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Too much software is difficult to use for experts, or unusable fornon-experts.Veretnik et al. (2008) PLoS Comp. Biol. doi:10.1371/journal.pcbi.1000136

Big Data: New SolutionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Workflows, pipelines, and service integrative frameworks

Cock et al. (2014) Methods Mol. Biol. 1127:3-15 doi:10.1007/978-1-62703-986-4 1Cock et al. (2013) PeerJ 1:e167 doi:10.7717/peerj.167

http://galaxy-community.org.uk/

Big Data: New SolutionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Sometimes new software is needed.Writing good software is difficult, and expensive.

http://www.theregister.co.uk/2015/01/22/us military finds f35 software is a buggy mess/

Big Data: New SolutionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Not enough software engineers to go round: train what we have.Programming literacy, computational thinking: versioned, readable,maintainable code.

http://www.software.ac.uk/http://software-carpentry.org/http://datacarpentry.org/

Table of ContentsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Introduction

Why Genomics?

2003-Now

Implications

Where Next?

Conclusions

Cheap Sequencing In The FieldIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Diagnostics and epidemic tracking by sequencingGlobal Microbial Identifier (GMI) http://www.globalmicrobialidentifier.org:

Global system of databases for microbial/disease identification and diagnostics.

Quick et al. (2014) BMJ Open 11:e006278 doi:10.1136/bmjopen-2014-006278

Sequencing In The FieldIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Live prediction for epidemiology?

(Peter Skelsey, JHI)

Sequence Isn’t EverythingIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Organisms are dynamic, and multi-scale

• Context: epigenetics, tissue differentiation, mesoscale systems,symbiosis, etc.

• Phenotypic plasticity: responses to environment - stress,temperature, etc.

The PhytobiomeIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Phytobiome: the plant, and its associated microbial community

• American Phytopathological Society “Phytobiomes Intitative”

• “a complete systems approach that spans foundational to appliedscience focused on downstream application”

• We are not at war with all microbes. . .

https://www.apsnet.org/members/outreach/ppb/phytobiomes

Genomes Are Parts ListsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

We know (some of) the bits that make up the machinery. . .

Flux Balance AnalysisIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Flux Balance Analysis: constraint-based static representation ofmetabolism (RNA/ChIP-seq adds dynamics to models)

• Set upper, lower bounds to reaction rate, define objective phenotype(biomass, target flux profile)

• in silico knockouts; viable states; nutrient usage

• A basis for synthetic biology and engineering

Flux Balance AnalysisIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Dickeya: 29 × FBA, host range ≈ nutrient-dependent growthalso transposon mutant libraries

(w/ Sonia Humphris, Ian Toth, JHI)

Plant-Microbe Interactions Are SystemsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Components, interactions, dynamics etc. = systems biologyInteraction creates a third system from host and microbe

Pritchard & Birch (2014) Mol. Plant. Pathol. 15:865-870 doi:10.1111/mpp.12210

Pritchard & Birch (2011) Plant Sci. 180:584-603 doi:10.1016/j.plantsci.2010.12.008

Plant-Microbe Interactions Are SystemsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Components, interactions, dynamics etc. = systems biologyInteraction creates a third system from host and microbe

microbe(bulk)

microbe(local)

PRR PRR*

R protein

R protein*

ø

øøe�ector translocation

e�ector(internalised)

PAMP

ø

ø

cell wall

microbe approaches cell microbe leaves cell/is destroyed

microbe producesPAMP

microbe producese�ector

PAMP bindingactivates PRR

e�ector bindingactivates R protein

calloseproduction

calloseloss

e�ectorloss

e�ectorloss

PAMPloss

enhanced by callose (PTI)and R protein* (ETI)

enhanced byPRR* (PTI)

slowed by callose (PTI)

callose

e�ector(external)

enhanced bye�ector action

No Response PTI

PTI+ETS PTI+ETS+ETI

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0 50 100 150 200 0 50 100 150 200Time

Arb

itrar

y un

its

variable

Callose

Pathogen

Pathogen, Callose timecourses by host type

Pritchard & Birch (2014) Mol. Plant. Pathol. 15:865-870 doi:10.1111/mpp.12210

Pritchard & Birch (2011) Plant Sci. 180:584-603 doi:10.1016/j.plantsci.2010.12.008

Integrate Models and DataIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Integration of models and datasets still a challenge

• Models at different scales

• Kinetic, metabolomic, proteomic, transcriptomic, genomicdatasets

Hartmann & Schreiber (2014) Front. Bioeng. Biotechnol. 8:226-244 doi:10.3389/fbioe.2014.00091

Types of ModelIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

• Combining data: models at different scales.

• Information required/produced depends on model type.

• Size/detail trade-off

Hartmann & Schreiber (2014) Front. Bioeng. Biotechnol. 8:226-244 doi:10.3389/fbioe.2014.00091/abstract

Synthetic BiologyIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Engineering new response modes into crops.

Gurr & Rushton (2005) Trends Biotech. 23:283-290 doi:10.1016/j.tibtech.2005.04.009

Genome EditingIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

TALENs and CRISPR/Cas9s

http://www.lifetechnologies.com/

http://www.umassmed.edu/xuelab

Trait StackingIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

For resistance and other beneficial traits (yield, nutrients, biofuels)

Vanholme et al. (2010) Trends Biotechnol. 28:543-547 doi:10.1016/j.tibtech.2010.07.008

Engineering Soil-Beneficial MicrobesIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Refactoring of Klebsiella nitrogen fixation:

Temme et al. (2012) Proc. Natl. Acad. Sci. USA 10:763 doi:10.1073/pnas.1120788109

Engineering New BiologyIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

dCas9 logic circuits, integrating with host regulation

Nielsen & Voigt (2014) Mol. Syst. Biol. 10:763 doi:10.15252/msb.20145735

Table of ContentsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Introduction

Why Genomics?

2003-Now

Implications

Where Next?

Conclusions

DataIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

Sequencing is ever cheaper and more productive:

• Very large datasets

• More information (with good planning)

• Challenges for data storage and sharing

• Challenges for analysis (“why” vs. “what”)

• Challenges for software, accessibility (workflows,multidisciplinary training)

• Interdisciplinary collaboration and data integration willbe essential

Systems/SyntheticsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

A parts list only gets us so far:

• Cells are dynamic biophysical systems

• Organisms are dynamic cellular systems

• ‘Real’ plant systems include the phytobiome

• Systems biology essential to understand plant-microbeinteractions

• Synthetic biology promises to be a powerful tool to improveplant health, nutrition, etc.

• BUT: ethical issues around deployment of synthetic systems

ConclusionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

ConclusionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

ConclusionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

ConclusionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

ConclusionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

ConclusionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

ConclusionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

ConclusionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

ConclusionsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

AcknowledgementsIntroduction Why Genomics? 2003-Now Implications Where Next? Conclusions

James Hutton InstitutePaul BirchEmma CampbellPeter CockIngo HeinNicola HoldenSonia HumphrisFlorian JupeIan TothNUI GalwayFlorence AbramFiona BrennanUniversity of AberdeenKen ForbesNorval Strachan

University of AlbertaDavid BroadhurstSASAVincent MulhollandGerry SaddlerFeraValerie BertrandJohn ElphinstoneRachel GloverNeil ParkinsonUniversity of MunsterMartina BielaszewskaHelge KarchUniversity of SalfordNatalie FerryRyan Joynson

And many others!