2014 mmg-talk

62
NON-MODEL ORGANISMS AND DATA-INTENSIVE BIOLOGY C. Titus Brown Assistant Professor MMG / CSE

description

yo

Transcript of 2014 mmg-talk

Page 1: 2014 mmg-talk

NON-MODEL ORGANISMS AND DATA-INTENSIVE BIOLOGY

C. Titus Brown

Assistant Professor

MMG / CSE

Page 2: 2014 mmg-talk

Outline

• The Molgulid story: investigating non-model ascidians ( this is the biology)

• Meditations on data analysis.• Methods, methods, methods.• Training, training, training.• Concluding thoughts

Page 3: 2014 mmg-talk

The Molgula Story – an int’l collaboration

Elijah Lowe(MSU; Naples?)

Billie Swalla (UW, BEACON)

Lionel Christiaen (NYU);Claudia Racioppi (Naples; NYU)

Page 4: 2014 mmg-talk

…to the urochordates we go!

Putnam et al., 2008, Nature.Modified from Swalla 2001

Page 5: 2014 mmg-talk

Filter feeding adults

Molgula oculata

Molgula occulta

Molgula oculata Ciona intestinalis

Elijah Lowe; collaboration w/Billie Swalla

Page 6: 2014 mmg-talk

Challenging organisms to work on!

Molgula occulta & M. oculata:• Only spawn ~1 month out of the year• Located off the northern coast of France• Hybrids not found outside of lab conditions• Species cannot be cultured• Wet lab techniques are not fully developed for species

• No genomic resources (as of 2008).

Page 7: 2014 mmg-talk

Billie Swalla, Nadine Peyrieras, Alberto Stolfi

Page 8: 2014 mmg-talk

Tail loss and notochord

a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occultaNotochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996

Page 9: 2014 mmg-talk

Molgula clades – tail loss is derived

Page 10: 2014 mmg-talk

Solitary ascidians have determinant

and invariant cleavage.

Some species have colored cytoplasms.

(Boltenia villosa)

The cell lineage is very similar in Ciona, Phallusia,

Halocynthia roretzi &Molgula oculata.

Page 11: 2014 mmg-talk

Molgula occidentalis

Ciona intestinalis

Page 12: 2014 mmg-talk
Page 13: 2014 mmg-talk

Notochord formation (convergence & extension) in ascidians is highly conserved.

Jiang and Smith, 2007Ciona savignyi

Page 14: 2014 mmg-talk

Molgula oculata notochord(40 cells, converged & extended)

Molgula occulta no notochord(20 cells, not converged & extended)

Hybrid notochord(20 cells, converged & extended)

Notochord Formation in Molgulids

Swalla and Jeffery, 1996

Page 15: 2014 mmg-talk

First we applied mRNAseq…

Lowe et al., in review (PeerJ). https://peerj.com/preprints/505/

Page 16: 2014 mmg-talk

…which gave us entire transcriptomes…

Lowe et al., in review (PeerJ). https://peerj.com/preprints/505/

Page 17: 2014 mmg-talk

…then we sequenced their genomes...

• 3 species:Molgula occidentalis (tailed) – “MOXI”Molgula oculata (tailed) – “MOCU”Molgula occulta (tail-less) – “MOCC”

• 3 lanes: 300-400 bp; 650-750 bp; 900-1000 bp

• ≥ 200X coverage each genome

De novo assembly by Elijah Lowe (MSU)

Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728

Page 18: 2014 mmg-talk

…which gave us most of their genes (and regulatory elements?)

Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728

Genome assembly statistics:

Page 19: 2014 mmg-talk

Shift in differentially expressed genes from gastrulation to neurulation

M. ocu vs. M. occ gastrula M. ocu vs. M. occ neurula

Differentially expressed during neurulation in M. ocu vs M. occ

Elijah Lowe

Page 20: 2014 mmg-talk

Notochord gene expression similar to tailed speciesElijah Lowe

Page 21: 2014 mmg-talk

Heterochronic Shift in Molgulidae Development*79 genes examined across six species

Page 22: 2014 mmg-talk

Transgenics of reporter constructs(“Mutual intelligibility” across ~350 my)

Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728

Page 23: 2014 mmg-talk

Prickle is a key part of the notochord program.

Veeman, M., et al., 2007

•Planar cell polarity (PCP) pathway

•Involved in convergence and extension

Page 24: 2014 mmg-talk

Prickle expressed in notochord cells of tailless ascidians.

Mita et al Zool. Sci., 2010

M. occulta gastrulationCiona intestinalis

Satoh Nature Reviews Genetics 4, 2003FGF

Bra Pk

Elijah Lowe

Page 25: 2014 mmg-talk

(Re)booting the Molgula --• Determined conservation of cardiopharyngeal

developmental program, despite shifts in cis-regulatory sequences (Stolfi et al, eLife, 2014).

• Examining heterochronic shifts in developmental timing (tail loss) (Maliska et al., in preparation).

• Connecting evolutionary shifts in developmental gene regulatory networks with conserved molecular profiles (Lowe et al, submitted; Lowe et al., in preparation).

Page 26: 2014 mmg-talk

More thoughts on Molgula• One grad student, two transcriptomes, three genomes,

four years…

• Genomic resources are enabling a sprawling international collaboration (UW/BEACON, MSU/BEACON, NYU, Naples, Paris)

• !Methods development key!

Page 27: 2014 mmg-talk

How Science Works

Page 28: 2014 mmg-talk

Luckily, data analysis is cheap and easy!

Page 29: 2014 mmg-talk

Err, well, actually…

http://www.pixelpog.com/ftpimages/GnomesAttack.jpg

Page 30: 2014 mmg-talk

It is now easy to generate sequencing data sets of such a size and scale that the first round analysis cannot even be

completed.

Page 31: 2014 mmg-talk

My research:theoretical => applied solutions to scale.

Page 32: 2014 mmg-talk

My research: three methods.

1. Adaptation of a suite of probabilistic data structures for representing set membership and counting (Bloom filters and CountMin Sketch). (Zhang et al., PLoS One, 2014.)

2. An online streaming approach to lossy compression of sequencing data. (Brown et al., arXiv, 2012; Howe et al., PNAS, 2014.)

3. Compressible de Bruijn graph representation for assembly. (Pell et al., PNAS, 2012.)

Page 33: 2014 mmg-talk

Method #2 - Digital normalization(a computational version of library normalization)

Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you

need to get 100x of A! Overkill!!

This 100x will consume disk space and, because

of errors, memory.

We can discard it for you…

Page 34: 2014 mmg-talk

Digital normalization

Page 35: 2014 mmg-talk

Digital normalization

Page 36: 2014 mmg-talk

Digital normalization

Page 37: 2014 mmg-talk

Digital normalization

Page 38: 2014 mmg-talk

Digital normalization

Page 39: 2014 mmg-talk

Digital normalization

Page 40: 2014 mmg-talk

Digital normalization retains information, while discarding data and errors

Page 41: 2014 mmg-talk

Digital normalization approach

A digital analog to cDNA library normalization, diginorm:

• Streaming & single pass: looks at each read at most once;• Does not “collect” the majority of errors;• Keeps all low-coverage reads;• Smooths out coverage of sequencing.

=>

Enables analyses that are otherwise completely impossible.

Page 42: 2014 mmg-talk

Witness the power of this fully operational set of sequence analysis methods:

1. Assembling soil metagenomes.

Howe et al., PNAS, 2014 (w/Tiedje)

2. Understanding bone-eating worm symbionts.Goffredi et al., ISME, 2014.

3. An ultra-deep look at the lamprey transcriptome.

Scott et al., in preparation (w/Li)

4. Understanding development in Molgulid ascidians. Stolfi et al, eLife 2014; etc.

Page 43: 2014 mmg-talk

Open scienceGuiding principle: methods that aren’t broadly available

aren’t very useful.

(=> Preprints, open source code, blog posts, Twitter, training, etc.)

Estimated ~1000 users of our software.

Diginorm now included in Trinity software from Broad Institute (~10,000 users)

Illumina TruSeq long-read technology now incorporates our approach (~100,000 users)

Page 44: 2014 mmg-talk

Current research:

Compressive algorithms for sequence analysis

Can we enable and accelerate sequence-based inquiry by making all basic analysis

easier and some analyses possible?

Page 45: 2014 mmg-talk

The data challenge in biology

In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic,

metabolomic, …?)

We currently have no good way of querying, exploring, investigating, or mining these data sets,

especially across multiple locations..

Moreover, most data is unavailable until after publication…

…which, in practice, means it will be lost.

Page 46: 2014 mmg-talk

Infrastructure: distributed graph database server

Page 47: 2014 mmg-talk

“Data Intensive Biology”• Increasingly, relevant data is out there or can be

generated fairly inexpensively.

• But what does the data mean? How can we get it to yield putative answers? How can we integrate it with other people’s data?

• Virtually nobody in biology is trained to do this.

• Virtually nobody in biology is being trained in how to do this.

Page 48: 2014 mmg-talk

Summer NGS workshop (2010-2017)

Page 49: 2014 mmg-talk

Perspectives on training• Prediction: The single biggest

challenge facing biology over the next 20 years is the lack of data analysis training (see: NIH DIWG report)

• Data analysis is not turning the crank; it is an intellectual exercise on par with experimental design or paper writing.

• Training is systematically undervalued in academia (!?)

Page 50: 2014 mmg-talk

Training - looking forward• NIH “Big Data 2 Knowledge” (BD2K) will be investing

~$20-40m in training each year (my estimate). Biomedical science increasingly depends on data analysis.

• Moore, Sloan Foundations are investing heavily in training (see: Software Carpentry)

• NSF BIO Centers have stated that “training is the second most important problem that all of us have”.

Page 51: 2014 mmg-talk

My training efforts – looking backwards

• Approximately $600k of my funding has been received for developing and implementing training.

• “Students” have included about a dozen associate & full professors; over 120 alumni of summer course in total.

• Invited talks, collaborations, problem discovery, networking, interaction with program managers, and volleyball.

• Strong pushback from every level of the administration at MSU!? But enthusiastic support from many research-active faculty.

(Invest in data science should be part of MMG’s vision for the future…)

Page 52: 2014 mmg-talk

About those STEM career paths…

Quote:

“…foisting graduates upon a carcass-strewn jobless dystopia.”

Dr. Rebecca Schuman, https://chroniclevitae.com/news/702-crimes-against-dissertation-humanity

Page 53: 2014 mmg-talk

Want a faculty job?

http://www.ascb.org/ascbpost/index.php/compass-points/item/285-where-will-a-biology-phd-take-you

Page 54: 2014 mmg-talk

Want a faculty job? Don’t count on it.

< 10% of entering PhD students will become tenure track faculty.*

53% rank research professorships as their desired career.*

(Optimism is great! But…)

Note: universities have little provision for permanent non-tenure-track positions.

* http://www.ascb.org/ascbpost/index.php/compass-points/item/285-where-will-a-biology-phd-take-you

Page 55: 2014 mmg-talk

(Sorry. I thought you should all know.)

Page 56: 2014 mmg-talk

Alternatives to tenure track.

PhD research prepares you marvelously for tackling an immense range of problems!!

Biotech, startups, research institutes, teaching, science communication…

(PhD advisors generally do not do such a good job of preparing you for non-tenure track positions.)

Papers are necessary to graduate but insufficient to get you a non-academic job afterwards.

Page 57: 2014 mmg-talk

Wrapping it all up• There are great opportunities in our increasing ability to

generate data!

• Data analysis is rapidly becoming a first class citizen in biology.

• We aren’t training people in data analysis approaches.

• …this would help them find jobs, too.

Page 58: 2014 mmg-talk

Funding

Page 59: 2014 mmg-talk

Students and postdocsFormer:• Dr. Jason Pell (Google NYC)• Asst Professor Adina Howe (Iowa State)

• Current:• Dr. Likit Preeyanon (MMG)• Elijah Lowe (CSE)• Qingpeng Zhang (CSE)• Jaron Guo (MMG)• Camille Scott (CSE)• Michael Crusoe• Luiz Irber (CSE)• Dr. Sherine Awad (MMG)

Page 60: 2014 mmg-talk

Support network

Dr. Vivien Bonazzi, my fairy NIH program officer.

Dr. Jim Tiedje, He Who Comes with Sequence

Page 61: 2014 mmg-talk

Support network

Page 62: 2014 mmg-talk

Co-conspirators / family

Thanks!

(1994)