2013 bms-retreat-talk

41
Data-intensive approaches to investigating non- model organisms C. Titus Brown [email protected] Assistant Professor Microbiology and Molecular Genetics; Computer Science and Engineering; BEACON; Quantitative Biology Initiative

Transcript of 2013 bms-retreat-talk

Page 1: 2013 bms-retreat-talk

Data-intensive approaches to investigating non-model organismsC. Titus [email protected] ProfessorMicrobiology and Molecular Genetics; Computer Science and Engineering;BEACON; Quantitative Biology Initiative

Page 2: 2013 bms-retreat-talk

Outline• My research!• Opportunities for computational science training• More unsolicited advice

Page 3: 2013 bms-retreat-talk

Acknowledgements

Lab members involved Collaborators

• Adina Howe (w/Tiedje)• Jason Pell• Arend Hintze• Rosangela Canino-Koning• Qingpeng Zhang• Elijah Lowe• Likit Preeyanon• Jiarong Guo• Tim Brom• Kanchan Pavangadkar• Eric McDonald

• Jim Tiedje, MSU• Erich Schwarz, Caltech / Cornell• Paul Sternberg, Caltech• Robin Gasser, U. Melbourne• Weiming Li• Hans Cheng

Funding

USDA NIFA; NSF IOS; BEACON; NIH.

Page 4: 2013 bms-retreat-talk

My interests

I work primarily on organisms of agricultural, evolutionary, or ecological importance, which tend to have poor reference genomes and transcriptomes. Focus on:

• Improving assembly sensitivity to better recover genomic/transcriptomic sequence, often from “weird” samples.

• Scaling sequence assembly approaches so that huge assemblies are possible and big assemblies are straightforward.

• “Better science through superior software”

Page 5: 2013 bms-retreat-talk

There is quite a bit of life left to sequence & assemble.

http://pacelab.colorado.edu/

Page 6: 2013 bms-retreat-talk

“Weird” biological samples:• Single genome

• Transcriptome

• High polymorphism data

• Whole genome amplified

• Metagenome (mixed microbial community)

• Hard to sequence DNA (e.g. GC/AT bias)

• Differential expression!

• Multiple alleles

• Often extreme amplification bias

• Differential abundance within community.

Page 7: 2013 bms-retreat-talk

Single genome assembly is already challenging --

Page 8: 2013 bms-retreat-talk

Once you start sequencing metagenomes…

Page 9: 2013 bms-retreat-talk

DNA sequencing• Observation of actual DNA sequence• Counting of molecules

Image: Werner Van Belle

Page 10: 2013 bms-retreat-talk

Fast, cheap, and easy to generate.

Image: Werner Van Belle

Page 11: 2013 bms-retreat-talk

New problem: data analysis & integration!• Once you can generate virtually any data set you want…

• …the next problem becomes finding your answer in the data set!

• Think of it as a gigantic NSA treasure hunt: you know there are terrorists out there, but to find them you to hunt through 1 bn phone calls a day…

Page 12: 2013 bms-retreat-talk

“Heuristics”• What do computers do when the answer is either really, really

hard to compute exactly, or actually impossible?

• They approximate! Or guess!

• The term “heuristic” refers to a guess, or shortcut procedure, that usually returns a pretty good answer.

Page 13: 2013 bms-retreat-talk

Often explicit or implicit tradeoffs between compute “amount” and quality of result

http://www.infernodevelopment.com/how-computer-chess-engines-think-minimax-tree

Page 14: 2013 bms-retreat-talk

My actual research focus

What we do is think about ways to get computers to play chess better, by:• Identifying better ways to guess;• Speeding up the guessing process;• Improving people’s ability to use the chess playing computer

Now, replace “play chess” with“analyze biological data”...

Page 15: 2013 bms-retreat-talk

My actual research focus…

We build tools that help experimental biologists work efficiently and correctly with large amounts of data, to help answer their

scientific questions.

This touches on many problems, including:• Computational and scientific correctness.• Computational efficiency.• Cultural divides between experimental biologists and

computational scientists.• Lack of training (biology and medical curricula devoid of math

and computing).

Page 16: 2013 bms-retreat-talk

Not-so-secret sauce: “digital normalization”

• One primary step of one type of data analysis becomes 20-200x faster, 20-150x “cheaper”.

Page 17: 2013 bms-retreat-talk

Approach: Digital normalization(a computational version of library normalization)

Suppose you have a dilution factor of A (10) to B(1). To

get 10x of B you need to get 100x of A! Overkill!!

This 100x will consume disk space and, because of

errors, memory.

We can discard it for you…

Page 18: 2013 bms-retreat-talk

Digital normalization

Page 19: 2013 bms-retreat-talk

Digital normalization

Page 20: 2013 bms-retreat-talk

Digital normalization

Page 21: 2013 bms-retreat-talk

Digital normalization

Page 22: 2013 bms-retreat-talk

Digital normalization

Page 23: 2013 bms-retreat-talk

Digital normalization

Page 24: 2013 bms-retreat-talk

Digital normalization approach

A digital analog to cDNA library normalization, diginorm:

• Is single pass: looks at each read only once;

• Does not “collect” the majority of errors;

• Keeps all low-coverage reads;

• Smooths out coverage of regions.

Page 25: 2013 bms-retreat-talk

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 26: 2013 bms-retreat-talk

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 27: 2013 bms-retreat-talk

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 28: 2013 bms-retreat-talk

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 29: 2013 bms-retreat-talk

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 30: 2013 bms-retreat-talk

Restated:

Can we use lossy compression approaches to make downstream analysis faster and better? (Yes.)

~2 GB – 2 TB of single-chassis RAM

Page 31: 2013 bms-retreat-talk

Soil metagenome assembly• Observation: 99% of microbes cannot easily be cultured in the

lab. (“The great plate count anomaly”)• Many reasons why you can’t or don’t want to culture:• Syntrophic relationships• Niche-specificity or unknown physiology• Dormant microbes• Abundance within communities

Single-cell sequencing & shotgun metagenomics are two common ways to investigate microbial communities.

Page 32: 2013 bms-retreat-talk

Investigating soil microbial ecology

• What ecosystem level functions are present, and how do microbes do them?

• How does agricultural soil differ from native soil?• How does soil respond to climate perturbation?

• Questions that are not easy to answer without shotgun sequencing:• What kind of strain-level heterogeneity is present in the

population?• What does the phage and viral population look like?• What species are where?

Page 33: 2013 bms-retreat-talk

SAMPLING LOCATIONS

Page 34: 2013 bms-retreat-talk

A “Grand Challenge” dataset (DOE/JGI)

Page 35: 2013 bms-retreat-talk

Putting it in perspective:Total equivalent of ~1200 bacterial genomesHuman genome ~3 billion bp

Assembly results for Iowa corn and prairie(2x ~300 Gbp soil metagenomes)

Total Assembly

Total Contigs(> 300 bp)

% Reads Assembled

Predicted protein coding

2.5 bill 4.5 mill 19% 5.3 mill

3.5 bill 5.9 mill 22% 6.8 mill

Adina Howe

Page 36: 2013 bms-retreat-talk

Strain variation?To

p tw

o al

lele

freq

uenc

ies

Position within contig

Of 5000 most abundantcontigs, only 1 has apolymorphism rate > 5%

Can measure by read mapping.

Page 37: 2013 bms-retreat-talk

Tentative observations from our soil samples:• We need 100x as much data…• Much of our sample may consist of phage.• Phylogeny varies more than functional predictions.• We see little to no strain variation within our samples• Not bulk soil --• Very small, localized, and low coverage samples

• We may be able to do selective really deep sequencing and then infer the rest from 16s.• Implications for soil aggregate assembly?

Page 38: 2013 bms-retreat-talk

I also work on…

• Genome assembly & analysis

• Transcriptome assembly and analysis

• Interpretation of annoying large data sets

Page 39: 2013 bms-retreat-talk

What are the tissue level changes in gene expression that support regeneration? Transcriptome analysis of a regenerating vertebrate after SCI

brainspinal cord

RNA-Seq to determinedifferential expressionprofile after injury

Sampling >weekly

-/+ Dex

Ona Bloom

Page 40: 2013 bms-retreat-talk

Training opportunities• PLB/MMG 810 (Shiu; ??)• CSE 801/Intro BEACON course (Brown; FS ‘13)

“Intro to Computational Science for Evolutionary Biologists”• CSE 801 bootcamp (late Sep)• Software Carpentry bootcamp(s) (late Sep)• Workshops in Applied Bioinformatics (Buell; ‘14?)• Next-Gen Sequence Analysis Workshop (Brown; summer ‘14)

+ a variety of genomics courses that I can’t keep track of!

Becky Mansel will have these slides.

Page 41: 2013 bms-retreat-talk

Unsolicited advice

Consider both faculty and non-faculty careers.

• It’s a bad time to be looking for faculty positions, and it’s a bad time to be looking for funding; maybe this will improve in 10 years, maybe not.

• A PhD qualifies you for many, many more things than we will (or can) tell you about!

• Specific advice:• Network with industry folk; think beyond your advisor’s career.• Write a blog: ivory.idyll.org/blog/advice-to-scientists-on-

blogging.html