Understanding Biological Function in Times of High Throughput and Low Output

download Understanding Biological Function in Times of High Throughput and Low Output

If you can't read please download the document

Transcript of Understanding Biological Function in Times of High Throughput and Low Output

Lush Green

Column B

protein interactions0.631846654696

gene expression0.631846654696

literature0.536536474892

profile-profile alignments0.369482390229

ortholog0.369482390229

sequence properties0.344164582244

phylogeny0.16184302545

sequence alignments-0.0613005258638

other functional information-0.0613005258638

machine learning based method-0.284444077178

sequence-profile alignments-0.591928776926

Column B

profile-profile alignments0.565957694461

literature0.50986822781

ortholog0.342814143146

sequence properties0.317496335162

protein interactions0.317496335162

gene expression0.317496335162

phylogeny0.135174778368

sequence alignments-0.087968772946

other functional information-0.087968772946

machine learning based method-0.087968772946

sequence-profile alignments-0.2131319159

Understanding Biological Function in Times of High Throughput and Low Output

Iddo FriedbergIowa State Universityhttp://iddo-friedberg.net

@iddux

Big Data in my lab

Gene block evolution

Images and Genomes

Host/Microbiome

Database error and bias

Critical Assessment of Protein Function Annotations

Big Data in my lab

Database error and bias

Critical Assessment of Protein Function Annotations

Big Data in my lab

Database error and bias

Critical Assessment of Protein Function Annotations

Understanding methods

Understanding the data

Understanding Methods: The Critical Assessment of protein Function Annotations

Pedja

Wyatt

Sean

Tal

Alex

Large Data Biology has a Bad Rap?

"So we now have a culture which is based on everything must be high-throughput.I like to call it low-input, high-throughput, no-output biology" Sydney Brenner

Motivation: The Knowledge Gap

The gap between data and Information

Information

Data

Temperton & Giovannoni Curr. Opin. Microbiology (2012)

Errors Accumulate in Databases

Schnoes A et al (2009) PLoS Computational Biology, 5 (12)

Assigning Function to Proteins

Low-ish throughputHigh throughputMachine learning

Most Proteins are Annotated Electronically

Compiled from the GOA project, EBI, 6/2011

Problems

Most genes are annotated electronically

Databases have a high error rate which is growing

Homology transfer is less effective

Solutions?

Assess accuracy of annotation softwareWrite better software

Challenges in Picking Targets

Can't use databases: circularity problem

Experimental groups have a small sharing timeframe

Function description too vague for precise GO annotation

There are unknown unknowns

Choose an annotated protein

Prediction method uses said annotation to predict function

Circular logic...

is circular

Choosing Assessment Benchmarks

Function unknown

Function still unknown

Function still unknown

Challenge opens

Submission deadline

Assessment time

Function unknown

Function still unknown

Function known

Benchmark?

Time

BLAST

Naive

Molecular Function precision/ recall

BLAST

Naive

Biological process precision/ recall

Case Study: hPNPase

Gadi Schuster (Technion)

Successful Methods?

log(obs/exp)

BiologicalProcess

Molecular Function

CAFA2 vs. CAFA1

CAFA2 was held in 2014-2015

More targets (100,00 vs. 50,000)More groups (56 vs 29)

CAFA2 vs. CAFA1

CAFA2 was held in 2014-2015

100,000 targets147 participantsMethods have improved

CAFA Conclusions & What's Next

Homology transfer still rules.

Combined methods work best

Molecular Function is easier to predict than Biological Process

Generally, the field can use improvement

Comparison of metrics is very much neededWhy do methods perform differently under different metrics?

Is there a best metric? What is best?

Databases are biased

Understanding Methods: The Critical Assessment of protein Function Annotations

Pedja

Wyatt

Sean

Tal

Alex

protein binding

protein homodimerization

activity

zinc ion binding

transcription activator

activity

chromatin binding

transcription repressor

activity

transcription factor

activity

two-component sensor

activity

specific transcriptional

repressor activity

DNA binding

calcium ion binding

identical protein binding

manganese ion binding

ATP binding

beta-galactoside alpha-

2,3-sialyltransferase

activity

magnesium ion binding

enzyme binding

electron carrier activity

structural constituent of

ribosome

metal ion binding

Leaf terms Molecular Function

David Ream(MU)

Alexander Thorman (MU)

Alexandra Schnoes (UCSF)

Protein BindingActivity

Annotations per article

Schnoes et al PloS Comp Biol (2013)

Information is in an inverse relationship to the number of proteins annotated

1