Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa...

Post on 03-Jan-2016

215 views 0 download

Tags:

Transcript of Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa...

Functional Annotation of Proteins via the CAFA ChallengeLee TienDuncan Renfrow-SymonShilpa NadimpalliMengfei Cao

COMP150PBT | Fall 2010

What’s the problem?1. Huge bottleneck = finding a protein’s

function when given a protein sequence

1. Incomplete, inaccurate, or inconsistent annotations are difficult to work with and can propagate

1. No good way to measure the accuracy of an annotation predictor

What is the CAFA Challenge?

What are Gene Ontology (GO) terms?•GO = controlled vocabulary of “gene

ontologies”

•Cover three domains:▫Cellular component▫Molecular function▫Biological process

•Hierarchy:▫Broad/general (e.g. “catalytic activity”)▫Specific (e.g. “leukotriene-C4-synthase

activity”)

Outline of Our Approach

CAFA targets(FASTA

sequences)

GO ids for each CAFA

target

SMURF?

Betawrap Pro?

Other Secondary Structure Predictor?

BLAST

PFAM

Pfam: Protein Family Database• Collection of protein

families represented by: ▫Multiple sequence

alignments▫Hidden Markov Models

• Two sections of Pfam:▫A: high-quality,

manually-curated▫B: large, automatically-

generated

Sample Multiple Sequence Alignment

Sample Hidden Markov Model

BLAST: Basic Local Align’t Search Tool•Goal: find homologous (i.e. derived from a

common ancester) sequences from a database

•Various BLAST programs:▫blastp = query: protein, database: protein▫blastn = query: nucleotide, database:

nucleotide▫blastx = query: translated nucleotide,

database: protein▫tblastn = query: protein, database: translated

nucleotide▫tblastx = query: translated nucleotide,

database: translated nucleotide

SMURF: Structural Motifs Using Random Fields

•Determines whether a protein sequence contains one of the following super secondary structures:▫6-bladed propeller▫7-bladed propeller▫8-bladed propeller▫Double blades (i.e. 6-6, 6-7,6-8…)

•Developed at Tufts!•Some propeller functions:

▫Often WD40 repeat –protein-protein interaction

▫Signaling, transcription, cell cycle

Smurf!

7-bladed propeller

Final Database Structure

cafa_targets

cafa_id

uniprot_id

gi_access_idblast_results

cafa_id

pdb_id

refseq_id

e_value_score

pfam_results

cafa_id

pfam_id

smurf_results

cafa_id

template_id

p_value_score

pdb_id

go_id

refseq_id

uniprot_id

uniprot_id

go_id

pfam_id

go_id

template_id

go_idgo_results

cafa_id

go_id

source

confidence

INPUT RESULTS MAPPING OUTPUT

Final Results Statistics

789

69

12

19

4

3,445

1,356

Distribution of sequence hits by method

Of 8,904 unknown sequences… 4,265 had at least one hit in PDB BLAST 4,824 had at least one hit in Pfam 104 had at least one hit in SMURF

In total, 5,694 unique sequences had at least one hit, a 63.9% success

Example ResultT38114MDLDMNGGNKRVFQRLGGGSNRPTTDSNQKVCFHWRAGRCNRYPCPYLHRELPGPGSGPVAASSNKRVADESGFAGPSHR

RGPGFSGTANNWGRFGGNRTVTKTEKLCKFWVDGNCPYGDKCRYLHCWSKGDSFSLLTQLDGHQKVVTGIALPSGSDKLY

TASKDETVRIWDCASGQCTGVLNLGGEVGCIISEGPWLLVGMPNLVKAWNIQNNADLSLNGPVGQVYSLVVGTDLLFAGT

QDGSILVWRYNSTTSCFDPAASLLGHTLAVVSLYVGANRLYSGAMDNSIKVWSLDNLQCIQTLTEHTSVVMSLICWDQFL

LSCSLDNTVKIWAATEGGNLEVTYTHKEEYGVLALCGVHDAEAKPVLLCSCNDNSLHLYDLPSFTERGKILAKQEIRSIQ

IGPGGIFFTGDGSGQVKVWKWSTESTPILS

•BLAST: matches with PDB structures 2OVP, 3MKS, 2CNX, 1P22, 1NEX, 3N0E

▫Transcription, mitosis, methylation, protein binding

•Pfam: match to family PF00642▫Zinc ion binding, nucleic acid binding

•SMURF: match to 7-bladed β-propeller template

▫WD domain (protein binding)

Possible Future Directions• Improving functional annotation for β-

propellers identified by SMURF▫Analyze training set of propeller proteins with

known function to build probabilistic model of protein function based on propeller type

•Addition of other structural prediction tools for motifs with known function▫G-coupled receptors, membrane bound proteins

•Expansion of BLAST search to include full nr database

Questions?