School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.

School B&I TCD Bioinformatics

Proteins: structure,function,databases,formats

Wot’s a protein,then?Hierarchical • A collection of amino acids (0-D)

– AACompIdent can identify a protein from AA%s

• A sequence (string) of AAs (1-D)

• 2ndry structural elements: -helix etc. (2-D)

• Domains – (independent) functional units

• Whole Protein (from single CDS) (3-D)

• Quaternary structure: dipeptides,ribosomes

• Interactome, pathways

Protein functions

Amino acid propertiesagain … and again and again

Amino acid groups

• KR (Lys Arg) NH3+ basic

• DE (Glu Asp) COO- acidic

• WYF (Trp Tyr Phe) large aromatic

• GP (Gly,Pro) -breaking

• C (Cys) disulphide –S – S – bridges– C also not disulphide bridges

• etc.

Secondary structure-helix (no Pro Gly)

– 3.4 residues per turn– Leucine zipper …LXXXXXXLXXXXXXL…– Amphipathic helix (charged on one side)– Transmembrane (-helix,hydrophobic ~21AA

long)

-sheet – 2 dimensional zigzag

• Coil,random

• Turn (kink)

Easy like exon prediction

Patterns to recognise(more reliable in MSA than in single seq)

• Alternate hydrophobic residues– Surface -sheet (zig-zag-zig-zag)

• Runs of hydrophobic residues– Interior/buried -sheet

• Residues with 3.5AA spacing (amphipathic)– -helix WNNWFNNFNNWNNNF

• Gaps/indels– Probably surface not core

MSA improves 2ndary structure (-helix -sheet) prediction by >6%)

Conserved residues• W,F,Y large hydrophobic, internal/core

– conserved WFY best signal for domains

• G,P turns, can mark end of -helix -sheet• C conserved with reliable spacing speaks C-C

disulphide bridges - defensins• H,S often catalytic sites in proteases (and other

enzymes)• KRDE charged: ligand binding or salt-bridge• L very common AA but not conserved

– except in Leucine zipper L234567L234567L234567L

Basic informationHow big is my protein?Where beta-sheets?Is there a signal peptide?Is there a trypsin cleavage site?• ProtParam tool (MWt etc.)• Tmpred,TMHMM transmembrane helix

inside/outside,external loops.• JPRED for 2-D structure• see practical manual for examples

Tertiary structure

• The holy grail of bioinformatics

• 3-D orientation of known ,• Proteins made of functional units

“domains”– Tried tested module– Domain shuffling and exon boundaries

• Bioinf tries to make predictive calls on aspects of the 3-D structure

• Q. Why is 3-D important ?A. Structure = function

Difficult likeGene prediction

What binf can do about 3-D

• Expressed/exported proteins have signal peptide

• Hydropathy plot,antigenicity index,amphipathicity get handle on surface probability

• But homology to known 3-D structure (Xray,NMR) is best predictor – threading.

• Plan to X-ray all “folds” in human genome.

SwissProt/UniProtSome of the 194 lines of info in a SwissProt entryID RECA_ECOLI STANDARD; PRT; 352 AA.AC P0A7G6; P03017; P26347; P78213;RX MEDLINE=92114994; PubMed=1731246;;RA Story R.M.,Weber I.T.,Steitz T.A.;RT "The structure of the E. coli recA protein";RL Nature 355:318-325(1992).DR EMBL; V00328; CAA23618.1; -; Genomic_DNA.DR PDB; 2REB; X-ray; @=-.DR PRINTS; PR00142; RECA.DR ProDom; PD000229; RecA; 1.DR SMART; SM00382; AAA; 1.DR TIGRFAMs; TIGR02012; tigrfam_recA; 1.DR PROSITE; PS00321; RECA_1; 1.FT HELIX 72 85FT TURN 86 87FT STRAND 90 94FT HELIX 101 106

UniProt is the key hub of Bioinformatics databases

Homology?LVMFWSIVGE Known1

L W GE

LIVYWTVIGE Unknown 40% ID

ILVFYTVVGD Known2

V TV G

LIVYWTVIGE Unknown 40% ID

Is Unknown part of the same family?

Or is this just a 4/10 co-incidence?

RegExLVMFWSIVGE Known1

ILVFYTVVGD Known2

[MILV](3)-[FYW](2)-[STA]-[MILV]-V-G-[DE]

LIVYWTVIGE Unknown

* ***** **

More convincing that it is same family?

How modify RegEx to include 3rd sequence?

RegEx

Family Databases

Three methods

Prosite• Groups families by conserved motif.

Which is

• Present in all family members

• Absent in all other proteins

• No/few false positives (selectivity)

• All true positives (sensitivity)

• Motif defined with a Regular expression

What prosite looks likeID RECA_1; PATTERN.AC PS00321;

DT APR-1990 (CREATED); NOV-1997 DE recA signature.

PA A-L-[KR]-[IF]-[FY]-[STA]-[STAD]-[LIVMQ]-R.

NR /RELEASE=49.0,207132;

NR /TOTAL=281(281); /POSITIVE=279(279); /UNKNOWN=0(0);

NR /FALSE_POS=2(2); /FALSE_NEG=11; /PARTIAL=10;

DR Q01840,RECA1_LACLA,T; P48291,RECA1_MYXXA,T;

DR P48292,RECA2_MYXXA,T; Q9ZUP2,RECA3_ARATH,T;

Etc for 70 lines

DR Q7UJJ0,RECA_RHOBA ,N; Q9EVV7,RECA_STRTR ,N;

DR Q4X0X6,EXO70_ASPFU,F; Q5AZS0,EXO70_EMENI,F;

3D 2REB; 2REC;

DO PDOC00131;

False negatives False positives PDB structures

Documentation

cf SwissProt

Prosite problems

• RegEx now breaking down as recAs increase so no longer defines the protein

• Database now huge so prob of finding any short motif is high.

• Many copies of ELVIS hiding in UniProt

• May be more than 1 motif defining a family

• A great first attempt and still useful but too crude

Prints

• A database of multiple domains/motifs.

• Multiple motifs abstracted to database

• Stored as probability matrix

• If two proteins have the same motifs in the same order they are likely to be homologous.

• More biological/real/sensitive than ProSite

ProDom

• A French DB

• All against all search of the nr protein Db.

• Includes domains with no known function– cf synteny of non coding regions

• Great for determining the domain structure of a particular protein.

Pfam

• Moves up from the short; highly conserved; easily aligned bits of protein family.

• Uses PSSM position specific scoring matrix

• … on complete aligned family members

PSSM• Multiple sequence

alignment:1234567890NSGTIVFLWPDSGTAIFLKPESGTIIFLHNDSDTVRSLKP

Posn1 50% D,N,EPosn2 100% SPosn3 75% G,DPosn4 100% TPosn5 50% I,A,VPosn6 50% I,V,RPosn7 75% F,SPosn8 100% LPosn9 50% K,H,WPosn0 75% P,N

Domain take home

• Run your protein against– InterproScan– CD server at NCBI– Pfscan

• Likely that the crucial bit of info is only in one of the above.

School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.

Documents

Transcript of School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.