School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.
-
Upload
philippa-porter -
Category
Documents
-
view
228 -
download
4
Transcript of School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.
School B&I TCD Bioinformatics
Proteins: structure,function,databases,formats
Wot’s a protein,then?Hierarchical • A collection of amino acids (0-D)
– AACompIdent can identify a protein from AA%s
• A sequence (string) of AAs (1-D)
• 2ndry structural elements: -helix etc. (2-D)
• Domains – (independent) functional units
• Whole Protein (from single CDS) (3-D)
• Quaternary structure: dipeptides,ribosomes
• Interactome, pathways
Protein functions
Amino acid propertiesagain … and again and again
Amino acid groups
• KR (Lys Arg) NH3+ basic
• DE (Glu Asp) COO- acidic
• WYF (Trp Tyr Phe) large aromatic
• GP (Gly,Pro) -breaking
• C (Cys) disulphide –S – S – bridges– C also not disulphide bridges
• etc.
Secondary structure-helix (no Pro Gly)
– 3.4 residues per turn– Leucine zipper …LXXXXXXLXXXXXXL…– Amphipathic helix (charged on one side)– Transmembrane (-helix,hydrophobic ~21AA
long)
-sheet – 2 dimensional zigzag
• Coil,random
• Turn (kink)
Easy like exon prediction
Patterns to recognise(more reliable in MSA than in single seq)
• Alternate hydrophobic residues– Surface -sheet (zig-zag-zig-zag)
• Runs of hydrophobic residues– Interior/buried -sheet
• Residues with 3.5AA spacing (amphipathic)– -helix WNNWFNNFNNWNNNF
• Gaps/indels– Probably surface not core
MSA improves 2ndary structure (-helix -sheet) prediction by >6%)
Conserved residues• W,F,Y large hydrophobic, internal/core
– conserved WFY best signal for domains
• G,P turns, can mark end of -helix -sheet• C conserved with reliable spacing speaks C-C
disulphide bridges - defensins• H,S often catalytic sites in proteases (and other
enzymes)• KRDE charged: ligand binding or salt-bridge• L very common AA but not conserved
– except in Leucine zipper L234567L234567L234567L
Basic informationHow big is my protein?Where beta-sheets?Is there a signal peptide?Is there a trypsin cleavage site?• ProtParam tool (MWt etc.)• Tmpred,TMHMM transmembrane helix
inside/outside,external loops.• JPRED for 2-D structure• see practical manual for examples
Tertiary structure
• The holy grail of bioinformatics
• 3-D orientation of known ,• Proteins made of functional units
“domains”– Tried tested module– Domain shuffling and exon boundaries
• Bioinf tries to make predictive calls on aspects of the 3-D structure
• Q. Why is 3-D important ?A. Structure = function
Difficult likeGene prediction
What binf can do about 3-D
• Expressed/exported proteins have signal peptide
• Hydropathy plot,antigenicity index,amphipathicity get handle on surface probability
• But homology to known 3-D structure (Xray,NMR) is best predictor – threading.
• Plan to X-ray all “folds” in human genome.
recaA
SwissProt/UniProtSome of the 194 lines of info in a SwissProt entryID RECA_ECOLI STANDARD; PRT; 352 AA.AC P0A7G6; P03017; P26347; P78213;RX MEDLINE=92114994; PubMed=1731246;;RA Story R.M.,Weber I.T.,Steitz T.A.;RT "The structure of the E. coli recA protein";RL Nature 355:318-325(1992).DR EMBL; V00328; CAA23618.1; -; Genomic_DNA.DR PDB; 2REB; X-ray; @=-.DR PRINTS; PR00142; RECA.DR ProDom; PD000229; RecA; 1.DR SMART; SM00382; AAA; 1.DR TIGRFAMs; TIGR02012; tigrfam_recA; 1.DR PROSITE; PS00321; RECA_1; 1.FT HELIX 72 85FT TURN 86 87FT STRAND 90 94FT HELIX 101 106
UniProt is the key hub of Bioinformatics databases
Homology?LVMFWSIVGE Known1
L W GE
LIVYWTVIGE Unknown 40% ID
ILVFYTVVGD Known2
V TV G
LIVYWTVIGE Unknown 40% ID
Is Unknown part of the same family?
Or is this just a 4/10 co-incidence?
RegExLVMFWSIVGE Known1
ILVFYTVVGD Known2
[MILV](3)-[FYW](2)-[STA]-[MILV]-V-G-[DE]
LIVYWTVIGE Unknown
* ***** **
More convincing that it is same family?
How modify RegEx to include 3rd sequence?
RegEx
Family Databases
Three methods
Prosite• Groups families by conserved motif.
Which is
• Present in all family members
• Absent in all other proteins
• No/few false positives (selectivity)
• All true positives (sensitivity)
• Motif defined with a Regular expression
What prosite looks likeID RECA_1; PATTERN.AC PS00321;
DT APR-1990 (CREATED); NOV-1997 DE recA signature.
PA A-L-[KR]-[IF]-[FY]-[STA]-[STAD]-[LIVMQ]-R.
NR /RELEASE=49.0,207132;
NR /TOTAL=281(281); /POSITIVE=279(279); /UNKNOWN=0(0);
NR /FALSE_POS=2(2); /FALSE_NEG=11; /PARTIAL=10;
DR Q01840,RECA1_LACLA,T; P48291,RECA1_MYXXA,T;
DR P48292,RECA2_MYXXA,T; Q9ZUP2,RECA3_ARATH,T;
Etc for 70 lines
DR Q7UJJ0,RECA_RHOBA ,N; Q9EVV7,RECA_STRTR ,N;
DR Q4X0X6,EXO70_ASPFU,F; Q5AZS0,EXO70_EMENI,F;
3D 2REB; 2REC;
DO PDOC00131;
False negatives False positives PDB structures
Documentation
cf SwissProt
Prosite problems
• RegEx now breaking down as recAs increase so no longer defines the protein
• Database now huge so prob of finding any short motif is high.
• Many copies of ELVIS hiding in UniProt
• May be more than 1 motif defining a family
• A great first attempt and still useful but too crude
Prints
• A database of multiple domains/motifs.
• Multiple motifs abstracted to database
• Stored as probability matrix
• If two proteins have the same motifs in the same order they are likely to be homologous.
• More biological/real/sensitive than ProSite
ProDom
• A French DB
• All against all search of the nr protein Db.
• Includes domains with no known function– cf synteny of non coding regions
• Great for determining the domain structure of a particular protein.
Pfam
• Moves up from the short; highly conserved; easily aligned bits of protein family.
• Uses PSSM position specific scoring matrix
• … on complete aligned family members
PSSM• Multiple sequence
alignment:1234567890NSGTIVFLWPDSGTAIFLKPESGTIIFLHNDSDTVRSLKP
Posn1 50% D,N,EPosn2 100% SPosn3 75% G,DPosn4 100% TPosn5 50% I,A,VPosn6 50% I,V,RPosn7 75% F,SPosn8 100% LPosn9 50% K,H,WPosn0 75% P,N
Domain take home
• Run your protein against– InterproScan– CD server at NCBI– Pfscan
• Likely that the crucial bit of info is only in one of the above.