Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.
-
Upload
jeffrey-dawson -
Category
Documents
-
view
215 -
download
0
Transcript of Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.
Towards the Management of Information Quality in Proteomics
David SteadUniversity of Aberdeen
What is Proteomics?
The large-scale study of proteinsof an organism, cell or tissue
Colony morphologies of Candida albicans wild-type and nrg1 mutant
Electron micrograph of a breast cancer cell (picture courtesy of the National Cancer Institute) MALDI protein imaging of a
human glioblastoma slice (Stoeckli et al. Nature Medicine 7, 493)
“Classical” proteomics
10
100
Mr
(kDa)
7pI4
Identification(Peptide massfingerprinting)
Quantification(Intensity of staining
of protein spot)
Separation(2-Dimensional
gel electrophoresis)
700 1190 1680 2170 2660 3150
Mass (m/z)
0
1.3E+5
0
10
20
30
40
50
60
70
80
90
100
% In
tens
ity
Voyager Spec #1=>AdvBC(32,0.5,0.1)=>NF0.7=>DI=>MC[BP = 1823.0, 134350]
1822
.97
1809
.89
832.
33
1895
.94
1641
.88
1836
.94
712.
27
842.
51
3075
.45
1521
.90
2509
.35
1561
.80
2211
.10
756.
47
2041
.00
1850
.97
1159
.60
804.
30
1910
.02
1718
.98
3093
.50
1791
.86
2283
.21
Biological function
Normalisedspot volume
Should we be concernedabout information quality
in proteomics?
• More, larger, datasets being generated• Combine datasets from different labs
– Answer new biological or technical questions
• Quality of information may affect decisions on how the data is used
•
Steven Carr et al. (2004)Molecular & Cellular Proteomics 3, 531
…a significant but undefined number of the proteins being reported as “identified” in proteomics articles are likely to be false positives.
Assessing the qualityof protein identifications
Difficulties:• Expert scrutiny of original MS data is not practical
for large datasets• No established minimum acceptance criteria for
protein identifications by MS
Hypothesis:Any peptide mass fingerprinting search report contains information that enables a universal quality score to be calculated
Protein identification by peptide mass fingerprinting
K
KR
R
H 2N
COOHKP
trypticdigestion
>Candida albicans|CA0001|IPF19501 unknown function
MYQTDHGVHNVDGRMSRYIIIPDRSTIRPLLTSNLIAGSLLPSLHCSVSLFLDRVRSSLSSVSVPARVSLPRCFWLSKCLSLGARVRSLFPSLSLSRSYSSSSGPALLYSSVVHSPFLFLLLHSSLFRLLSSPLSSCSLQHLLILNSQWTHRRWEGATQFSSVKGISAVFRPSRASMCPRGFFXCSVCVPLSFRVSIGPFMLFRVPIGFSCISGPLAICFPFNEFLSCLPFLLFRFLFHPLQFLSGLPLLHYSPVINPRPFGFPHPAQPSSYV
783.3858889.51411089.58981089.61631106.62041166.63901239.60041628.72342733.45043223.78713398.7783
in silico digestion
Theoretical mass lists
700 1190 1680 2170 2660 3150
Mass (m/z)
0
1.3E+5
0
10
20
30
40
50
60
70
80
90
100
% In
tens
ity
Voyager Spec #1=>AdvBC(32,0.5,0.1)=>NF0.7=>DI=>MC[BP = 1823.0, 134350]
1822
.97
1809
.89
832.
33
1895
.94
1641
.88
1836
.94
712.
27
842.
51
3075
.45
1521
.90
2509
.35
1561
.80
2211
.10
756.
47
2041
.00
1850
.97
1159
.60
804.
30
1910
.02
1718
.98
3093
.50
1791
.86
2283
.21
Experimental mass list
Search engine
KR
H2N
COOHKP
RK
MALDI-TOF
Protein Protein sequencedatabase
Protein identification quality indicators
Hit ratio (HR) – the number of masses matched divided by the number of masses submitted to the search
– Provides a measure of the signal-to-noise ratio in the mass spectrum
m/z
m1
m2m3
m4
m5
m6
m7
m8
m9
m10
peptide mass fingerprint mass list
m1
m2
m3
m4
m5
m6
m7
m8
m9
m10
m/z
m1
m2m3
m4
m5
m6
m7
m8
m9
m10
highlighted peaks matched to protein
HR = 6/10 = 0.6
spectrum processing
database searching
Protein identification quality indicators
Mass coverage (MC) – the percent sequence coverage multiplied by the protein mass in kDa
MC = 55752 x 25 1000 100
= 13.9 kDa
– Measures the amount of protein sequence matched
Protein identification quality indicators
Excess of limit-digested peptides (ELDP) – the number of matched peptides having no missed cleavages minus the number of matched peptides containing a missed cleavage site
– reflects the completeness of the digestion that precedes the peptide mass fingerprinting
ELDP= 5 – 3
= +2
Protein identification quality indicators
David A. Stead, Alun Preece, and Alistair J. P. Brown Universal metrics for quality assessment of protein identifications by mass spectrometry MCP published March 27, 2006
David A. Stead, Alun Preece, and Alistair J. P. Brown Universal metrics for quality assessment of protein identifications by mass spectrometry MCP published March 27, 2006
www.mcponline.org/cgi/reprint/M500426-MCP200v1
Streptomyces coelicolor Clostridium difficile Methanococcus jannaschii
ROC analysis shows that HR, MC, and ELDP can discriminate between correct
and incorrect protein identifications
PMF score = (100 * HR) + MC + (10 * ELDP)Data from 581 PMF experiments (protein identifications from 2-D gel spots)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
1 - Specificity
Se
ns
itiv
ity
No discrimination
MC
HR
Mascot
ELDP
PMF score
Qurator provides an information quality (IQ)
framework
• Extend generic ontology of IQ concepts– Allow scientists to define quality characteristics specific
to their domain• HR, MC, ELDP
• Framework for managing IQ– Allow scientists to use their own IQ definitions– ... and reuse those created by others
• Annotate experimental data with quality characteristics– Produce “quality-aware” information resources– Allow user-scientists to access/select/filter data
according to their quality preferences
www.qurator.org
Making the Qurator framework useful
• A key aim of the Qurator project is to integrate IQ tools with existing standards – IQ indicators should apply to common data formats– Qurator functions should be plugged into tools already
used by scientists
• For proteomics we have aligned Qurator with– the PEDRo standard data model (and its XML
serialisation)– the Pedro data entry tool
sourceforge.net/projects/pedro
PEDRo: a standard formatfor proteomics data
Taylor CF et al. (2003)Nature Biotechnology 3, 247
PEDRo schema
Section of XML output from PEDRo data collator tool
Qurator Pedro Plugin
When a data model is selected, the Qurator Pedro plugin queries the IQ ontology to discover indicators relevant to the kind of datae.g. for the PEDRo proteomics model, HR, MC and ELDP
Values for the calculated indicators for the selected data items are displayed along with basic provenance data (e.g. timestamp…)
Web services that calculate the IQ indicators can be invoked using the “Plugins” button
Conclusions & future work
• Numerical indicators (HR, MC, and ELDP) that describe the quality of protein identifications by peptide mass fingerprinting– Useful for validation of protein identifications– Can be computed from search reports (e.g. Mascot)
• The proteomics case is a proof-of-concept for the Qurator IQ framework– We are working to embed Qurator services in a wider
range of desktop tools (e.g. Taverna workflow environment)
– Further usability/usefulness trials of the tools are planned
Acknowledgements
Alun PreeceBinling Jin
Al Brown
Paulo MissierSuzanne Embury
Computing Science
Medical Sciences
Computer Science
www.qurator.org www.abdn.ac.uk/proteomics