Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

17
Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen

Transcript of Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Page 1: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Towards the Management of Information Quality in Proteomics

David SteadUniversity of Aberdeen

Page 2: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

What is Proteomics?

The large-scale study of proteinsof an organism, cell or tissue

Colony morphologies of Candida albicans wild-type and nrg1 mutant

Electron micrograph of a breast cancer cell (picture courtesy of the National Cancer Institute) MALDI protein imaging of a

human glioblastoma slice (Stoeckli et al. Nature Medicine 7, 493)

Page 3: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

“Classical” proteomics

10

100

Mr

(kDa)

7pI4

Identification(Peptide massfingerprinting)

Quantification(Intensity of staining

of protein spot)

Separation(2-Dimensional

gel electrophoresis)

700 1190 1680 2170 2660 3150

Mass (m/z)

0

1.3E+5

0

10

20

30

40

50

60

70

80

90

100

% In

tens

ity

Voyager Spec #1=>AdvBC(32,0.5,0.1)=>NF0.7=>DI=>MC[BP = 1823.0, 134350]

1822

.97

1809

.89

832.

33

1895

.94

1641

.88

1836

.94

712.

27

842.

51

3075

.45

1521

.90

2509

.35

1561

.80

2211

.10

756.

47

2041

.00

1850

.97

1159

.60

804.

30

1910

.02

1718

.98

3093

.50

1791

.86

2283

.21

Biological function

Normalisedspot volume

Page 4: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Should we be concernedabout information quality

in proteomics?

• More, larger, datasets being generated• Combine datasets from different labs

– Answer new biological or technical questions

• Quality of information may affect decisions on how the data is used

Steven Carr et al. (2004)Molecular & Cellular Proteomics 3, 531

…a significant but undefined number of the proteins being reported as “identified” in proteomics articles are likely to be false positives.

Page 5: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Assessing the qualityof protein identifications

Difficulties:• Expert scrutiny of original MS data is not practical

for large datasets• No established minimum acceptance criteria for

protein identifications by MS

Hypothesis:Any peptide mass fingerprinting search report contains information that enables a universal quality score to be calculated

Page 6: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Protein identification by peptide mass fingerprinting

K

KR

R

H 2N

COOHKP

trypticdigestion

>Candida albicans|CA0001|IPF19501 unknown function

MYQTDHGVHNVDGRMSRYIIIPDRSTIRPLLTSNLIAGSLLPSLHCSVSLFLDRVRSSLSSVSVPARVSLPRCFWLSKCLSLGARVRSLFPSLSLSRSYSSSSGPALLYSSVVHSPFLFLLLHSSLFRLLSSPLSSCSLQHLLILNSQWTHRRWEGATQFSSVKGISAVFRPSRASMCPRGFFXCSVCVPLSFRVSIGPFMLFRVPIGFSCISGPLAICFPFNEFLSCLPFLLFRFLFHPLQFLSGLPLLHYSPVINPRPFGFPHPAQPSSYV

783.3858889.51411089.58981089.61631106.62041166.63901239.60041628.72342733.45043223.78713398.7783

in silico digestion

Theoretical mass lists

700 1190 1680 2170 2660 3150

Mass (m/z)

0

1.3E+5

0

10

20

30

40

50

60

70

80

90

100

% In

tens

ity

Voyager Spec #1=>AdvBC(32,0.5,0.1)=>NF0.7=>DI=>MC[BP = 1823.0, 134350]

1822

.97

1809

.89

832.

33

1895

.94

1641

.88

1836

.94

712.

27

842.

51

3075

.45

1521

.90

2509

.35

1561

.80

2211

.10

756.

47

2041

.00

1850

.97

1159

.60

804.

30

1910

.02

1718

.98

3093

.50

1791

.86

2283

.21

Experimental mass list

Search engine

KR

H2N

COOHKP

RK

MALDI-TOF

Protein Protein sequencedatabase

Page 7: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Protein identification quality indicators

Hit ratio (HR) – the number of masses matched divided by the number of masses submitted to the search

– Provides a measure of the signal-to-noise ratio in the mass spectrum

m/z

m1

m2m3

m4

m5

m6

m7

m8

m9

m10

peptide mass fingerprint mass list

m1

m2

m3

m4

m5

m6

m7

m8

m9

m10

m/z

m1

m2m3

m4

m5

m6

m7

m8

m9

m10

highlighted peaks matched to protein

HR = 6/10 = 0.6

spectrum processing

database searching

Page 8: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Protein identification quality indicators

Mass coverage (MC) – the percent sequence coverage multiplied by the protein mass in kDa

MC = 55752 x 25 1000 100

= 13.9 kDa

– Measures the amount of protein sequence matched

Page 9: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Protein identification quality indicators

Excess of limit-digested peptides (ELDP) – the number of matched peptides having no missed cleavages minus the number of matched peptides containing a missed cleavage site

– reflects the completeness of the digestion that precedes the peptide mass fingerprinting

ELDP= 5 – 3

= +2

Page 10: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Protein identification quality indicators

David A. Stead, Alun Preece, and Alistair J. P. Brown Universal metrics for quality assessment of protein identifications by mass spectrometry MCP published March 27, 2006

David A. Stead, Alun Preece, and Alistair J. P. Brown Universal metrics for quality assessment of protein identifications by mass spectrometry MCP published March 27, 2006

www.mcponline.org/cgi/reprint/M500426-MCP200v1

Streptomyces coelicolor Clostridium difficile Methanococcus jannaschii

Page 11: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

ROC analysis shows that HR, MC, and ELDP can discriminate between correct

and incorrect protein identifications

PMF score = (100 * HR) + MC + (10 * ELDP)Data from 581 PMF experiments (protein identifications from 2-D gel spots)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

1 - Specificity

Se

ns

itiv

ity

No discrimination

MC

HR

Mascot

ELDP

PMF score

Page 12: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Qurator provides an information quality (IQ)

framework

• Extend generic ontology of IQ concepts– Allow scientists to define quality characteristics specific

to their domain• HR, MC, ELDP

• Framework for managing IQ– Allow scientists to use their own IQ definitions– ... and reuse those created by others

• Annotate experimental data with quality characteristics– Produce “quality-aware” information resources– Allow user-scientists to access/select/filter data

according to their quality preferences

www.qurator.org

Page 13: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Making the Qurator framework useful

• A key aim of the Qurator project is to integrate IQ tools with existing standards – IQ indicators should apply to common data formats– Qurator functions should be plugged into tools already

used by scientists

• For proteomics we have aligned Qurator with– the PEDRo standard data model (and its XML

serialisation)– the Pedro data entry tool

sourceforge.net/projects/pedro

Page 14: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

PEDRo: a standard formatfor proteomics data

Taylor CF et al. (2003)Nature Biotechnology 3, 247

PEDRo schema

Section of XML output from PEDRo data collator tool

Page 15: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Qurator Pedro Plugin

When a data model is selected, the Qurator Pedro plugin queries the IQ ontology to discover indicators relevant to the kind of datae.g. for the PEDRo proteomics model, HR, MC and ELDP

Values for the calculated indicators for the selected data items are displayed along with basic provenance data (e.g. timestamp…)

Web services that calculate the IQ indicators can be invoked using the “Plugins” button

Page 16: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Conclusions & future work

• Numerical indicators (HR, MC, and ELDP) that describe the quality of protein identifications by peptide mass fingerprinting– Useful for validation of protein identifications– Can be computed from search reports (e.g. Mascot)

• The proteomics case is a proof-of-concept for the Qurator IQ framework– We are working to embed Qurator services in a wider

range of desktop tools (e.g. Taverna workflow environment)

– Further usability/usefulness trials of the tools are planned

Page 17: Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Acknowledgements

Alun PreeceBinling Jin

Al Brown

Paulo MissierSuzanne Embury

Computing Science

Medical Sciences

Computer Science

www.qurator.org www.abdn.ac.uk/proteomics